Difference between revisions of "UTF8 string class/en"

From Lazarus wiki
Jump to navigationJump to search
Line 1: Line 1:
 +
In deutsch [[UTF8 string class]]
 
== What is TbUtf8? ==
 
== What is TbUtf8? ==
  

Revision as of 15:11, 23 January 2022

In deutsch UTF8 string class

What is TbUtf8?

Under construction!!

With the library TbUtf8 you can easily change UTF8 Strings.

Problem

With Lazarus (Free Pascal) the string UTF8 encoded. However, the "String" type is nothing more than a dynamic byte array. Length returns the number of bytes in the array but not the number of characters. With UTF8, a character can be 4 bytes long and even 7 bytes with combined characters. An example should illustrate this. 'Thomas' 6 characters, 6 bytes in size. 'Thömäs' 6 characters, 8 bytes in size.

Solution

With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8).

Benefits

  • TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free.
  • All indexes are character based.
  • All returned characters are of type String.
  • Returns the number of characters in the string.
  • Returns the number of bytes in the string.
  • Delete characters or character groups.
  • Insertion of characters and character groups.
  • Appending characters and character groups.
  • Reading / writing of characters and character groups.
  • Read from file / write to a file.
  • Read from stream / write to a stream.

Disadvantage

  • Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price)
  • Slightly more memory is required.