Difference between revisions of "TbUtf8"
(8 intermediate revisions by 3 users not shown) | |||
Line 3: | Line 3: | ||
__TOC__ | __TOC__ | ||
− | == | + | == About == |
− | + | ||
+ | '''TbUtf8''' is a library to easily change UTF-8 encoded strings. | ||
=== Problem === | === Problem === | ||
− | + | With Lazarus (Free Pascal) the string UTF8 encoded. However, the "String" type is nothing more than a dynamic byte array. Length returns the number of bytes in the array but not the number of characters. With UTF8, a character can be 4 bytes long and even 7 bytes with combined characters. An example should illustrate this. 'Thomas' 6 characters, 6 bytes in size. 'Thömäs' 6 characters, 8 bytes in size. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | === | + | === Solution === |
− | + | With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8). | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | == | + | === Benefits === |
− | + | *TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free. | |
− | + | *All indexes are character based. | |
− | + | *All returned characters are of type String. | |
− | + | *Returns the number of characters in the string. | |
+ | *Returns the number of bytes in the string. | ||
+ | *Delete characters or character groups. | ||
+ | *Insertion of characters and character groups. | ||
+ | *Appending characters and character groups. | ||
+ | *Reading / writing of characters and character groups. | ||
+ | *Read from file / write to a file. | ||
+ | *Read from stream / write to a stream. | ||
− | + | === Disadvantage === | |
− | + | *Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price) | |
− | + | *Slightly more memory is required. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | === | + | === Example === |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<syntaxhighlight lang="pascal"> | <syntaxhighlight lang="pascal"> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
var | var | ||
− | + | u: IbUtf8; | |
− | + | i: Integer; | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
begin | begin | ||
− | + | u:= TIbUtf8.Create('Thömäß'); | |
− | + | for i:= 1 to u.NumberOfChars do begin | |
− | + | case u.Chars[i] of | |
− | + | 'ö': u.Chars[i]:= 'o'; | |
+ | 'ä': u.Chars[i]:= 'a'; | ||
+ | 'ß': u.Chars[i]:= 's'; | ||
+ | end; | ||
end; | end; | ||
− | + | if u.Text = 'Thomas' then begin | |
− | + | WriteLn('That''s right!'); | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | if | ||
− | WriteLn(' | ||
end; | end; | ||
− | end | + | end. |
</syntaxhighlight> | </syntaxhighlight> | ||
− | == | + | == Download == |
− | + | * GitLab repository: [https://gitlab.com/FpTuxe/tbutf8 FpTuxe/TbUtf8] | |
− | + | * Git clone: | |
− | + | <syntaxhighlight lang="bash"> | |
− | + | git clone https://gitlab.com/FpTuxe/tbutf8.git | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | <syntaxhighlight lang=" | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
</syntaxhighlight> | </syntaxhighlight> | ||
− | == | + | == Installation == |
− | + | ;Variant 1 | |
− | + | :Start Lazarus and open your project. | |
− | + | :Lazarus->File->Open your workspace/tbutf8/src/tb_utf8.pas | |
− | + | :Lazarus->Project->Add Editor File to Project | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | ;Variant 2 | ||
+ | :Start Lazarus and open your project. | ||
+ | :Lazarus->Package->Open Package File (.lpk) your workspace/tbutf8/src/tbutf8.lpk | ||
+ | :Now, click Use->Add to Project | ||
+ | :Close then Package window. | ||
− | + | == Functional Description == | |
+ | The functional description, you can found under the project folder "doc/". |
Latest revision as of 22:26, 15 April 2022
│
Deutsch (de) │
English (en) │
About
TbUtf8 is a library to easily change UTF-8 encoded strings.
Problem
With Lazarus (Free Pascal) the string UTF8 encoded. However, the "String" type is nothing more than a dynamic byte array. Length returns the number of bytes in the array but not the number of characters. With UTF8, a character can be 4 bytes long and even 7 bytes with combined characters. An example should illustrate this. 'Thomas' 6 characters, 6 bytes in size. 'Thömäs' 6 characters, 8 bytes in size.
Solution
With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8).
Benefits
- TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free.
- All indexes are character based.
- All returned characters are of type String.
- Returns the number of characters in the string.
- Returns the number of bytes in the string.
- Delete characters or character groups.
- Insertion of characters and character groups.
- Appending characters and character groups.
- Reading / writing of characters and character groups.
- Read from file / write to a file.
- Read from stream / write to a stream.
Disadvantage
- Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price)
- Slightly more memory is required.
Example
var
u: IbUtf8;
i: Integer;
begin
u:= TIbUtf8.Create('Thömäß');
for i:= 1 to u.NumberOfChars do begin
case u.Chars[i] of
'ö': u.Chars[i]:= 'o';
'ä': u.Chars[i]:= 'a';
'ß': u.Chars[i]:= 's';
end;
end;
if u.Text = 'Thomas' then begin
WriteLn('That''s right!');
end;
end.
Download
- GitLab repository: FpTuxe/TbUtf8
- Git clone:
git clone https://gitlab.com/FpTuxe/tbutf8.git
Installation
- Variant 1
- Start Lazarus and open your project.
- Lazarus->File->Open your workspace/tbutf8/src/tb_utf8.pas
- Lazarus->Project->Add Editor File to Project
- Variant 2
- Start Lazarus and open your project.
- Lazarus->Package->Open Package File (.lpk) your workspace/tbutf8/src/tbutf8.lpk
- Now, click Use->Add to Project
- Close then Package window.
Functional Description
The functional description, you can found under the project folder "doc/".