UTF8 string class/en: Difference between revisions
No edit summary |
|||
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{UTF8 | {{UTF8 string class}} | ||
__TOC__ | |||
== What is TbUtf8? == | == What is TbUtf8? == | ||
With the library TbUtf8 you can easily change UTF8 Strings. | With the library TbUtf8 you can easily change UTF8 Strings. | ||
Line 13: | Line 12: | ||
=== Solution === | === Solution === | ||
With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8). | With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8). | ||
=== Benefits === | |||
*TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free. | |||
*All indexes are character based. | |||
*All returned characters are of type String. | |||
*Returns the number of characters in the string. | |||
*Returns the number of bytes in the string. | |||
*Delete characters or character groups. | |||
*Insertion of characters and character groups. | |||
*Appending characters and character groups. | |||
*Reading / writing of characters and character groups. | |||
*Read from file / write to a file. | |||
*Read from stream / write to a stream. | |||
=== Disadvantage === | |||
*Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price) | |||
*Slightly more memory is required. | |||
=== Example === | |||
<syntaxhighlight lang="pascal"> | |||
proceudre Demo01: Boolean; | |||
var | |||
u: IbUtf8; | |||
i: Integer; | |||
begin | |||
u:= TIbUtf8.Create('Thömäß'); | |||
for i:= 1 to u.NumberOfChars do begin | |||
case u.Chars[i] of | |||
'ö': u.Chars[i]:= 'o'; | |||
'ä': u.Chars[i]:= 'a'; | |||
'ß': u.Chars[i]:= 's'; | |||
end; | |||
end; | |||
if u.Text = 'Thomas' then begin | |||
WriteLn('That's right!'); | |||
end; | |||
end; | |||
</syntaxhighlight> | |||
== DownLoad == | |||
* GitLab FpTuxe/TbUtf8 repository | |||
[https://gitlab.com/FpTuxe/tbutf8 FpTuxe/TbUtf8] | |||
* Git clone | |||
<syntaxhighlight lang="bash"> | |||
git clone https://gitlab.com/FpTuxe/tbutf8.git | |||
</syntaxhighlight> | |||
== Installation == | |||
;Variant 1 | |||
:Start Lazarus and open your project. | |||
:Lazarus->File->Open your workspace/tbutf8/src/tb_utf8.pas | |||
:Lazarus->Project->Add Editor File to Project | |||
;Variant 2 | |||
:Start Lazarus and open your project. | |||
:Lazarus->Package->Open Package File (.lpk) your workspace/tbutf8/src/tbutf8.lpk | |||
:Now, click Use->Add to Project | |||
:Close then Package window. | |||
== Functional Description == | |||
The functional description, you can found under the project folder "doc/". | |||
[[Category:Unicode]] |
Latest revision as of 17:36, 23 January 2022
│
English (en) │
What is TbUtf8?
With the library TbUtf8 you can easily change UTF8 Strings.
Problem
With Lazarus (Free Pascal) the string UTF8 encoded. However, the "String" type is nothing more than a dynamic byte array. Length returns the number of bytes in the array but not the number of characters. With UTF8, a character can be 4 bytes long and even 7 bytes with combined characters. An example should illustrate this. 'Thomas' 6 characters, 6 bytes in size. 'Thömäs' 6 characters, 8 bytes in size.
Solution
With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8).
Benefits
- TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free.
- All indexes are character based.
- All returned characters are of type String.
- Returns the number of characters in the string.
- Returns the number of bytes in the string.
- Delete characters or character groups.
- Insertion of characters and character groups.
- Appending characters and character groups.
- Reading / writing of characters and character groups.
- Read from file / write to a file.
- Read from stream / write to a stream.
Disadvantage
- Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price)
- Slightly more memory is required.
Example
proceudre Demo01: Boolean;
var
u: IbUtf8;
i: Integer;
begin
u:= TIbUtf8.Create('Thömäß');
for i:= 1 to u.NumberOfChars do begin
case u.Chars[i] of
'ö': u.Chars[i]:= 'o';
'ä': u.Chars[i]:= 'a';
'ß': u.Chars[i]:= 's';
end;
end;
if u.Text = 'Thomas' then begin
WriteLn('That's right!');
end;
end;
DownLoad
- GitLab FpTuxe/TbUtf8 repository
FpTuxe/TbUtf8
- Git clone
git clone https://gitlab.com/FpTuxe/tbutf8.git
Installation
- Variant 1
- Start Lazarus and open your project.
- Lazarus->File->Open your workspace/tbutf8/src/tb_utf8.pas
- Lazarus->Project->Add Editor File to Project
- Variant 2
- Start Lazarus and open your project.
- Lazarus->Package->Open Package File (.lpk) your workspace/tbutf8/src/tbutf8.lpk
- Now, click Use->Add to Project
- Close then Package window.
Functional Description
The functional description, you can found under the project folder "doc/".