Difference between revisions of "UTF8 string class/en"

From Lazarus wiki
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{UTF8 String Class}}
+
{{UTF8 string class}}
 +
 
 +
__TOC__
  
{{UTF8 String Class en}}
 
 
== What is TbUtf8? ==
 
== What is TbUtf8? ==
 
Under construction!!
 
  
 
With the library TbUtf8 you can easily change UTF8 Strings.
 
With the library TbUtf8 you can easily change UTF8 Strings.
Line 13: Line 12:
 
=== Solution ===
 
=== Solution ===
 
With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8).
 
With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8).
 +
 +
=== Benefits ===
 +
*TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free.
 +
*All indexes are character based.
 +
*All returned characters are of type String.
 +
*Returns the number of characters in the string.
 +
*Returns the number of bytes in the string.
 +
*Delete characters or character groups.
 +
*Insertion of characters and character groups.
 +
*Appending characters and character groups.
 +
*Reading / writing of characters and character groups.
 +
*Read from file / write to a file.
 +
*Read from stream / write to a stream.
 +
 +
=== Disadvantage ===
 +
*Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price)
 +
*Slightly more memory is required.
 +
 +
 +
=== Example ===
 +
<syntaxhighlight lang="pascal">
 +
proceudre Demo01: Boolean;
 +
var
 +
  u: IbUtf8;
 +
  i: Integer;
 +
begin
 +
  u:= TIbUtf8.Create('Thömäß');
 +
  for i:= 1 to u.NumberOfChars do begin
 +
    case u.Chars[i] of
 +
      'ö': u.Chars[i]:= 'o';
 +
      'ä': u.Chars[i]:= 'a';
 +
      'ß': u.Chars[i]:= 's';
 +
    end;
 +
  end;
 +
  if u.Text = 'Thomas' then begin
 +
    WriteLn('That's right!');
 +
  end;
 +
end;
 +
</syntaxhighlight>
 +
 +
== DownLoad ==
 +
* GitLab FpTuxe/TbUtf8 repository
 +
[https://gitlab.com/FpTuxe/tbutf8 FpTuxe/TbUtf8]
 +
* Git clone
 +
<syntaxhighlight lang="bash">
 +
git clone https://gitlab.com/FpTuxe/tbutf8.git
 +
</syntaxhighlight>
 +
 +
== Installation ==
 +
;Variant  1
 +
:Start Lazarus and open your project.
 +
:Lazarus->File->Open your workspace/tbutf8/src/tb_utf8.pas
 +
:Lazarus->Project->Add Editor File to Project
 +
 +
;Variant  2
 +
:Start Lazarus and open your project.
 +
:Lazarus->Package->Open Package File (.lpk) your workspace/tbutf8/src/tbutf8.lpk
 +
:Now, click Use->Add to Project
 +
:Close then Package window.
 +
 +
== Functional Description ==
 +
The functional description, you can found under the project folder "doc/".
 +
 +
 +
[[Category:Unicode]]

Latest revision as of 17:36, 23 January 2022

English (en)

What is TbUtf8?

With the library TbUtf8 you can easily change UTF8 Strings.

Problem

With Lazarus (Free Pascal) the string UTF8 encoded. However, the "String" type is nothing more than a dynamic byte array. Length returns the number of bytes in the array but not the number of characters. With UTF8, a character can be 4 bytes long and even 7 bytes with combined characters. An example should illustrate this. 'Thomas' 6 characters, 6 bytes in size. 'Thömäs' 6 characters, 8 bytes in size.

Solution

With TbUtf8 you can now easily change and search UTF8 strings with special and combined characters, such as "üäößẶặǺǻǼǽǞǟǍǎḂḃÞþÇçĆćĊċ...". Essentially, the library consists of a UTF8 string class (TIbUtf8).

Benefits

  • TIbUtf8 is a class type of TInterfacedObject and does not need to be cleaned up with free.
  • All indexes are character based.
  • All returned characters are of type String.
  • Returns the number of characters in the string.
  • Returns the number of bytes in the string.
  • Delete characters or character groups.
  • Insertion of characters and character groups.
  • Appending characters and character groups.
  • Reading / writing of characters and character groups.
  • Read from file / write to a file.
  • Read from stream / write to a stream.

Disadvantage

  • Since UTF8 does not have a constant offset from character to character, searching for characters is much more complex. Iterating over the characters is about 20 times slower than with the string. (Comfort has its price)
  • Slightly more memory is required.


Example

proceudre Demo01: Boolean;
var
  u: IbUtf8;
  i: Integer;
begin
  u:= TIbUtf8.Create('Thömäß');
  for i:= 1 to u.NumberOfChars do begin
    case u.Chars[i] of
      'ö': u.Chars[i]:= 'o';
      'ä': u.Chars[i]:= 'a';
      'ß': u.Chars[i]:= 's';
    end;
  end;
  if u.Text = 'Thomas' then begin
    WriteLn('That's right!');
  end;
end;

DownLoad

  • GitLab FpTuxe/TbUtf8 repository
FpTuxe/TbUtf8
  • Git clone
git clone https://gitlab.com/FpTuxe/tbutf8.git

Installation

Variant 1
Start Lazarus and open your project.
Lazarus->File->Open your workspace/tbutf8/src/tb_utf8.pas
Lazarus->Project->Add Editor File to Project
Variant 2
Start Lazarus and open your project.
Lazarus->Package->Open Package File (.lpk) your workspace/tbutf8/src/tbutf8.lpk
Now, click Use->Add to Project
Close then Package window.

Functional Description

The functional description, you can found under the project folder "doc/".