UTF8 strings and characters

From Lazarus wiki
Revision as of 01:26, 1 February 2015 by JuhaManninen (talk | contribs) (Created page with "Until Lazarus 0.9.30 the UTF-8 handling routines were in the LCL in the unit LCLProc. In Lazarus 0.9.31+ the routines in LCLProc are still available for backwards compatibilit...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Until Lazarus 0.9.30 the UTF-8 handling routines were in the LCL in the unit LCLProc. In Lazarus 0.9.31+ the routines in LCLProc are still available for backwards compatibility but the real code to deal with UTF-8 is located in the lazutils package in the unit lazutf8.

To execute operations on UTF-8 strings please use routines from the unit lazutf8 instead of routines from the SysUtils routine from Free Pascal, because SysUtils is not yet prepared to deal with Unicode, while lazutf8 is. Simply substitute the routines from SysUtils with their lazutf8 equivalent, which always has the same name except for an added "UTF8" prefix.

Also note that simply iterating over chars as if the string was an array does not work in Unicode. This is not something specific to UTF-8: one simply cannot suppose that a character will have a fixed size in Unicode. If you want to iterate over the characters of an UTF-8 string, there are basically two ways:

  • iterate over the bytes - useful for searching a substring or when looking only at the ASCII characters of the UTF8 string. For example when parsing XML files.
  • iterate over the characters - useful for graphical components like synedit. For example when you want to know the third printed character on the screen.

Searching a substring

Due to the special nature of UTF8 you can simply use the normal string functions for searching a sub-string. Even though UTF-8 is a multi-byte encoding the first byte can not be confused with the second. So searching for a valid UTF-8 string with Pos will always return a valid UTF-8 position:

uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
...
procedure Where(SearchFor, aText: string);
var
  BytePos: LongInt;
  CharacterPos: LongInt;
begin
  BytePos:=Pos(SearchFor,aText);
  CharacterPos:=UTF8Length(PChar(aText),BytePos-1);
  writeln('The substring "',SearchFor,'" is in the text "',aText,'"',
    ' at byte position ',BytePos,' and at character position ',CharacterPos);
end;

Due to the ambiguity of Unicode, Pos() (just like any compare) might show unexpected behavior, when e.g. one of the string contains decomposed characters, while the other uses the direct codes for the same letter. This is not automatically handled by the RTL.

Accessing UTF8 characters

Unicode characters can vary in length, so the best solution for accessing them is to use an iteration when one intends to access the characters in the sequence in which they are. For iterating through the characters use this code:

uses lazutf8; // LCLProc for Lazarus 0.9.30 or lower
...
procedure DoSomethingWithString(AnUTF8String: string);
var
  p: PChar;
  CharLen: integer;
  FirstByte, SecondByte, ThirdByte: Char;
begin
  p:=PChar(AnUTF8String);
  repeat
    CharLen := UTF8CharacterLength(p);

    // Here you have a pointer to the char and its length
    // You can access the bytes of the UTF-8 Char like this:
    if CharLen >= 1 then FirstByte := P[0];
    if CharLen >= 2 then SecondByte := P[1];
    if CharLen >= 3 then ThirdByte := P[2];

    inc(p,CharLen);
  until (CharLen=0) or (p^ = #0);
end;

Accessing the Nth UTF8 character

Besides iterating one might also want to have random access to UTF-8 Characters.

uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
...
var
  AnUTF8String, NthChar: string;
begin
  NthChar := UTF8Copy(AnUTF8String, N, 1);

Showing character codepoints with UTF8CharacterToUnicode

The following demonstrates how to show the 32bit code point value of each character in an UTF8 string:

uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
...
procedure IterateUTF8Characters(const AnUTF8String: string);
var
  p: PChar;
  unicode: Cardinal;
  CharLen: integer;
begin
  p:=PChar(AnUTF8String);
  repeat
    unicode:=UTF8CharacterToUnicode(p,CharLen);
    writeln('Unicode=',unicode);
    inc(p,CharLen);
  until (CharLen=0) or (unicode=0);
end;