Revision as of 15:01, 24 January 2012

│ Deutsch (de) │ English (en) │ español (es) │ français (fr) │ 日本語 (ja) │ 한국어 (ko) │ русский (ru) │ 中文（中国大陆）‎ (zh_CN) │ 中文（台灣）‎ (zh_TW) │

Introduction

A partir de la version 0.9.25, Lazarus supporte pleinement Unicode pour toutes les plateformes, excepté Gtk 1. Dans cette page, vous pouvez trouver des instructions pour les utilisateurs de Lazarus, des feuilles de route, des descriptions de concepts basiques et des détails d'implémentation.

Instructions pour les utilisateurs

Même si Lazarus possèdes des ensembles de widgets Unicode, il est important de noter que tout n'est pas en Unicode. Il est de la responsabilité du développeur de connaitre l'encodage de ses chaines de caractère, et d'effectuer la conversion appropriée entre les bibliothèques qui attendent des encodages différents.

Habituellement, l'encodage est défini bibliothèque par bibliothèque (une bibliothèque dynamique (dll) ou un package Lazarus). Chaque bibliothèque attendra uniformément un type d'encodage, qui sera habituellement soit Unicode (UTF-8 pour Lazarus), soit ANSI (qui signifie l'encodage du système, et peut être UTF-8 ou non).

La RTL et la FCL de FPC 2.4 attendent des chaines ansi. FPC 2.5.x aussi actuellement.

Vous pouvez convertir entre unicode et ansi en utilisant les fonctions UTF8ToAnsi et AnsiToUTF8 de l'unité System, ou les fonctions UTF8ToSys et SysToUTF8 de l'unité FileUtil. Les deux dernières sont plus intelligentes mais engendrent plus de code dans votre programme.

FPC ne travaille pas en Unicode

Le Runtime Free Pascal (RTL), et la bibliothèque de composants Free Pascal (FCL), dans les versions actuelles de FPC (jusqu'à la 2.5.x) sont ANSI, vous devrez donc convertir les chaines venant des bibliothèques Unicode, ou allant vers des bibliothèques Unicode (comme la LCL).

Convertir entre ANSI et Unicode

Exemples:

Disons que vous récupérez une chaine d'un TEdit et que vous voulez la passer à une fonction de fichier de la RTL :

<delphi>var

 MyString: string; // utf-8 encoded

begin

 MyString := MyTEdit.Text;
 SomeRTLRoutine(UTF8ToAnsi(MyString));

end;</delphi>

Et pour le sens inverse :

<delphi>var

 MyString: string; // ansi encoded

begin

 MyString := SomeRTLRoutine;
 MyTEdit.Text := AnsiToUTF8(MyString);

end;</delphi>

Important: UTF8ToAnsi retournera une chaine vide si la chaine UTF8 contient des caractères invalides.

Important: AnsiToUTF8 and UTF8ToAnsi require a widestring manager under Linux, BSD and Mac OS X. You can use the SysToUTF8 and UTF8ToSys functions (unit FileUtil) or add the widestring manager by adding cwstring as one of the first units to your program's uses section.

Widestrings and Ansistrings

When passing Ansistrings to Widestrings you have to convert the encoding.

<Delphi>var

 w: widestring;

begin

 w:='Über'; // wrong, because FPC will convert system codepage to UTF16
 w:=UTF8ToUTF16('Über'); // correct
 Button1.Caption:=UTF16ToUTF8(w);

end;</Delphi>

Dealing with UTF8 strings and characters

Until Lazarus 0.9.30 the UTF-8 handling routines were in the LCL in the unit LCLProc. In Lazarus 0.9.31+ the routines in LCLProc are still available for backwards compatibility but the real code to deal with UTF-8 is located in the lazutils package in the unit lazutf8.

To execute operations over UTF-8 strings one should prefer to use routines from the unit lazutf8 instead of routines from the SysUtils routine from Free Pascal, because SysUtils is not yet prepared to deal with Unicode, while lazutf8 is. Simply substitute the routines from SysUtils with their lazutf8 equivalent, which always has the same name except for an added "UTF8" prefix.

Also note that simply iterating over chars as if the string was an array does not work in Unicode. This is not something specific to UTF-8 and one simply cannot suppose that a character will have a fixed size in Unicode. If you want to iterate over the characters of an UTF-8 string, there are basically two ways:

iterate over the bytes - useful for searching a substring or when looking only at the ASCII characters of the UTF8 string. For example when parsing xml files.
iterate over the characters - useful for graphical components like synedit. For example when you want to know the third printed character on the screen.

Searching a substring

Due to the special nature of UTF8 you can simply use the normal string functions:

<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior ... procedure Where(SearchFor, aText: string); var

 BytePos: LongInt;
 CharacterPos: LongInt;

begin

 BytePos:=Pos(SearchFor,aText);
 CharacterPos:=UTF8Length(PChar(aText),BytePos-1);
 writeln('The substring "',SearchFor,'" is in the text "',aText,'"',
   ' at byte position ',BytePos,' and at character position ',CharacterPos);

end;</Delphi>

Accessing UTF8 characters

Unicode characters can vary in length, so the best solution for accessing them is to use an iteration when one intends to access the characters in the sequence in which they are. For iterating through the characters use this code:

<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior ... procedure DoSomethingWithString(AnUTF8String: string); var

 p: PChar;
 CharLen: integer;
 FirstByte, SecondByte, ThirdByte: Char;

begin

 p:=PChar(AnUTF8String);
 repeat
   CharLen := UTF8CharacterLength(p);

   // here you have a pointer to the char and it's length
   // You can access the bytes of the UTF-8 Char like this:
   if CharLen >= 1 then FirstByte := P[0];
   if CharLen >= 2 then SecondByte := P[1];
   if CharLen >= 3 then ThirdByte := P[2];

   inc(p,CharLen);
 until (CharLen=0) or (p^ = #0);

end;</Delphi>

Accessing the Nth UTF8 character

Besides iterating one might also want to execute a random access to UTF-8 Characters.

<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior ... var

 AnUTF8String, NthChar: string;

begin

 NthChar := UTF8Copy(AnUTF8String, N, 1);

</Delphi>

Iterating over codepoints using UTF8CharacterToUnicode

The following demonstrates how to iterate the 32bit code point value of each character in an UTF8 string:

<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior ... procedure IterateUTF8Characters(const AnUTF8String: string); var

 p: PChar;
 unicode: Cardinal;
 CharLen: integer;

begin

 p:=PChar(AnUTF8String);
 repeat
   unicode:=UTF8CharacterToUnicode(p,CharLen);
   writeln('Unicode=',unicode);
   inc(p,CharLen);
 until (CharLen=0) or (unicode=0);

end;</Delphi>

UTF-8 String Copy, Length, LowerCase, etc

Nearly all operations which one might want to execute with UTF-8 strings are covered by the routines in the unit lazutf8 (unit LCLProc for Lazarus 0.9.30 or inferior). See the following list of routines take from lazutf8.pas:

<Delphi> function UTF8CharacterLength(p: PChar): integer; function UTF8Length(const s: string): PtrInt; function UTF8Length(p: PChar; ByteCount: PtrInt): PtrInt; function UTF8CharacterToUnicode(p: PChar; out CharLen: integer): Cardinal; function UnicodeToUTF8(u: cardinal; Buf: PChar): integer; inline; function UnicodeToUTF8SkipErrors(u: cardinal; Buf: PChar): integer; function UnicodeToUTF8(u: cardinal): shortstring; inline; function UTF8ToDoubleByteString(const s: string): string; function UTF8ToDoubleByte(UTF8Str: PChar; Len: PtrInt; DBStr: PByte): PtrInt; function UTF8FindNearestCharStart(UTF8Str: PChar; Len: integer;

                                 BytePos: integer): integer;

// find the n-th UTF8 character, ignoring BIDI function UTF8CharStart(UTF8Str: PChar; Len, CharIndex: PtrInt): PChar; // find the byte index of the n-th UTF8 character, ignoring BIDI (byte len of substr) function UTF8CharToByteIndex(UTF8Str: PChar; Len, CharIndex: PtrInt): PtrInt; procedure UTF8FixBroken(P: PChar); function UTF8CharacterStrictLength(P: PChar): integer; function UTF8CStringToUTF8String(SourceStart: PChar; SourceLen: PtrInt) : string; function UTF8Pos(const SearchForText, SearchInText: string): PtrInt; function UTF8Copy(const s: string; StartCharIndex, CharCount: PtrInt): string; procedure UTF8Delete(var s: String; StartCharIndex, CharCount: PtrInt); procedure UTF8Insert(const source: String; var s: string; StartCharIndex: PtrInt);

function UTF8LowerCase(const AInStr: string; ALanguage: string=): string; function UTF8UpperCase(const AInStr: string; ALanguage: string=): string; function FindInvalidUTF8Character(p: PChar; Count: PtrInt;

                                 StopOnNonASCII: Boolean = false): PtrInt;

function ValidUTF8String(const s: String): String;

procedure AssignUTF8ListToAnsi(UTF8List, AnsiList: TStrings);

//compare functions

function UTF8CompareStr(const S1, S2: string): Integer; function UTF8CompareText(const S1, S2: string): Integer; </Delphi>

Dealing with directory and filenames

Lazarus controls and functions expect filenames and directory names in utf-8 encoding, but the RTL uses ansi strings for directories and filenames.

For example, consider a button, which sets the Directory property of the TFileListBox to the current directory. The RTL Function GetCurrentDir is ansi, and not unicode, so conversion is needed:

<delphi>procedure TForm1.Button1Click(Sender: TObject); begin

 FileListBox1.Directory:=SysToUTF8(GetCurrentDir);
 // or use the functions from the FileUtil unit
 FileListBox1.Directory:=GetCurrentDirUTF8;

end;</delphi>

The unit FileUtil defines common file functions with UTF-8 strings:

<Delphi>// basic functions similar to the RTL but working with UTF-8 instead of the // system encoding

// AnsiToUTF8 and UTF8ToAnsi need a widestring manager under Linux, BSD, Mac OS X // but normally these OS use UTF-8 as system encoding so the widestringmanager // is not needed. function NeedRTLAnsi: boolean;// true if system encoding is not UTF-8 procedure SetNeedRTLAnsi(NewValue: boolean); function UTF8ToSys(const s: string): string;// as UTF8ToAnsi but more independent of widestringmanager function SysToUTF8(const s: string): string;// as AnsiToUTF8 but more independent of widestringmanager

// file operations function FileExistsUTF8(const Filename: string): boolean; function FileAgeUTF8(const FileName: string): Longint; function DirectoryExistsUTF8(const Directory: string): Boolean; function ExpandFileNameUTF8(const FileName: string): string; function ExpandUNCFileNameUTF8(const FileName: string): string; {$IFNDEF VER2_2_0} function ExtractShortPathNameUTF8(Const FileName : String) : String; {$ENDIF} function FindFirstUTF8(const Path: string; Attr: Longint; out Rslt: TSearchRec): Longint; function FindNextUTF8(var Rslt: TSearchRec): Longint; procedure FindCloseUTF8(var F: TSearchrec); function FileSetDateUTF8(const FileName: String; Age: Longint): Longint; function FileGetAttrUTF8(const FileName: String): Longint; function FileSetAttrUTF8(const Filename: String; Attr: longint): Longint; function DeleteFileUTF8(const FileName: String): Boolean; function RenameFileUTF8(const OldName, NewName: String): Boolean; function FileSearchUTF8(const Name, DirList : String): String; function FileIsReadOnlyUTF8(const FileName: String): Boolean; function GetCurrentDirUTF8: String; function SetCurrentDirUTF8(const NewDir: String): Boolean; function CreateDirUTF8(const NewDir: String): Boolean; function RemoveDirUTF8(const Dir: String): Boolean; function ForceDirectoriesUTF8(const Dir: string): Boolean;

// environment function ParamStrUTF8(Param: Integer): string; function GetEnvironmentStringUTF8(Index : Integer): String; function GetEnvironmentVariableUTF8(const EnvVar: String): String; function GetAppConfigDirUTF8(Global: Boolean): string;</Delphi>

Mac OS X

The file functions of the FileUtil unit also take care of a Mac OS X special: OS X normalizes filenames. For example the filename 'ä.txt' can be encoded in unicode with two different sequences (#$C3#$A4 and 'a'#$CC#$88). Under Linux and BSD you can create a filename with both encodings. OS X automatically converts the a umlaut to the three byte sequence. This means:

<Delphi>if Filename1 = Filename2 then ... // is not sufficient under OS X if AnsiCompareFileName(Filename1, Filename2) = 0 then ... // not sufficient under fpc 2.2.2, even not with cwstring if CompareFilenames(Filename1, Filename2) = 0 then ... // this always works (unit FileUtil or FileProcs</Delphi>

East Asian languages on Windows

The default font (Tahoma) for user interface controls under Windows XP is capable of displaying correctly several languages, including arabic, russian and western languages, but not east asian languages, like chinese, japanese and korean. By going to the Control Panel, choosing Regional Settings, clicking on the Languages Tab and installing the East Asia Language Pack, the standard user interface font will simply start showing those languages correctly. Obviously Windows XP versions localized for those languages will already contain this language pack installed. Extended instructions here.

Free Pascal Particularities

UTF8 and source files - the missing BOM

When you create source files with Lazarus and type some non ascii characters the file is saved in UTF8. It does not use BOM (Byte Order Mark). You can change the encoding via right click on source editor / File Settings / Encoding. The reason for the lacking BOM is how FPC treats Ansistrings. For compatibility the LCL uses Ansistrings and for portability the LCL uses UTF8.

Note: Some MS Windows text editors might treat the files as system codepage and show them as invalid characters. Do not add the BOM. If you add the BOM you have to change all string assignments.

For example:

<Delphi>Button1.Caption := 'Über';</Delphi>

When no BOM is given (and no codepage parameter was passed) the compiler treats the string as system encoding and copies each byte unconverted to the string. This is how the LCL expects strings.

<Delphi>// source file saved as UTF without BOM if FileExists('Über.txt') then ; // wrong, because FileExists expects system encoding if FileExistsUTF8('Über.txt') then ; // correct</Delphi>

Unicode essentials

Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).

There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:

                           UTF-8 UTF-16 UTF-32
Smallest code point [hex] 000000 000000 000000
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
Code unit size [bits]          8     16     32
Minimal bytes/character        1      2      4
Maximal bytes/character        4      4      4

UTF-8 has several important and useful properties: It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-16 has the following most important properties: It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.

Finally, any Unicode character can be represented as a single 32-bit unit in UTF-32.

For more, see: Unicode FAQ - Basic questions, Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM, Wikipedia: UTF-8 [1]

Implementation Details

Since the gtk1 interface was obsoleted in Lazarus 0.9.31 all LCL interfaces are Unicode capable and Lazarus and the LCL use and accept only UTF-8 encoded strings, unless in routines explicitly marked as accepting other encodings.

Unicode-enabling the win32 interface

Guidelines

First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At this moment all existing programs that use ANSI characters will need migration to Unicode.

Windows platforms <=Win9x are based on ISO code page standards and only partially support Unicode. Windows platforms starting with WinNT and Windows CE fully support Unicode. Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W functions accept wide strings, i.e. UTF-16 encoded strings, as parameters. Windows CE only uses Wide API functions.

Wide functions present on Windows 9x

Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp

Conversion example:

<delphi>GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);</delphi>

Becomes:

<delphi>{$ifdef WindowsUnicodeSupport}

 GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);

{$else}

 GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);

{$endif}</delphi>

Functions that need Ansi and Wide versions

First Conversion example:

<delphi>function TGDIWindow.GetTitle: String; var

l: Integer;

begin

  l := Windows.GetWindowTextLength(Handle);
  SetLength(Result, l);
  Windows.GetWindowText(Handle, @Result[1], l);

end;</delphi>

Becomes:

<delphi>function TGDIWindow.GetTitle: String; var

 l: Integer;
 AnsiBuffer: string;
 WideBuffer: WideString;

begin

{$ifdef WindowsUnicodeSupport}

if UnicodeEnabledOS then begin

 l := Windows.GetWindowTextLengthW(Handle);
 SetLength(WideBuffer, l);
 l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
 SetLength(WideBuffer, l);
 Result := Utf8Encode(WideBuffer);

end else begin

 l := Windows.GetWindowTextLength(Handle);
 SetLength(AnsiBuffer, l);
 l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
 SetLength(AnsiBuffer, l);
 Result := AnsiToUtf8(AnsiBuffer);

end;

{$else}

  l := Windows.GetWindowTextLength(Handle);
  SetLength(Result, l);
  Windows.GetWindowText(Handle, @Result[1], l);

{$endif}

end;</delphi>

Difference between revisions of "LCL Unicode Support/fr"

Revision as of 15:01, 24 January 2012

Contents

Introduction

Instructions pour les utilisateurs

FPC ne travaille pas en Unicode

Convertir entre ANSI et Unicode

Widestrings and Ansistrings

Dealing with UTF8 strings and characters

Searching a substring

Accessing UTF8 characters

Accessing the Nth UTF8 character

Iterating over codepoints using UTF8CharacterToUnicode

UTF-8 String Copy, Length, LowerCase, etc

Dealing with directory and filenames

Mac OS X

East Asian languages on Windows

Free Pascal Particularities

UTF8 and source files - the missing BOM

Unicode essentials

Implementation Details

Unicode-enabling the win32 interface

Guidelines

Wide functions present on Windows 9x

Functions that need Ansi and Wide versions

Screenshots

See Also

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search

@@ Line 3: / Line 3: @@
 == Introduction ==
-Le support avec Lazarus de la norme Unicode nécessite un développement approfondi, principalement en regard de la plate-forme Windows. Voici quelque information de base  pour ceux qui voudraient développer plus loin  le support de la norme Unicode sous Lazarus. Svp corriger, prolonger et mettre à jour cette page .
+A partir de la version 0.9.25, Lazarus supporte pleinement Unicode pour toutes les plateformes, excepté Gtk 1. Dans cette page, vous pouvez trouver des instructions pour les utilisateurs de Lazarus, des feuilles de route, des descriptions de concepts basiques et des détails d'implémentation.
-Cela aidera si vous avez déjà entendu parler de la norme Unicode  et si vous aviez peut-être une certaine expérience avec les WideStrings sous Delphi. L'utilisation précédente de scripts latins non-(occidentaux)  et leurs jeux de caractères variés aidera aussi.
+== Instructions pour les utilisateurs ==
-Note: On discute toujours des détails d'implémentation, et le contenu de ce document peut changer.
+Même si Lazarus possèdes des ensembles de widgets Unicode, il est important de noter que tout n'est pas en Unicode. Il est de la responsabilité du développeur de connaitre l'encodage de ses chaines de caractère, et d'effectuer la conversion appropriée entre les bibliothèques qui attendent des encodages différents.
-== Directives d'implémentation  ==
+Habituellement, l'encodage est défini bibliothèque par bibliothèque (une bibliothèque dynamique (dll) ou un package Lazarus). Chaque bibliothèque attendra uniformément un type d'encodage, qui sera habituellement soit Unicode (UTF-8 pour Lazarus), soit ANSI (qui signifie l'encodage du système, et peut être UTF-8 ou non).
-=== Nécessités ===
+La RTL et la FCL de FPC 2.4 attendent des chaines ansi. FPC 2.5.x aussi actuellement.
-L'esprit de Lazarus est : "Écrire une fois, compiler partout ."
+Vous pouvez convertir entre unicode et ansi en utilisant les fonctions '''UTF8ToAnsi''' et '''AnsiToUTF8''' de l'unité System, ou les fonctions '''UTF8ToSys''' et '''SysToUTF8''' de l'unité FileUtil. Les deux dernières sont plus intelligentes mais engendrent plus de code dans votre programme.
-Ceci signifie que , idéalement , une application avec la norme unicode autorisée
-should have only one Unicode supporting source code version,
-without any conditional defines in respect to various target
-platforms.
-The "interface" part of the LCL should support Unicode for
+===FPC ne travaille pas en Unicode ===
-the target platforms which support it themselves, concealing
+Le Runtime Free Pascal (RTL), et la bibliothèque de composants Free Pascal (FCL), dans les versions actuelles de FPC (jusqu'à la 2.5.x) sont ANSI, vous devrez donc convertir les chaines venant des bibliothèques Unicode, ou allant vers des bibliothèques Unicode (comme la LCL).
-at the same time all peculiarities from the application
-programmer.
-What concerns Lazarus, the internal string communication at
+===Convertir entre ANSI et Unicode===
-the boundaries "Application code <--> LCL", as well as "LCL
+Exemples:
-<--> Widgetsets" is based on the classical (byte oriented)
-strings. Logically, their contents should be encoded according
-to the [[UTF-8]].
-=== Migration à l'Unicode ===
+Disons que vous récupérez une chaine d'un TEdit et que vous voulez la passer à une fonction de fichier de la RTL :
-Most existing Lazarus use Ansi encodings, because that´s the default for Gtk1 and win32 interfaces today. This will change in the future and all widgetsets will support UTF-8, so all applications that pass strings directly to the interface (be written on code or on the object inspector) will need to be converted to utf-8.
+<delphi>var
+  MyString: string; // utf-8 encoded
+begin
+  MyString := MyTEdit.Text;
+  SomeRTLRoutine(UTF8ToAnsi(MyString));
+end;</delphi>
-When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. To avoid inconsistencies (like passing iso characters for a utf-8 widgetset), it´s necessary to use an IDE working on the same encoding as the target widgetset. This means that we will need stable UTF-8 IDE before completing the migration to Unicode.
+Et pour le sens inverse :
+<delphi>var
+  MyString: string; // ansi encoded
+begin
+  MyString := SomeRTLRoutine;
+  MyTEdit.Text := AnsiToUTF8(MyString);
+end;</delphi>
-Currently we have various groups of widgetsets, according to the encoding:
+'''Important''': UTF8ToAnsi retournera une chaine vide si la chaine UTF8 contient des caractères invalides.
-*Interfaces that use ANSI encoding: win32 and gtk (1) interfaces.
+'''Important''': AnsiToUTF8 and UTF8ToAnsi require a widestring manager under Linux, BSD and Mac OS X. You can use the SysToUTF8 and UTF8ToSys functions (unit FileUtil) or add the widestring manager by adding cwstring as one of the first units to your program's uses section.
-*Interfaces that use UTF-8 encoding: gtk (1), gtk2, qt, fpGUI, carbon
+===Widestrings and Ansistrings===
-*Interfaces that currently use ANSI encoding, but need migration to UTF-8: win32, wince
+When passing Ansistrings to Widestrings you have to convert the encoding.
+<Delphi>var
+  w: widestring;
+begin
+  w:='Über'; // wrong, because FPC will convert system codepage to UTF16
+  w:=UTF8ToUTF16('Über'); // correct
+  Button1.Caption:=UTF16ToUTF8(w);
+end;</Delphi>
-Notice that gtk 1 is on both ANSI and UTF-8 groups. That´s because the encoding is controlled by an environment variable on Gtk 1.
+===Dealing with UTF8 strings and characters===
-As Lazarus is today, existing software will work, if recompiled for win32, wince or gtk interfaces, but will face encoding issues compiling for other widgetset. And new software, using UTF-8 will work when recompiled for any of the widgetsets on the Unicode group.
+Until Lazarus 0.9.30 the UTF-8 handling routines were in the LCL in the unit LCLProc. In Lazarus 0.9.31+ the routines in LCLProc are still available for backwards compatibility but the real code to deal with UTF-8 is located in the lazutils package in the unit lazutf8.
-One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.
+To execute operations over UTF-8 strings one should prefer to use routines from the unit lazutf8 instead of routines from the SysUtils routine from Free Pascal, because SysUtils is not yet prepared to deal with Unicode, while lazutf8 is. Simply substitute the routines from SysUtils with their lazutf8 equivalent, which always has the same name except for an added "UTF8" prefix.
-== Roadmap ==
+Also note that simply iterating over chars as if the string was an array does not work in Unicode. This is not something specific to UTF-8 and one simply cannot suppose that a character will have a fixed size in Unicode. If you want to iterate over the characters of an UTF-8 string, there are basically two ways:
-Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.
+*iterate over the bytes - useful for searching a substring or when looking only at the ASCII characters of the UTF8 string. For example when parsing xml files.
+*iterate over the characters - useful for graphical components like synedit. For example when you want to know the third printed character on the screen.
-All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.
+====Searching a substring====
-Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.
+Due to the special nature of UTF8 you can simply use the normal string functions:
+<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
+...
+procedure Where(SearchFor, aText: string);
+var
+  BytePos: LongInt;
+  CharacterPos: LongInt;
+begin
+  BytePos:=Pos(SearchFor,aText);
+  CharacterPos:=UTF8Length(PChar(aText),BytePos-1);
+  writeln('The substring "',SearchFor,'" is in the text "',aText,'"',
+    ' at byte position ',BytePos,' and at character position ',CharacterPos);
+end;</Delphi>
-=== Primary Tasks ===
+====Accessing UTF8 characters====
+Unicode characters can vary in length, so the best solution for accessing them is to use an iteration when one intends to access the characters in the sequence in which they are. For iterating through the characters use this code:
-'''Make Win32 Widgetset support UTF-8'''
+<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
+...
+procedure DoSomethingWithString(AnUTF8String: string);
+var
+  p: PChar;
+  CharLen: integer;
+  FirstByte, SecondByte, ThirdByte: Char;
+begin
+  p:=PChar(AnUTF8String);
+  repeat
+    CharLen := UTF8CharacterLength(p);
-Notes: On this step we will target all 32-bits Windows versions at the same time. All code produced on this effort will be isolated from the current win32 interface by IFDEFs, to avoid introducing bugs on this primary interface. After the transition time, the IFDEFs will be removed and only the Unicode code will remain.
+    // here you have a pointer to the char and it's length
+    // You can access the bytes of the UTF-8 Char like this:
+    if CharLen >= 1 then FirstByte := P[0];
+    if CharLen >= 2 then SecondByte := P[1];
+    if CharLen >= 3 then ThirdByte := P[2];
-Status: Partially implemented
+    inc(p,CharLen);
+  until (CharLen=0) or (p^ = #0);
+end;</Delphi>
+====Accessing the Nth UTF8 character====
-'''Update Gtk 2 keyboard functions so they work with UTF-8'''
+Besides iterating one might also want to execute a random access to UTF-8 Characters.
-Notes:
+<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
+...
+var
+  AnUTF8String, NthChar: string;
+begin
+  NthChar := UTF8Copy(AnUTF8String, N, 1);
+</Delphi>
+====Iterating over codepoints using UTF8CharacterToUnicode====
-Status: Almost complete. Some pre-editing features of the gtk2 are not yet supported in custom controls. I don't know, which language needs them.
+The following demonstrates how to iterate the 32bit code point value of each character in an UTF8 string:
+<Delphi>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
+...
+procedure IterateUTF8Characters(const AnUTF8String: string);
+var
+  p: PChar;
+  unicode: Cardinal;
+  CharLen: integer;
+begin
+  p:=PChar(AnUTF8String);
+  repeat
+    unicode:=UTF8CharacterToUnicode(p,CharLen);
+    writeln('Unicode=',unicode);
+    inc(p,CharLen);
+  until (CharLen=0) or (unicode=0);
+end;</Delphi>
-'''Make sure the Lazarus IDE runs correctly with Win32 Unicode widgetset and supports UTF-8'''
+====UTF-8 String Copy, Length, LowerCase, etc====
-Notes:
+Nearly all operations which one might want to execute with UTF-8 strings are covered by the routines in the unit lazutf8 (unit LCLProc for Lazarus 0.9.30 or inferior). See the following list of routines take from lazutf8.pas:
-Status: Complete. Except for the character map, which still shows only 255 characters. But all modern OS provide nice unicode character maps anyway.
+<Delphi>
+function UTF8CharacterLength(p: PChar): integer;
+function UTF8Length(const s: string): PtrInt;
+function UTF8Length(p: PChar; ByteCount: PtrInt): PtrInt;
+function UTF8CharacterToUnicode(p: PChar; out CharLen: integer): Cardinal;
+function UnicodeToUTF8(u: cardinal; Buf: PChar): integer; inline;
+function UnicodeToUTF8SkipErrors(u: cardinal; Buf: PChar): integer;
+function UnicodeToUTF8(u: cardinal): shortstring; inline;
+function UTF8ToDoubleByteString(const s: string): string;
+function UTF8ToDoubleByte(UTF8Str: PChar; Len: PtrInt; DBStr: PByte): PtrInt;
+function UTF8FindNearestCharStart(UTF8Str: PChar; Len: integer;
+                                  BytePos: integer): integer;
+// find the n-th UTF8 character, ignoring BIDI
+function UTF8CharStart(UTF8Str: PChar; Len, CharIndex: PtrInt): PChar;
+// find the byte index of the n-th UTF8 character, ignoring BIDI (byte len of substr)
+function UTF8CharToByteIndex(UTF8Str: PChar; Len, CharIndex: PtrInt): PtrInt;
+procedure UTF8FixBroken(P: PChar);
+function UTF8CharacterStrictLength(P: PChar): integer;
+function UTF8CStringToUTF8String(SourceStart: PChar; SourceLen: PtrInt) : string;
+function UTF8Pos(const SearchForText, SearchInText: string): PtrInt;
+function UTF8Copy(const s: string; StartCharIndex, CharCount: PtrInt): string;
+procedure UTF8Delete(var s: String; StartCharIndex, CharCount: PtrInt);
+procedure UTF8Insert(const source: String; var s: string; StartCharIndex: PtrInt);
+function UTF8LowerCase(const AInStr: string; ALanguage: string=''): string;
+function UTF8UpperCase(const AInStr: string; ALanguage: string=''): string;
+function FindInvalidUTF8Character(p: PChar; Count: PtrInt;
+                                  StopOnNonASCII: Boolean = false): PtrInt;
+function ValidUTF8String(const s: String): String;
-'''Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and supports UTF-8'''
+procedure AssignUTF8ListToAnsi(UTF8List, AnsiList: TStrings);
-Notes:
+//compare functions
-Status: Complete. There are gtk2 intf bugs, but they have nothing to do with utf-8.
+function UTF8CompareStr(const S1, S2: string): Integer;
+function UTF8CompareText(const S1, S2: string): Integer;
+</Delphi>
-=== Secondary Tasks ===
+===Dealing with directory and filenames===
+Lazarus controls and functions expect filenames and directory names in utf-8 encoding, but the RTL uses ansi strings for directories and filenames.
-'''Update Windows CE widgetset so it uses UTF-8'''
+For example, consider a button, which sets the Directory property of the TFileListBox to the current directory. The RTL Function [[doc:rtl/sysutils/getcurrentdir.html|GetCurrentDir]] is ansi, and not unicode, so conversion is needed:
-Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.
+<delphi>procedure TForm1.Button1Click(Sender: TObject);
+begin
+  FileListBox1.Directory:=SysToUTF8(GetCurrentDir);
+  // or use the functions from the FileUtil unit
+  FileListBox1.Directory:=GetCurrentDirUTF8;
+end;</delphi>
-Status: Not implemented
+The unit FileUtil defines common file functions with UTF-8 strings:
+<Delphi>// basic functions similar to the RTL but working with UTF-8 instead of the
+// system encoding
-'''Update Gtk 1 keyboard functions so they work with UTF-8'''
+// AnsiToUTF8 and UTF8ToAnsi need a widestring manager under Linux, BSD, Mac OS X
+// but normally these OS use UTF-8 as system encoding so the widestringmanager
+// is not needed.
+function NeedRTLAnsi: boolean;// true if system encoding is not UTF-8
+procedure SetNeedRTLAnsi(NewValue: boolean);
+function UTF8ToSys(const s: string): string;// as UTF8ToAnsi but more independent of widestringmanager
+function SysToUTF8(const s: string): string;// as AnsiToUTF8 but more independent of widestringmanager
-Notes:
+// file operations
+function FileExistsUTF8(const Filename: string): boolean;
+function FileAgeUTF8(const FileName: string): Longint;
+function DirectoryExistsUTF8(const Directory: string): Boolean;
+function ExpandFileNameUTF8(const FileName: string): string;
+function ExpandUNCFileNameUTF8(const FileName: string): string;
+{$IFNDEF VER2_2_0}
+function ExtractShortPathNameUTF8(Const FileName : String) : String;
+{$ENDIF}
+function FindFirstUTF8(const Path: string; Attr: Longint; out Rslt: TSearchRec): Longint;
+function FindNextUTF8(var Rslt: TSearchRec): Longint;
+procedure FindCloseUTF8(var F: TSearchrec);
+function FileSetDateUTF8(const FileName: String; Age: Longint): Longint;
+function FileGetAttrUTF8(const FileName: String): Longint;
+function FileSetAttrUTF8(const Filename: String; Attr: longint): Longint;
+function DeleteFileUTF8(const FileName: String): Boolean;
+function RenameFileUTF8(const OldName, NewName: String): Boolean;
+function FileSearchUTF8(const Name, DirList : String): String;
+function FileIsReadOnlyUTF8(const FileName: String): Boolean;
+function GetCurrentDirUTF8: String;
+function SetCurrentDirUTF8(const NewDir: String): Boolean;
+function CreateDirUTF8(const NewDir: String): Boolean;
+function RemoveDirUTF8(const Dir: String): Boolean;
+function ForceDirectoriesUTF8(const Dir: string): Boolean;
-Status: Not implemented
+// environment
+function ParamStrUTF8(Param: Integer): string;
+function GetEnvironmentStringUTF8(Index : Integer): String;
+function GetEnvironmentVariableUTF8(const EnvVar: String): String;
+function GetAppConfigDirUTF8(Global: Boolean): string;</Delphi>
+====Mac OS X====
-'''Complete RTL in synedit'''
+The file functions of the FileUtil unit also take care of a Mac OS X special: OS X normalizes filenames. For example the filename 'ä.txt' can be encoded in unicode with two different sequences (#$C3#$A4 and 'a'#$CC#$88). Under Linux and BSD you can create a filename with both encodings. OS X automatically converts the a umlaut to the three byte sequence. This means:
+<Delphi>if Filename1 = Filename2 then ... // is not sufficient under OS X
+if AnsiCompareFileName(Filename1, Filename2) = 0 then ... // not sufficient under fpc 2.2.2, even not with cwstring
+if CompareFilenames(Filename1, Filename2) = 0 then ... // this always works (unit FileUtil or FileProcs</Delphi>
+===East Asian languages on Windows===
+The default font (Tahoma) for user interface controls under Windows XP is capable of displaying correctly several languages, including arabic, russian and western languages, but not east asian languages, like chinese, japanese and korean. By going to the Control Panel, choosing Regional Settings, clicking on the Languages Tab and installing the East Asia Language Pack, the standard user interface font will simply start showing those languages correctly. Obviously Windows XP versions localized for those languages will already contain this language pack installed. Extended instructions [http://newton.uor.edu/Departments&Programs/AsianStudiesDept/Language/asianlanguageinstallation_XP.html here].
+== Free Pascal Particularities ==
+===UTF8 and source files - the missing BOM===
+When you create source files with Lazarus and type some non ascii characters the file is saved in UTF8. It does '''not use BOM''' (Byte Order Mark). You can change the encoding via right click on source editor / File Settings / Encoding. The reason for the lacking BOM is how FPC treats Ansistrings. For compatibility the LCL uses Ansistrings and for portability the LCL uses UTF8.
+Note: Some MS Windows text editors might treat the files as system codepage and show them as invalid characters. Do not add the BOM. If you add the BOM you have to change all string assignments.
+For example:
+<Delphi>Button1.Caption := 'Über';</Delphi>
+When no BOM is given (and no codepage parameter was passed) the compiler treats the string as system encoding and copies each byte unconverted to the string. This is how the LCL expects strings.
+<Delphi>// source file saved as UTF without BOM
+if FileExists('Über.txt') then ; // wrong, because FileExists expects system encoding
+if FileExistsUTF8('Über.txt') then ; // correct</Delphi>
-Notes: RTL means right to left as used for example by arabic
-Status: Not implemented.
 == Unicode essentials ==
@@ Line 158: / Line 316: @@
 [http://en.wikipedia.org/wiki/ISO-8859]
-== Lazarus component library architecture essentials ==
+= Implementation Details  =
-The LCL consists of two parts:
+Since the gtk1 interface was obsoleted in Lazarus 0.9.31 all LCL interfaces are Unicode capable and Lazarus and the LCL use and accept only UTF-8 encoded strings, unless in routines explicitly marked as accepting other encodings.
-# A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
-# "Interfaces" - a part that implements the interface to APIs of each target platform.
-The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.
-The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.
-Gtk2 widgetset only works with UTF-8 encoding and supports UTF-8 completely.
-The win32 interface is setup with ansi widgets and UTF-8 support is started, but not yet complete and therefore disabled by default. So it is currently not possible to use Unicode with win32.
-Qt interface is prepared for UTF-8. Qt itself uses UTF-16 as native encoding, but the lazarus interface for Qt converts from UTF-8 to UTF-16.
-Windows CE only support UTF-16 as character encoding, but our interface for it currently converts strings from ISO to UTF-16 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.
-For more, see: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
 == Unicode-enabling the win32 interface  ==
-=== Compiling LCL-Win32 with Unicode ===
-To enable unicode on LCL for Windows go to the menu "Tools" --> "Configure Build Lazarus"
-Put -dWindowsUnicodeSupport on the "Options" field. Select all targets to NONE, and only LCL to Clean+Build. Select win32 as target widgetset. Click on "Build".
-Now you can recompile your existing applications and they will have Unicode mode enabled. Note that at the moment only a few parts of the software will be really unicode enabled and you may find bugs on those parts.
 === Guidelines ===
@@ Line 203: / Line 337: @@
 Conversion example:
-<pre>
+<delphi>GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
-  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
+Length(ButtonCaption), TextSize);</delphi>
-Length(ButtonCaption), TextSize);
-</pre>
 Becomes:
-<pre>
+<delphi>{$ifdef WindowsUnicodeSupport}
-  {$ifdef WindowsUnicodeSupport}
+  GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
-    GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
+{$else}
-  {$else}
+  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
-    GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
+{$endif}</delphi>
-  {$endif}
-</pre>
 ====Functions that need Ansi and Wide versions====
@@ Line 222: / Line 352: @@
 First Conversion example:
-<pre>
+<delphi>function TGDIWindow.GetTitle: String;
-function TGDIWindow.GetTitle: String;
 var
   l: Integer;
@@ Line 230: / Line 359: @@
     SetLength(Result, l);
     Windows.GetWindowText(Handle, @Result[1], l);
-end;
+end;</delphi>
-</pre>
 Becomes:
-<pre>
+<delphi>function TGDIWindow.GetTitle: String;
-function TGDIWindow.GetTitle: String;
 var
- l: Integer;
+  l: Integer;
- AnsiBuffer: string;
+  AnsiBuffer: string;
- WideBuffer: WideString;
+  WideBuffer: WideString;
 begin
 {$ifdef WindowsUnicodeSupport}
- if UnicodeEnabledOS then
+if UnicodeEnabledOS then
- begin
+begin
-   l := Windows.GetWindowTextLengthW(Handle);
+  l := Windows.GetWindowTextLengthW(Handle);
-   SetLength(WideBuffer, l);
+  SetLength(WideBuffer, l);
-   l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
+  l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
-   SetLength(WideBuffer, l);
+  SetLength(WideBuffer, l);
-   Result := Utf8Encode(WideBuffer);
+  Result := Utf8Encode(WideBuffer);
- end
+end
- else
+else
- begin
+begin
-   l := Windows.GetWindowTextLength(Handle);
+  l := Windows.GetWindowTextLength(Handle);
-   SetLength(AnsiBuffer, l);
+  SetLength(AnsiBuffer, l);
-   l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
+  l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
-   SetLength(AnsiBuffer, l);
+  SetLength(AnsiBuffer, l);
-   Result := AnsiToUtf8(AnsiBuffer);
+  Result := AnsiToUtf8(AnsiBuffer);
- end;
+end;
 {$else}
@@ Line 270: / Line 397: @@
 {$endif}
-end;
+end;</delphi>
-</pre>
-=== Roadmap ===
-What should already be working with Unicode:
-* TForm, TButton, TLabel
-* Most controls
-* Menus
-* LCLIntf.ExtTextOut and most other text related winapis
-* TStrings based controls. Examples: TComboBox, TListBox, etc
-Known problems with Unicode support:
-* SynEdit does not support RTL (right to left)
-* MessageBox buttons don't show unicode correctly when they are translated. Tested on the IDE. Could be a problem on the IDE however.
-List of units to be checked:
-*"win32callback.inc"
-*"win32def.pp"
-*"win32int.pp"
-*"win32lclintf.inc"
-*"win32lclintfh.inc"
-*"win32listsl.inc"
-*"win32listslh.inc"
-*"win32memostrings.inc"
-*"win32object.inc"
-*"win32proc.pp"
-*"win32winapi.inc"
-*"win32winapih.inc"
-*"win32wsactnlist.pp"
-*"win32wsarrow.pp"
-*"win32wsbuttons.pp"
-*"win32wscalendar.pp"
-*"win32wschecklst.pp"
-*"win32wsclistbox.pp"
-*"win32wscomctrls.pp"
-*"win32wscontrols.pp"
-*"win32wscustomlistview.inc"
-*"win32wsdbctrls.pp"
-*"win32wsdbgrids.pp"
-*"win32wsdialogs.pp"
-*<s>"win32wsdirsel.pp"</s> - Felipe
-*<s>"win32wseditbtn.pp"</s> - Felipe
-*<s>"win32wsextctrls.pp"</s> - Felipe
-*<s>"win32wsextdlgs.pp"</s> - Felipe
-*<s>"win32wsfilectrl.pp"</s> - Felipe
-*<s>"win32wsforms.pp"</s> - Felipe
-*<s>"win32wsgrids.pp"</s> - Felipe
-*<s>"win32wsimglist.pp"</s> - Felipe
-*<s>"win32wsmaskedit.pp"</s> - Felipe
-*<s>"win32wsmenus.pp"</s> - Felipe
-*<s>"win32wspairsplitter.pp"</s> - Felipe
-*<s>"win32wsspin.pp"</s> - Felipe
-*<s>"win32wsstdctrls.pp"</s> - Felipe
-*<s>"win32wstoolwin.pp"</s> - Felipe
-*<s>"winext.pas"</s> - Felipe
 === Screenshots ===
@@ Line 334: / Line 405: @@
 [[Image:Lazarus Unicode Test.png]]
-== See Also ==
+= See Also =
 * [[UTF-8]] - Description of UTF-8 strings
+[[Category:LCL]]