Difference between revisions of "LCL Unicode Support"

From Lazarus wiki
Jump to navigationJump to search
m (→‎Migration to Unicode: - added internal wiki links.)
(set up MW-redirect)
Tag: New redirect
 
(87 intermediate revisions by 15 users not shown)
Line 1: Line 1:
{{LCL Unicode Support}}
+
#REDIRECT [[Unicode Support in Lazarus]]
 
 
== Introduction ==
 
 
 
As of 0.9.25, Lazarus has full Unicode support in all platforms except Gtk 1. In this page one can find instructions for Lazarus users, roadmaps, descriptions of basic concepts and implementation details.
 
 
 
== Instructions for users ==
 
 
 
In unicode widgetsets it's important to note that not everything is unicode. The Free Pascal Runtime Library, the Free Pascal FCL library are ansi. It's the responsibility of the developer to know what is the encoding of their strings and do the proper conversion between libraries which expect different encodings.
 
 
 
Usually the encoding is per-library. Each library will uniformly expect 1 kind of encoding, which will usually either be unicode (UTF-8 for Lazarus) or Ansi (which means the system encoding, and may be utf-8 or not). The RTL and the FCL of FPC 2.2.2 expect ansi strings. FPC 2.3.x currently too.
 
 
 
You can can convert between unicode and ansi using the UTF8ToAnsi and AnsiToUTF8 functions from the System unit or the UTF8ToSys and SysToUTF8 from the FileUtil unit. The later two are smarter but pull more code into your program.
 
 
 
Examples:
 
 
 
Say you get a string from a TEdit and you want to give it to some rtl file routine:
 
 
 
<delphi>
 
var
 
  MyString: string; // utf-8 encoded
 
begin
 
  MyString := MyTEdit.Text;
 
  SomeRTLRoutine(UTF8ToAnsi(MyString));
 
end;
 
</delphi>
 
 
 
And for the opposite direction:
 
 
 
<delphi>
 
var
 
  MyString: string; // ansi encoded
 
begin
 
  MyString := SomeRTLRoutine;
 
  MyTEdit.Text := AnsiToUTF8(MyString);
 
end;
 
</delphi>
 
 
 
'''Important''': UTF8ToAnsi will return an empty string if the UTF8 string contains invalid characters.
 
 
 
'''Important''': AnsiToUTF8 and UTF8ToAnsi require a widestring manager under Linux, BSD and Mac OS X. You can use the SysToUTF8 and UTF8ToSys functions (unit FileUtil) or add a the widetstring manager by adding cwstring as one of the first units to your program's uses section.
 
 
 
 
 
===Dealing with UTF8 strings and characters===
 
 
 
If you want to iterate over the characters of an UTF8 string, there are basically two ways.
 
 
 
*iterate over the bytes - useful for searching a substring or when looking only at the ASCII characters of the UTF8 string. For example when parsing xml files.
 
*iterate over the characters - useful for graphical components like synedit. For example when you want to know the third printed character on the screen.
 
 
 
====Searching a substring====
 
 
 
Due to the special nature of UTF8 you can simply use the normal string functions:
 
 
 
<Delphi>
 
procedure Where(SearchFor, aText: string);
 
var
 
  BytePos: LongInt;
 
  CharacterPos: LongInt;
 
begin
 
  BytePos:=Pos(SearchFor,aText);
 
  CharacterPos:=UTF8Length(PChar(aText),BytePos-1);
 
  writeln('The substring "',SearchFor,'" is in the text "',aText,'"',
 
    ' at byte position ',BytePos,' and at character position ',CharacterPos);
 
end;
 
</Delphi>
 
 
 
 
 
====Accessing an UTF8 string as array====
 
 
 
An array consists of elements with the same size. An UTF8 character can have 1 to 4 bytes. So you have to convert the UTF8 string to either an array of unicode values (longwords) or to an array of PChar with a pointer to each character.
 
 
 
====Iterating over characters using UTF8CharacterToUnicode====
 
 
 
The following demonstrates how to iterate the 32bit unicode value of each character in an UTF8 string:
 
 
 
<Delphi>
 
uses LCLProc;
 
...
 
procedure IterateUTF8Characters(const AnUTF8String: string);
 
var
 
  p: PChar;
 
  unicode: Cardinal;
 
  CharLen: integer;
 
begin
 
  p:=PChar(AnUTF8String);
 
  repeat
 
    unicode:=UTF8CharacterToUnicode(p,CharLen);
 
    writeln('Unicode=',unicode);
 
    inc(p,CharLen);
 
  until (CharLen=0) or (unicode=0);
 
end;
 
</Delphi>
 
 
 
===Dealing with directory and filenames===
 
 
 
Lazarus controls and functions expect filenames and directory names in utf-8 encoding, but the RTL uses ansi strings for directories and filenames.
 
 
 
For example, consider a button, which sets the Directory property of the TFileListBox to the current directory. The RTL Function [[doc:rtl/sysutils/getcurrentdir.html|GetCurrentDir]] is ansi, and not unicode, so conversion is needed:
 
 
 
<delphi>
 
procedure TForm1.Button1Click(Sender: TObject);
 
begin
 
  FileListBox1.Directory:=SysToUTF8(GetCurrentDir);
 
  // or use the functions from the FileUtil unit
 
  FileListBox1.Directory:=GetCurrentDirUTF8;
 
end;
 
</delphi>
 
 
 
The unit FileUtil defines common file functions with UTF-8 strings:
 
 
 
<Delphi>
 
// basic functions similar to the RTL but working with UTF-8 instead of the
 
// system encoding
 
 
 
// AnsiToUTF8 and UTF8ToAnsi need a widestring manager under Linux, BSD, Mac OS X
 
// but normally these OS use UTF-8 as system encoding so the widestringmanager
 
// is not needed.
 
function NeedRTLAnsi: boolean;// true if system encoding is not UTF-8
 
procedure SetNeedRTLAnsi(NewValue: boolean);
 
function UTF8ToSys(const s: string): string;// as UTF8ToAnsi but more independent of widestringmanager
 
function SysToUTF8(const s: string): string;// as AnsiToUTF8 but more independent of widestringmanager
 
 
 
// file operations
 
function FileExistsUTF8(const Filename: string): boolean;
 
function FileAgeUTF8(const FileName: string): Longint;
 
function DirectoryExistsUTF8(const Directory: string): Boolean;
 
function ExpandFileNameUTF8(const FileName: string): string;
 
function ExpandUNCFileNameUTF8(const FileName: string): string;
 
{$IFNDEF VER2_2_0}
 
function ExtractShortPathNameUTF8(Const FileName : String) : String;
 
{$ENDIF}
 
function FindFirstUTF8(const Path: string; Attr: Longint; out Rslt: TSearchRec): Longint;
 
function FindNextUTF8(var Rslt: TSearchRec): Longint;
 
procedure FindCloseUTF8(var F: TSearchrec);
 
function FileSetDateUTF8(const FileName: String; Age: Longint): Longint;
 
function FileGetAttrUTF8(const FileName: String): Longint;
 
function FileSetAttrUTF8(const Filename: String; Attr: longint): Longint;
 
function DeleteFileUTF8(const FileName: String): Boolean;
 
function RenameFileUTF8(const OldName, NewName: String): Boolean;
 
function FileSearchUTF8(const Name, DirList : String): String;
 
function FileIsReadOnlyUTF8(const FileName: String): Boolean;
 
function GetCurrentDirUTF8: String;
 
function SetCurrentDirUTF8(const NewDir: String): Boolean;
 
function CreateDirUTF8(const NewDir: String): Boolean;
 
function RemoveDirUTF8(const Dir: String): Boolean;
 
function ForceDirectoriesUTF8(const Dir: string): Boolean;
 
 
 
// environment
 
function ParamStrUTF8(Param: Integer): string;
 
function GetEnvironmentStringUTF8(Index : Integer): String;
 
function GetEnvironmentVariableUTF8(const EnvVar: String): String;
 
function GetAppConfigDirUTF8(Global: Boolean): string;
 
</Delphi>
 
 
 
====Mac OS X====
 
 
 
The file functions of the FileUtil unit also take care of a Mac OS X special: OS X normalizes filenames. For example the filename 'ä.txt' can be encoded in unicode with two different sequences (#$C3#$A4 and 'a'#$CC#$88). Under Linux and BSD you can create a filename with both encodings. OS X automatically converts the a umlaut to the three byte sequence. This means:
 
 
 
<Delphi>
 
if Filename1=Filename2 then ... // is not sufficient under OS X
 
if AnsiCompareFileName(Filename1,Filename2)=0 then ... // not sufficient under fpc 2.2.2, even not with cwstring
 
if CompareFilenames(Filename1,Filename2)=0 then ... // this always works (unit FileUtil or FileProcs)
 
</Delphi>
 
 
 
===East Asian languages on Windows===
 
 
 
The default font (Tahoma) for user interface controls under Windows XP is capable of displaying correctly several languages, including arabic, russian and western languages, but not east asian languages, like chinese, japanese and korean. By going to the Control Panel, choosing Regional Settings, clicking on the Languages Tab and installing the East Asia Language Pack, the standard user interface font will simply start showing those languages correctly. Obviously Windows XP versions localized for those languages will already contain this language pack installed. Extended instructions [http://newton.uor.edu/Departments&Programs/AsianStudiesDept/Language/asianlanguageinstallation_XP.html here].
 
 
 
== Implementation guidelines ==
 
 
 
=== Requirements ===
 
 
 
The spirit of Lazarus is: "Write once, compile everywhere."
 
This means that, ideally, an Unicode enabled application
 
should have only one Unicode supporting source code version,
 
without any conditional defines in respect to various target
 
platforms.
 
 
 
The "interface" part of the LCL should support Unicode for
 
the target platforms which support it themselves, concealing
 
at the same time all peculiarities from the application
 
programmer.
 
 
 
What concerns Lazarus, the internal string communication at
 
the boundaries "Application code <--> LCL", as well as "LCL
 
<--> Widgetsets" is based on the classical (byte oriented)
 
strings. Logically, their contents should be encoded according
 
to the [[UTF-8]].
 
 
 
 
 
=== Migration to Unicode ===
 
 
 
Most existing Lazarus use Ansi encodings, because that was the default for Gtk1 and win32 interfaces till 0.9.24. With 0.9.25 all widgetsets will use [[UTF-8]] by default (with the exception of gtk1, which only supports [[UTF-8]] when the system encoding is [[UTF-8]] and it has some limitations). So all applications that pass strings directly to the interface (be written on code or on the object inspector) will need to be converted to utf-8.
 
 
 
Currently we have various groups of widgetsets, according to the encoding:
 
 
 
*Interfaces that use ANSI encoding: gtk (1) on ansi systems.
 
 
 
*Interfaces that use [[UTF-8]] encoding: gtk (1) on UTF-8 systems, gtk2, qt, [[fpGUI]], carbon, win32, wince, all others
 
 
 
Notice that gtk 1 is on both ANSI and UTF-8 groups. That´s because the encoding is controlled by an environment variable on Gtk 1.
 
 
 
The IDE was extended to load/save/edit files of different encodings (one encoding per file). It has a built in heuristic to determine the encoding and you can change the encoding of a file at any time (Source Editor / Popup Menu / File Settings/ Encoding). So the IDE can open old files and projects and can be used to convert the encoding.
 
 
 
== Roadmap ==
 
 
 
Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.
 
 
 
All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.
 
 
 
Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.
 
 
 
 
 
=== Primary Tasks ===
 
 
 
 
 
'''Make Win32 Widgetset support UTF-8'''
 
 
 
Notes: On this step we will target all 32-bits Windows versions at the same time. All code produced on this effort will be isolated from the current win32 interface by IFDEFs, to avoid introducing bugs on this primary interface. After the transition time, the IFDEFs will be removed and only the Unicode code will remain.
 
 
 
Status: Fully implemented
 
 
 
 
 
'''Update Gtk 2 keyboard functions so they work with UTF-8'''
 
 
 
Notes:
 
 
 
Status: Almost complete. Some pre-editing features of the gtk2 are not yet supported in custom controls. I don't know, which language needs them.
 
 
 
 
 
'''Make sure the Lazarus IDE runs correctly with Win32 Unicode widgetset and supports UTF-8'''
 
 
 
Notes:
 
 
 
Status: Complete. Except for the character map, which still shows only 255 characters. But all modern OS provide nice unicode character maps anyway.
 
 
 
 
 
'''Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and supports UTF-8'''
 
 
 
Notes:
 
 
 
Status: Complete. There are gtk2 intf bugs, but they have nothing to do with utf-8.
 
 
 
=== Secondary Tasks ===
 
 
 
 
 
'''Update Windows CE widgetset so it uses UTF-8'''
 
 
 
Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.
 
 
 
Status: Completed
 
 
 
 
 
'''Update Gtk 1 keyboard functions so they work with UTF-8'''
 
 
 
Notes:
 
 
 
Status: Not implemented
 
 
 
 
 
'''Complete RTL in synedit'''
 
 
 
Notes: RTL means right to left as used for example by arabic
 
 
 
Status: Not implemented.
 
 
 
== Unicode essentials ==
 
 
 
Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).
 
 
 
There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:
 
 
 
                            UTF-8 UTF-16 UTF-32
 
Smallest code point [hex] 000000 000000 000000
 
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
 
Code unit size [bits]          8    16    32
 
Minimal bytes/character        1      2      4
 
Maximal bytes/character        4      4      4
 
 
 
'''UTF-8''' has several important and useful properties: It is
 
interpreted as a sequence of bytes, so that the concept of
 
lo- and hi-order byte does not exist. Unicode
 
characters U+0000 to U+007F (ASCII) are encoded simply as
 
bytes 00h to 7Fh (ASCII compatibility). This means that
 
files and strings which contain only 7-bit ASCII characters
 
have the same encoding under both ASCII and UTF-8. All
 
characters >U+007F are encoded as a sequence of several
 
bytes, each of which has the two most significant bits set. No
 
byte sequence of one character is contained within a longer
 
byte sequence of another character. This allows easy search for substrings. The first byte of a
 
multibyte sequence that represents a non-ASCII character is
 
always in the range C0h to FDh and it indicates how many
 
bytes follow for this character. All further bytes in a
 
multibyte sequence are in the range 80h to BFh. This allows
 
easy resynchronization and robustness.
 
 
 
'''UTF-16''' has the following most important properties: It uses a
 
single 16-bit word to encode characters from U+0000
 
to U+d7ff, and a pair of 16-bit words to encode any of the
 
remaining Unicode characters.
 
 
 
Finally, any Unicode character can be represented as a
 
single 32-bit unit in '''UTF-32'''.
 
 
 
For more, see:
 
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
 
[http://en.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
 
[http://en.wikipedia.org/wiki/ISO-8859]
 
 
 
== Lazarus component library architecture essentials ==
 
 
 
The LCL consists of two parts:
 
# A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
 
# "Interfaces" - a part that implements the interface to APIs of each target platform.
 
 
 
The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.
 
 
 
The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.
 
 
 
Gtk2 widgetset only works with UTF-8 encoding and supports UTF-8 completely.
 
 
 
As of Lazarus 0.9.28 the Windows and Windows CE interfaces support Unicode fully.
 
 
 
Qt interface is prepared for UTF-8. Qt itself uses UTF-16 as native encoding, but the lazarus interface for Qt converts from UTF-8 to UTF-16.
 
 
 
For more, see: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
 
 
 
== Unicode-enabling the win32 interface  ==
 
 
 
 
 
 
 
=== Guidelines ===
 
 
 
First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At this moment all existing programs that use ANSI characters will need migration to Unicode.
 
 
 
Windows platforms <=Win9x are based on ISO code page
 
standards and only partially support Unicode. Windows platforms
 
starting with WinNT and Windows CE fully support Unicode. Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W
 
functions accept wide strings, i.e. UTF-16 encoded strings, as parameters. Windows CE only uses Wide API functions.
 
 
 
====Wide functions present on Windows 9x====
 
 
 
Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp
 
 
 
Conversion example:
 
 
 
<pre>
 
  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
 
Length(ButtonCaption), TextSize);
 
</pre>
 
 
 
Becomes:
 
 
 
<pre>
 
  {$ifdef WindowsUnicodeSupport}
 
    GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
 
  {$else}
 
    GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
 
  {$endif}
 
</pre>
 
 
 
====Functions that need Ansi and Wide versions====
 
 
 
First Conversion example:
 
 
 
<pre>
 
function TGDIWindow.GetTitle: String;
 
var
 
l: Integer;
 
begin
 
  l := Windows.GetWindowTextLength(Handle);
 
  SetLength(Result, l);
 
  Windows.GetWindowText(Handle, @Result[1], l);
 
end;
 
</pre>
 
 
 
Becomes:
 
 
 
<pre>
 
function TGDIWindow.GetTitle: String;
 
var
 
l: Integer;
 
AnsiBuffer: string;
 
WideBuffer: WideString;
 
begin
 
 
 
{$ifdef WindowsUnicodeSupport}
 
 
 
if UnicodeEnabledOS then
 
begin
 
  l := Windows.GetWindowTextLengthW(Handle);
 
  SetLength(WideBuffer, l);
 
  l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
 
  SetLength(WideBuffer, l);
 
  Result := Utf8Encode(WideBuffer);
 
end
 
else
 
begin
 
  l := Windows.GetWindowTextLength(Handle);
 
  SetLength(AnsiBuffer, l);
 
  l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
 
  SetLength(AnsiBuffer, l);
 
  Result := AnsiToUtf8(AnsiBuffer);
 
end;
 
 
 
{$else}
 
 
 
  l := Windows.GetWindowTextLength(Handle);
 
  SetLength(Result, l);
 
  Windows.GetWindowText(Handle, @Result[1], l);
 
 
 
{$endif}
 
 
 
end;
 
 
 
</pre>
 
 
 
=== Roadmap ===
 
 
 
==== What should already be working with Unicode: ====
 
 
 
* TForm, TButton, TLabel
 
* Most controls
 
* Menus
 
* LCLIntf.ExtTextOut and most other text related winapis
 
* TStrings based controls. Examples: TComboBox, TListBox, etc
 
* SynEdit shows and can input UTF-8 characters correctly
 
* Setting/Getting unicode strings to/from the ClipBoard
 
* Setting the Application Title in the project options to (for example) 'Minha Aplicação'.
 
* Double clicking words with non-ascii chars in the editor to select them
 
 
 
==== Known problems with Unicode support ====
 
 
 
* SynEdit does not support RTL (right to left)
 
* <s>Is OpenFileDialogCallBack tested with selection large numbers of files?
 
** Is this problem unicode specific? I think it's a generic problem. --[[User:Sekelsenmat|Sekelsenmat]] 13:40, 14 February 2008 (CET)
 
*** Maybe. I know I tested it with large number of files before the Unicode version was added. If it is a generic problem, then the the non-Unicode version got broken, when the Unicode version was added. [[User:Vincent|Vincent]] 21:45, 15 February 2008 (CET)
 
**** Associated bugtracker item: http://bugs.freepascal.org/view.php?id=10918</s> Fixed and implemented.
 
* <s>class function TWin32WSSelectDirectoryDialog.CreateHandle: Title, FileName and InitialDir should be made Unicode aware.
 
** Associated bugtracker item: http://bugs.freepascal.org/view.php?id=10919</s> Implemented.
 
 
 
'''Possible problems with Unicode support'''
 
 
 
Based on a code review, the following needs to be tested, because the code doesn't seem to be Unicode aware:
 
* <s>class procedure TWin32WSCustomComboBox.SetText</s>
 
* <s>TWin32WSCustomTrayIcon.Show: ATrayIcon.Hint is not Unicode aware</s>
 
* <s>TWin32WidgetSet.MessageBox doesn't call MessageBoxW.</s>
 
* TWin32WidgetSet.TextOut: Is Windows.TextOut supported on windows 9X?
 
** Yes, please see [[LCL_Unicode_Support#Wide_functions_present_on_Windows_9x]]. Althought I never tested this.
 
* MessageBox buttons don't show unicode correctly when they are translated. Tested on the IDE. Could be a problem on the IDE however.
 
** Note: I couldn't reproduce using the portuguese translation --[[User:Sekelsenmat|Sekelsenmat]] 22:20, 12 January 2008 (CET)
 
* (list of unconfirmed problems, if confirmed can be moved to the list above)
 
 
 
 
 
 
 
=== Screenshots ===
 
 
 
[[Image:Lazarus Unicode Test.png]]
 
 
 
== See Also ==
 
 
 
* [[UTF-8]] - Description of UTF-8 strings
 

Latest revision as of 00:58, 25 October 2019