Difference between revisions of "LCL Unicode Support"

From Lazarus wiki
Jump to navigationJump to search
 
(set up MW-redirect)
Tag: New redirect
 
(183 intermediate revisions by 19 users not shown)
Line 1: Line 1:
{{LCL Unicode Support}}
+
#REDIRECT [[Unicode Support in Lazarus]]
 
 
== Introduction ==
 
 
 
Lazarus support of the Unicode standard needs further
 
development, mostly in regard to the Windows platform. Here
 
are some basic information for those who would like to
 
further develop the Lazarus Unicode support.
 
Please correct, extend and update this page.
 
 
 
It will help if you have already heard for the Unicode
 
standard and if you perhaps had some experience with
 
WideStrings under Delphi. Previous use of
 
non-(western)Latin scripts and their various character sets
 
will help too.
 
 
 
== Unicode essentials ==
 
 
 
Unicode standard maps integers from 0 to 10FFFF(h) to
 
characters. Each such mapping is called a code point. In
 
other words, Unicode characters are in principle defined for
 
code points from U+000000 to U+10FFFF (0 to 1 114 111).
 
 
 
There are three schemes for representing Unicode code points
 
as unique byte sequences. These schemes are called Unicode
 
transformation formats: UTF-8, UTF-16 and UTF-32. The
 
conversions between all of them are possible.
 
Here are their basic properties:
 
                            UTF-8 UTF-16 UTF-32
 
Smallest code point [hex] 000000 000000 000000
 
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
 
Code unit size [bits]         8    16    32
 
Minimal bytes/character        1      2      4
 
Maximal bytes/character        4      4      4
 
 
 
UTF-8 has several important and useful properties: It is
 
interpreted as a sequence of bytes, so that the concept of
 
lo- and hi-order byte does not exist. Unicode
 
characters U+0000 to U+007F (ASCII) are encoded simply as
 
bytes 00h to 7Fh (ASCII compatibility). This means that
 
files and strings which contain only 7-bit ASCII characters
 
have the same encoding under both ASCII and UTF-8. All
 
characters >U+007F are encoded as a sequence of several
 
bytes, each of which has the most significant bit set. No
 
byte sequence of one character is contained within a longer
 
byte sequence of another character. This allows easy search for substrings. The first byte of a
 
multibyte sequence that represents a non-ASCII character is
 
always in the range C0h to FDh and it indicates how many
 
bytes follow for this character. All further bytes in a
 
multibyte sequence are in the range 80h to BFh. This allows
 
easy resynchronization and robustness.
 
 
 
UTF-16 has the following most important properties: It uses a
 
single 16-bit word to encode any of the characters from U+0000
 
to U+FFFF, and a pair of 16-bit words to encode any of the
 
remaining Unicode characters.
 
 
 
Finally, any Unicode character can be represented as a
 
single 32-bit unit in UTF-32.
 
 
 
For more, see:
 
Unicode FAQ - Basic questions: http://www.unicode.org/faq/basic_q.html
 
Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM: http://www.unicode.org/faq/utf_bom.html
 
Wikipedia: UTF-8 http://en.wikipedia.org/wiki/UTF-8
 
 
 
 
 
== Lazarus component library architecture essentials ==
 
 
 
(This part based on a mail by Marc Weustink)
 
The LCL consists of two parts: 1. A target platform independent
 
part, which implements a class hierarchy analogous to Delphi
 
VCL; 2. "Interfaces" - a part that implements the interface
 
to APIs of each target platform.
 
 
 
The communication between the two parts is done by an
 
abstract class TWidgetset. Each widgetset is implemented by
 
its own derived class form TWidgetset.
 
 
 
The GTK widgetset is the oldest. In this widgetset the
 
string encoding is determined by the LANG environment var.
 
If it is a UTF8 variant, all strings from and to native
 
controls/widgets are UTF8 encoded. However utf8 may affect
 
keyboard handling for gtk1. On gtk2 this problem is solved,
 
but not implemented yet, the keyboard routines still rely on
 
gtk1 code there.
 
 
 
The win32 interfaces is setup with ansii widgets, so it is
 
currently not possible to use Unicode with win32.
 
 
 
For more, see:
 
 
 
Internals of the LCL: http://wiki.lazarus.freepascal.org/index.php/LCL_Internals#Internals_of_the_LCL
 
 
 
 
 
== Unicode-enabling the win32 interface  ==
 
 
 
=== Requirements ===
 
 
 
The spirit of Lazarus is: "Write once, compile everywhere."
 
This means that, ideally, an Unicode enabled application
 
should have only one Unicode supporting source code version,
 
without any conditional defines in respect to various target
 
platforms.
 
 
 
The "interface" part of the LCL should support Unicode for
 
the target platforms which support it themselves, concealing
 
at the same time all peculiarities from the application
 
programmer.
 
 
 
Windows platforms <=Win9x are based on ISO code page
 
standards and do not support Unicode. Windows platforms
 
starting with WinNT support Unicode. In doing that, these
 
platforms offer two parallel sets of API functions: the old
 
ANSII enabled *A and the new, Unicode enabled *W. *W
 
functions accept wide strings, i.e. UTF-16 encoded strings,
 
as parameters.
 
 
 
What concerns Lazarus, the internal string communication at
 
the boundaries "Application code <--> LCL", as well as "LCL
 
<--> Widgetsets" is based on the classical (byte oriented)
 
strings. Logically, their contents should be encoded according
 
to the UTF-8.
 
 
 
It is sound to assume that the existing WinXX application
 
base internally does not use UTF-8 encoded strings, but
 
the ISO code page based ones. Any Unicode enabling changes
 
to LCL and widget sets for win32 must not break the existing
 
application base. At the same time they should support
 
applications internally based on the Unicode UTF-8 encoded
 
strings, both on older Win9x platforms, as well as on
 
Unicode based >=WinNT platforms.
 
 
 
 
 
=== A solution approach ===
 
 
 
=== Making progress ===
 

Latest revision as of 01:58, 25 October 2019