|
|
(183 intermediate revisions by 19 users not shown) |
Line 1: |
Line 1: |
− | {{LCL Unicode Support}}
| + | #REDIRECT [[Unicode Support in Lazarus]] |
− | | |
− | == Introduction ==
| |
− | | |
− | Lazarus support of the Unicode standard needs further
| |
− | development, mostly in regard to the Windows platform. Here
| |
− | are some basic information for those who would like to
| |
− | further develop the Lazarus Unicode support.
| |
− | Please correct, extend and update this page.
| |
− | | |
− | It will help if you have already heard for the Unicode
| |
− | standard and if you perhaps had some experience with
| |
− | WideStrings under Delphi. Previous use of
| |
− | non-(western)Latin scripts and their various character sets
| |
− | will help too.
| |
− | | |
− | == Unicode essentials ==
| |
− | | |
− | Unicode standard maps integers from 0 to 10FFFF(h) to
| |
− | characters. Each such mapping is called a code point. In
| |
− | other words, Unicode characters are in principle defined for
| |
− | code points from U+000000 to U+10FFFF (0 to 1 114 111).
| |
− | | |
− | There are three schemes for representing Unicode code points
| |
− | as unique byte sequences. These schemes are called Unicode
| |
− | transformation formats: UTF-8, UTF-16 and UTF-32. The
| |
− | conversions between all of them are possible.
| |
− | Here are their basic properties:
| |
− | UTF-8 UTF-16 UTF-32
| |
− | Smallest code point [hex] 000000 000000 000000
| |
− | Largest code point [hex] 10FFFF 10FFFF 10FFFF
| |
− | Code unit size [bits] 8 16 32
| |
− | Minimal bytes/character 1 2 4
| |
− | Maximal bytes/character 4 4 4
| |
− | | |
− | UTF-8 has several important and useful properties: It is
| |
− | interpreted as a sequence of bytes, so that the concept of
| |
− | lo- and hi-order byte does not exist. Unicode
| |
− | characters U+0000 to U+007F (ASCII) are encoded simply as
| |
− | bytes 00h to 7Fh (ASCII compatibility). This means that
| |
− | files and strings which contain only 7-bit ASCII characters
| |
− | have the same encoding under both ASCII and UTF-8. All
| |
− | characters >U+007F are encoded as a sequence of several
| |
− | bytes, each of which has the most significant bit set. No
| |
− | byte sequence of one character is contained within a longer
| |
− | byte sequence of another character. This allows easy search for substrings. The first byte of a
| |
− | multibyte sequence that represents a non-ASCII character is
| |
− | always in the range C0h to FDh and it indicates how many
| |
− | bytes follow for this character. All further bytes in a
| |
− | multibyte sequence are in the range 80h to BFh. This allows
| |
− | easy resynchronization and robustness.
| |
− | | |
− | UTF-16 has the following most important properties: It uses a
| |
− | single 16-bit word to encode any of the characters from U+0000
| |
− | to U+FFFF, and a pair of 16-bit words to encode any of the
| |
− | remaining Unicode characters.
| |
− | | |
− | Finally, any Unicode character can be represented as a
| |
− | single 32-bit unit in UTF-32.
| |
− | | |
− | For more, see:
| |
− | Unicode FAQ - Basic questions: http://www.unicode.org/faq/basic_q.html
| |
− | Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM: http://www.unicode.org/faq/utf_bom.html
| |
− | Wikipedia: UTF-8 http://en.wikipedia.org/wiki/UTF-8
| |
− | | |
− | | |
− | == Lazarus component library architecture essentials ==
| |
− | | |
− | (This part based on a mail by Marc Weustink)
| |
− | The LCL consists of two parts: 1. A target platform independent
| |
− | part, which implements a class hierarchy analogous to Delphi
| |
− | VCL; 2. "Interfaces" - a part that implements the interface
| |
− | to APIs of each target platform.
| |
− | | |
− | The communication between the two parts is done by an
| |
− | abstract class TWidgetset. Each widgetset is implemented by
| |
− | its own derived class form TWidgetset.
| |
− | | |
− | The GTK widgetset is the oldest. In this widgetset the
| |
− | string encoding is determined by the LANG environment var.
| |
− | If it is a UTF8 variant, all strings from and to native
| |
− | controls/widgets are UTF8 encoded. However utf8 may affect
| |
− | keyboard handling for gtk1. On gtk2 this problem is solved,
| |
− | but not implemented yet, the keyboard routines still rely on
| |
− | gtk1 code there.
| |
− | | |
− | The win32 interfaces is setup with ansii widgets, so it is
| |
− | currently not possible to use Unicode with win32.
| |
− | | |
− | For more, see:
| |
− | | |
− | Internals of the LCL: http://wiki.lazarus.freepascal.org/index.php/LCL_Internals#Internals_of_the_LCL
| |
− | | |
− | | |
− | == Unicode-enabling the win32 interface ==
| |
− | | |
− | === Requirements ===
| |
− | | |
− | The spirit of Lazarus is: "Write once, compile everywhere."
| |
− | This means that, ideally, an Unicode enabled application
| |
− | should have only one Unicode supporting source code version,
| |
− | without any conditional defines in respect to various target
| |
− | platforms.
| |
− | | |
− | The "interface" part of the LCL should support Unicode for
| |
− | the target platforms which support it themselves, concealing
| |
− | at the same time all peculiarities from the application
| |
− | programmer.
| |
− | | |
− | Windows platforms <=Win9x are based on ISO code page
| |
− | standards and do not support Unicode. Windows platforms
| |
− | starting with WinNT support Unicode. In doing that, these
| |
− | platforms offer two parallel sets of API functions: the old
| |
− | ANSII enabled *A and the new, Unicode enabled *W. *W
| |
− | functions accept wide strings, i.e. UTF-16 encoded strings,
| |
− | as parameters.
| |
− | | |
− | What concerns Lazarus, the internal string communication at
| |
− | the boundaries "Application code <--> LCL", as well as "LCL
| |
− | <--> Widgetsets" is based on the classical (byte oriented)
| |
− | strings. Logically, their contents should be encoded according
| |
− | to the UTF-8.
| |
− | | |
− | It is sound to assume that the existing WinXX application
| |
− | base internally does not use UTF-8 encoded strings, but
| |
− | the ISO code page based ones. Any Unicode enabling changes
| |
− | to LCL and widget sets for win32 must not break the existing
| |
− | application base. At the same time they should support
| |
− | applications internally based on the Unicode UTF-8 encoded
| |
− | strings, both on older Win9x platforms, as well as on
| |
− | Unicode based >=WinNT platforms.
| |
− | | |
− | | |
− | === A solution approach ===
| |
− | | |
− | === Making progress ===
| |