Difference between revisions of "LCL Unicode Support"

Latest revision as of 01:58, 25 October 2019

Redirect to:

Unicode Support in Lazarus

@@ Line 1: / Line 1: @@
-{{LCL Unicode Support}}
+#REDIRECT [[Unicode Support in Lazarus]]
-== Introduction ==
-Lazarus support of the Unicode standard needs further
-development, mostly in regard to the Windows platform. Here
-are some basic information for those who would like to
-further develop the Lazarus Unicode support.
-Please correct, extend and update this page.
-It will help if you have already heard for the Unicode
-standard and if you perhaps had some experience with
-WideStrings under Delphi. Previous use of
-non-(western)Latin scripts and their various character sets
-will help too.
-== Unicode essentials ==
-Unicode standard maps integers from 0 to 10FFFF(h) to
-characters. Each such mapping is called a code point. In
-other words, Unicode characters are in principle defined for
-code points from U+000000 to U+10FFFF (0 to 1 114 111).
-There are three schemes for representing Unicode code points
-as unique byte sequences. These schemes are called Unicode
-transformation formats: UTF-8, UTF-16 and UTF-32. The
-conversions between all of them are possible.
-Here are their basic properties:
-                            UTF-8 UTF-16 UTF-32
- Smallest code point [hex] 000000 000000 000000
- Largest code point  [hex] 10FFFF 10FFFF 10FFFF
- Code unit size [bits]          8     16     32
- Minimal bytes/character        1      2      4
- Maximal bytes/character        4      4      4
-UTF-8 has several important and useful properties: It is
-interpreted as a sequence of bytes, so that the concept of
-lo- and hi-order byte does not exist. Unicode
-characters U+0000 to U+007F (ASCII) are encoded simply as
-bytes 00h to 7Fh (ASCII compatibility). This means that
-files and strings which contain only 7-bit ASCII characters
-have the same encoding under both ASCII and UTF-8. All
-characters >U+007F are encoded as a sequence of several
-bytes, each of which has the most significant bit set. No
-byte sequence of one character is contained within a longer
-byte sequence of another character. This allows easy search for substrings. The first byte of a
-multibyte sequence that represents a non-ASCII character is
-always in the range C0h to FDh and it indicates how many
-bytes follow for this character. All further bytes in a
-multibyte sequence are in the range 80h to BFh. This allows
-easy resynchronization and robustness.
-UTF-16 has the following most important properties: It uses a
-single 16-bit word to encode any of the characters from U+0000
-to U+FFFF, and a pair of 16-bit words to encode any of the
-remaining Unicode characters.
-Finally, any Unicode character can be represented as a
-single 32-bit unit in UTF-32.
-For more, see:
- Unicode FAQ - Basic questions: http://www.unicode.org/faq/basic_q.html
- Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM: http://www.unicode.org/faq/utf_bom.html
- Wikipedia: UTF-8 http://en.wikipedia.org/wiki/UTF-8
-== Lazarus component library architecture essentials ==
-(This part based on a mail by Marc Weustink)
-The LCL consists of two parts: 1. A target platform independent
-part, which implements a class hierarchy analogous to Delphi
-VCL; 2. "Interfaces" - a part that implements the interface
-to APIs of each target platform.
-The communication between the two parts is done by an
-abstract class TWidgetset. Each widgetset is implemented by
-its own derived class form TWidgetset.
-The GTK widgetset is the oldest. In this widgetset the
-string encoding is determined by the LANG environment var.
-If it is a UTF8 variant, all strings from and to native
-controls/widgets are UTF8 encoded. However utf8 may affect
-keyboard handling for gtk1. On gtk2 this problem is solved,
-but not implemented yet, the keyboard routines still rely on
-gtk1 code there.
-The win32 interfaces is setup with ansii widgets, so it is
-currently not possible to use Unicode with win32.
-For more, see:
- Internals of the LCL: http://wiki.lazarus.freepascal.org/index.php/LCL_Internals#Internals_of_the_LCL
-== Unicode-enabling the win32 interface  ==
-=== Requirements ===
-The spirit of Lazarus is: "Write once, compile everywhere."
-This means that, ideally, an Unicode enabled application
-should have only one Unicode supporting source code version,
-without any conditional defines in respect to various target
-platforms.
-The "interface" part of the LCL should support Unicode for
-the target platforms which support it themselves, concealing
-at the same time all peculiarities from the application
-programmer.
-Windows platforms <=Win9x are based on ISO code page
-standards and do not support Unicode. Windows platforms
-starting with WinNT support Unicode. In doing that, these
-platforms offer two parallel sets of API functions: the old
-ANSII enabled *A and the new, Unicode enabled *W. *W
-functions accept wide strings, i.e. UTF-16 encoded strings,
-as parameters.
-What concerns Lazarus, the internal string communication at
-the boundaries "Application code <--> LCL", as well as "LCL
-<--> Widgetsets" is based on the classical (byte oriented)
-strings. Logically, their contents should be encoded according
-to the UTF-8.
-It is sound to assume that the existing WinXX application
-base internally does not use UTF-8 encoded strings, but
-the ISO code page based ones. Any Unicode enabling changes
-to LCL and widget sets for win32 must not break the existing
-application base. At the same time they should support
-applications internally based on the Unicode UTF-8 encoded
-strings, both on older Win9x platforms, as well as on
-Unicode based >=WinNT platforms.
-=== A solution approach ===
-=== Making progress ===

Difference between revisions of "LCL Unicode Support"

Latest revision as of 01:58, 25 October 2019

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search