Difference between revisions of "FPC Unicode support"
(ref to recent status message on unicode) |
(Updated with some info from recent fpc ml posts) |
||
Line 3: | Line 3: | ||
Since several releases, Delphi supports Unicode. | Since several releases, Delphi supports Unicode. | ||
FPC must be compatible with Delphi in Unicode support. | FPC must be compatible with Delphi in Unicode support. | ||
+ | |||
+ | |||
+ | == FPC 2.7.x Unicode plans == | ||
+ | There will be a unicode RTL and an ANSI/legacy compatiblity RTL. See [http://www.mail-archive.com/fpc-pascal@lists.freepascal.org/msg30919.html] | ||
+ | |||
+ | == Libraries == | ||
+ | The string architecture for FCL etc libraries has not yet been decided. See [http://www.mail-archive.com/fpc-pascal@lists.freepascal.org/msg30934.html]. | ||
{{Warning|This page has not been updated for a long time. Since FPC 2.7 (current development version), extensive Unicode support has been implemented.}} | {{Warning|This page has not been updated for a long time. Since FPC 2.7 (current development version), extensive Unicode support has been implemented.}} | ||
Line 8: | Line 15: | ||
'''Please update this page with the latest status, e.g. from [http://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg27659.html this post on the FPC dev list]''' | '''Please update this page with the latest status, e.g. from [http://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg27659.html this post on the FPC dev list]''' | ||
+ | == Old/obsolete sections== | ||
+ | These sections are kept for historical reference - please update the sections above with this information if it is still applicable. | ||
− | == Tiburon Unicode support == | + | === Tiburon Unicode support === |
Currently we have some information about Tiburon's Unicode support implementation. | Currently we have some information about Tiburon's Unicode support implementation. | ||
Line 17: | Line 26: | ||
http://blogs.codegear.com/abauer/2008/07/16/38864 | http://blogs.codegear.com/abauer/2008/07/16/38864 | ||
− | == FPC Unicode support == | + | === FPC Unicode support === |
FPC must have the following string types with transparent conversion between them (like current AnsiString <-> WideString conversion): | FPC must have the following string types with transparent conversion between them (like current AnsiString <-> WideString conversion): | ||
Line 60: | Line 69: | ||
The compiler generates helper procedure name from type names. The compiler does not perform any string conversion handling by itself. | The compiler generates helper procedure name from type names. The compiler does not perform any string conversion handling by itself. | ||
− | ==Status of Unicode support in FPC so far== | + | ===Status of Unicode support in FPC so far=== |
Currently FPC 2.3.x has a new type called UnicodeString. This is similar to a WideString type. The difference being that UnicodeString is reference counted on all platforms. | Currently FPC 2.3.x has a new type called UnicodeString. This is similar to a WideString type. The difference being that UnicodeString is reference counted on all platforms. | ||
Line 66: | Line 75: | ||
All implementation work is currently done in a separate svn branch: http://svn.freepascal.org/svn/fpc/branches/cpstrnew | All implementation work is currently done in a separate svn branch: http://svn.freepascal.org/svn/fpc/branches/cpstrnew | ||
− | ==User visible changes== | + | ===User visible changes=== |
Full support of code page aware strings is not possible without breaking some existing code. The following list tries to summarize the most important user visible changes. | Full support of code page aware strings is not possible without breaking some existing code. The following list tries to summarize the most important user visible changes. | ||
Line 74: | Line 83: | ||
* UTF8ToAnsi and AnsiToUTF8 take a RawByteString now | * UTF8ToAnsi and AnsiToUTF8 take a RawByteString now | ||
− | ==Roadmap of RTL Unicode support with UnicodeString== | + | ===Roadmap of RTL Unicode support with UnicodeString=== |
{| class="wikitable" | {| class="wikitable" | ||
Line 88: | Line 97: | ||
|} | |} | ||
− | ==Roadmap of RTL Unicode support with UTF8String== | + | ===Roadmap of RTL Unicode support with UTF8String=== |
{| class="wikitable" | {| class="wikitable" | ||
Line 103: | Line 112: | ||
* [[unicode use cases]] | * [[unicode use cases]] | ||
− | + | * [[LCL Unicode Support]] | |
[[Category:Unicode]] | [[Category:Unicode]] | ||
+ | [[Category:FPC]] |
Revision as of 19:02, 22 December 2012
Introduction
Free Pascal compiler and RTL/FCL should natively support Unicode. Since several releases, Delphi supports Unicode. FPC must be compatible with Delphi in Unicode support.
FPC 2.7.x Unicode plans
There will be a unicode RTL and an ANSI/legacy compatiblity RTL. See [1]
Libraries
The string architecture for FCL etc libraries has not yet been decided. See [2].
Warning: This page has not been updated for a long time. Since FPC 2.7 (current development version), extensive Unicode support has been implemented.
Please update this page with the latest status, e.g. from this post on the FPC dev list
Old/obsolete sections
These sections are kept for historical reference - please update the sections above with this information if it is still applicable.
Tiburon Unicode support
Currently we have some information about Tiburon's Unicode support implementation.
http://blogs.codegear.com/abauer/2008/01/09/38845
http://blogs.codegear.com/abauer/2008/07/16/38864
FPC Unicode support
FPC must have the following string types with transparent conversion between them (like current AnsiString <-> WideString conversion):
- shortstring
- ansistring
- widestring
- utf8string
- utf16string
- utf32string
- ucs2string (?)
- ucs4string (?)
Development and further maintenance of these string types must be as simple as possible. New string types must be easily added in future if needed.
Compiler uses generic structure and helper routines to handle all refcounted string types.
String header:
type
TRefStringRec = packed record
Encoding: word; // encoding of string
ElementSize: byte; // size in bytes of string's element (1-4)
Ref: SizeInt; // number of references
Len: SizeInt; // number of elements is string
end;
Helper routines will know how to handle string from its header.
Extra parameter with string type information is passed to some routines (like fpc_RefString_SetLength) to allow properly initialize new strings.
widestring type on Windows targets remains non-refcounted and OLE compatible. Minimal number of helper routines is used for it. On non-Windows targets widestring is alias to utf16string.
The compiler uses helpers for string type conversions like this:
procedure fpc_ansistring_to_utf16string(out dst: utf16string; const src: ansistring);
procedure fpc_utf32string_to_utf16string(out dst: utf16string; const src: utf32string);
The compiler generates helper procedure name from type names. The compiler does not perform any string conversion handling by itself.
Status of Unicode support in FPC so far
Currently FPC 2.3.x has a new type called UnicodeString. This is similar to a WideString type. The difference being that UnicodeString is reference counted on all platforms.
All implementation work is currently done in a separate svn branch: http://svn.freepascal.org/svn/fpc/branches/cpstrnew
User visible changes
Full support of code page aware strings is not possible without breaking some existing code. The following list tries to summarize the most important user visible changes.
- The string header has two new fields: encoding and element size. On 32 Bit platforms this increases the header size by 4 and on 64 bit platforms by 8 bytes.
- WideCharLenToString, UnicodeCharLenToString, WideCharToString, UnicodeCharToString and OleStrToString return an UnicodeString instead of an Ansistring before.
- the type of the dest parameter of WideCharLenToString and UnicodeCharLenToString has been changed from Ansistring to Unicodestring
- UTF8ToAnsi and AnsiToUTF8 take a RawByteString now
Roadmap of RTL Unicode support with UnicodeString
Topic | Status | Comments | Assigned To |
---|---|---|---|
Locale Variables | Not implemented | Variables are all 1 byte in size and can't hold UnicodeChar size values. e.g.: The Russian thousand separator is a no-break space $00A0 which doesn't fit in the ThousandSeparator (standard Char type) variable. | |
TStrings | Not implemented | There is no UnicodeString version of TStrings | |
TStringList | Not implemented | There is no UnicodeString version of TStringList | |
Pos() | Working |
Roadmap of RTL Unicode support with UTF8String
Topic | Status | Comments | Assigned To |
---|---|---|---|
UTF8String | Not implemented | Needs a real implementation. Is currently just an alias for ansistring. | |
TStrings | Not implemented | There is no UTF8String version of TStrings | |
TStringList | Not implemented | There is no UTF8String version of TStringList |