Difference between revisions of "FPC Unicode support"

From Lazarus wiki
Jump to navigationJump to search
(→‎Old/obsolete sections: removed some superfluous stuff)
(Added a bunch of information about the behaviour of the new string types)
Line 1: Line 1:
== Introduction ==
+
= Introduction =
Free Pascal compiler and RTL/FCL should natively support Unicode.
+
Up to an including FPC 2.6.x, the RTL is based on the ones of Turbo Pascal and Delphi 7. This means it was primarily based around the ''shortstring'', ''ansistring'' and ''pchar'' types. None of these types has any encoding information associated with them, but where implicitly assumed to be encoded in the "default system encoding" and were passed on to OS API calls without any conversion.
Since several releases, Delphi supports Unicode.
 
FPC must be compatible with Delphi in Unicode support.
 
  
 +
In Delphi 2009, Embarcadero switched the entire RTL over to the ''unicodestring'' type, which represents strings using UTF-16. Additionally, they also made the ansistring type "codepage-aware". This means that ansistrings from then store the codepage according to which their data should be interpreted.
  
== FPC 2.7.x Unicode plans ==
+
FPC's language-level support for these string types is already available in current development version of the compiler (FPC 2.7.1/trunk). The RTL level support is not yet complete. This page gives an overview of the codepage-related behaviour of these string types, the current level of support in the RTL, and possible future ways of how this support may be improved.
 +
 
 +
= Codepages =
 +
 
 +
A codepage defines how the individual bytes of a string should be interpreted, i.e., which letter, symbol or other graphic character corresponds to every byte or sequence of bytes.
 +
 
 +
== Codepage identifiers ==
 +
A codepage is always stored as a ''TSystemCodePage'', which is an alias for [Word]. The value represents the corresponding code page as defined by [http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx Microsoft]. Additionally, there are 3 special code page values:
 +
* CP_ACP: this value represents the currently set "default system code page". See [[#Codepage settings]] for more information.
 +
* CP_OEM: this value represents the OEM code page. On Windows platforms this corresponds to the code page used by the console (e.g. cmd.exe windows). On other platforms this value is interpreted the same as CP_ACP
 +
* CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined.
 +
 
 +
== Codepage settings ==
 +
The system unit contains several global variables that indicate the default code page used for certain operations.
 +
 
 +
=== DefaultSystemCodePage ===
 +
* '''Purpose''': determines how CP_ACP is interpreted
 +
* '''Initial value''':
 +
** Windows: The result of the ''GetACP'' OS call
 +
** iOS: UTF-8
 +
** Unix (excluding iOS): Based on the currently set ''LANG'' or ''LC_CTYPE'' environment variables. This is usually UTF-8, but that is not guaranteed to be the case.
 +
** Other platforms: CP_ACP (these platforms currently do not support multiple code pages, and are hardcoded to use their OS-specific code page in all cases)
 +
* '''Modifications''': you can modify this value by calling ''SetMultiByteConversionCodePage(CodePage: TSystemCodePage)''
 +
* '''Notes''': Since the value of this variable can be changed, it is not a good idea to use its value to determine the real OS "default system code page" (unless you do it at program startup and are certain no other unit has changed it in its initialisation code).
 +
 
 +
=== DefaultFileSystemCodePage ===
 +
* '''Purpose''': defines the code page to which file/path names are translated before they are passed OS API calls, ''''if'''' the RTL uses a single byte OS API for this purpose. This code page is also used for intermediate operations of file paths inside the RTL before make OS API calls.
 +
* '''Initial value''':
 +
** Windows: UTF-8, because the RTL uses UTF-16 OS API calls.
 +
** OS X and iOS: UTF-8
 +
** Unix (excluding OS X and iOS): DefaultSystemCodePage, because the encoding of file names is undefined on Unix platforms (it's an untyped array of bytes that can be interpreted in any way; it is not guaranteed to be valid UTF-8)
 +
* '''Modifications''': you can modify this value by calling ''SetMultiByteFileSystemCodePage(CodePage: TSystemCodePage)''
 +
* '''Notes''': the Unix/OS X/iOS settings only apply in case the ''cwstring'' widestring manager is installed, otherwise DefaultFileSystemCodePage will have the same value as DefaultSystemCodePage
 +
 
 +
=== DefaultRTLFileSystemCodePage ===
 +
* '''Purpose''': defines the code page to which file/path names are translated before they are returned from RTL routines. E.g., the file/path names returned by the RawbyteString versions of ''SysUtils.FindFirst'' and ''System.GetDir''. The main reason for its existence is to enable the RTL to provide backward compatibility with earlier versions of FPC, as these always returned strings encoded in ''DefaultFileSystemCodePage''.
 +
* '''Initial value'''
 +
** Windows: DefaultSystemCodePage, for backward compatibility.
 +
** OS X and iOS: UTF-8, for backward compatibility (it was already always UTF-8 in the past, since that's what the OS file APIs return and we did not convert this data).
 +
** Unix (excluding OS X and iOS): DefaultSystemCodePage, for the same reason as with DefaultFileSystemCodePage. Setting this to a different value than DefaultFileSystemCodePage is a bad idea on these platforms, since any code page conversion can corrupt these strings as their initial encoding is unknown in practice.
 +
* '''Modifications''': you can modify this value by calling ''SetMultiByteRTLFileSystemCodePage(CodePage: TSystemCodePage)''
 +
* '''Notes''': same as for DefaultFileSystemCodePage.
 +
 
 +
= Strings =
 +
 
 +
== String/character types ==
 +
 
 +
=== Shortstring ===
 +
The codepage of a shortsfring is implicitly CP_ACP and hence will always be equal to the current value of DefaultSystemCodePage.
 +
 
 +
=== PAnsiChar/AnsiChar ===
 +
These types are the same as the old PChar/Char types. In all compiler modes except for ''{$mode delphiunicode}'', PChar/Char are also still aliases for PAnsiChar/AnsiChar. Their codepage is implicitly CP_ACP and hence will always be equal to the current value of DefaultSystemCodePage.
 +
 
 +
=== UnicodeString/WideString ===
 +
These types behave the same as in previous versions:
 +
* ''Widestring'' is the same as a "COM BSTR" on Windows, and an alias for UnicodeString on all other platforms. Its string data is encoded using UTF-16
 +
* ''UnicodeString'' is a reference-counted string encoded using UTF-16
 +
 
 +
=== Ansistring ===
 +
AnsiStrings are reference-counted types with a maximum length of high(SizeInt) bytes. Additionally, they now also have codepage information associated with them.
 +
 
 +
The most important thing to understand about the new AnsiString type is that it both has a declared/static/preferred/default codepage (called ''static codepage'' from now on), and a dynamic code page. The static code page tells that compiler that when assigning something to that AnsiString, it should first convert the data to that static codepage (except if it is CP_NONE, see [[#RawByteString]] below). The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual codepage of the data currently held by that AnsiString.
 +
 
 +
==== Static code page ====
 +
The static code page of an AnsiString can only be defined by declaring a new type as follows:
 +
<syntaxhighlight>
 +
type
 +
  CP866String = type AnsiString(866); // note the extra "type"
 +
</syntaxhighlight>
 +
 
 +
The static code page of a variable declared as plain ''AnsiString'' is CP_ACP. In effect, the AnsiString type is now semantically defined in the System unit as
 +
<syntaxhighlight>
 +
type
 +
  AnsiString = type AnsiString(CP_ACP);
 +
</syntaxhighlight>
 +
 
 +
Another predefined AnsiString(X) type in the System unit is UTF8String:
 +
<syntaxhighlight>
 +
type
 +
  AnsiString = type AnsiString(CP_UTF8);
 +
</syntaxhighlight>
 +
 
 +
Once you have defined such a custom AnsiString(X) type, you can use it to declare variables, parameters, fields etc as usual.
 +
 
 +
==== Dynamic code page ====
 +
If a string is assigned to a plain AnsiString (= AnsiString(CP_ACP)), the string data will generally first be converted to DefaultSystemCodePage (if required) and then be stored in that AnsiString. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant <> 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios.
 +
 
 +
==== RawByteString ====
 +
The RawByteString type is defined as
 +
<syntaxhighlight>
 +
type
 +
  RawByteString = type AnsiString(CP_NONE);
 +
</syntaxhighlight>
 +
 
 +
As mentioned earlier, the results of conversions from/to the CP_NONE codepage are undefined. As it does not make sense to define a type in the RTL whose behaviour is undefined, the behaviour of RawByteString is somewhat different than that of other AnsiString(X) types.
 +
 
 +
As a first approximation, ''RawByteString'' can be thought of as an "untyped AnsiString": assigning an AnsiString(X) to a RawByteString has exactly the same behaviour as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no codepage conversion or copying occurs, just the reference count is increased.
 +
 
 +
Less intuitive is probably that when a RawByteString is assigned to an AnsiString(X), the same happens: no code page conversion or copying, just the reference count is increased. Note that this means that results from functions returning a RawByteString will never be converted to the destination's static code page. This is another way in which the dynamic code page of an AnsiString(X) can become different of its static code page.
 +
 
 +
This type is mainly used to declare ''const'', ''constref'' and value parameters that accept any AnsiString(X) value without converting it to a predefined static code page. Note that if you do this, the routine itself should be able to handle strings with any possible dynamic code page.
 +
 
 +
''var'' and ''out'' parameters can also be declared as ''RawByteString'', but in this case the compiler will give an error if an AnsiString(X) whose static codepage is different from CP_NONE is passed in. You can add an explicit RawByteString() typecast around an argument to remove this error.
 +
 
 +
== String concatenations ==
 +
Normally, in Pascal the result type of an expression is independent of how its result is used afterwards. E.g. multiplying two longints on a 32 bit platform and assigning the result to an int64 will still perform the addition using 32 bit arithmetic, and only afterwards the result is converted to 64 bit.
 +
 
 +
Codepage-aware strings are the only exception to this rule: after concatenating two or more strings, the resulting string will be converted to the static code page of the destination. If the to be concatenated strings have differing code pages, the concatenation itself is performed in a way that does not result in data loss. Data loss may still occur when the result is converted to the destination code page, of course.
 +
 
 +
Assigning the result of a concatenation to a RawByteString is again special:
 +
* if all concatenated strings have the same code page, the result will have this code page too
 +
* in other cases the result will be converted to CP_ACP (we may add an option in the future to change this RawByteString behaviour, as it is not very practical).
 +
 
 +
== String indexing ==
 +
Nothing changes to string indexing. Every string element of a UnicodeString/WideString is two bytes and every string element of all other strings is one byte. The string indexing mechanism completely ignores codepages and composite code points.
 +
 
 +
= Older random notes about FPC 2.7.1 (trunk) status =
  
 
=== Runtime Libraries ===
 
=== Runtime Libraries ===

Revision as of 18:31, 4 January 2014

Introduction

Up to an including FPC 2.6.x, the RTL is based on the ones of Turbo Pascal and Delphi 7. This means it was primarily based around the shortstring, ansistring and pchar types. None of these types has any encoding information associated with them, but where implicitly assumed to be encoded in the "default system encoding" and were passed on to OS API calls without any conversion.

In Delphi 2009, Embarcadero switched the entire RTL over to the unicodestring type, which represents strings using UTF-16. Additionally, they also made the ansistring type "codepage-aware". This means that ansistrings from then store the codepage according to which their data should be interpreted.

FPC's language-level support for these string types is already available in current development version of the compiler (FPC 2.7.1/trunk). The RTL level support is not yet complete. This page gives an overview of the codepage-related behaviour of these string types, the current level of support in the RTL, and possible future ways of how this support may be improved.

Codepages

A codepage defines how the individual bytes of a string should be interpreted, i.e., which letter, symbol or other graphic character corresponds to every byte or sequence of bytes.

Codepage identifiers

A codepage is always stored as a TSystemCodePage, which is an alias for [Word]. The value represents the corresponding code page as defined by Microsoft. Additionally, there are 3 special code page values:

  • CP_ACP: this value represents the currently set "default system code page". See #Codepage settings for more information.
  • CP_OEM: this value represents the OEM code page. On Windows platforms this corresponds to the code page used by the console (e.g. cmd.exe windows). On other platforms this value is interpreted the same as CP_ACP
  • CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined.

Codepage settings

The system unit contains several global variables that indicate the default code page used for certain operations.

DefaultSystemCodePage

  • Purpose: determines how CP_ACP is interpreted
  • Initial value:
    • Windows: The result of the GetACP OS call
    • iOS: UTF-8
    • Unix (excluding iOS): Based on the currently set LANG or LC_CTYPE environment variables. This is usually UTF-8, but that is not guaranteed to be the case.
    • Other platforms: CP_ACP (these platforms currently do not support multiple code pages, and are hardcoded to use their OS-specific code page in all cases)
  • Modifications: you can modify this value by calling SetMultiByteConversionCodePage(CodePage: TSystemCodePage)
  • Notes: Since the value of this variable can be changed, it is not a good idea to use its value to determine the real OS "default system code page" (unless you do it at program startup and are certain no other unit has changed it in its initialisation code).

DefaultFileSystemCodePage

  • Purpose: defines the code page to which file/path names are translated before they are passed OS API calls, 'if' the RTL uses a single byte OS API for this purpose. This code page is also used for intermediate operations of file paths inside the RTL before make OS API calls.
  • Initial value:
    • Windows: UTF-8, because the RTL uses UTF-16 OS API calls.
    • OS X and iOS: UTF-8
    • Unix (excluding OS X and iOS): DefaultSystemCodePage, because the encoding of file names is undefined on Unix platforms (it's an untyped array of bytes that can be interpreted in any way; it is not guaranteed to be valid UTF-8)
  • Modifications: you can modify this value by calling SetMultiByteFileSystemCodePage(CodePage: TSystemCodePage)
  • Notes: the Unix/OS X/iOS settings only apply in case the cwstring widestring manager is installed, otherwise DefaultFileSystemCodePage will have the same value as DefaultSystemCodePage

DefaultRTLFileSystemCodePage

  • Purpose: defines the code page to which file/path names are translated before they are returned from RTL routines. E.g., the file/path names returned by the RawbyteString versions of SysUtils.FindFirst and System.GetDir. The main reason for its existence is to enable the RTL to provide backward compatibility with earlier versions of FPC, as these always returned strings encoded in DefaultFileSystemCodePage.
  • Initial value
    • Windows: DefaultSystemCodePage, for backward compatibility.
    • OS X and iOS: UTF-8, for backward compatibility (it was already always UTF-8 in the past, since that's what the OS file APIs return and we did not convert this data).
    • Unix (excluding OS X and iOS): DefaultSystemCodePage, for the same reason as with DefaultFileSystemCodePage. Setting this to a different value than DefaultFileSystemCodePage is a bad idea on these platforms, since any code page conversion can corrupt these strings as their initial encoding is unknown in practice.
  • Modifications: you can modify this value by calling SetMultiByteRTLFileSystemCodePage(CodePage: TSystemCodePage)
  • Notes: same as for DefaultFileSystemCodePage.

Strings

String/character types

Shortstring

The codepage of a shortsfring is implicitly CP_ACP and hence will always be equal to the current value of DefaultSystemCodePage.

PAnsiChar/AnsiChar

These types are the same as the old PChar/Char types. In all compiler modes except for {$mode delphiunicode}, PChar/Char are also still aliases for PAnsiChar/AnsiChar. Their codepage is implicitly CP_ACP and hence will always be equal to the current value of DefaultSystemCodePage.

UnicodeString/WideString

These types behave the same as in previous versions:

  • Widestring is the same as a "COM BSTR" on Windows, and an alias for UnicodeString on all other platforms. Its string data is encoded using UTF-16
  • UnicodeString is a reference-counted string encoded using UTF-16

Ansistring

AnsiStrings are reference-counted types with a maximum length of high(SizeInt) bytes. Additionally, they now also have codepage information associated with them.

The most important thing to understand about the new AnsiString type is that it both has a declared/static/preferred/default codepage (called static codepage from now on), and a dynamic code page. The static code page tells that compiler that when assigning something to that AnsiString, it should first convert the data to that static codepage (except if it is CP_NONE, see #RawByteString below). The dynamic code page is a property of the AnsiString which, similar to the length and the reference count, defines the actual codepage of the data currently held by that AnsiString.

Static code page

The static code page of an AnsiString can only be defined by declaring a new type as follows:

type
  CP866String = type AnsiString(866); // note the extra "type"

The static code page of a variable declared as plain AnsiString is CP_ACP. In effect, the AnsiString type is now semantically defined in the System unit as

type
  AnsiString = type AnsiString(CP_ACP);

Another predefined AnsiString(X) type in the System unit is UTF8String:

type
  AnsiString = type AnsiString(CP_UTF8);

Once you have defined such a custom AnsiString(X) type, you can use it to declare variables, parameters, fields etc as usual.

Dynamic code page

If a string is assigned to a plain AnsiString (= AnsiString(CP_ACP)), the string data will generally first be converted to DefaultSystemCodePage (if required) and then be stored in that AnsiString. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant <> 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios.

RawByteString

The RawByteString type is defined as

type
  RawByteString = type AnsiString(CP_NONE);

As mentioned earlier, the results of conversions from/to the CP_NONE codepage are undefined. As it does not make sense to define a type in the RTL whose behaviour is undefined, the behaviour of RawByteString is somewhat different than that of other AnsiString(X) types.

As a first approximation, RawByteString can be thought of as an "untyped AnsiString": assigning an AnsiString(X) to a RawByteString has exactly the same behaviour as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no codepage conversion or copying occurs, just the reference count is increased.

Less intuitive is probably that when a RawByteString is assigned to an AnsiString(X), the same happens: no code page conversion or copying, just the reference count is increased. Note that this means that results from functions returning a RawByteString will never be converted to the destination's static code page. This is another way in which the dynamic code page of an AnsiString(X) can become different of its static code page.

This type is mainly used to declare const, constref and value parameters that accept any AnsiString(X) value without converting it to a predefined static code page. Note that if you do this, the routine itself should be able to handle strings with any possible dynamic code page.

var and out parameters can also be declared as RawByteString, but in this case the compiler will give an error if an AnsiString(X) whose static codepage is different from CP_NONE is passed in. You can add an explicit RawByteString() typecast around an argument to remove this error.

String concatenations

Normally, in Pascal the result type of an expression is independent of how its result is used afterwards. E.g. multiplying two longints on a 32 bit platform and assigning the result to an int64 will still perform the addition using 32 bit arithmetic, and only afterwards the result is converted to 64 bit.

Codepage-aware strings are the only exception to this rule: after concatenating two or more strings, the resulting string will be converted to the static code page of the destination. If the to be concatenated strings have differing code pages, the concatenation itself is performed in a way that does not result in data loss. Data loss may still occur when the result is converted to the destination code page, of course.

Assigning the result of a concatenation to a RawByteString is again special:

  • if all concatenated strings have the same code page, the result will have this code page too
  • in other cases the result will be converted to CP_ACP (we may add an option in the future to change this RawByteString behaviour, as it is not very practical).

String indexing

Nothing changes to string indexing. Every string element of a UnicodeString/WideString is two bytes and every string element of all other strings is one byte. The string indexing mechanism completely ignores codepages and composite code points.

Older random notes about FPC 2.7.1 (trunk) status

Runtime Libraries

There will be a unicode RTL and an ANSI/legacy compatiblity RTL. See [1]

Current support via merged cpstrrtl branch

There is some support for Unicode in the RTL in current FPC trunk.

From http://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg29827.html

  • merged cpstrrtl branch (includes unicode branch). In general, this adds support for arbitrarily encoded ansistrings to many routines related to file system access (and some others).

WARNING: while the parameters of many routines have been changed from "ansistring" to "rawbytestring" to avoid data loss due to conversions, this is not a panacea. If you pass a string concatenation to such a parameter and not all strings in this concatenation have the same code page, all strings and the result will be converted to DefaultSystemCodePage (= ansi code page by default). In particular, concatenating e.g. an Utf8String with a constant string and passing the result to a RawByteString parameter will convert the result into the DefaultSystemCodePage (unless the source code is compiler with {$modeswitch systemcodepage} or {$mode delphiunicode} *and* the ansi code page on the system you are compiling *on* happens to be UTF-8)

You can define and use alternative routines that explicitly accept Utf8String parameters to avoid this pitfall. Internally, all of these routines ensure that they never trigger this condition and ensure that no unnecessary/unwanted code page conversions occur.

  • DefaultFileSystemCodePage variable that holds the code page used for communicating with the OS single byte file system APIs, and for the strings returned by those same APIs. Initialized with
    • the result of GetACP in the system unit of Windows platforms, except for WinCE which uses UTF-8 since its file system OS API calls already use the UTF-16 versions
    • CP_UTF8 on Unix platforms with FPCRTL_FILESYSTEM_UTF8 defined, and with DefaultSystemCodePage on other Unix platforms
    • DefaultSystemCodePage on Java/Android JVM targets
  • DefaultRTLFileSystemCodePage variable that holds the code page used to encode strings returned by RTL routines that return filenames obtained from OS API calls. By default the same as DefaultFileSystemCodePage on all platforms. Separate from DefaultFileSystemCodePage for clarity on platforms that may use either utf-16 or single byte OS API calls to send/receive file names (such as most Windows platforms)
  • new scpFileSystemSingleByte enum that can be passed to GetStandardCodePage() to get the default code page for OS single byte file system APIs, with implementations for Unix and Windows
  • SetMultiByteFileSystemCodePage() procedure to override the value of DefaultFileSystemCodePage
  • ToSingleByteFileSystemEncodedFileName() function to convert a string to DefaultFileSystemCodePage (does *not* take care of OS-specific quirks like Darwin always returning file names in decomposed UTF-8)
  • support for CP_OEMCP
  • textrec/filerec now store the filename by default using widechar. It is possible to switch back to ansichars using the FPC_ANSI_TEXTFILEREC define. In that case, from now on the filename will always be stored in DefaultFileSystemEncoding
  • fixed potential buffer overflows and non-null-terminated file names in textrec/filerec
  • when concatenating ansistrings, do not map CP_NONE (rawbytestring) to CP_ACP (defaultsystemcodepage), because if all input strings have the same code page then the result should also have that code page if it's assigned to a rawbytestring rather than getting defaultsystemcodepage
  • do not consider empty strings to determine the code page of the result in fpc_AnsiStr_Concat_multi(), because that will cause a different result than when using a sequence of fpc_AnsiStr_Concat() calls (it ignores empty strings to determine the result code page) and it's also slower
  • do not consider the run time code page of the destination string in fpc_AnsiStr_Concat(_multi)() because Delphi does not do so either. This was introduced in r19118, probably to hide another bug + test
    • never change the code page of a non-empty string when calling setlength on it
  • handle the fact that GetEnvironmentStringsA returns the environment in the OEM instead of in the Ansi code page (mantis #22524, #15233)
  • don't truncate environment variable strings in GetEnvironmentString(), its result is now ansistring/unicodestring depending on whether the RTL was compiled with FPC_RTL_UNICODE
  • unix:
    • made the ansistring parameters of the fp*() file system routine overloads constant, changed them to rawbytestring and added DefaultFileSystemCodePage conversions
    • unicodestring support for POpen(), and DefaultFileSystemCodePage support for POpen(RawByteString)
  • DefaultFileSystemCodePage support for dynlibs unit
  • rawbytestring/unicodestring overloads for:
    • system: fexpand, lowercase, uppercase, getdir, mkdir, chdir, rmdir, assign, erase, rename
    • objpas: AssignFile
    • sysutils: FileCreate, FileOpen, FileExists, DirectoryExists, FileSetDate, FileGetAttr, FileSetAttr, DeleteFile, RenameFile, FileSearch, ExeSearch, FindFirst, FindNext, FindClose, FileIsReadOnly, GetCurrentDir, SetCurrentDir, ChangeFileExt, ExtractFilePath, ExtractFileDrive, ExtractFileName, ExtractFileExt, ExtractFileDir, ExtractShortPathName, ExpandFileName, ExpandFileNameCase, ExpandUNCFileName, ExtractRelativepath, IncludeTrailingPathDelimiter, IncludeTrailingBackslash, ExcludeTrailingBackslash, ExcludeTrailingPathDelimiter, IncludeLeadingPathDelimiter, ExcludeLeadingPathDelimiter, IsPathDelimiter, DoDirSeparators, SetDirSeparators, GetDirs, ConcatPaths, GetEnvironmentVariable
      • the default string type used by FindFirst/Next depends on whether the RTL was compiled with FPC_RTL_UNICODE. To force the RawByteString version pass a TRawByteSearchRec, for the UnicodeString version pass a TUnicodeSearchRec.
  • paramstr(longint):unicodestring available for {$modeswitch unicodestrings}
  • pwidechar versions in sysutils of strecopy, strend, strcat, strcomp,strlcomp, stricomp, strlcat, strrscan,strlower, strupper, strlicomp,strpos, WideStrAlloc, StrBufSize, StrDispose + tests

Other libraries

The string architecture for FCL etc libraries has not yet been decided. See [2].

Old/obsolete sections

Warning-icon.png

Warning: These sections are kept for historical reference - please update the sections above with this information if it is still applicable. Since FPC 2.7 (current development version), extensive Unicode support has been implemented.

Tiburon Unicode support

Currently we have some information about Tiburon's Unicode support implementation.

http://blogs.codegear.com/abauer/2008/01/09/38845

http://blogs.codegear.com/abauer/2008/07/16/38864

FPC Unicode support

FPC must have the following string types with transparent conversion between them (like current AnsiString <-> WideString conversion):

  • shortstring
  • ansistring
  • widestring
  • utf8string
  • utf16string
  • utf32string
  • ucs2string (?)
  • ucs4string (?)

Development and further maintenance of these string types must be as simple as possible. New string types must be easily added in future if needed.

Compiler uses generic structure and helper routines to handle all refcounted string types.

String header:

type
  TRefStringRec = packed record
    Encoding: word;    // encoding of string
    ElementSize: byte; // size in bytes of string's element (1-4)
    Ref: SizeInt;      // number of references
    Len: SizeInt;      // number of elements is string 
  end;

Helper routines will know how to handle string from its header.

Extra parameter with string type information is passed to some routines (like fpc_RefString_SetLength) to allow properly initialize new strings.

widestring type on Windows targets remains non-refcounted and OLE compatible. Minimal number of helper routines is used for it. On non-Windows targets widestring is alias to utf16string.

The compiler uses helpers for string type conversions like this:

procedure fpc_ansistring_to_utf16string(out dst: utf16string; const src: ansistring);
procedure fpc_utf32string_to_utf16string(out dst: utf16string; const src: utf32string);

The compiler generates helper procedure name from type names. The compiler does not perform any string conversion handling by itself.

Status of Unicode support in FPC so far

Currently FPC 2.3.x has a new type called UnicodeString. This is similar to a WideString type. The difference being that UnicodeString is reference counted on all platforms.

All implementation work is currently done in a separate svn branch: http://svn.freepascal.org/svn/fpc/branches/cpstrnew

User visible changes

Full support of code page aware strings is not possible without breaking some existing code. The following list tries to summarize the most important user visible changes.

  • The string header has two new fields: encoding and element size. On 32 Bit platforms this increases the header size by 4 and on 64 bit platforms by 8 bytes.
  • WideCharLenToString, UnicodeCharLenToString, WideCharToString, UnicodeCharToString and OleStrToString return an UnicodeString instead of an Ansistring before.
  • the type of the dest parameter of WideCharLenToString and UnicodeCharLenToString has been changed from Ansistring to Unicodestring
  • UTF8ToAnsi and AnsiToUTF8 take a RawByteString now

Roadmap of RTL Unicode support with UnicodeString

Topic Status Comments Assigned To
Locale Variables Not implemented Variables are all 1 byte in size and can't hold UnicodeChar size values. e.g.: The Russian thousand separator is a no-break space $00A0 which doesn't fit in the ThousandSeparator (standard Char type) variable.
TStrings Not implemented There is no UnicodeString version of TStrings
TStringList Not implemented There is no UnicodeString version of TStringList
Pos() Working

Roadmap of RTL Unicode support with UTF8String

Topic Status Comments Assigned To
UTF8String Not implemented Needs a real implementation. Is currently just an alias for ansistring.
TStrings Not implemented There is no UTF8String version of TStrings
TStringList Not implemented There is no UTF8String version of TStringList

See Also