Difference between revisions of "Character and string types"

From Lazarus wiki
Jump to navigationJump to search
m
(Ansi code page value)
(32 intermediate revisions by 10 users not shown)
Line 3: Line 3:
 
Free Pascal supports several '''[[Char|character]] and [[String|string]] types'''. They range from single ANSI characters to unicode strings and also include pointer types. Differences also apply to encodings and reference counting.
 
Free Pascal supports several '''[[Char|character]] and [[String|string]] types'''. They range from single ANSI characters to unicode strings and also include pointer types. Differences also apply to encodings and reference counting.
  
== AnsiChar ==
+
__TOC__
A variable of type '''AnsiChar''', also referred to as '''char''', is exactly 1 byte in size, and contains one ANSI character.
+
 
 +
== Character types ==
 +
=== AnsiChar ===
 +
 
 +
A variable of type '''AnsiChar''', also referred to as '''[[Char|char]]''', is exactly 1 byte in size, and contains one "ANSI" (local code page) character.
  
 
{| class="wikitable" style="text-align:center; width:25px"
 
{| class="wikitable" style="text-align:center; width:25px"
Line 12: Line 16:
  
 
==== Reference ====
 
==== Reference ====
* [http://www.freepascal.org/docs-html/ref/refsu7.html FPC AnsiChar documentation]
+
 
 +
* [http://www.freepascal.org/docs-html/ref/refsu6.html FPC AnsiChar documentation]
 
* [[Char|Usage Char]]
 
* [[Char|Usage Char]]
 +
* [[Wikipedia:Windows code page#ANSI code page]]
  
== WideChar ==
+
=== WideChar ===
  
A variable of type '''WideChar''', also referred to as '''UnicodeChar''', is exactly 2 bytes in size, and contains one (part of) [[LCL Unicode Support|Unicode]] character in UTF-16 encoding.
+
A variable of type '''WideChar''', also referred to as '''UnicodeChar''', is exactly 2 bytes in size, and usually contains one [[LCL Unicode Support|Unicode]] code point (normally a character) in UTF-16 encoding.
 
Note: it is impossible to encode all Unicode code points in 2 bytes. Therefore, 2 WideChars may be needed to encode a single code point.
 
Note: it is impossible to encode all Unicode code points in 2 bytes. Therefore, 2 WideChars may be needed to encode a single code point.
  
Line 26: Line 32:
  
 
==== References ====
 
==== References ====
* [http://www.freepascal.org/docs-html/ref/refsu8.html FPC WideChar documentation]
+
 
 +
* [http://www.freepascal.org/docs-html/ref/refsu7.html FPC WideChar documentation]
 
* [https://en.wikipedia.org/wiki/UTF-16 UTF-16 information on Wikipedia]
 
* [https://en.wikipedia.org/wiki/UTF-16 UTF-16 information on Wikipedia]
 
* [[doc:rtl/system/unicodechar.html|RTL UnicodeChar documentation]]
 
* [[doc:rtl/system/unicodechar.html|RTL UnicodeChar documentation]]
  
== Array of Char ==
+
== Character-derived types ==
  
Early Pascal implementations that were in use before 1978 didn't support a string type (with the exception of string constants). The only possibility to store strings in variables was the use of arrays of char. This approach has many disadvantages and is no longer recommended. It is, however, still supported to ensure backward-compatibility with ancient code.
+
=== Array of Char ===
 +
 
 +
Early Pascal implementations that were in use before 1978 did not support a string type (with the exception of string constants). The only possibility to store strings in variables was the use of arrays of char. This approach has many disadvantages and is no longer recommended. It is, however, still supported to ensure backward-compatibility with ancient code.
  
 
===Static Array of Char===
 
===Static Array of Char===
  
<syntaxhighlight>
+
<syntaxhighlight lang="pascal">
 
type
 
type
 
   TOldString4 = array[0..3] of char;
 
   TOldString4 = array[0..3] of char;
Line 47: Line 56:
 
   aOldString4[3] := 'd';
 
   aOldString4[3] := 'd';
 
end;
 
end;
</syntaxhighlight> The static array of char has now the content:
+
</syntaxhighlight>  
 +
 
 +
The static array of char has now the content:
  
 
{| class="wikitable" style="text-align:center; width:100px"
 
{| class="wikitable" style="text-align:center; width:100px"
Line 58: Line 69:
 
===Dynamic Array of Char===
 
===Dynamic Array of Char===
  
<syntaxhighlight>
+
<syntaxhighlight lang="pascal">
 
var
 
var
 
   aOldString: Array of Char;  
 
   aOldString: Array of Char;  
Line 68: Line 79:
 
   aOldString[3] := 'd';
 
   aOldString[3] := 'd';
 
end;
 
end;
</syntaxhighlight> The dynamic array of char has now the content:
+
</syntaxhighlight>  
 +
 
 +
The dynamic array of char has now the content:
  
 
{| class="wikitable" style="text-align:center; width:100px"
 
{| class="wikitable" style="text-align:center; width:100px"
Line 77: Line 90:
 
{{Note|Unassigned chars in dynamic arrays have a content #0, cause empty positions of all dynamic arrays are initially initialised with 0 (or #0, or nil, or ...)}}
 
{{Note|Unassigned chars in dynamic arrays have a content #0, cause empty positions of all dynamic arrays are initially initialised with 0 (or #0, or nil, or ...)}}
  
== PChar ==
+
=== PChar ===
  
A variable of type '''PChar''' is basically a pointer to a '''[[Character_and_string_types#AnsiChar|Char]]''' type, but allows additional operations. PChars can be used to access C-style [http://en.wikipedia.org/wiki/Null-terminated_string null-terminated strings], e.g. in interaction with certain OS libraries or third-party software.
+
A variable of type '''[[PChar]]''' is basically a pointer to a '''Char''' type, but allows additional operations. PChars can be used to access C-style [http://en.wikipedia.org/wiki/Null-terminated_string null-terminated strings], e.g. in interaction with certain OS libraries or third-party software.
  
 
{| class="wikitable" style="text-align:center; width:100px"
 
{| class="wikitable" style="text-align:center; width:100px"
Line 89: Line 102:
  
 
==== Reference ====
 
==== Reference ====
* [http://www.freepascal.org/docs-html/ref/refsu16.html FPC PChar documentation]
+
 
 +
* [http://www.freepascal.org/docs-html/ref/refsu12.html FPC PChar documentation]
 
* [[doc:rtl/sysutils/pcharfunctions.html|PChar related functions]]
 
* [[doc:rtl/sysutils/pcharfunctions.html|PChar related functions]]
  
 
+
=== PWideChar ===
== PWideChar ==
 
  
 
A variable of type '''PWideChar''' is a pointer to a [[#WideChar|WideChar]] variable.
 
A variable of type '''PWideChar''' is a pointer to a [[#WideChar|WideChar]] variable.
Line 105: Line 118:
  
 
==== Reference ====
 
==== Reference ====
 +
 
* [[doc:rtl/system/pwidechar.html|RTL PWideChar documentation]]
 
* [[doc:rtl/system/pwidechar.html|RTL PWideChar documentation]]
  
== String ==
+
== String types ==
  
 +
=== String ===
 
The type '''String''' may refer to '''[[#ShortString|ShortString]]''' or '''[[#AnsiString|AnsiString]]''', depending from the [http://www.freepascal.org/docs-html/prog/progsu25.html#x32-310001.2.25 {$H} switch]. If the switch is off ({$H-}) then any string declaration will define a '''ShortString'''. It size will be 255 chars, if not otherwise specified. If it is on ({$H+}) '''string''' without length specifier will define an '''AnsiString''', otherwise a '''ShortString''' with specified length.
 
The type '''String''' may refer to '''[[#ShortString|ShortString]]''' or '''[[#AnsiString|AnsiString]]''', depending from the [http://www.freepascal.org/docs-html/prog/progsu25.html#x32-310001.2.25 {$H} switch]. If the switch is off ({$H-}) then any string declaration will define a '''ShortString'''. It size will be 255 chars, if not otherwise specified. If it is on ({$H+}) '''string''' without length specifier will define an '''AnsiString''', otherwise a '''ShortString''' with specified length.
 
In '''mode delphiunicode''' '''String'' is '''[[#UnicodeString|UnicodeString]]'''.
 
In '''mode delphiunicode''' '''String'' is '''[[#UnicodeString|UnicodeString]]'''.
  
 
==== Reference ====
 
==== Reference ====
* [[String|Usage String]]
+
* [[String|Usage String]].
 
* [[doc:rtl/sysutils/stringfunctions.html|String functions]]
 
* [[doc:rtl/sysutils/stringfunctions.html|String functions]]
 
* [[doc:rtl/strutils/index-5.html|Reference for unit 'strutils': Procedures and functions]]
 
* [[doc:rtl/strutils/index-5.html|Reference for unit 'strutils': Procedures and functions]]
  
== ShortString ==
+
=== ShortString ===
  
Short strings have a maximum length of 255 characters with the implicit [[FPC Unicode support#Codepages|codepage]] CP_ACP. The length is stored in the character at index 0.
+
Short strings have a maximum length of 255 characters with the implicit [[FPC Unicode support#Codepages|codepage]] CP_ACP. The length is stored in the character at index 0. A short string of 255 characters uses 256 bytes of memory (one byte for the length specification and 255 bytes for characters).
  
 
{| class="wikitable" style="text-align:center; width:100px"
 
{| class="wikitable" style="text-align:center; width:100px"
Line 127: Line 142:
  
 
==== Reference ====
 
==== Reference ====
* [[doc:ref/refsu12.html|FPC AnsiString documentation]]
 
  
== AnsiString ==
+
* [https://www.freepascal.org/docs-html/ref/refsu9.html FPC Single-byte String types documentation]
 +
 
 +
=== AnsiString ===
  
 
Ansistrings are strings that have no length limit. They are [http://en.wikipedia.org/wiki/Reference_counting reference counted] and are guaranteed to be [http://en.wikipedia.org/wiki/Null-terminated_string null terminated]. Internally, a variable of type '''AnsiString''' is treated as a pointer: the actual content of the string is stored on the heap, as much memory as needed to store the string content is allocated.
 
Ansistrings are strings that have no length limit. They are [http://en.wikipedia.org/wiki/Reference_counting reference counted] and are guaranteed to be [http://en.wikipedia.org/wiki/Null-terminated_string null terminated]. Internally, a variable of type '''AnsiString''' is treated as a pointer: the actual content of the string is stored on the heap, as much memory as needed to store the string content is allocated.
Line 139: Line 155:
 
| colspan="4" style="width: 16%;" | RefCount || colspan="4" style="width: 16%;" | Length
 
| colspan="4" style="width: 16%;" | RefCount || colspan="4" style="width: 16%;" | Length
 
|}
 
|}
 +
 +
An AnsiString type may also have a compile-time code page since FPC 2.7.1; a missing value defaults to <tt>DefaultSystemCodePage</tt>. A value of <tt>CP_NONE</tt> results in <tt>'''RawBytestring'''</tt> and a value of <tt>CP_UTF8</tt> results in <tt>'''UTF8String'''</tt>.
  
 
==== Reference ====
 
==== Reference ====
* [http://www.freepascal.org/docs-html/ref/refsu12.html FPC AnsiString documentation]
 
  
== UnicodeString ==
+
* [https://www.freepascal.org/docs-html/ref/refsu9.html FPC Single-byte String types documentation]
 +
 
 +
=== UnicodeString ===
  
 
Like '''AnsiStrings''', '''UnicodeStrings''' are reference counted, null-terminated arrays, but they are implemented as arrays of '''[[#WideChar|WideChars]]''' instead of regular '''[[#Char|Chars]]'''.
 
Like '''AnsiStrings''', '''UnicodeStrings''' are reference counted, null-terminated arrays, but they are implemented as arrays of '''[[#WideChar|WideChars]]''' instead of regular '''[[#Char|Chars]]'''.
Line 157: Line 176:
  
 
==== Reference ====
 
==== Reference ====
* [http://www.freepascal.org/docs-html/ref/refsu13.html FPC UnicodeString documentation]
 
  
== UTF8String ==
+
* [http://www.freepascal.org/docs-html/ref/refsu10.html FPC Multi-byte String types documentation]
  
Currently, the type '''UTF8String''' is an alias to the type '''[[#AnsiString|AnsiString]]'''. It is meant to contain UTF8 encoded strings (i.e. unicode data) ranging from 1..4 bytes per character.
+
=== UTF8String ===
 +
 
 +
In FPC 2.6.5 and below the type '''UTF8String''' was an alias to the type '''[[#AnsiString|AnsiString]]'''. In FPC 2.7.1 and above it is defined as <syntaxhighlight lang="pascal" inline>UTF8String = type AnsiString(CP_UTF8);</syntaxhighlight>
 +
 
 +
It is meant to contain UTF-8 encoded strings (i.e. unicode data) ranging from 1..4 bytes per character.
 +
Note that '''String''' can also contain UTF-8 encoded characters.
  
 
==== Reference ====
 
==== Reference ====
* [[doc:rtl/system/utf8string.html|RTL UTF8String documentation]]
 
  
== UTF16String ==
+
* [https://www.freepascal.org/docs-html/ref/refsu9.html#x32-400003.2.4 FPC UTF8String documentation]
 +
 
 +
=== UTF16String ===
  
 
The type '''UTF16String''' is an alias to the type '''[[#WideString|WideString]]'''. In the LCL unit ''lclproc'' it is an alias to '''[[#UnicodeString|UnicodeString]]'''.
 
The type '''UTF16String''' is an alias to the type '''[[#WideString|WideString]]'''. In the LCL unit ''lclproc'' it is an alias to '''[[#UnicodeString|UnicodeString]]'''.
  
 
==== Reference ====
 
==== Reference ====
 +
 
* [[doc:lcl/lclproc/utf16string.html|LCL UTF16String documentation]]
 
* [[doc:lcl/lclproc/utf16string.html|LCL UTF16String documentation]]
  
== WideString ==
+
=== WideString ===
  
 
Variables of type '''[[Widestrings|WideString]]''' (used to represent unicode character strings in COM applications) resemble those of type '''UnicodeString''', but unlike them they are not reference-counted. On Windows they are allocated with a special windows function which allows them to be used for OLE automation.
 
Variables of type '''[[Widestrings|WideString]]''' (used to represent unicode character strings in COM applications) resemble those of type '''UnicodeString''', but unlike them they are not reference-counted. On Windows they are allocated with a special windows function which allows them to be used for OLE automation.
Line 187: Line 212:
  
 
==== Reference ====
 
==== Reference ====
* [http://www.freepascal.org/docs-html/ref/refsu14.html#x37-400003.2.8 FPC WideString documentation]
 
  
== PShortString ==
+
* [http://www.freepascal.org/docs-html/ref/refsu10.html FPC Multi-byte String types documentation]
 +
 
 +
== String-derived types ==
 +
 
 +
=== PShortString ===
  
 
A variable of type '''PShortString''' is a pointer that points to the first byte of a '''[[#ShortString|ShortString]]'''-type variable (which defines the length of the ShortString).
 
A variable of type '''PShortString''' is a pointer that points to the first byte of a '''[[#ShortString|ShortString]]'''-type variable (which defines the length of the ShortString).
Line 201: Line 229:
  
 
==== Reference ====
 
==== Reference ====
 +
 
* [[doc:rtl/system/pshortstring.html|RTL PShortString documentation]]
 
* [[doc:rtl/system/pshortstring.html|RTL PShortString documentation]]
  
== PAnsiString ==
+
=== PAnsiString ===
  
 
Variables of type '''PAnsiString''' are pointers to '''[[#AnsiString|AnsiString]]'''-type variables. However, unlike '''PShortString'''-type variables they don't point to the first byte of the header, but to the first '''char''' of the '''AnsiString'''.
 
Variables of type '''PAnsiString''' are pointers to '''[[#AnsiString|AnsiString]]'''-type variables. However, unlike '''PShortString'''-type variables they don't point to the first byte of the header, but to the first '''char''' of the '''AnsiString'''.
Line 215: Line 244:
  
 
==== Reference ====
 
==== Reference ====
 +
 
* [[doc:rtl/system/pansistring.html|RTL PAnsiString documentation]]
 
* [[doc:rtl/system/pansistring.html|RTL PAnsiString documentation]]
  
== PUnicodeString ==
+
=== PUnicodeString ===
  
 
Variables of type '''PUnicodeString''' are pointers to variables of type '''[[#UnicodeString|UnicodeString]]'''.
 
Variables of type '''PUnicodeString''' are pointers to variables of type '''[[#UnicodeString|UnicodeString]]'''.
Line 231: Line 261:
 
* [[doc:rtl/system/punicodestring.html|RTL PUnicodeString documentation]]
 
* [[doc:rtl/system/punicodestring.html|RTL PUnicodeString documentation]]
  
== PWideString ==
+
=== PWideString ===
  
 
Variables of type '''PWideString''' are pointers. They point to the first char of a '''[[#WideString|WideString]]'''-typed variable.
 
Variables of type '''PWideString''' are pointers. They point to the first char of a '''[[#WideString|WideString]]'''-typed variable.
Line 243: Line 273:
  
 
==== Reference ====
 
==== Reference ====
 +
 
* [[doc:rtl/system/pwidestring.html|RTL PWideString documentation]]
 
* [[doc:rtl/system/pwidestring.html|RTL PWideString documentation]]
  
 
== String constants ==
 
== String constants ==
  
If you use only English constants your strings work the same with all types, on all platforms and all compiler versions. Non English strings can be loaded via resourcestrings or from files. If you want to use non English strings in code then you should read further.
+
If you use only English (ASCII) constants your strings work the same with all types, on all platforms and all compiler versions. Non English strings can be loaded via resourcestrings or from files. If you want to use non English strings in code then you should read further.
  
 
There are various encodings for non English strings. By default Lazarus saves Pascal files as '''UTF-8 without BOM'''. UTF-8 supports the full Unicode range. That means all string constants are stored in UTF-8. Lazarus also supports to change the encoding of a file to other encoding, for example under Windows your local codepage. The Windows codepage is limited to your current language group.
 
There are various encodings for non English strings. By default Lazarus saves Pascal files as '''UTF-8 without BOM'''. UTF-8 supports the full Unicode range. That means all string constants are stored in UTF-8. Lazarus also supports to change the encoding of a file to other encoding, for example under Windows your local codepage. The Windows codepage is limited to your current language group.
  
 
{| class="wikitable sortable"
 
{| class="wikitable sortable"
! String Type, UTF-8 Source !! With or without {$codepage utf8} !! FPC 2.6.5 and below !! FPC 2.7.1 and above !! FPC 2.7.1+ with UTF8 as default CodePage
+
! String Type, UTF-8 Source !! With <code>{$codepage utf8}</code>? !! FPC &le; 2.6.5 !! FPC &ge; 2.7.1 !! FPC &le; 2.7.1, [[Unicode_Support_in_Lazarus#RTL_with_default_codepage_UTF-8|UTF8 as default CodePage]]
 
|----
 
|----
|AnAnsiString:='ãü';||Without||Needs UTF8ToAnsi in RTL/WinAPI. Ok in LCL||Needs UTF8ToAnsi in RTL/WinAPI. Ok in LCL||Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI.
+
|<tt>AnAnsiString:='ãü';||No||class="partial"|Needs UTF8ToAnsi in RTL/WinAPI. Ok in LCL||class="partial"|Needs UTF8ToAnsi in RTL/WinAPI. Ok in LCL||class="working"|Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI.
 
|----
 
|----
|AnAnsiString:='ãü';||With||System cp ok in RTL/WinAPI. Needs SysToUTF8 in LCL||Ok in RTL/WinAPI/LCL. Mixing with other strings converts to system cp||Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI
+
|<tt>AnAnsiString:='ãü';||Yes||class="partial"|System cp ok in RTL/WinAPI. Needs SysToUTF8 in LCL||class="working"|Ok in RTL/WinAPI/LCL. Mixing causes conversion.||class="working"|Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI
 
|----
 
|----
|AnUnicodeString:='ãü';||Without||Wrong everywhere||Wrong everywhere||Wrong everywhere
+
|<tt>AnUnicodeString:='ãü';||No||class="not"|Wrong everywhere||class="not"|Wrong everywhere||class="not"|Wrong everywhere
 
|----
 
|----
|AnUnicodeString:='ãü';||With||System cp ok in RTL/WinAPI. Needs UTF8Encode in LCL||Ok in RTL/WinAPI/LCL. Mixing with other strings converts to system cp ||Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI
+
|<tt>AnUnicodeString:='ãü';||Yes||class="partial"|System cp ok in RTL/WinAPI. Needs UTF8Encode in LCL||class="working"|Ok in RTL/WinAPI/LCL.Mixing causes conversion.||class="working"|Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI
 
|----
 
|----
|AnUTF8String:='ãü';||Without||Same as AnsiString||Wrong everywhere||Wrong everywhere
+
|<tt>AnUTF8String:='ãü';||No||class="partial"|Same as AnsiString||class="not"|Wrong everywhere||class="not"|Wrong everywhere
 
|----
 
|----
|AnUTF8String:='ãü';||With||Same as AnsiString||Ok in RTL/WinAPI/LCL. Mixing with other strings converts to system cp ||Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI.
+
|<tt>AnUTF8String:='ãü';||Yes||class="partial"|Same as AnsiString||class="working"|Ok in RTL/WinAPI/LCL. Mixing causes conversion. ||class="working"|Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI.
 
|}
 
|}
  
*W-WinAPI = Windows API "W" functions, UTF-16
+
* W-WinAPI = Windows API "W" functions, UTF-16
*A-WinAPI = Windows API non "W" functions, 8bit system code page
+
* A-WinAPI = Windows API non "W" functions, 8bit system ("ANSI") code page
*System CP = The 8bit system code page of the OS. For example code page [http://en.wikipedia.org/wiki/Windows-1252 1252].
+
* System CP = The 8bit system code page of the OS. For example code page [http://en.wikipedia.org/wiki/Windows-1252 1252].
  
<syntaxhighlight>
+
<syntaxhighlight lang="pascal">
 
const  
 
const  
 
   c='ãü';
 
   c='ãü';
Line 285: Line 316:
 
end;
 
end;
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
The rules for conversion are laid out in a [https://freepascal.org/docs-html/ref/refsu9.html#x32-380003.2.4 "Code page conversions"] section in the FPC manual. The basic point is that assigning an AnsiString (including the <tt>CP_UTF8</tt> specialization) to another AnsiString converts what is in the source to match the code page of the target string. A quirk for compatibility with presumably fpc &le; 2.6.5 is that no such conversion will be done if one matches the source code CP and the other matches the system CP. In this case, forced (likely incorrect) interpretation as the target code page will occur.
  
 
== See also ==
 
== See also ==
 +
 
* [[FPC Unicode support]]
 
* [[FPC Unicode support]]
 
* [[LCL Unicode Support]]
 
* [[LCL Unicode Support]]
 
* [[TStringList-TStrings Tutorial]]
 
* [[TStringList-TStrings Tutorial]]
 
+
* [http://web.archive.org/web/20151125114234/http://www.codexterity.com/delphistrings.htm A Brief History of Strings]
[[Category: FPC]]
 
[[Category: RTL]]
 
[[Category: Data types]]
 
[[Category: Unicode]]
 

Revision as of 12:10, 31 March 2020

Deutsch (de) English (en) español (es) français (fr) русский (ru) 中文(中国大陆)‎ (zh_CN)

Free Pascal supports several character and string types. They range from single ANSI characters to unicode strings and also include pointer types. Differences also apply to encodings and reference counting.

Character types

AnsiChar

A variable of type AnsiChar, also referred to as char, is exactly 1 byte in size, and contains one "ANSI" (local code page) character.

a

Reference

WideChar

A variable of type WideChar, also referred to as UnicodeChar, is exactly 2 bytes in size, and usually contains one Unicode code point (normally a character) in UTF-16 encoding. Note: it is impossible to encode all Unicode code points in 2 bytes. Therefore, 2 WideChars may be needed to encode a single code point.

a

References

Character-derived types

Array of Char

Early Pascal implementations that were in use before 1978 did not support a string type (with the exception of string constants). The only possibility to store strings in variables was the use of arrays of char. This approach has many disadvantages and is no longer recommended. It is, however, still supported to ensure backward-compatibility with ancient code.

Static Array of Char

type
  TOldString4 = array[0..3] of char;
var
  aOldString4: TOldString4; 
begin
  aOldString4[0] := 'a';
  aOldString4[1] := 'b';
  aOldString4[2] := 'c';
  aOldString4[3] := 'd';
end;

The static array of char has now the content:

a b c d
Light bulb  Note: Unassigned chars can have any content, depending on what was just in memory when the memory for the array was made available.

Dynamic Array of Char

var
  aOldString: Array of Char; 
begin
  SetLength(aOldString, 5);
  aOldString[0] := 'a';
  aOldString[1] := 'b';
  aOldString[2] := 'c';
  aOldString[3] := 'd';
end;

The dynamic array of char has now the content:

a b c d #0
Light bulb  Note: Unassigned chars in dynamic arrays have a content #0, cause empty positions of all dynamic arrays are initially initialised with 0 (or #0, or nil, or ...)

PChar

A variable of type PChar is basically a pointer to a Char type, but allows additional operations. PChars can be used to access C-style null-terminated strings, e.g. in interaction with certain OS libraries or third-party software.

a b c #0
^

Reference

PWideChar

A variable of type PWideChar is a pointer to a WideChar variable.

a b c #0 #0
^

Reference

String types

String

The type String may refer to ShortString or AnsiString, depending from the {$H} switch. If the switch is off ({$H-}) then any string declaration will define a ShortString. It size will be 255 chars, if not otherwise specified. If it is on ({$H+}) string without length specifier will define an AnsiString, otherwise a ShortString with specified length. In mode delphiunicode' String is UnicodeString.

Reference

ShortString

Short strings have a maximum length of 255 characters with the implicit codepage CP_ACP. The length is stored in the character at index 0. A short string of 255 characters uses 256 bytes of memory (one byte for the length specification and 255 bytes for characters).

#3 a b c

Reference

AnsiString

Ansistrings are strings that have no length limit. They are reference counted and are guaranteed to be null terminated. Internally, a variable of type AnsiString is treated as a pointer: the actual content of the string is stored on the heap, as much memory as needed to store the string content is allocated.

a b c #0
RefCount Length

An AnsiString type may also have a compile-time code page since FPC 2.7.1; a missing value defaults to DefaultSystemCodePage. A value of CP_NONE results in RawBytestring and a value of CP_UTF8 results in UTF8String.

Reference

UnicodeString

Like AnsiStrings, UnicodeStrings are reference counted, null-terminated arrays, but they are implemented as arrays of WideChars instead of regular Chars.

Light bulb  Note: The UnicodeString naming is a bit ambiguous but probably due to its use in Delphi on Windows, where the OS uses UTF16 encoding; it's not the only string type that can hold Unicode string data (see also UTF8String)...
a b c #0 #0
RefCount Length

Reference

UTF8String

In FPC 2.6.5 and below the type UTF8String was an alias to the type AnsiString. In FPC 2.7.1 and above it is defined as UTF8String = type AnsiString(CP_UTF8);

It is meant to contain UTF-8 encoded strings (i.e. unicode data) ranging from 1..4 bytes per character. Note that String can also contain UTF-8 encoded characters.

Reference

UTF16String

The type UTF16String is an alias to the type WideString. In the LCL unit lclproc it is an alias to UnicodeString.

Reference

WideString

Variables of type WideString (used to represent unicode character strings in COM applications) resemble those of type UnicodeString, but unlike them they are not reference-counted. On Windows they are allocated with a special windows function which allows them to be used for OLE automation.

WideStrings consist of COM compatible UTF16 encoded bytes on Windows machines (UCS2 on Windows 2000), and they are encoded as plain UTF16 on Linux, Mac OS X and iOS.

a b c #0 #0
Length

Reference

String-derived types

PShortString

A variable of type PShortString is a pointer that points to the first byte of a ShortString-type variable (which defines the length of the ShortString).

#3 a b c
^

Reference

PAnsiString

Variables of type PAnsiString are pointers to AnsiString-type variables. However, unlike PShortString-type variables they don't point to the first byte of the header, but to the first char of the AnsiString.

a b c #0
RefCount Length ^

Reference

PUnicodeString

Variables of type PUnicodeString are pointers to variables of type UnicodeString.

a b c #0 #0
RefCount Length ^

Reference

PWideString

Variables of type PWideString are pointers. They point to the first char of a WideString-typed variable.

a b c #0 #0
Length ^

Reference

String constants

If you use only English (ASCII) constants your strings work the same with all types, on all platforms and all compiler versions. Non English strings can be loaded via resourcestrings or from files. If you want to use non English strings in code then you should read further.

There are various encodings for non English strings. By default Lazarus saves Pascal files as UTF-8 without BOM. UTF-8 supports the full Unicode range. That means all string constants are stored in UTF-8. Lazarus also supports to change the encoding of a file to other encoding, for example under Windows your local codepage. The Windows codepage is limited to your current language group.

String Type, UTF-8 Source With {$codepage utf8}? FPC ≤ 2.6.5 FPC ≥ 2.7.1 FPC ≤ 2.7.1, UTF8 as default CodePage
AnAnsiString:='ãü'; No Needs UTF8ToAnsi in RTL/WinAPI. Ok in LCL Needs UTF8ToAnsi in RTL/WinAPI. Ok in LCL Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI.
AnAnsiString:='ãü'; Yes System cp ok in RTL/WinAPI. Needs SysToUTF8 in LCL Ok in RTL/WinAPI/LCL. Mixing causes conversion. Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI
AnUnicodeString:='ãü'; No Wrong everywhere Wrong everywhere Wrong everywhere
AnUnicodeString:='ãü'; Yes System cp ok in RTL/WinAPI. Needs UTF8Encode in LCL Ok in RTL/WinAPI/LCL.Mixing causes conversion. Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI
AnUTF8String:='ãü'; No Same as AnsiString Wrong everywhere Wrong everywhere
AnUTF8String:='ãü'; Yes Same as AnsiString Ok in RTL/WinAPI/LCL. Mixing causes conversion. Ok in RTL/W-WinAPI/LCL. Needs UTF8ToWinCP in A-WinAPI.
  • W-WinAPI = Windows API "W" functions, UTF-16
  • A-WinAPI = Windows API non "W" functions, 8bit system ("ANSI") code page
  • System CP = The 8bit system code page of the OS. For example code page 1252.
const 
  c='ãü';
  cstring: string = 'ãü'; // see AnAnsiString:='ãü';
var
  s: string;
  u: UnicodeString;
begin
  s:=c; // same as s:='ãü';
  s:=cstring; // does not change encoding
  u:=c; // same as u:='ãü';
  u:=cstring; // fpc 2.6.1: converts from system cp to UTF-16, fpc 2.7.1+: depends on encoding of cstring
end;

The rules for conversion are laid out in a "Code page conversions" section in the FPC manual. The basic point is that assigning an AnsiString (including the CP_UTF8 specialization) to another AnsiString converts what is in the source to match the code page of the target string. A quirk for compatibility with presumably fpc ≤ 2.6.5 is that no such conversion will be done if one matches the source code CP and the other matches the system CP. In this case, forced (likely incorrect) interpretation as the target code page will occur.

See also