Difference between revisions of "Character and string types"

From Lazarus wiki
Jump to navigationJump to search
Line 265: Line 265:
 
*W-WinAPI = Windows API "W" functions, UTF-16
 
*W-WinAPI = Windows API "W" functions, UTF-16
 
*A-WinAPI = Windows API non "W" functions, 8bit system code page
 
*A-WinAPI = Windows API non "W" functions, 8bit system code page
 +
 +
<syntaxhighlight>
 +
const
 +
  c='ãü';
 +
  cstring: string = 'ãü'; // see AnAnsiString:='ãü';
 +
var
 +
  s: string;
 +
  u: UnicodeString;
 +
begin
 +
  s:=c; // same as s:='ãü';
 +
  s:=cstring; // does not change encoding
 +
  u:=c; // same as u:='ãü';
 +
  u:=cstring; // fpc 2.6.1: converts from system cp to UTF-16, fpc 2.7.1+: depends on encoding of cstring
 +
end;
 +
</syntaxhighlight>
  
 
== See also ==
 
== See also ==

Revision as of 13:45, 24 November 2014

Deutsch (de) English (en) español (es) français (fr) русский (ru) 中文(中国大陆)‎ (zh_CN)

Free Pascal supports several types of characters and strings. They range from single ANSI characters to unicode strings and also include pointer types. Differences also apply to encodings and reference counting.

AnsiChar

A variable of type AnsiChar, also referred to as char, is exactly 1 byte in size, and contains one ANSI character.

a

Reference

WideChar

A variable of type WideChar, also referred to as UnicodeChar, is exactly 2 bytes in size, and contains one (part of) Unicode character in UTF-16 encoding. Note: it is impossible to encode all Unicode code points in 2 bytes. Therefore, 2 WideChars may be needed to encode a single code point.

a

References

Array of Char

Early Pascal implementations that were in use before 1978 didn't support a string type (with the exception of string constants). The only possibility to store strings in variables was the use of arrays of char. This approach has many disadvantages and is no longer recommended. It is, however, still supported to ensure backward-compatibility with ancient code.

Static Array of Char

type
  TOldString4 = array[0..3] of char;
var
  aOldString4: TOldString4; 
begin
  aOldString4[0] := 'a';
  aOldString4[1] := 'b';
  aOldString4[2] := 'c';
  aOldString4[3] := 'd';
end;

The static array of char has now the content:

a b c d
Light bulb  Note: Unassigned chars can have any content, depending on what was just in memory when the memory for the array was made available.

Dynamic Array of Char

var
  aOldString: Array of Char; 
begin
  SetLength(aOldString, 5);
  aOldString[0] := 'a';
  aOldString[1] := 'b';
  aOldString[2] := 'c';
  aOldString[3] := 'd';
end;

The dynamic array of char has now the content:

a b c d #0
Light bulb  Note: Unassigned chars in dynamic arrays have a content #0, cause empty positions of all dynamic arrays are initially initialised with 0 (or #0, or nil, or ...)

PChar

A variable of type PChar is basically a pointer to a Char type, but allows additional operations. PChars can be used to access C-style null-terminated strings, e.g. in interaction with certain OS libraries or third-party software.

a b c #0
^

Reference


PWideChar

A variable of type PWideChar is a pointer to a WideChar variable.

a b c #0 #0
^

Reference

String

The type String may refer to ShortString or AnsiString, depending from the {$H} switch. If the switch is off ({$H-}) then any string declaration will define a ShortString. It size will be 255 chars, if not otherwise specified. If it is on ({$H+}) string without length specifier will define an AnsiString, otherwise a ShortString with specified length.

Reference

ShortString

Short strings have a maximum length of 255 characters with the implicit codepage CP_ACP. The length is stored in the character at index 0.

#3 a b c

Reference

AnsiString

Ansistrings are strings that have no length limit. They are reference counted and are guaranteed to be null terminated. Internally, a variable of type AnsiString is treated as a pointer: the actual content of the string is stored on the heap, as much memory as needed to store the string content is allocated.

a b c #0
RefCount Length

Reference

UnicodeString

Like AnsiStrings, UnicodeStrings are reference counted, null-terminated arrays, but they are implemented as arrays of WideChars instead of regular Chars.

Light bulb  Note: The UnicodeString naming is a bit ambiguous but probably due to its use in Delphi on Windows, where the OS uses UTF16 encoding; it's not the only string type that can hold Unicode string data (see also UTF8String)...
a b c #0 #0
RefCount Length

Reference

UTF8String

Currently, the type UTF8String is an alias to the type AnsiString. It is meant to contain UTF8 encoded strings (i.e. unicode data) ranging from 1..4 bytes per character.

Reference

UTF16String

The type UTF16String is an alias to the type WideString. In the LCL unit lclproc it is an alias to UnicodeString.

Reference

WideString

Variables of type WideString (used to represent unicode character strings in COM applications) resemble those of type UnicodeString, but unlike them they are not reference-counted. On Windows they are allocated with a special windows function which allows them to be used for OLE automation.

WideStrings consist of COM compatible UTF16 encoded bytes on Windows machines (UCS2 on Windows 2000), and they are encoded as plain UTF16 on Linux, Mac OS X and iOS.

a b c #0 #0
Length

Reference

PShortString

A variable of type PShortString is a pointer that points to the first byte of a ShortString-type variable (which defines the length of the ShortString).

#3 a b c
^

Reference

PAnsiString

Variables of type PAnsiString are pointers to AnsiString-type variables. However, unlike PShortString-type variables they don't point to the first byte of the header, but to the first char of the AnsiString.

a b c #0
RefCount Length ^

Reference

PUnicodeString

Variables of type PUnicodeString are pointers to variables of type UnicodeString.

a b c #0 #0
RefCount Length ^

Reference

PWideString

Variables of type PWideString are pointers. They point to the first char of a WideString-typed variable.

a b c #0 #0
Length ^

Reference

String constants

UTF-8 encoded literals With or without {$codepage utf8} FPC 2.6.5 and below FPC 2.7.1 and above FPC 2.7.1+ with UTF8 as default CodePage
AnAnsiString:='ãü'; without needs UTF8ToAnsi in RTL/WinAPI, ok in LCL needs UTF8ToAnsi in RTL/WinAPI, ok in LCL ok in RTL, W-WinAPI, LCL, needs UTF8ToWinCP in A-WinAPI
AnAnsiString:='ãü'; with system cp ok in RTL/WinAPI, needs SysToUTF8 in LCL ok in RTL/WinAPI/LCL, mixing with other strings converts to system cp ok in RTL, W-WinAPI, LCL, needs UTF8ToWinCP in A-WinAPI
AnUnicodeString:='ãü'; without wrong everywhere wrong everywhere wrong everywhere
AnUnicodeString:='ãü'; with system cp ok in RTL/WinAPI, needs UTF8Encode in LCL ok in RTL/WinAPI/LCL, mixing with other strings converts to system cp ok in RTL, W-WinAPI, LCL, needs UTF8ToWinCP in A-WinAPI
AnUTF8String:='ãü'; without same as AnsiString wrong everywhere wrong everywhere
AnUTF8String:='ãü'; with same as AnsiString ok in RTL/WinAPI/LCL, mixing with other strings converts to system cp ok in RTL, W-WinAPI, LCL, needs UTF8ToWinCP in A-WinAPI
  • W-WinAPI = Windows API "W" functions, UTF-16
  • A-WinAPI = Windows API non "W" functions, 8bit system code page
const 
  c='ãü';
  cstring: string = 'ãü'; // see AnAnsiString:='ãü';
var
  s: string;
  u: UnicodeString;
begin
  s:=c; // same as s:='ãü';
  s:=cstring; // does not change encoding
  u:=c; // same as u:='ãü';
  u:=cstring; // fpc 2.6.1: converts from system cp to UTF-16, fpc 2.7.1+: depends on encoding of cstring
end;

See also