Difference between revisions of "Talk:FPC Unicode support"

From Lazarus wiki
Jump to navigationJump to search
(Useless stuff see http://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg30196.html)
 
 
(4 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Irrelevant stuff - looks like a wish list =
+
Given that so much behavior depends on the code page associated with a string, it would be helpful if this page mentioned
Removed this from the main page as it will only give cause for confusion.
 
Developers agree it is useless: http://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg30196.html
 
  
IMO this can be deleted from this page as well.
+
* how to tell what code page the compiler (or Lazarus) has associated with a string, if that is possible
  
Thanks,
+
* how to correct the code page information, if that is possible
--[[User:BigChimp|BigChimp]] 16:08, 7 January 2014 (CET)
+
 
 +
If either of these is not possible, it would be helpful to mention that, too.
  
== FPC Unicode support ==
+
As a new user of FPC and Lazarus, I am finding it rather vexing that code to iterate over the characters of a UTF8-encoded file is insisting on iterating over the octets (or bytes, if you prefer that term) of the file, and not over the characters.  Assigning to a string declared with type UnicodeString might perhaps help, but the conversion code appears not to recognize the UTF8 in its input argument.  I would like to confirm that analysis by interrogating the string to find out what code page it thinks it is encoded in, and then if possible to fix the problem for the moment by telling it what code page it is actually encoded in -- not with a conversion routine but with a routine that corrects the erroneous assumption that this UTF8-encoded string is in an eight-bit character set.
  
FPC must have the following string types with transparent conversion between them (like current AnsiString <-> WideString conversion):
+
It does not help that many of the discussions of Unicode support refer to packages or units like lazUTF8 or LCLProc, which raise errors when I try to refer to them in a 'uses' clause, and for which in any case I cannot find documentation.  (If this suggestion sounds testy, it is because I have spent several hours searching without success for a way to make a simple function to calculate Levenshtein distance work in Free Pascal.)
  
* shortstring
+
[[User:Cmsmcq|Cmsmcq]] ([[User talk:Cmsmcq|talk]]) 02:13, 14 June 2020 (CEST)
* ansistring
+
:StringCodePage(SomeString) tells you what the compiler thinks is the codepage.
* widestring
+
:SetCodePage(var s : RawByteString; CodePage : TSystemCodePage; Convert : Boolean = True) is the procedure you are lokking dor to "correct" the codepage. Set Convert to False, otherwise the compiler will try to convert the string to the new codepage. --[[User:Bart|Bart]] ([[User talk:Bart|talk]]) 22:08, 14 June 2020 (CEST)
* utf8string
 
* utf16string
 
* utf32string
 
* ucs2string (?)
 
* ucs4string (?)
 
  
'''Development and further maintenance of these string types must be as simple as possible.''' New string types must be easily added in future if needed.
+
== Strange phrase in part 'dynamic cp' ==
  
Compiler uses generic structure and helper routines to handle all refcounted string types.
+
<pre>
 +
Dynamic code page
  
String header:
+
If a string with a declared code page SOURCE_CP is assigned to a string with declared code page DEST_CP , then
  
<syntaxhighlight>
+
    if (SOURCE_CP = CP_NONE) or (DEST_CP = CP_NONE), see RawByteString, otherwise
type
+
</pre>
  TRefStringRec = packed record
+
This phrase in incomplete, as I got it... --[[User:Alextpp|Alextpp]] ([[User talk:Alextpp|talk]]) 17:00, 11 April 2022 (CEST)
    Encoding: word;    // encoding of string
+
:No, it continues in the next bullet point. This seemed clearer than putting everything one giant sentence. Maybe it could be clarified by putting '...' at the end and also '...' at the start of the next bullet point?
    ElementSize: byte; // size in bytes of string's element (1-4)
+
:[[User:Jonas|Jonas]] ([[User talk:Jonas|talk]])
    Ref: SizeInt;      // number of references
 
    Len: SizeInt;      // number of elements is string
 
  end;
 
</syntaxhighlight>
 
 
 
Helper routines will know how to handle string from its header.
 
 
 
Extra parameter with string type information is passed to some routines (like fpc_RefString_SetLength) to allow properly initialize new strings.
 
 
 
widestring type on Windows targets remains non-refcounted and OLE compatible. Minimal number of helper routines is used for it. On non-Windows targets widestring is alias to utf16string.
 
 
 
The compiler uses helpers for string type conversions like this:
 
<syntaxhighlight>
 
procedure fpc_ansistring_to_utf16string(out dst: utf16string; const src: ansistring);
 
procedure fpc_utf32string_to_utf16string(out dst: utf16string; const src: utf32string);
 
</syntaxhighlight>
 
 
 
The compiler generates helper procedure name from type names. The compiler does not perform any string conversion handling by itself.
 

Latest revision as of 22:00, 11 April 2022

Given that so much behavior depends on the code page associated with a string, it would be helpful if this page mentioned

  • how to tell what code page the compiler (or Lazarus) has associated with a string, if that is possible
  • how to correct the code page information, if that is possible

If either of these is not possible, it would be helpful to mention that, too.

As a new user of FPC and Lazarus, I am finding it rather vexing that code to iterate over the characters of a UTF8-encoded file is insisting on iterating over the octets (or bytes, if you prefer that term) of the file, and not over the characters. Assigning to a string declared with type UnicodeString might perhaps help, but the conversion code appears not to recognize the UTF8 in its input argument. I would like to confirm that analysis by interrogating the string to find out what code page it thinks it is encoded in, and then if possible to fix the problem for the moment by telling it what code page it is actually encoded in -- not with a conversion routine but with a routine that corrects the erroneous assumption that this UTF8-encoded string is in an eight-bit character set.

It does not help that many of the discussions of Unicode support refer to packages or units like lazUTF8 or LCLProc, which raise errors when I try to refer to them in a 'uses' clause, and for which in any case I cannot find documentation. (If this suggestion sounds testy, it is because I have spent several hours searching without success for a way to make a simple function to calculate Levenshtein distance work in Free Pascal.)

Cmsmcq (talk) 02:13, 14 June 2020 (CEST)

StringCodePage(SomeString) tells you what the compiler thinks is the codepage.
SetCodePage(var s : RawByteString; CodePage : TSystemCodePage; Convert : Boolean = True) is the procedure you are lokking dor to "correct" the codepage. Set Convert to False, otherwise the compiler will try to convert the string to the new codepage. --Bart (talk) 22:08, 14 June 2020 (CEST)

Strange phrase in part 'dynamic cp'

Dynamic code page

If a string with a declared code page SOURCE_CP is assigned to a string with declared code page DEST_CP , then

    if (SOURCE_CP = CP_NONE) or (DEST_CP = CP_NONE), see RawByteString, otherwise

This phrase in incomplete, as I got it... --Alextpp (talk) 17:00, 11 April 2022 (CEST)

No, it continues in the next bullet point. This seemed clearer than putting everything one giant sentence. Maybe it could be clarified by putting '...' at the end and also '...' at the start of the next bullet point?
Jonas (talk)