Talk:FPC Unicode support
Given that so much behavior depends on the code page associated with a string, it would be helpful if this page mentioned
- how to tell what code page the compiler (or Lazarus) has associated with a string, if that is possible
- how to correct the code page information, if that is possible
If either of these is not possible, it would be helpful to mention that, too.
As a new user of FPC and Lazarus, I am finding it rather vexing that code to iterate over the characters of a UTF8-encoded file is insisting on iterating over the octets (or bytes, if you prefer that term) of the file, and not over the characters. Assigning to a string declared with type UnicodeString might perhaps help, but the conversion code appears not to recognize the UTF8 in its input argument. I would like to confirm that analysis by interrogating the string to find out what code page it thinks it is encoded in, and then if possible to fix the problem for the moment by telling it what code page it is actually encoded in -- not with a conversion routine but with a routine that corrects the erroneous assumption that this UTF8-encoded string is in an eight-bit character set.
It does not help that many of the discussions of Unicode support refer to packages or units like lazUTF8 or LCLProc, which raise errors when I try to refer to them in a 'uses' clause, and for which in any case I cannot find documentation. (If this suggestion sounds testy, it is because I have spent several hours searching without success for a way to make a simple function to calculate Levenshtein distance work in Free Pascal.)
- StringCodePage(SomeString) tells you what the compiler thinks is the codepage.
- SetCodePage(var s : RawByteString; CodePage : TSystemCodePage; Convert : Boolean = True) is the procedure you are lokking dor to "correct" the codepage. Set Convert to False, otherwise the compiler will try to convert the string to the new codepage. --Bart (talk) 22:08, 14 June 2020 (CEST)