Difference between revisions of "Talk:FPC Unicode support"

From Lazarus wiki
Jump to navigationJump to search
(Removed irrelevant stuff. It's still in the history in case somebody really needs it...)
(suggest some additional information on the page)
Line 1: Line 1:
 +
Given that so much behavior depends on the code page associated with a string, it would be helpful if this page mentioned
  
 +
* how to tell what code page the compiler (or Lazarus) has associated with a string, if that is possible
 +
 +
* how to correct the code page information, if that is possible
 +
 
 +
If either of these is not possible, it would be helpful to mention that, too.
 +
 +
As a new user of FPC and Lazarus, I am finding it rather vexing that code to iterate over the characters of a UTF8-encoded file is insisting on iterating over the octets (or bytes, if you prefer that term) of the file, and not over the characters.  Assigning to a string declared with type UnicodeString might perhaps help, but the conversion code appears not to recognize the UTF8 in its input argument.  I would like to confirm that analysis by interrogating the string to find out what code page it thinks it is encoded in, and then if possible to fix the problem for the moment by telling it what code page it is actually encoded in -- not with a conversion routine but with a routine that corrects the erroneous assumption that this UTF8-encoded string is in an eight-bit character set.
 +
 +
It does not help that many of the discussions of Unicode support refer to packages or units like lazUTF8 or LCLProc, which raise errors when I try to refer to them in a 'uses' clause, and for which in any case I cannot find documentation.  (If this suggestion sounds testy, it is because I have spent several hours searching without success for a way to make a simple function to calculate Levenshtein distance work in Free Pascal.)
 +
 +
[[User:Cmsmcq|Cmsmcq]] ([[User talk:Cmsmcq|talk]]) 02:13, 14 June 2020 (CEST)

Revision as of 02:13, 14 June 2020

Given that so much behavior depends on the code page associated with a string, it would be helpful if this page mentioned

  • how to tell what code page the compiler (or Lazarus) has associated with a string, if that is possible
  • how to correct the code page information, if that is possible

If either of these is not possible, it would be helpful to mention that, too.

As a new user of FPC and Lazarus, I am finding it rather vexing that code to iterate over the characters of a UTF8-encoded file is insisting on iterating over the octets (or bytes, if you prefer that term) of the file, and not over the characters. Assigning to a string declared with type UnicodeString might perhaps help, but the conversion code appears not to recognize the UTF8 in its input argument. I would like to confirm that analysis by interrogating the string to find out what code page it thinks it is encoded in, and then if possible to fix the problem for the moment by telling it what code page it is actually encoded in -- not with a conversion routine but with a routine that corrects the erroneous assumption that this UTF8-encoded string is in an eight-bit character set.

It does not help that many of the discussions of Unicode support refer to packages or units like lazUTF8 or LCLProc, which raise errors when I try to refer to them in a 'uses' clause, and for which in any case I cannot find documentation. (If this suggestion sounds testy, it is because I have spent several hours searching without success for a way to make a simple function to calculate Levenshtein distance work in Free Pascal.)

Cmsmcq (talk) 02:13, 14 June 2020 (CEST)