Difference between revisions of "not Delphi compatible enhancement for Unicode Support"

From Lazarus wiki
Jump to navigationJump to search
Line 39: Line 39:
 
If the target is Dynamic: Just assign the source (be it dynamic or not or even a RawByteString).  
 
If the target is Dynamic: Just assign the source (be it dynamic or not or even a RawByteString).  
  
If the target is not dynamic (but no Raw String), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is the same as the static (compile-time) brand of the target, just assign. If the encoding is different, have the appropriate conversion library function be called (which would be necessary with pure static encoding paradigm in a similar case as well).  
+
If the target is not dynamic (but no Raw String), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is the same as the static (compile-time) brand of the target, just assign. If the encoding is different, have the appropriate conversion library function be called (which would be necessary with a pure static encoding paradigm in a similar case as well).  
  
 
If the target is a Raw String (static compile-Time brand), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is Raw String as well, just assign, if the encoding is different, behave like Delphi would do (I don't know what it does when assigning something to a RawByteString).  
 
If the target is a Raw String (static compile-Time brand), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is Raw String as well, just assign, if the encoding is different, behave like Delphi would do (I don't know what it does when assigning something to a RawByteString).  
  
One more case: if the source is a DynamicString and happens to have been assigned a “RAW” content, but the target is static and nit RAW, behave like Delphi would do when assigning Raw to any other encoding.  
+
One more case: if the source is a DynamicString and happens to have been assigned a “RAW” content, but the target is static and not RAW, behave like Delphi would do when assigning Raw to any other encoding.  
  
 
Now it's obvious how a “dynamic” version of TStrings would work.  
 
Now it's obvious how a “dynamic” version of TStrings would work.  
Line 53: Line 53:
 
Reworking the code for storing and retrieving with TStringList would be easy to do. Of course more complex classes and functions (such as sorting) would need some more effort.  
 
Reworking the code for storing and retrieving with TStringList would be easy to do. Of course more complex classes and functions (such as sorting) would need some more effort.  
  
Hence, if setting the same (static) encoding brand for all the variables you use with a TStrings based store, there will be exactly the same functionality and no noticeable overhead regarding the Delphi way of always using UTF-16.  
+
Hence, if setting the same (static) encoding brand for all the variables you use with a TStrings based store, there will provide exactly the same functionality and no noticeable overhead regarding the Delphi way of always using UTF-16.  
  
As an advantage of the dynamic TStringList besides the obvious enhanced versatility for (portable) code that is done with another String encoding brand as UTF-16,  is, that you can  force it to use any encoding you like by just assigning your strings to an appropriate variable before storing (no need for any considering while retrieving). So you can e. g. do a compact UTF-8 store while using Delphi-style UTF-16 (more easily handled regarding pos() and friends) in your user code The conversion between UTF-8 and UTF-16 is faster than with locale based ANSI one byte encoding and not prone to information loss.
+
As an advantage of the dynamic TStringList besides the obvious enhanced versatility for (portable) code that is done with another String encoding brand as UTF-16,  is, that you can  force it to use any encoding you like by just assigning your strings to an appropriate variable before storing (no need for any considering while retrieving). So you can e. g. do a compact UTF-8 store while using Delphi-style UTF-16 (more easily handled regarding pos() and friends) in your user code. The conversion between UTF-8 and UTF-16 is fast and not prone to information loss.
  
 
= Three more RAW types =
 
= Three more RAW types =

Revision as of 11:42, 26 November 2014


Introduction

I feel that there are several shortcomings with the Delphi style Code Aware Strings that with fpc can be overcome quite easily and in a rather compatible way.

I myself don't have access to Delphi XE, so my impressions of same are derived from talks with my colleagues who use it.


Analysis

With Delphi XE, Embarcadero decided that the one-byte-per-character String type that uses an (ANSI-compatible) encoding restricted to a certain set of languages (usually defined by the system locale) is not versatile enough. So they introduced a “code-page aware” string type that allows for each string to be held in a different encoding. A broad set of encoding variants is offered, including one Byte ANSI code pages, and Unicode in 1, 2 and 4 Byte variants. In principle each string variable in a project can be defined to use a different encoding. Hence the type “String” now comes in several “brands” (that in a variable or typename definition can be denoted by a number in brackets). When leaving out the brackets the default is 2 byte Unicode, which makes a lot of sense with Windows OS, that internally uses this encoding style.

Delphi provides automatic code-conversion of the string content whenever it seems necessary.

Unfortunately, Embarcadero forces a peculiar mix of static and dynamic “branding” of the string variables. Each variable is gets a certain encoding brand by defining it's name, on the other hand each string variable content additionally contains a (potentially) dynamically definable setting of as well the count of bytes per character as the encoding style to be used to interpret the content whenever necessary. IMHO this is the cause of major misconception. Supposedly they feared to be bashed due to some run-time overhead necessary with fully dynamic string code handling. Obviously with the partly-dynamic string code handling, the decision whether or not to do an automatic conversion when accessing a string, can be done at compile time rather than at run time.

The problem

While simple assignment and compiler built-in functions such as “pos()” and “copy()” with any instance can decide what to do – i. e. do or not do auto-conversion of the content based on the encoding brand of the arguments (by implicitly faking fully dynamically encoded arguments), not-built-in functions and properties need to be defined in a fixed string brand. Hence when using them with differently encoded Strings, the compiler will introduce very time consuming and loss-prone auto conversion (unless the function itself is provided in several brands).

This problem is much more urgent with fpc than with Delphi, as fpc is supposed to run with multiple OSes that might use different system-wide default encoding styles (e. g. UTF-8 with Linux and UTF-16 with Windows). While Delphi XE just creates Windows software, and projects that explicitly other but system-default-endcoded strings (e. g. read from a file or retrieved from a network resource) are rather seldom done, fpc needs to be able to handle user software that is compilable for multiple OSes with different encoding defaults.

The most obvious candidate for pain on that behalf is “TStrings”. This basic class is used to derive lots of string handling classes from. Storing the string and retrieving the content could easily – and with close to no overhead – be done in a fully dynamic way: each string comes with it's branding notes, anyway. But the fixed branding of the string type used in Delphi compatible TString's interface forces auto-conversion in all cases but one. Another example is the interface of a GUI connection libraries – such as Lazarus or mse: portable programs would benefit from a common Interface for all OSes, but forcing any fixed encoding independent from the OS does not seem appropriate either, as permanent code conversion on the way between the user code and the OS could not be avoided at all. Of course the fpc RTL could be mentioned as well.

A possible solution

Thinking about this, I think, it should be possible to find a solution that of course is not the “Delphi Way” but rather compatible both to Delphi XE as to earlier fpc and Delphi versions, so that normal user software would not have to be aware of this modification. IMHO it should be rather easy to do (on top of the Delphi compatible Code Aware String library) and not eat up any noticeable amount of CPU cycles.

Delphi just provides the coding brand “RawByteString” as a string type that the compiler does not force auto-conversion for with any assignment to/from another brand. But AFAIK, with this brand, auto-conversion is disabled for all assignments, and maybe some assignment even are forbidden. This obviously is not what we want here.

Hence the suggestion is to introduce yet another brand of string encoding styles, that is fully dynamic and prone to auto-conversion. This might be called the “DynamicString” encoding brand and might be assigned by the encoding brand number $F000 or $FF00 (to leave some room for encoding brands more similar to RawByteString which AFAIK is $FFFE.

By this, all normal String handling introduced by an operator and compiler-built-in function stays completely unchanged, unless we explicitly use the DynamcString brand. Also “old-style” user or library functions using the traditional Delphi String brands, the calling also stays as it is defined by Delphi.

If with an assignment one or more of the partners is a DynamicString, the compiler needs to adhere to some simple rules and needs to generate code to check the dynamic encoding brand of the appropriate String(s).

If the target is Dynamic: Just assign the source (be it dynamic or not or even a RawByteString).

If the target is not dynamic (but no Raw String), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is the same as the static (compile-time) brand of the target, just assign. If the encoding is different, have the appropriate conversion library function be called (which would be necessary with a pure static encoding paradigm in a similar case as well).

If the target is a Raw String (static compile-Time brand), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is Raw String as well, just assign, if the encoding is different, behave like Delphi would do (I don't know what it does when assigning something to a RawByteString).

One more case: if the source is a DynamicString and happens to have been assigned a “RAW” content, but the target is static and not RAW, behave like Delphi would do when assigning Raw to any other encoding.

Now it's obvious how a “dynamic” version of TStrings would work.

As the type of the parameter of “Add” is dynamic, here we would never need a conversion. The string is just forwarded to the overridden add procedure and there be stored as it is, including it's header that defines its encoding brand. With TStringList, there is a very slight overhead, as the size of the storage area is calculated with regard to the “Bytes per Character” setting: a dynamic multiplication instead of just a fixed factor “2”.

When retrieving a string from a TStrings based storage, the compiler would automatically check the dynamically defined encoding brand of the source and do an auto-conversion if necessary (see above).

Reworking the code for storing and retrieving with TStringList would be easy to do. Of course more complex classes and functions (such as sorting) would need some more effort.

Hence, if setting the same (static) encoding brand for all the variables you use with a TStrings based store, there will provide exactly the same functionality and no noticeable overhead regarding the Delphi way of always using UTF-16.

As an advantage of the dynamic TStringList besides the obvious enhanced versatility for (portable) code that is done with another String encoding brand as UTF-16, is, that you can force it to use any encoding you like by just assigning your strings to an appropriate variable before storing (no need for any considering while retrieving). So you can e. g. do a compact UTF-8 store while using Delphi-style UTF-16 (more easily handled regarding pos() and friends) in your user code. The conversion between UTF-8 and UTF-16 is fast and not prone to information loss.

Three more RAW types

While we are defining a not Delphi compatible String type brand, of course we should add the obviously missing type RawWordString, RawDWordString, and RawQWordString.

See Also