Difference between revisions of "LCL Unicode Support/ja"

From Lazarus wiki
Jump to navigationJump to search
Line 6: Line 6:
 
ここに、Lazarusでユニコードサポートをもっときちんとやりたい人のために、いくつかの基本的な情報を示します。このページは積極的に修正、拡張、変更していってください。
 
ここに、Lazarusでユニコードサポートをもっときちんとやりたい人のために、いくつかの基本的な情報を示します。このページは積極的に修正、拡張、変更していってください。
  
すでにユニコード標準や、DelphiでのWideStringの経験がある人にとって、このページは助けになるでしょう。
+
すでにユニコード標準や、DelphiでのWideStringの経験がある人にとって、Unicodeをサポートするプログラムを書く際に、このページの情報は助けになるでしょう。
以前の非(西の)ラテン言語での利用方法や、多くの種類のキャラクタセットにとっても、また、助けになるでしょう。
+
以前の非ラテン言語(西方の)での利用方法や、多くの種類のキャラクタセットにとっても、また、助けになるでしょう。
  
注意:細かい実装については、いまだ議論の最中で、この文書の内容は変わる可能性があります。
+
注意:細かい実装については、いま議論の最中ですので、この文書の内容は変わる可能性があります。
  
 
== 実装のガイドライン ==
 
== 実装のガイドライン ==

Revision as of 03:02, 4 November 2006

Deutsch (de) English (en) español (es) français (fr) 日本語 (ja) 한국어 (ko) русский (ru) 中文(中国大陆)‎ (zh_CN) 中文(台灣)‎ (zh_TW)

イントロダクション

Lazarusは開発においてUnicode標準をもっとサポートする必要があります。ほとんどの場合、Windows環境においてです。 ここに、Lazarusでユニコードサポートをもっときちんとやりたい人のために、いくつかの基本的な情報を示します。このページは積極的に修正、拡張、変更していってください。

すでにユニコード標準や、DelphiでのWideStringの経験がある人にとって、Unicodeをサポートするプログラムを書く際に、このページの情報は助けになるでしょう。 以前の非ラテン言語(西方の)での利用方法や、多くの種類のキャラクタセットにとっても、また、助けになるでしょう。

注意:細かい実装については、いま議論の最中ですので、この文書の内容は変わる可能性があります。

実装のガイドライン

要求事項

Lazarusの精神は、"Write once, Compile everywhere."です。 これは、理想的には、ユニコードを使うアプリケーションでは、ターゲットにむけた条件定義などがなく、1つのユニコードをサポートするソースコードで、すべてのターゲットのプログラムを生成できる、ということです。

LCLの"interface"の部分はターゲットプラットホームがユニコードをサポートするよう、ターゲットプラットホームのためのユニコードサポートを実装するべきです。そして、同時に、アプリケーションプログラマからは、プラットホームに依存している実装を隠蔽しなくてはなりません。

Lazarusに関して言えることは、アプリケーションコードとLCLの境界における内部的な文字列の通信は、LCLとWidgetsetsと同様に、古典的な(バイト並びの)stringで成り立っている、ということです。 論理的には、それらの保持には、UTF-8を使ってエンコードされているべきです。

Unicodeへの移行

Most existing Lazarus use Ansi encodings, because that´s the default for Gtk1 and win32 interfaces today. This will change in the future and all widgetsets will support UTF-8, so all applications that pass strings directly to the interface (be written on code or on the object inspector) will need to be converted to utf-8.

When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. To avoid inconsistencies (like passing iso characters for a utf-8 widgetset), it´s necessary to use an IDE working on the same encoding as the target widgetset. This means that we will need stable UTF-8 IDE before completing the migration to Unicode.


Currently we have various groups of widgetsets, according to the encoding:

  • Interfaces that use ANSI encoding: win32 and gtk (1) interfaces.
  • Interfaces that use UTF-8 encoding: gtk (1), gtk2, qt, fpGUI
  • Interfaces that currently use ANSI encoding, but need migration to UTF-8: win32, wince


Notice that gtk 1 is on both ANSI and UTF-8 groups. That´s becouse the encoding is controled by an enviroment variable on Gtk 1.

As Lazarus is today, existing software will work, if recompiled for win32, wince or gtk interfaces, but will face encoding issues compiling for other widgetset. And new software, using UTF-8 will work when recompiled for any of the widgetsets on the Unicode group.

One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.

Roadmap

Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.

All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.

Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.


Primary Tasks

Make Win32 Widgetset support UTF-8

Notes: On this step we will target all 32-bits Windows versions at the same time. All code produced on this effort will be isolated from the current win32 interface by IFDEFs, to avoid introducing bugs on this primary interface. After the transition time, the IFDEFs will be removed and only the Unicode code will remain.

Details about how to support unicode on Win9x are being debated.

Status: Not implemented


Update Gtk 2 keyboard functions so they work with UTF-8

Notes:

Status: Not implemented


Make sure the Lazarus IDE runs correctly with Win32 Unicode widgetset and supports UTF-8

Notes:

Status: Not implemented


Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and supports UTF-8

Notes:

Status: Not implemented

Secondary Tasks

Update Windows CE widgetset so it uses UTF-8

Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.

Status: Not implemented


Update Gtk 1 keyboard functions so they work with UTF-8

Notes:

Status: Not implemented



Unicode essentials

Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).

There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:

                           UTF-8 UTF-16 UTF-32
Smallest code point [hex] 000000 000000 000000
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
Code unit size [bits]          8     16     32
Minimal bytes/character        1      2      4
Maximal bytes/character        4      4      4

UTF-8 has several important and useful properties: It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-16 has the following most important properties: It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.

Finally, any Unicode character can be represented as a single 32-bit unit in UTF-32.

For more, see: Unicode FAQ - Basic questions, Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM, Wikipedia: UTF-8 [1]

Lazarus component library architecture essentials

The LCL consists of two parts:

  1. A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
  2. "Interfaces" - a part that implements the interface to APIs of each target platform.

The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.

The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.

Gtk2 widgetset only works with UTF-8 encoding, but the keyboard code of the interface is still based on the old gtk 1 code, so it does not support UTF-8 completely.

The win32 interface is setup with ansi widgets, so it is currently not possible to use Unicode with win32.

Qt interface is prepared for UTF-8. Qt itself uses UCS-2 as native encoding, but the lazarus interface for Qt converts from UTF-8 to UCS-2.

Windows CE only support UCS-2 as character encoding, but our interface for it currently converts strings from ISO to UCS-2 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.

For more, see: Internals of the LCL

Unicode-enabling the win32 interface

Guidelines

First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At this moment all existing programs that use ANSI characters will need migration to Unicode.

Windows platforms <=Win9x are based on ISO code page standards and only partially support Unicode. Windows platforms starting with WinNT and Windows CE fully support Unicode. Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W functions accept wide strings, i.e. UTF-16 encoded strings, as parameters. Windows CE only uses Wide API functions.

Wide functions present on Windows 9x

Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp

Conversion example:

  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
Length(ButtonCaption), TextSize);

Becomes:

var
  WideCaption: WideString;

  ....

  {$ifdef WindowsUnicodeSupport}
    WideCaption := Utf8Decode(ButtonCaption);
    GetTextExtentPoint32W(hdcNewBitmap, PWideChar(WideCaption), Length(WideCaption), TextSize);
  {$else}
    GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
  {$endif}
end;

Functions that need Ansi and Wide versions

First Conversion example:

function TGDIWindow.GetTitle: String;
var
 l: Integer;
begin
   l := Windows.GetWindowTextLength(Handle);
   SetLength(Result, l);
   Windows.GetWindowText(Handle, @Result[1], l);
end;

Becomes:

function TGDIWindow.GetTitle: String;
var
 l: Integer;
 AnsiBuffer: string;
 WideBuffer: WideString;
begin

{$ifdef WindowsUnicodeSupport}

 if UnicodeEnabledOS then
 begin
   l := Windows.GetWindowTextLengthW(Handle);
   SetLength(WideBuffer, l);
   l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
   SetLength(WideBuffer, l);
   Result := Utf8Encode(WideBuffer);
 end
 else
 begin
   l := Windows.GetWindowTextLength(Handle);
   SetLength(AnsiBuffer, l);
   l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
   SetLength(AnsiBuffer, l);
   Result := AnsiToUtf8(AnsiBuffer);
 end;

{$else}

   l := Windows.GetWindowTextLength(Handle);
   SetLength(Result, l);
   Windows.GetWindowText(Handle, @Result[1], l);

{$endif}

end;

Roadmap

Bellow is a list of units that need to be checked

  • "win32callback.inc"
  • "win32def.pp"
  • "win32int.pp"
  • "win32lclintf.inc"
  • "win32lclintfh.inc"
  • "win32listsl.inc"
  • "win32listslh.inc"
  • "win32memostrings.inc"
  • "win32object.inc"
  • "win32proc.pp"
  • "win32winapi.inc"
  • "win32winapih.inc"
  • "win32wsactnlist.pp"
  • "win32wsarrow.pp"
  • "win32wsbuttons.pp"
  • "win32wscalendar.pp"
  • "win32wschecklst.pp"
  • "win32wsclistbox.pp"
  • "win32wscomctrls.pp"
  • "win32wscontrols.pp"
  • "win32wscustomlistview.inc"
  • "win32wsdbctrls.pp"
  • "win32wsdbgrids.pp"
  • "win32wsdialogs.pp"
  • "win32wsdirsel.pp"
  • "win32wseditbtn.pp"
  • "win32wsextctrls.pp"
  • "win32wsextdlgs.pp"
  • "win32wsfilectrl.pp"
  • "win32wsforms.pp"
  • "win32wsgrids.pp"
  • "win32wsimglist.pp"
  • "win32wsmaskedit.pp"
  • "win32wsmenus.pp"
  • "win32wspairsplitter.pp"
  • "win32wsspin.pp"
  • "win32wsstdctrls.pp"
  • "win32wstoolwin.pp"
  • "winext.pas"