Difference between revisions of "UTF-8"

From Lazarus wiki
(New page: UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means ...)
 
Line 2: Line 2:
  
 
All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.
 
All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.
 +
 +
 +
{| border="1"
 +
|+ UTF-8 byte Sequences
 +
!   Code points
 +
!1st byte
 +
!2nd byte
 +
!3rd byte
 +
!4th byte
 +
!most significant bits of the first byte of a multi-byte sequence
 +
!
 +
|-
 +
|   U+0000..U+007F
 +
|   00..7F
 +
|
 +
|
 +
|
 +
|   0
 +
|   [[ASCII]]  
 +
|-
 +
|   U+0080..U+07FF
 +
|   C2..DF
 +
|   80..BF
 +
|
 +
|
 +
|   110
 +
|
 +
|-
 +
|   U+0800..U+0FFF
 +
|   E0
 +
|   A0..BF
 +
|   80..BF
 +
|
 +
|   1110
 +
|
 +
|-
 +
|   U+1000..U+FFFF
 +
|   E1..EF
 +
|   80..BF
 +
|   80..BF
 +
|
 +
|   1110
 +
|
 +
|-
 +
|   U+10000..U+3FFFF
 +
|   F0
 +
|   90..BF
 +
|   80..BF
 +
|   80..BF
 +
|   1111
 +
|
 +
|-
 +
|   U+40000..U+FFFFF
 +
|   F1..F3
 +
|   80..BF
 +
|   80..BF
 +
|   80..BF
 +
|   1111
 +
|
 +
|-
 +
|   U+100000..U+10FFFF
 +
|   F4
 +
|   80..BF
 +
|   80..BF
 +
|   80..BF
 +
|   1111
 +
|
 +
|}

Revision as of 19:42, 22 July 2007

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.


UTF-8 byte Sequences
  Code points 1st byte 2nd byte 3rd byte 4th byte most significant bits of the first byte of a multi-byte sequence
  U+0000..U+007F   00..7F   0   ASCII  
  U+0080..U+07FF   C2..DF   80..BF   110
  U+0800..U+0FFF   E0   A0..BF   80..BF   1110
  U+1000..U+FFFF   E1..EF   80..BF   80..BF   1110
  U+10000..U+3FFFF   F0   90..BF   80..BF   80..BF   1111
  U+40000..U+FFFFF   F1..F3   80..BF   80..BF   80..BF   1111
  U+100000..U+10FFFF   F4   80..BF   80..BF   80..BF   1111