Difference between revisions of "UTF-8"
(New page: UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means ...) |
|||
Line 2: | Line 2: | ||
All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness. | All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness. | ||
+ | |||
+ | |||
+ | {| border="1" | ||
+ | |+ UTF-8 byte Sequences | ||
+ | ! Code points | ||
+ | !1st byte | ||
+ | !2nd byte | ||
+ | !3rd byte | ||
+ | !4th byte | ||
+ | !most significant bits of the first byte of a multi-byte sequence | ||
+ | ! | ||
+ | |- | ||
+ | | U+0000..U+007F | ||
+ | | 00..7F | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | 0 | ||
+ | | [[ASCII]] | ||
+ | |- | ||
+ | | U+0080..U+07FF | ||
+ | | C2..DF | ||
+ | | 80..BF | ||
+ | | | ||
+ | | | ||
+ | | 110 | ||
+ | | | ||
+ | |- | ||
+ | | U+0800..U+0FFF | ||
+ | | E0 | ||
+ | | A0..BF | ||
+ | | 80..BF | ||
+ | | | ||
+ | | 1110 | ||
+ | | | ||
+ | |- | ||
+ | | U+1000..U+FFFF | ||
+ | | E1..EF | ||
+ | | 80..BF | ||
+ | | 80..BF | ||
+ | | | ||
+ | | 1110 | ||
+ | | | ||
+ | |- | ||
+ | | U+10000..U+3FFFF | ||
+ | | F0 | ||
+ | | 90..BF | ||
+ | | 80..BF | ||
+ | | 80..BF | ||
+ | | 1111 | ||
+ | | | ||
+ | |- | ||
+ | | U+40000..U+FFFFF | ||
+ | | F1..F3 | ||
+ | | 80..BF | ||
+ | | 80..BF | ||
+ | | 80..BF | ||
+ | | 1111 | ||
+ | | | ||
+ | |- | ||
+ | | U+100000..U+10FFFF | ||
+ | | F4 | ||
+ | | 80..BF | ||
+ | | 80..BF | ||
+ | | 80..BF | ||
+ | | 1111 | ||
+ | | | ||
+ | |} |
Revision as of 19:42, 22 July 2007
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.
Code points | 1st byte | 2nd byte | 3rd byte | 4th byte | most significant bits of the first byte of a multi-byte sequence | |
---|---|---|---|---|---|---|
U+0000..U+007F | 00..7F | 0 | ASCII | |||
U+0080..U+07FF | C2..DF | 80..BF | 110 | |||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | 1110 | ||
U+1000..U+FFFF | E1..EF | 80..BF | 80..BF | 1110 | ||
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | 1111 | |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | 1111 | |
U+100000..U+10FFFF | F4 | 80..BF | 80..BF | 80..BF | 1111 |