Difference between revisions of "UTF-8"

Revision as of 19:42, 22 July 2007

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-8 byte Sequences
Code points	1st byte	2nd byte	3rd byte	4th byte	most significant bits of the first byte of a multi-byte sequence
U+0000..U+007F	00..7F				0	ASCII
U+0080..U+07FF	C2..DF	80..BF			110
U+0800..U+0FFF	E0	A0..BF	80..BF		1110
U+1000..U+FFFF	E1..EF	80..BF	80..BF		1110
U+10000..U+3FFFF	F0	90..BF	80..BF	80..BF	1111
U+40000..U+FFFFF	F1..F3	80..BF	80..BF	80..BF	1111
U+100000..U+10FFFF	F4	80..BF	80..BF	80..BF	1111

@@ Line 2: / Line 2: @@
 All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.
+{| border="1"
+|+ UTF-8 byte Sequences
+! &nbsp; Code points
+!1st byte
+!2nd byte
+!3rd byte
+!4th byte
+!most significant bits of the first byte of a multi-byte sequence
+!
+|-
+| &nbsp; U+0000..U+007F
+| &nbsp; 00..7F
+|
+|
+|
+| &nbsp; 0
+| &nbsp; [[ASCII]] &nbsp;
+|-
+| &nbsp; U+0080..U+07FF
+| &nbsp; C2..DF
+| &nbsp; 80..BF
+|
+|
+| &nbsp; 110
+|
+|-
+| &nbsp; U+0800..U+0FFF
+| &nbsp; E0
+| &nbsp; A0..BF
+| &nbsp; 80..BF
+|
+| &nbsp; 1110
+|
+|-
+| &nbsp; U+1000..U+FFFF
+| &nbsp; E1..EF
+| &nbsp; 80..BF
+| &nbsp; 80..BF
+|
+| &nbsp; 1110
+|
+|-
+| &nbsp; U+10000..U+3FFFF
+| &nbsp; F0
+| &nbsp; 90..BF
+| &nbsp; 80..BF
+| &nbsp; 80..BF
+| &nbsp; 1111
+|
+|-
+| &nbsp; U+40000..U+FFFFF
+| &nbsp; F1..F3
+| &nbsp; 80..BF
+| &nbsp; 80..BF
+| &nbsp; 80..BF
+| &nbsp; 1111
+|
+|-
+| &nbsp; U+100000..U+10FFFF
+| &nbsp; F4
+| &nbsp; 80..BF
+| &nbsp; 80..BF
+| &nbsp; 80..BF
+| &nbsp; 1111
+|
+|}

Difference between revisions of "UTF-8"

Revision as of 19:42, 22 July 2007

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search