UTF-8

From Lazarus wiki
Revision as of 14:43, 21 July 2007 by Djzepi (talk | contribs) (New page: UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means ...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.