Difference between revisions of "UTF-8"

From Lazarus wiki
 
(12 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 +
{{UTF-8}}
 +
 
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit [[ASCII]] characters have the same encoding under both ASCII and UTF-8.  
 
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit [[ASCII]] characters have the same encoding under both ASCII and UTF-8.  
  
Line 4: Line 6:
  
  
{| border="1"
+
{| class="wikitable"
 
|+ UTF-8 byte Sequences
 
|+ UTF-8 byte Sequences
 
!   Code points
 
!   Code points
Line 22: Line 24:
 
|   [[ASCII]]    
 
|   [[ASCII]]    
 
|-
 
|-
| rowspan=2 |   U+0080..U+07FF
+
|   U+0080..U+07FF
| rowspan=2 |   C2..DF
+
|   C2..DF
| rowspan=2 |   80..BF  
+
|   80..BF  
| rowspan=2 |
+
|
| rowspan=2 |
+
|  
| rowspan=2 |   110
+
|   110
 
|   - [[UTF-8 Latin characters]]
 
|   - [[UTF-8 Latin characters]]
|-
 
|  
 
 
|-
 
|-
 
|   U+0800..U+0FFF
 
|   U+0800..U+0FFF
Line 46: Line 46:
 
|  
 
|  
 
|   1110
 
|   1110
|
+
|   - [[UTF-8_subscripts_and_superscripts]]
 
|-
 
|-
 
|   U+10000..U+3FFFF
 
|   U+10000..U+3FFFF
Line 72: Line 72:
 
|
 
|
 
|}
 
|}
 +
 +
==UTF8 functions==
 +
 +
===FreePascal===
 +
The system unit contains some basic functions:
 +
* UnicodeToUtf8
 +
* Utf8ToUnicode
 +
* UTF8Encode
 +
* UTF8Decode
 +
* AnsiToUtf8
 +
* Utf8ToAnsi
 +
 +
 +
===Lazarus===
 +
Lazarus also contains UTF8 functions. For more details see [[LCL Unicode Support]]
 +
 +
==See also==
 +
 +
* [[LCL_Unicode_Support#Dealing_with_directory_and_filenames|Dealing with directory and filenames]] - UTF8 functions for files
 +
* [[LCL_Unicode_Support|LCL Unicode Support]] - UTF8 in graphical applications
 +
* [[Console_Mode_Pascal#Unicode (UTF8) output|Console mode Pascal: Unicode (UTF8) output]] - Showing UTF8 output in console mode/text mode programs
 +
* [[UTF8 strings and characters]]
 +
 +
[[Category:Unicode]]

Latest revision as of 11:52, 26 June 2017

English (en) suomi (fi) français (fr) русский (ru)

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.


UTF-8 byte Sequences
  Code points 1st byte 2nd byte 3rd byte 4th byte most significant bits of the first byte of a multi-byte sequence
  U+0000..U+007F   00..7F   0   ASCII  
  U+0080..U+07FF   C2..DF   80..BF   110   - UTF-8 Latin characters
  U+0800..U+0FFF   E0   A0..BF   80..BF   1110
  U+1000..U+FFFF   E1..EF   80..BF   80..BF   1110   - UTF-8_subscripts_and_superscripts
  U+10000..U+3FFFF   F0   90..BF   80..BF   80..BF   11110
  U+40000..U+FFFFF   F1..F3   80..BF   80..BF   80..BF   11110
  U+100000..U+10FFFF   F4   80..BF   80..BF   80..BF   11110

UTF8 functions

FreePascal

The system unit contains some basic functions:

  • UnicodeToUtf8
  • Utf8ToUnicode
  • UTF8Encode
  • UTF8Decode
  • AnsiToUtf8
  • Utf8ToAnsi


Lazarus

Lazarus also contains UTF8 functions. For more details see LCL Unicode Support

See also