![]() |
|||||||||||||
|
UTF-8 |
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages1, and other places where characters are stored or streamed.
UTF-8 encodes each character (code point) in one to four octets (8-bit bytes), with the 1-byte encoding used for the 128 US-ASCII characters. See the Description section below for details.
The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.2 The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8.3
| Unicode |
|---|
| Character encodings |
| UCS |
| Mapping |
| Bi-directional text |
| BOM |
| Han unification |
| Unicode and HTML |
| Unicode and E-mail |
| Unicode typefaces |
By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit characters. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.
In July 1992 the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only 8-bit characters, i.e., those where the high bit was set.
In August 1992 this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Laboratories then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find character boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.4
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29, 1993.
The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte:
| Unicode | Byte1 | Byte2 | Byte3 | Byte4 | example |
|---|---|---|---|---|---|
U+000000-U+00007F |
0xxxxxxx |
'$' U+0024 → 00100100 → 0x24 |
|||
U+000080-U+0007FF |
110xxxxx |
10xxxxxx |
'¢' U+00A2 → 11000010,10100010 → 0xC2,0xA2 |
||
U+000800-U+00FFFF |
1110xxxx |
10xxxxxx |
10xxxxxx |
'€' U+20AC → 11100010,10000010,10101100 → 0xE2,0x82,0xAC |
|
U+010000-U+10FFFF |
11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
U+10ABCD → 11110100,10001010,10101111,10001101 → 0xf4,0x8a,0xaf,0x8d |
So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.
With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in green are represented in a single byte. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:
| binary | hex | decimal | notes |
|---|---|---|---|
| 00000000-01111111 | 00-7F | 0-127 | US-ASCII (single byte) |
| 10000000-10111111 | 80-BF | 128-191 | Second, third, or fourth byte of a multi-byte sequence |
| 11000000-11000001 | C0-C1 | 192-193 | Overlong encoding: start of a 2-byte sequence, but code point <= 127 |
| 11000010-11011111 | C2-DF | 194-223 | Start of 2-byte sequence |
| 11100000-11101111 | E0-EF | 224-239 | Start of 3-byte sequence |
| 11110000-11110100 | F0-F4 | 240-244 | Start of 4-byte sequence |
| 11110101-11110111 | F5-F7 | 245-247 | Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF |
| 11111000-11111011 | F8-FB | 248-251 | Restricted by RFC 3629: start of 5-byte sequence |
| 11111100-11111101 | FC-FD | 252-253 | Restricted by RFC 3629: start of 6-byte sequence |
| 11111110-11111111 | FE-FF | 254-255 | Invalid: not defined by original UTF-8 specification |
Unicode also disallows the 2048 code points U+D800..U+DFFF (the UTF-16 surrogate pairs) and also the 32 code points U+FDD0..U+FDEF (noncharacters) and all 34 code points of the form U+xxFFFE and U+xxFFFF (more noncharacters). See Table 3-7 in the Unicode 5.0 standard. UTF-8 reliably transforms these values, but they are not valid scalar values in Unicode, and thus the UTF-8 encodings of them may be considered invalid sequences.
The official name is "UTF-8", which is used in all the documents relating to the encoding. There are many instances, particularly for documents to be transmitted across the internet, where the character set in which the document is encoded is declared by the name near the start of the document. In this case, the correct name to use is "UTF-8". In addition, all standards conforming to the Internet Assigned Numbers Authority (IANA) list5, which include CSS, HTML, XML, and HTTP headers6 may also use the name "utf-8", as the declaration is case insensitive. Despite this, alternative forms, usually "utf8" or "UTF8", are seen; while this is incorrect and should be avoided, most agents such as browsers can understand this.
UTF-8 was designed to satisfy the following properties:
This means that a stream of UTF-8 characters that come exclusively from the ASCII set is indistinguishable from an ASCII-encoded stream, and so can be processed by a program that only handles ASCII characters. Furthermore, even if the stream contains non-ASCII multi-byte characters, a program written for ASCII processing might perform correctly, depending on the characteristics of the program. See Backward compatibility.
These properties add redundancy to UTF-8–encoded text. Redundancy makes it very unlikely that a random sequence of bytes will validate as UTF-8. The lack of a practical validity test results in errors like mojibake for Shift-JIS and ISO-8859-1 and requires usage of a Byte-order mark in UTF-16. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.9% for a 2-byte sequence, 0.41% for a 3-byte sequence and 0.026% for a 4-byte sequence.7 While natural languages encoded in traditional encodings are not random byte sequences, they are even less likely to pass a UTF-8 validity test and then be misinterpreted. For example, for ISO-8859-1 text to be mis-recognized as UTF-8, the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol (pure ASCII text would pass a UTF-8 validity test, but it is UTF-8 by definition).
Redundancy means that UTF-8 text does not use memory as efficiently as possible. However data compression is not one of the aims of the Unicode encodings. A modern compression algorithm such as used by gzip will compress any encoding of Unicode to about the same size and (if the source text is more than a few hundred characters) to less than one byte per character. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead. Size arguments (both for and against UTF-8) are listed in the advantages/disadvantages below.
The following implementations are slight differences from the UTF-8 specification. They are incompatible with the UTF-8 specification.
Many pieces of software added UTF-8 conversions for UCS-2 data and did not alter their UTF-8 conversion when UCS-2 was replaced with the surrogate-pair supporting UTF-16. The result is that each half of a UTF-16 surrogate pair is encoded as its own 3-byte UTF-8 encoding, resulting in 6 bytes rather than 4 for characters outside the Basic Multilingual Plane. Oracle databases use this, as well as Java and Tcl as described below, and probably a great deal of other Windows software where the programmers were unaware of the complexities of UTF-16. Although most usage is by accident, a supposed benefit is that this preserves UTF-16 binary sorting order when CESU-8 is binary sorted.
In Modified UTF-8 the null character (U+0000) is encoded as 0xc0,0x80 rather than 0x00. (0xc0,0x80 is not valid UTF-8 because it is not the shortest possible representation.) This means that the encoding of an array of Unicode containing the null character will not have a null byte in it, and thus will not be truncated if processed in a language such as C using traditional ASCIIZ string functions.
All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.
In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter. However it uses modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files. Tcl also uses the same modified UTF-8 as Java for internal representation of Unicode data.
Many Windows programs (including Windows Notepad) add the bytes 0xEF,0xBB,0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode Byte Order Mark. This causes interoperability problems with software that does not expect the BOM. In particular:
Some Windows software (including Notepad) will sometimes misidentify UTF-8 (and thus plain ASCII) documents as UTF-16LE if this BOM is missing, a bug commonly known as "Bush hid the facts" after a particular phrase that can trigger it.
The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
Decoders may also differ in what bytes are part of the error. The sequence 0xF0,0x20,0x20,0x20 might be considered a single 4-byte error, or a 1-byte error followed by 3 space characters.
It is possible for a decoder to behave in different ways for different types of invalid input.
RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."8 The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded, but older specifications for UTF-8 only gave a warning, and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.
To maintain security in the case of invalid input, there are a few options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input either returns an error or text that the application knows to be harmless. A third possibility is to never decode the UTF-8 at all, and only look for the byte patterns you wish to match, but this requires that you know that no other part of your system will attempt a decoding, a catch-22 that makes this simple solution difficult to use in many systems.
UTF-8 provides limited backward compatibility for programs that were written to process ASCII-encoded data. A program written to expect only single-byte ASCII characters might or might not correctly process a UTF-8 stream that contains multi-byte non-ASCII characters. It depends upon the characteristics of the program. Consider these three hypothetical programs, written to process ASCII data.
Program 1's purpose is to copy the first 10 lines of text from one file to another. Since the end of a line is marked by a character that is in the ASCII set, and since the program is not concerned about the contents of the lines themselves, then it should work properly on any file of UTF-8 data. It will simply copy all the bytes of data from the beginning of the line to the end, and is not concerned with the location of character boundaries.
Program 2's purpose is to copy the first 10 characters of text from one file to another. This program will give an incorrect result when there are non-ASCII characters in the first 10 bytes, because of its assumption that each character occupies only one byte. The program would incorrectly count the number of characters copied, and might even copy only part of a multi-byte character, producing an invalid UTF-8 stream.
Program 3's purpose is to display the contents of a file to the screen. A simple program of this type would merely read the contents of the file into memory, and then call a operating system function that will handle the presentation. However, the operating system depends on the calling program to identify the encoding method of the data. If the program tells the operating system that it is ASCII data, but instead it contains non-ASCII characters, the characters will not be properly displayed.
There are many more failure scenarios.
A common criticism from newcomers to variable-length encodings such as UTF-8 is that the algorithm to find the number of characters between two points or to advance a pointer by n characters, is not O(1) (constant time), causing programs using them to be slower. However the use of these algorithms by actual working software is often vastly over-estimated:
malloc(strlen(s) + 1) or pointer += length_of_word(*pointer). Changing the functions to return byte counts in place of character counts will get the exact same program with O(1) performance.So while the number of octets in a UTF-8 string or substring is related in a more complex way to the number of code points than for UTF-32, it is very rare to encounter a situation where this makes a difference in practice, and this cannot be used as either an advantage or disadvantage of UTF-8.
There are several current definitions of UTF-8 in various standards documents:
They supersede the definitions given in the following obsolete works:
They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.
|
||||||||||||||||||||