What is UTF-8?
UTF-8 (8-bit Unicode Transformation Format) is a variable-width character encoding that can represent every character in the Unicode standard. It uses 1 to 4 bytes per character and is backward compatible with ASCII.
Quick Facts
| Full Name | 8-bit Unicode Transformation Format |
|---|---|
| Created | 1992 by Ken Thompson and Rob Pike |
| Specification | Official Specification |
How UTF-8 Works
UTF-8 was designed by Ken Thompson and Rob Pike in 1992 and has become the dominant character encoding for the web. Its key innovation is variable-width encoding: ASCII characters (0-127) use just 1 byte, making UTF-8 efficient for English text while still supporting all Unicode characters. Characters are encoded using specific bit patterns that indicate how many bytes follow. UTF-8 is self-synchronizing, meaning you can find character boundaries without reading from the start. It's the default encoding for HTML5, JSON, and most modern systems.
Key Characteristics
- Variable-width: 1-4 bytes per character
- Backward compatible with ASCII (first 128 characters)
- Self-synchronizing encoding
- No byte-order issues (unlike UTF-16)
- Default encoding for HTML5, JSON, and web
- Efficient for ASCII-heavy text
Common Use Cases
- Web page encoding (HTML, CSS, JavaScript)
- JSON and XML data files
- Database text storage
- Email and messaging systems
- Source code files
Example
UTF-8 Byte Patterns:
Bytes Range Pattern
1 U+0000-007F 0xxxxxxx
2 U+0080-07FF 110xxxxx 10xxxxxx
3 U+0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000-10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Encoding Examples:
Char Unicode UTF-8 Bytes
A U+0041 41
é U+00E9 C3 A9
中 U+4E2D E4 B8 AD
😀 U+1F600 F0 9F 98 80
HTML Declaration:
<meta charset="UTF-8">
HTTP Header:
Content-Type: text/html; charset=utf-8