What is UTF-8?
UTF-8 (8-bit Unicode Transformation Format) is a variable-width character encoding that can represent every character in the Unicode standard. It uses 1 to 4 bytes per character and is backward compatible with ASCII.
Quick Facts
| Full Name | 8-bit Unicode Transformation Format |
|---|---|
| Created | 1992 by Ken Thompson and Rob Pike |
| Specification | Official Specification |
How It Works
UTF-8 was designed by Ken Thompson and Rob Pike in 1992 and has become the dominant character encoding for the web. Its key innovation is variable-width encoding: ASCII characters (0-127) use just 1 byte, making UTF-8 efficient for English text while still supporting all Unicode characters. Characters are encoded using specific bit patterns that indicate how many bytes follow. UTF-8 is self-synchronizing, meaning you can find character boundaries without reading from the start. It's the default encoding for HTML5, JSON, and most modern systems.
Key Characteristics
- Variable-width: 1-4 bytes per character
- Backward compatible with ASCII (first 128 characters)
- Self-synchronizing encoding
- No byte-order issues (unlike UTF-16)
- Default encoding for HTML5, JSON, and web
- Efficient for ASCII-heavy text
Common Use Cases
- Web page encoding (HTML, CSS, JavaScript)
- JSON and XML data files
- Database text storage
- Email and messaging systems
- Source code files
Example
Loading code...Frequently Asked Questions
What is the difference between UTF-8 and Unicode?
Unicode is a character set that assigns unique code points to every character. UTF-8 is an encoding scheme that converts those Unicode code points into bytes for storage and transmission. UTF-8 is one of several Unicode encodings, alongside UTF-16 and UTF-32.
Why is UTF-8 the most popular encoding on the web?
UTF-8 is popular because it's backward compatible with ASCII, efficient for English text (1 byte per character), supports all Unicode characters, has no byte-order issues, and is self-synchronizing for error recovery.
How many bytes does UTF-8 use per character?
UTF-8 uses variable-width encoding: 1 byte for ASCII (U+0000-007F), 2 bytes for Latin/Greek/Cyrillic (U+0080-07FF), 3 bytes for most other characters including CJK (U+0800-FFFF), and 4 bytes for emoji and rare characters (U+10000-10FFFF).
What is the difference between UTF-8 and UTF-16?
UTF-8 uses 1-4 bytes per character and is ASCII-compatible, while UTF-16 uses 2 or 4 bytes and requires byte-order marks (BOM). UTF-8 is more efficient for ASCII-heavy content, while UTF-16 can be more efficient for CJK text.
How do I detect if a file is UTF-8 encoded?
UTF-8 files can optionally start with a BOM (EF BB BF), but this is discouraged. Detection usually involves checking for valid UTF-8 byte sequences or relying on metadata like HTTP headers, HTML charset declarations, or file system attributes.