What is UTF-8?

UTF-8 (8-bit Unicode Transformation Format) is a variable-width character encoding that can represent every character in the Unicode standard. It uses 1 to 4 bytes per character and is backward compatible with ASCII, meaning any valid ASCII document is also valid UTF-8. UTF-8 achieves this through a clever byte-prefix system: single-byte characters start with 0, two-byte sequences start with 110, three-byte with 1110, and four-byte with 11110, with continuation bytes always starting with 10 — enabling self-synchronization and error recovery at any point in a byte stream.

Quick Facts

Full Name8-bit Unicode Transformation Format
Created1992 by Ken Thompson and Rob Pike
SpecificationOfficial Specification

How It Works

UTF-8 was designed by Ken Thompson and Rob Pike in September 1992 at Bell Labs and has become the dominant character encoding for the world wide web, used by over 98% of all websites as of 2024. Its key innovation is variable-width encoding: ASCII characters (U+0000 to U+007F) use just 1 byte, Latin/Greek/Cyrillic scripts (U+0080 to U+07FF) use 2 bytes, most CJK characters and common symbols (U+0800 to U+FFFF) use 3 bytes, and emoji and supplementary characters (U+10000 to U+10FFFF) use 4 bytes. This makes UTF-8 highly efficient for ASCII-heavy content like source code and English text while still supporting the full Unicode repertoire of over 149,000 characters. UTF-8 is self-synchronizing — you can jump to any byte in a stream and find the next character boundary by looking for a byte that doesn't start with 10, making error recovery straightforward. Unlike UTF-16 and UTF-32, UTF-8 has no byte-order issues (no BOM required) and contains no null bytes for ASCII text, making it compatible with C string functions. UTF-8 is the mandatory encoding for JSON (RFC 8259), the default for HTML5, the required encoding for YAML, TOML, and Rust source files, and the recommended encoding for XML, HTTP headers, and email (MIME). Compared to ASCII (which only supports 128 characters), UTF-8 extends coverage to all of Unicode while remaining backward compatible. Compared to UTF-16 (used internally by JavaScript, Java, and Windows), UTF-8 is more space-efficient for Latin-script text and avoids surrogate pair complications. Compared to UTF-32 (fixed 4 bytes per character), UTF-8 is far more space-efficient at the cost of variable-width complexity.

Key Characteristics

  • Variable-width: 1-4 bytes per character
  • Backward compatible with ASCII (first 128 characters)
  • Self-synchronizing encoding
  • No byte-order issues (unlike UTF-16)
  • Default encoding for HTML5, JSON, and web
  • Efficient for ASCII-heavy text

Common Use Cases

  1. Web page encoding: the default and recommended encoding for HTML5, CSS, and JavaScript files — declared via <meta charset="UTF-8"> or HTTP Content-Type header
  2. JSON data format: UTF-8 is the only encoding allowed by RFC 8259, making it mandatory for all JSON documents, APIs, and configuration files
  3. Database text storage: modern databases (PostgreSQL, MySQL with utf8mb4, MongoDB, SQLite) use UTF-8 to store multilingual text, emoji, and special characters correctly
  4. Email and messaging (MIME): UTF-8 is the standard encoding for email bodies and headers, replacing legacy encodings like ISO-8859-1 and Windows-1252
  5. Source code files: most programming languages (Python 3, Rust, Go, Ruby) default to UTF-8 for source files, and major style guides mandate it
  6. Version control and file systems: Git treats files as UTF-8 by default, and modern operating systems (Linux, macOS) use UTF-8 as their native filesystem encoding
  7. Internationalization (i18n): enabling applications to support users worldwide with a single encoding that handles Latin, Cyrillic, Arabic, CJK, Devanagari, emoji, and all other Unicode scripts

Example

loading...
Loading code...

Frequently Asked Questions

What is the difference between UTF-8 and Unicode?

Unicode is a character set that assigns unique code points to every character. UTF-8 is an encoding scheme that converts those Unicode code points into bytes for storage and transmission. UTF-8 is one of several Unicode encodings, alongside UTF-16 and UTF-32.

Why is UTF-8 the most popular encoding on the web?

UTF-8 is popular because it's backward compatible with ASCII, efficient for English text (1 byte per character), supports all Unicode characters, has no byte-order issues, and is self-synchronizing for error recovery.

How many bytes does UTF-8 use per character?

UTF-8 uses variable-width encoding: 1 byte for ASCII (U+0000-007F), 2 bytes for Latin/Greek/Cyrillic (U+0080-07FF), 3 bytes for most other characters including CJK (U+0800-FFFF), and 4 bytes for emoji and rare characters (U+10000-10FFFF).

What is the difference between UTF-8 and UTF-16?

UTF-8 uses 1-4 bytes per character and is ASCII-compatible, while UTF-16 uses 2 or 4 bytes and requires byte-order marks (BOM). UTF-8 is more efficient for ASCII-heavy content, while UTF-16 can be more efficient for CJK text.

How do I detect if a file is UTF-8 encoded?

UTF-8 files can optionally start with a BOM (EF BB BF), but this is discouraged. Detection usually involves checking for valid UTF-8 byte sequences or relying on metadata like HTTP headers, HTML charset declarations, or file system attributes.

What is the difference between MySQL utf8 and utf8mb4?

MySQL's legacy 'utf8' charset (aliased as utf8mb3) only supports up to 3-byte UTF-8 sequences, meaning it cannot store characters above U+FFFF including emoji and some rare CJK characters. 'utf8mb4' is the true UTF-8 implementation supporting the full 4-byte range. Always use utf8mb4 for new MySQL databases to ensure complete Unicode support, and set the collation to utf8mb4_unicode_ci or utf8mb4_0900_ai_ci for correct sorting.

Why do some UTF-8 files have a BOM and should I use one?

The UTF-8 BOM (Byte Order Mark, bytes EF BB BF) is an optional signature at the start of a file that explicitly identifies it as UTF-8. Unlike UTF-16 where the BOM indicates byte order, UTF-8 has no byte-order ambiguity, so the BOM is unnecessary. Most style guides and standards (including JSON RFC 8259 and the W3C) recommend against using a UTF-8 BOM, as it can cause issues with Unix tools, shell scripts, and some parsers. However, some Windows applications (like Notepad) add it by default.

How does UTF-8 handle invalid byte sequences?

When a decoder encounters invalid UTF-8 bytes (e.g., unexpected continuation bytes, overlong encodings, or sequences encoding values above U+10FFFF), behavior depends on the implementation. The Unicode standard recommends replacing each maximal invalid subsequence with the U+FFFD replacement character (�). Python uses 'strict' mode by default (raising UnicodeDecodeError) but supports 'replace', 'ignore', and 'surrogateescape' error handlers. Web browsers typically use U+FFFD replacement for robustness.

Related Tools

Related Terms

Related Articles