What is UTF-8?

UTF-8 (8-bit Unicode Transformation Format) is a variable-width character encoding that can represent every character in the Unicode standard. It uses 1 to 4 bytes per character and is backward compatible with ASCII.

Quick Facts

Full Name8-bit Unicode Transformation Format
Created1992 by Ken Thompson and Rob Pike
SpecificationOfficial Specification

How It Works

UTF-8 was designed by Ken Thompson and Rob Pike in 1992 and has become the dominant character encoding for the web. Its key innovation is variable-width encoding: ASCII characters (0-127) use just 1 byte, making UTF-8 efficient for English text while still supporting all Unicode characters. Characters are encoded using specific bit patterns that indicate how many bytes follow. UTF-8 is self-synchronizing, meaning you can find character boundaries without reading from the start. It's the default encoding for HTML5, JSON, and most modern systems.

Key Characteristics

  • Variable-width: 1-4 bytes per character
  • Backward compatible with ASCII (first 128 characters)
  • Self-synchronizing encoding
  • No byte-order issues (unlike UTF-16)
  • Default encoding for HTML5, JSON, and web
  • Efficient for ASCII-heavy text

Common Use Cases

  1. Web page encoding (HTML, CSS, JavaScript)
  2. JSON and XML data files
  3. Database text storage
  4. Email and messaging systems
  5. Source code files

Example

loading...
Loading code...

Frequently Asked Questions

What is the difference between UTF-8 and Unicode?

Unicode is a character set that assigns unique code points to every character. UTF-8 is an encoding scheme that converts those Unicode code points into bytes for storage and transmission. UTF-8 is one of several Unicode encodings, alongside UTF-16 and UTF-32.

Why is UTF-8 the most popular encoding on the web?

UTF-8 is popular because it's backward compatible with ASCII, efficient for English text (1 byte per character), supports all Unicode characters, has no byte-order issues, and is self-synchronizing for error recovery.

How many bytes does UTF-8 use per character?

UTF-8 uses variable-width encoding: 1 byte for ASCII (U+0000-007F), 2 bytes for Latin/Greek/Cyrillic (U+0080-07FF), 3 bytes for most other characters including CJK (U+0800-FFFF), and 4 bytes for emoji and rare characters (U+10000-10FFFF).

What is the difference between UTF-8 and UTF-16?

UTF-8 uses 1-4 bytes per character and is ASCII-compatible, while UTF-16 uses 2 or 4 bytes and requires byte-order marks (BOM). UTF-8 is more efficient for ASCII-heavy content, while UTF-16 can be more efficient for CJK text.

How do I detect if a file is UTF-8 encoded?

UTF-8 files can optionally start with a BOM (EF BB BF), but this is discouraged. Detection usually involves checking for valid UTF-8 byte sequences or relying on metadata like HTTP headers, HTML charset declarations, or file system attributes.

Related Tools

Related Terms

Related Articles