What is UTF-8?

UTF-8 (8-bit Unicode Transformation Format) is a variable-width character encoding that can represent every character in the Unicode standard. It uses 1 to 4 bytes per character and is backward compatible with ASCII.

Quick Facts

Full Name8-bit Unicode Transformation Format
Created1992 by Ken Thompson and Rob Pike
SpecificationOfficial Specification

How UTF-8 Works

UTF-8 was designed by Ken Thompson and Rob Pike in 1992 and has become the dominant character encoding for the web. Its key innovation is variable-width encoding: ASCII characters (0-127) use just 1 byte, making UTF-8 efficient for English text while still supporting all Unicode characters. Characters are encoded using specific bit patterns that indicate how many bytes follow. UTF-8 is self-synchronizing, meaning you can find character boundaries without reading from the start. It's the default encoding for HTML5, JSON, and most modern systems.

Key Characteristics

  • Variable-width: 1-4 bytes per character
  • Backward compatible with ASCII (first 128 characters)
  • Self-synchronizing encoding
  • No byte-order issues (unlike UTF-16)
  • Default encoding for HTML5, JSON, and web
  • Efficient for ASCII-heavy text

Common Use Cases

  1. Web page encoding (HTML, CSS, JavaScript)
  2. JSON and XML data files
  3. Database text storage
  4. Email and messaging systems
  5. Source code files

Example

UTF-8 Byte Patterns:

Bytes  Range           Pattern
1      U+0000-007F     0xxxxxxx
2      U+0080-07FF     110xxxxx 10xxxxxx
3      U+0800-FFFF     1110xxxx 10xxxxxx 10xxxxxx
4      U+10000-10FFFF  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Encoding Examples:

Char  Unicode   UTF-8 Bytes
A     U+0041    41
é     U+00E9    C3 A9
中    U+4E2D    E4 B8 AD
😀    U+1F600   F0 9F 98 80

HTML Declaration:
<meta charset="UTF-8">

HTTP Header:
Content-Type: text/html; charset=utf-8

Related Tools on QubitTool

Related Concepts