Question 1

What is the difference between UTF-8 and Unicode?

Accepted Answer

Unicode is a character set that assigns unique code points to every character. UTF-8 is an encoding scheme that converts those Unicode code points into bytes for storage and transmission. UTF-8 is one of several Unicode encodings, alongside UTF-16 and UTF-32.

Question 2

Why is UTF-8 the most popular encoding on the web?

Accepted Answer

UTF-8 is popular because it's backward compatible with ASCII, efficient for English text (1 byte per character), supports all Unicode characters, has no byte-order issues, and is self-synchronizing for error recovery.

Question 3

How many bytes does UTF-8 use per character?

Accepted Answer

UTF-8 uses variable-width encoding: 1 byte for ASCII (U+0000-007F), 2 bytes for Latin/Greek/Cyrillic (U+0080-07FF), 3 bytes for most other characters including CJK (U+0800-FFFF), and 4 bytes for emoji and rare characters (U+10000-10FFFF).

Question 4

What is the difference between UTF-8 and UTF-16?

Accepted Answer

UTF-8 uses 1-4 bytes per character and is ASCII-compatible, while UTF-16 uses 2 or 4 bytes and requires byte-order marks (BOM). UTF-8 is more efficient for ASCII-heavy content, while UTF-16 can be more efficient for CJK text.

Question 5

How do I detect if a file is UTF-8 encoded?

Accepted Answer

UTF-8 files can optionally start with a BOM (EF BB BF), but this is discouraged. Detection usually involves checking for valid UTF-8 byte sequences or relying on metadata like HTTP headers, HTML charset declarations, or file system attributes.

Question 6

What is the difference between MySQL utf8 and utf8mb4?

Accepted Answer

MySQL's legacy 'utf8' charset (aliased as utf8mb3) only supports up to 3-byte UTF-8 sequences, meaning it cannot store characters above U+FFFF including emoji and some rare CJK characters. 'utf8mb4' is the true UTF-8 implementation supporting the full 4-byte range. Always use utf8mb4 for new MySQL databases to ensure complete Unicode support, and set the collation to utf8mb4_unicode_ci or utf8mb4_0900_ai_ci for correct sorting.

Question 7

Why do some UTF-8 files have a BOM and should I use one?

Accepted Answer

The UTF-8 BOM (Byte Order Mark, bytes EF BB BF) is an optional signature at the start of a file that explicitly identifies it as UTF-8. Unlike UTF-16 where the BOM indicates byte order, UTF-8 has no byte-order ambiguity, so the BOM is unnecessary. Most style guides and standards (including JSON RFC 8259 and the W3C) recommend against using a UTF-8 BOM, as it can cause issues with Unix tools, shell scripts, and some parsers. However, some Windows applications (like Notepad) add it by default.

Question 8

How does UTF-8 handle invalid byte sequences?

Accepted Answer

When a decoder encounters invalid UTF-8 bytes (e.g., unexpected continuation bytes, overlong encodings, or sequences encoding values above U+10FFFF), behavior depends on the implementation. The Unicode standard recommends replacing each maximal invalid subsequence with the U+FFFD replacement character (�). Python uses 'strict' mode by default (raising UnicodeDecodeError) but supports 'replace', 'ignore', and 'surrogateescape' error handlers. Web browsers typically use U+FFFD replacement for robustness.

Full Name	8-bit Unicode Transformation Format
Created	1992 by Ken Thompson and Rob Pike
Specification	Official Specification

What is UTF-8?

Quick Facts

How It Works

Key Characteristics

Common Use Cases

Example

Frequently Asked Questions

What is the difference between UTF-8 and Unicode?

Why is UTF-8 the most popular encoding on the web?

How many bytes does UTF-8 use per character?

What is the difference between UTF-8 and UTF-16?

How do I detect if a file is UTF-8 encoded?

What is the difference between MySQL utf8 and utf8mb4?

Why do some UTF-8 files have a BOM and should I use one?

How does UTF-8 handle invalid byte sequences?

Related Tools

Base64 Encoder/Decoder

ASCII/Unicode Converter

Related Terms

ASCII

Unicode

Base64

Emoji

Related Articles

Character Encoding Deep Dive [2026] - ASCII, Unicode & UTF-8

Text Encoding Complete Guide: HTML Entities, ASCII, Unicode, and Character Encoding