What is Unicode?
Unicode is a universal character encoding standard that assigns a unique number (code point) to every character from every writing system in the world. It aims to represent all characters used in human communication, including letters, symbols, and emojis.
Quick Facts
| Full Name | Unicode Standard |
|---|---|
| Created | 1991 (Unicode 1.0) |
| Specification | Official Specification |
How It Works
Unicode was developed starting in 1987 to solve the problem of incompatible character encoding systems. Before Unicode, different systems used different encodings (ASCII, ISO-8859, GB2312, etc.), causing text to display incorrectly across platforms. Unicode assigns each character a unique code point written as U+XXXX (e.g., U+0041 for 'A'). The standard includes over 150,000 characters covering 161 scripts. Unicode can be encoded in different formats: UTF-8 (variable-width, web standard), UTF-16 (used by Windows/Java), and UTF-32 (fixed-width).
Key Characteristics
- Universal standard covering all writing systems
- Over 150,000 characters from 161 scripts
- Code points written as U+XXXX format
- Multiple encoding forms: UTF-8, UTF-16, UTF-32
- Backward compatible with ASCII (first 128 code points)
- Includes emojis, symbols, and historic scripts
Common Use Cases
- Multilingual text processing
- Web content internationalization
- Database character storage
- Cross-platform text compatibility
- Emoji support in applications
Example
Loading code...Frequently Asked Questions
What is the difference between Unicode and UTF-8?
Unicode is the standard that assigns unique code points to characters (like U+0041 for 'A'). UTF-8 is one encoding format for storing Unicode text as bytes. UTF-8 uses 1-4 bytes per character, is backward compatible with ASCII, and is the dominant encoding for the web. Other encodings include UTF-16 and UTF-32.
Why does emoji display differently across platforms?
While Unicode defines standard code points for emojis, each platform (Apple, Google, Microsoft, Samsung) designs their own visual representations called 'emoji fonts.' This leads to different appearances for the same Unicode character. Some emojis may also be newer than a device's font version, causing display issues.
What is a Unicode code point and how is it written?
A code point is the unique number assigned to each character in Unicode, written as U+ followed by 4-6 hexadecimal digits. For example, U+0041 is 'A', U+4E2D is '中', and U+1F600 is '😀'. The first 128 code points (U+0000 to U+007F) match ASCII for compatibility.
How do I handle Unicode in databases?
Use UTF-8 encoding for database character sets and collations. In MySQL, use utf8mb4 (not utf8 which only supports 3-byte characters, excluding many emojis). Ensure your connection string specifies UTF-8 encoding. For PostgreSQL, UTF-8 is the default and recommended encoding for international applications.
What are combining characters and normalization in Unicode?
Some characters can be represented multiple ways: 'é' can be a single character (U+00E9) or 'e' + combining accent (U+0065 U+0301). Unicode normalization converts text to a standard form. NFC (composed) and NFD (decomposed) are common forms. Always normalize text before comparison or storage for consistency.