Character encoding is the foundation of how computers process text. Understanding encoding principles helps solve garbled text issues and plays a crucial role in web security and internationalization. This article provides an in-depth explanation of various encoding methods.
Character Encoding Basics
Why Do We Need Character Encoding?
Computers can only process numbers (binary), while humans use text. Character encoding establishes the mapping between characters and numbers:
Character 'A' → Number 65 → Binary 01000001
Character '中' → Number 20013 → Binary ...
Evolution of Encoding
ASCII (1963) → Extended ASCII → ISO-8859 → Unicode (1991) → UTF-8/UTF-16
↓ ↓ ↓ ↓
7-bit/128 8-bit/256 Regional Unified
characters characters encoding encoding
ASCII Encoding
ASCII Basics
ASCII (American Standard Code for Information Interchange) is the most fundamental character encoding:
- Range: 0-127 (7 bits)
- Characters: 128
- Includes: English letters, digits, punctuation, control characters
ASCII Table
| Range | Type | Examples |
|---|---|---|
| 0-31 | Control characters | NUL, TAB, LF, CR |
| 32-47 | Punctuation | Space, !, ", # |
| 48-57 | Digits | 0-9 |
| 65-90 | Uppercase | A-Z |
| 97-122 | Lowercase | a-z |
| 123-127 | Other symbols | {, |, }, ~ |
ASCII Conversion Implementation
class ASCIIConverter {
static charToCode(char) {
return char.charCodeAt(0);
}
static codeToChar(code) {
return String.fromCharCode(code);
}
static stringToASCII(str) {
return Array.from(str).map(char => ({
char,
decimal: char.charCodeAt(0),
hex: char.charCodeAt(0).toString(16).toUpperCase(),
binary: char.charCodeAt(0).toString(2).padStart(8, '0')
}));
}
static asciiToString(codes) {
return codes.map(code => String.fromCharCode(code)).join('');
}
static isASCII(str) {
return /^[\x00-\x7F]*$/.test(str);
}
static toUpperCase(char) {
const code = char.charCodeAt(0);
if (code >= 97 && code <= 122) {
return String.fromCharCode(code - 32);
}
return char;
}
static toLowerCase(char) {
const code = char.charCodeAt(0);
if (code >= 65 && code <= 90) {
return String.fromCharCode(code + 32);
}
return char;
}
}
// Usage
console.log(ASCIIConverter.stringToASCII('Hello'));
// [
// { char: 'H', decimal: 72, hex: '48', binary: '01001000' },
// { char: 'e', decimal: 101, hex: '65', binary: '01100101' },
// ...
// ]
console.log(ASCIIConverter.asciiToString([72, 101, 108, 108, 111]));
// "Hello"
Python ASCII Implementation
class ASCIIConverter:
@staticmethod
def char_to_code(char: str) -> int:
return ord(char)
@staticmethod
def code_to_char(code: int) -> str:
return chr(code)
@staticmethod
def string_to_ascii(s: str) -> list:
return [
{
'char': char,
'decimal': ord(char),
'hex': hex(ord(char))[2:].upper(),
'binary': bin(ord(char))[2:].zfill(8)
}
for char in s
]
@staticmethod
def ascii_to_string(codes: list) -> str:
return ''.join(chr(code) for code in codes)
@staticmethod
def is_ascii(s: str) -> bool:
return all(ord(char) < 128 for char in s)
# Usage
print(ASCIIConverter.string_to_ascii('Hello'))
print(ASCIIConverter.ascii_to_string([72, 101, 108, 108, 111]))
Unicode Encoding
Unicode Basics
Unicode is a character set standard that assigns a unique code point to every character in the world:
- Range: U+0000 to U+10FFFF
- Characters: Over 140,000
- Notation: U+XXXX (hexadecimal)
Unicode Planes
| Plane | Range | Name | Content |
|---|---|---|---|
| 0 | U+0000-U+FFFF | BMP | Common characters |
| 1 | U+10000-U+1FFFF | SMP | Emoji, ancient scripts |
| 2 | U+20000-U+2FFFF | SIP | Extended CJK |
| 14 | U+E0000-U+EFFFF | SSP | Special purpose |
Unicode Conversion Implementation
class UnicodeConverter {
static charToCodePoint(char) {
return char.codePointAt(0);
}
static codePointToChar(codePoint) {
return String.fromCodePoint(codePoint);
}
static stringToUnicode(str) {
const result = [];
for (const char of str) {
const codePoint = char.codePointAt(0);
result.push({
char,
codePoint,
unicode: `U+${codePoint.toString(16).toUpperCase().padStart(4, '0')}`,
utf8: this.toUTF8Bytes(codePoint),
utf16: this.toUTF16(codePoint)
});
}
return result;
}
static toUTF8Bytes(codePoint) {
const bytes = [];
if (codePoint <= 0x7F) {
bytes.push(codePoint);
} else if (codePoint <= 0x7FF) {
bytes.push(0xC0 | (codePoint >> 6));
bytes.push(0x80 | (codePoint & 0x3F));
} else if (codePoint <= 0xFFFF) {
bytes.push(0xE0 | (codePoint >> 12));
bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
bytes.push(0x80 | (codePoint & 0x3F));
} else {
bytes.push(0xF0 | (codePoint >> 18));
bytes.push(0x80 | ((codePoint >> 12) & 0x3F));
bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
bytes.push(0x80 | (codePoint & 0x3F));
}
return bytes.map(b => b.toString(16).toUpperCase().padStart(2, '0'));
}
static toUTF16(codePoint) {
if (codePoint <= 0xFFFF) {
return [codePoint.toString(16).toUpperCase().padStart(4, '0')];
}
// Surrogate pair
const offset = codePoint - 0x10000;
const high = 0xD800 + (offset >> 10);
const low = 0xDC00 + (offset & 0x3FF);
return [
high.toString(16).toUpperCase(),
low.toString(16).toUpperCase()
];
}
static escapeUnicode(str) {
return Array.from(str)
.map(char => {
const code = char.codePointAt(0);
if (code > 0xFFFF) {
return `\\u{${code.toString(16).toUpperCase()}}`;
}
return `\\u${code.toString(16).toUpperCase().padStart(4, '0')}`;
})
.join('');
}
static unescapeUnicode(str) {
return str.replace(/\\u\{([0-9A-Fa-f]+)\}|\\u([0-9A-Fa-f]{4})/g,
(match, p1, p2) => {
const codePoint = parseInt(p1 || p2, 16);
return String.fromCodePoint(codePoint);
}
);
}
}
// Usage
console.log(UnicodeConverter.stringToUnicode('Hello👋'));
console.log(UnicodeConverter.escapeUnicode('Hello World'));
UTF-8 Encoding
UTF-8 Principles
UTF-8 is a variable-length encoding for Unicode:
| Unicode Range | UTF-8 Bytes | Format |
|---|---|---|
| U+0000-U+007F | 1 | 0xxxxxxx |
| U+0080-U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800-U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000-U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
UTF-8 Encoding Example
For the Chinese character "中" (U+4E2D):
1. Code point: 0x4E2D = 0100 1110 0010 1101
2. Range: U+0800-U+FFFF, needs 3 bytes
3. Template: 1110xxxx 10xxxxxx 10xxxxxx
4. Fill in:
- 1110 0100 (E4)
- 10 111000 (B8)
- 10 101101 (AD)
5. Result: E4 B8 AD
UTF-8 Codec Implementation
class UTF8Codec {
static encode(str) {
const bytes = [];
for (const char of str) {
const codePoint = char.codePointAt(0);
if (codePoint <= 0x7F) {
bytes.push(codePoint);
} else if (codePoint <= 0x7FF) {
bytes.push(0xC0 | (codePoint >> 6));
bytes.push(0x80 | (codePoint & 0x3F));
} else if (codePoint <= 0xFFFF) {
bytes.push(0xE0 | (codePoint >> 12));
bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
bytes.push(0x80 | (codePoint & 0x3F));
} else {
bytes.push(0xF0 | (codePoint >> 18));
bytes.push(0x80 | ((codePoint >> 12) & 0x3F));
bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
bytes.push(0x80 | (codePoint & 0x3F));
}
}
return new Uint8Array(bytes);
}
static decode(bytes) {
let result = '';
let i = 0;
while (i < bytes.length) {
let codePoint;
const byte1 = bytes[i];
if ((byte1 & 0x80) === 0) {
codePoint = byte1;
i += 1;
} else if ((byte1 & 0xE0) === 0xC0) {
codePoint = ((byte1 & 0x1F) << 6) | (bytes[i + 1] & 0x3F);
i += 2;
} else if ((byte1 & 0xF0) === 0xE0) {
codePoint = ((byte1 & 0x0F) << 12) |
((bytes[i + 1] & 0x3F) << 6) |
(bytes[i + 2] & 0x3F);
i += 3;
} else {
codePoint = ((byte1 & 0x07) << 18) |
((bytes[i + 1] & 0x3F) << 12) |
((bytes[i + 2] & 0x3F) << 6) |
(bytes[i + 3] & 0x3F);
i += 4;
}
result += String.fromCodePoint(codePoint);
}
return result;
}
static toHexString(bytes) {
return Array.from(bytes)
.map(b => b.toString(16).toUpperCase().padStart(2, '0'))
.join(' ');
}
}
// Usage
const encoded = UTF8Codec.encode('Hello World');
console.log(UTF8Codec.toHexString(encoded));
const decoded = UTF8Codec.decode(encoded);
console.log(decoded);
HTML Entity Encoding
What Are HTML Entities?
HTML entities are encoding methods for representing special characters in HTML:
<!-- Named entities -->
< → <
> → >
& → &
" → "
→ non-breaking space
<!-- Numeric entities -->
< → < (decimal)
< → < (hexadecimal)
Why Use HTML Entities?
- Avoid parsing errors:
<and>would be parsed as tags - Prevent XSS attacks: Escape user input
- Display special characters: Copyright ©, trademark ™, etc.
HTML Entity Encoding Implementation
class HTMLEntityEncoder {
static namedEntities = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''',
'/': '/',
'`': '`',
'=': '='
};
static reverseEntities = {
'amp': '&',
'lt': '<',
'gt': '>',
'quot': '"',
'apos': "'",
'nbsp': '\u00A0',
'copy': '©',
'reg': '®',
'trade': '™',
'euro': '€',
'pound': '£',
'yen': '¥',
'cent': '¢'
};
static encode(str, options = {}) {
const { mode = 'named', encodeAll = false } = options;
return str.replace(/[&<>"'`=\/]|[^\x00-\x7F]/g, char => {
if (this.namedEntities[char]) {
return this.namedEntities[char];
}
if (encodeAll || char.charCodeAt(0) > 127) {
const code = char.codePointAt(0);
return mode === 'hex'
? `&#x${code.toString(16).toUpperCase()};`
: `&#${code};`;
}
return char;
});
}
static decode(str) {
return str
.replace(/&([a-zA-Z]+);/g, (match, name) => {
return this.reverseEntities[name.toLowerCase()] || match;
})
.replace(/&#(\d+);/g, (match, code) => {
return String.fromCodePoint(parseInt(code, 10));
})
.replace(/&#x([0-9A-Fa-f]+);/g, (match, code) => {
return String.fromCodePoint(parseInt(code, 16));
});
}
static encodeForAttribute(str) {
return str.replace(/[&<>"']/g, char => this.namedEntities[char]);
}
static encodeForHTML(str) {
return str.replace(/[&<>]/g, char => this.namedEntities[char]);
}
}
// Usage
console.log(HTMLEntityEncoder.encode('<script>alert("XSS")</script>'));
// "<script>alert("XSS")</script>"
console.log(HTMLEntityEncoder.decode('<div>Hello</div>'));
// "<div>Hello</div>"
Common HTML Entity Reference
| Character | Named | Decimal | Hex | Description |
|---|---|---|---|---|
| < | < |
< |
< |
Less than |
| > | > |
> |
> |
Greater than |
| & | & |
& |
& |
Ampersand |
| " | " |
" |
" |
Double quote |
| ' | ' |
' |
' |
Single quote |
| © | © |
© |
© |
Copyright |
| ® | ® |
® |
® |
Registered |
| ™ | ™ |
™ |
™ |
Trademark |
| € | € |
€ |
€ |
Euro |
| £ | £ |
£ |
£ |
Pound |
| ¥ | ¥ |
¥ |
¥ |
Yen |
|
  |
  |
Non-breaking space |
URL Encoding
URL Encoding Principles
URL encoding (Percent-encoding) safely transmits special characters in URLs:
Space → %20 or +
Chinese → Hexadecimal of UTF-8 bytes
URL Encoding Implementation
class URLEncoder {
static encode(str) {
return encodeURIComponent(str);
}
static decode(str) {
return decodeURIComponent(str);
}
static encodeQueryParam(params) {
return Object.entries(params)
.map(([key, value]) =>
`${encodeURIComponent(key)}=${encodeURIComponent(value)}`
)
.join('&');
}
static decodeQueryParam(queryString) {
const params = {};
const pairs = queryString.replace(/^\?/, '').split('&');
for (const pair of pairs) {
const [key, value] = pair.split('=');
params[decodeURIComponent(key)] = decodeURIComponent(value || '');
}
return params;
}
}
// Usage
console.log(URLEncoder.encode('Hello World'));
// "Hello%20World"
console.log(URLEncoder.encodeQueryParam({
name: 'John Doe',
message: 'Hello World!'
}));
// "name=John%20Doe&message=Hello%20World!"
Practical Applications
1. XSS Prevention
function sanitizeHTML(input) {
return HTMLEntityEncoder.encode(input);
}
function createSafeElement(tag, text) {
const element = document.createElement(tag);
element.textContent = text; // Auto-escapes
return element;
}
// Unsafe
element.innerHTML = userInput; // ❌ XSS risk
// Safe
element.textContent = userInput; // ✅ Auto-escape
element.innerHTML = sanitizeHTML(userInput); // ✅ Manual escape
2. Internationalization Text Processing
function normalizeText(str) {
// NFD: Decomposition
// NFC: Composition
// NFKD: Compatibility decomposition
// NFKC: Compatibility composition
return str.normalize('NFC');
}
function compareStrings(a, b, locale = 'en-US') {
return a.localeCompare(b, locale);
}
// Full-width/Half-width conversion
function toHalfWidth(str) {
return str.replace(/[\uFF01-\uFF5E]/g, char =>
String.fromCharCode(char.charCodeAt(0) - 0xFEE0)
).replace(/\u3000/g, ' ');
}
function toFullWidth(str) {
return str.replace(/[\x21-\x7E]/g, char =>
String.fromCharCode(char.charCodeAt(0) + 0xFEE0)
).replace(/ /g, '\u3000');
}
3. File Encoding Detection
async function detectEncoding(file) {
const buffer = await file.arrayBuffer();
const bytes = new Uint8Array(buffer);
// Detect BOM
if (bytes[0] === 0xEF && bytes[1] === 0xBB && bytes[2] === 0xBF) {
return 'UTF-8';
}
if (bytes[0] === 0xFF && bytes[1] === 0xFE) {
return 'UTF-16LE';
}
if (bytes[0] === 0xFE && bytes[1] === 0xFF) {
return 'UTF-16BE';
}
// Try UTF-8 decoding
try {
new TextDecoder('utf-8', { fatal: true }).decode(bytes);
return 'UTF-8';
} catch {
return 'unknown';
}
}
Common Issues and Solutions
1. Garbled Text
// Problem: UTF-8 file opened with wrong encoding
// Solution: Specify correct encoding
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(buffer);
// Problem: Database garbled text
// Solution: Ensure consistent connection encoding
// SET NAMES utf8mb4;
2. Emoji Handling
// Problem: Incorrect emoji length
'👨👩👧👦'.length; // 11 (wrong)
// Solution: Use spread or Array.from
[...'👨👩👧👦'].length; // 7 (ZWJ sequence)
// Get actual character count
function getCharacterCount(str) {
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
return [...segmenter.segment(str)].length;
}
3. Surrogate Pair Issues
// Problem: Characters outside BMP
const emoji = '😀';
emoji.length; // 2 (surrogate pair)
emoji.charCodeAt(0); // 55357 (high surrogate)
emoji.charCodeAt(1); // 56832 (low surrogate)
// Solution: Use codePointAt
emoji.codePointAt(0); // 128512 (correct code point)
String.fromCodePoint(128512); // '😀'
Summary
Character encoding is fundamental to text processing. Key points:
- ASCII: Basic encoding, English only
- Unicode: Unified character set with code points for all characters
- UTF-8: Variable-length encoding, ASCII-compatible, most widely used
- HTML Entities: Safely display special characters in HTML
- URL Encoding: Safely transmit special characters in URLs
For quick encoding conversions, try our online tools:
- HTML Entity Encoder - HTML entity encoding/decoding
- ASCII Unicode Converter - Character encoding conversion
- URL Encoder - URL encoding/decoding
Related Resources
- Base64 Encoder - Base64 encoding/decoding
- Base Converter - Number base conversion
- JSON Escaper - JSON string escaping