Character encoding is the foundation of how computers process text. Understanding encoding principles helps solve garbled text issues and plays a crucial role in web security and internationalization. This article provides an in-depth explanation of various encoding methods.

Character Encoding Basics

Why Do We Need Character Encoding?

Computers can only process numbers (binary), while humans use text. Character encoding establishes the mapping between characters and numbers:

Character 'A' → Number 65 → Binary 01000001
Character '中' → Number 20013 → Binary ...

Evolution of Encoding

ASCII (1963) → Extended ASCII → ISO-8859 → Unicode (1991) → UTF-8/UTF-16
     ↓              ↓              ↓            ↓
   7-bit/128     8-bit/256     Regional     Unified
   characters    characters    encoding     encoding

ASCII Encoding

ASCII Basics

ASCII (American Standard Code for Information Interchange) is the most fundamental character encoding:

  • Range: 0-127 (7 bits)
  • Characters: 128
  • Includes: English letters, digits, punctuation, control characters

ASCII Table

Range Type Examples
0-31 Control characters NUL, TAB, LF, CR
32-47 Punctuation Space, !, ", #
48-57 Digits 0-9
65-90 Uppercase A-Z
97-122 Lowercase a-z
123-127 Other symbols {, |, }, ~

ASCII Conversion Implementation

class ASCIIConverter {
  static charToCode(char) {
    return char.charCodeAt(0);
  }

  static codeToChar(code) {
    return String.fromCharCode(code);
  }

  static stringToASCII(str) {
    return Array.from(str).map(char => ({
      char,
      decimal: char.charCodeAt(0),
      hex: char.charCodeAt(0).toString(16).toUpperCase(),
      binary: char.charCodeAt(0).toString(2).padStart(8, '0')
    }));
  }

  static asciiToString(codes) {
    return codes.map(code => String.fromCharCode(code)).join('');
  }

  static isASCII(str) {
    return /^[\x00-\x7F]*$/.test(str);
  }

  static toUpperCase(char) {
    const code = char.charCodeAt(0);
    if (code >= 97 && code <= 122) {
      return String.fromCharCode(code - 32);
    }
    return char;
  }

  static toLowerCase(char) {
    const code = char.charCodeAt(0);
    if (code >= 65 && code <= 90) {
      return String.fromCharCode(code + 32);
    }
    return char;
  }
}

// Usage
console.log(ASCIIConverter.stringToASCII('Hello'));
// [
//   { char: 'H', decimal: 72, hex: '48', binary: '01001000' },
//   { char: 'e', decimal: 101, hex: '65', binary: '01100101' },
//   ...
// ]

console.log(ASCIIConverter.asciiToString([72, 101, 108, 108, 111]));
// "Hello"

Python ASCII Implementation

class ASCIIConverter:
    @staticmethod
    def char_to_code(char: str) -> int:
        return ord(char)
    
    @staticmethod
    def code_to_char(code: int) -> str:
        return chr(code)
    
    @staticmethod
    def string_to_ascii(s: str) -> list:
        return [
            {
                'char': char,
                'decimal': ord(char),
                'hex': hex(ord(char))[2:].upper(),
                'binary': bin(ord(char))[2:].zfill(8)
            }
            for char in s
        ]
    
    @staticmethod
    def ascii_to_string(codes: list) -> str:
        return ''.join(chr(code) for code in codes)
    
    @staticmethod
    def is_ascii(s: str) -> bool:
        return all(ord(char) < 128 for char in s)

# Usage
print(ASCIIConverter.string_to_ascii('Hello'))
print(ASCIIConverter.ascii_to_string([72, 101, 108, 108, 111]))

Unicode Encoding

Unicode Basics

Unicode is a character set standard that assigns a unique code point to every character in the world:

  • Range: U+0000 to U+10FFFF
  • Characters: Over 140,000
  • Notation: U+XXXX (hexadecimal)

Unicode Planes

Plane Range Name Content
0 U+0000-U+FFFF BMP Common characters
1 U+10000-U+1FFFF SMP Emoji, ancient scripts
2 U+20000-U+2FFFF SIP Extended CJK
14 U+E0000-U+EFFFF SSP Special purpose

Unicode Conversion Implementation

class UnicodeConverter {
  static charToCodePoint(char) {
    return char.codePointAt(0);
  }

  static codePointToChar(codePoint) {
    return String.fromCodePoint(codePoint);
  }

  static stringToUnicode(str) {
    const result = [];
    for (const char of str) {
      const codePoint = char.codePointAt(0);
      result.push({
        char,
        codePoint,
        unicode: `U+${codePoint.toString(16).toUpperCase().padStart(4, '0')}`,
        utf8: this.toUTF8Bytes(codePoint),
        utf16: this.toUTF16(codePoint)
      });
    }
    return result;
  }

  static toUTF8Bytes(codePoint) {
    const bytes = [];
    if (codePoint <= 0x7F) {
      bytes.push(codePoint);
    } else if (codePoint <= 0x7FF) {
      bytes.push(0xC0 | (codePoint >> 6));
      bytes.push(0x80 | (codePoint & 0x3F));
    } else if (codePoint <= 0xFFFF) {
      bytes.push(0xE0 | (codePoint >> 12));
      bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
      bytes.push(0x80 | (codePoint & 0x3F));
    } else {
      bytes.push(0xF0 | (codePoint >> 18));
      bytes.push(0x80 | ((codePoint >> 12) & 0x3F));
      bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
      bytes.push(0x80 | (codePoint & 0x3F));
    }
    return bytes.map(b => b.toString(16).toUpperCase().padStart(2, '0'));
  }

  static toUTF16(codePoint) {
    if (codePoint <= 0xFFFF) {
      return [codePoint.toString(16).toUpperCase().padStart(4, '0')];
    }
    // Surrogate pair
    const offset = codePoint - 0x10000;
    const high = 0xD800 + (offset >> 10);
    const low = 0xDC00 + (offset & 0x3FF);
    return [
      high.toString(16).toUpperCase(),
      low.toString(16).toUpperCase()
    ];
  }

  static escapeUnicode(str) {
    return Array.from(str)
      .map(char => {
        const code = char.codePointAt(0);
        if (code > 0xFFFF) {
          return `\\u{${code.toString(16).toUpperCase()}}`;
        }
        return `\\u${code.toString(16).toUpperCase().padStart(4, '0')}`;
      })
      .join('');
  }

  static unescapeUnicode(str) {
    return str.replace(/\\u\{([0-9A-Fa-f]+)\}|\\u([0-9A-Fa-f]{4})/g, 
      (match, p1, p2) => {
        const codePoint = parseInt(p1 || p2, 16);
        return String.fromCodePoint(codePoint);
      }
    );
  }
}

// Usage
console.log(UnicodeConverter.stringToUnicode('Hello👋'));
console.log(UnicodeConverter.escapeUnicode('Hello World'));

UTF-8 Encoding

UTF-8 Principles

UTF-8 is a variable-length encoding for Unicode:

Unicode Range UTF-8 Bytes Format
U+0000-U+007F 1 0xxxxxxx
U+0080-U+07FF 2 110xxxxx 10xxxxxx
U+0800-U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000-U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Encoding Example

For the Chinese character "中" (U+4E2D):

1. Code point: 0x4E2D = 0100 1110 0010 1101
2. Range: U+0800-U+FFFF, needs 3 bytes
3. Template: 1110xxxx 10xxxxxx 10xxxxxx
4. Fill in:
   - 1110 0100 (E4)
   - 10 111000 (B8)
   - 10 101101 (AD)
5. Result: E4 B8 AD

UTF-8 Codec Implementation

class UTF8Codec {
  static encode(str) {
    const bytes = [];
    for (const char of str) {
      const codePoint = char.codePointAt(0);
      
      if (codePoint <= 0x7F) {
        bytes.push(codePoint);
      } else if (codePoint <= 0x7FF) {
        bytes.push(0xC0 | (codePoint >> 6));
        bytes.push(0x80 | (codePoint & 0x3F));
      } else if (codePoint <= 0xFFFF) {
        bytes.push(0xE0 | (codePoint >> 12));
        bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
        bytes.push(0x80 | (codePoint & 0x3F));
      } else {
        bytes.push(0xF0 | (codePoint >> 18));
        bytes.push(0x80 | ((codePoint >> 12) & 0x3F));
        bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
        bytes.push(0x80 | (codePoint & 0x3F));
      }
    }
    return new Uint8Array(bytes);
  }

  static decode(bytes) {
    let result = '';
    let i = 0;
    
    while (i < bytes.length) {
      let codePoint;
      const byte1 = bytes[i];
      
      if ((byte1 & 0x80) === 0) {
        codePoint = byte1;
        i += 1;
      } else if ((byte1 & 0xE0) === 0xC0) {
        codePoint = ((byte1 & 0x1F) << 6) | (bytes[i + 1] & 0x3F);
        i += 2;
      } else if ((byte1 & 0xF0) === 0xE0) {
        codePoint = ((byte1 & 0x0F) << 12) | 
                    ((bytes[i + 1] & 0x3F) << 6) | 
                    (bytes[i + 2] & 0x3F);
        i += 3;
      } else {
        codePoint = ((byte1 & 0x07) << 18) | 
                    ((bytes[i + 1] & 0x3F) << 12) | 
                    ((bytes[i + 2] & 0x3F) << 6) | 
                    (bytes[i + 3] & 0x3F);
        i += 4;
      }
      
      result += String.fromCodePoint(codePoint);
    }
    
    return result;
  }

  static toHexString(bytes) {
    return Array.from(bytes)
      .map(b => b.toString(16).toUpperCase().padStart(2, '0'))
      .join(' ');
  }
}

// Usage
const encoded = UTF8Codec.encode('Hello World');
console.log(UTF8Codec.toHexString(encoded));

const decoded = UTF8Codec.decode(encoded);
console.log(decoded);

HTML Entity Encoding

What Are HTML Entities?

HTML entities are encoding methods for representing special characters in HTML:

<!-- Named entities -->
&lt;    → <
&gt;    → >
&amp;   → &
&quot;  → "
&nbsp;  → non-breaking space

<!-- Numeric entities -->
&#60;   → < (decimal)
&#x3C;  → < (hexadecimal)

Why Use HTML Entities?

  1. Avoid parsing errors: < and > would be parsed as tags
  2. Prevent XSS attacks: Escape user input
  3. Display special characters: Copyright ©, trademark ™, etc.

HTML Entity Encoding Implementation

class HTMLEntityEncoder {
  static namedEntities = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    '"': '&quot;',
    "'": '&#39;',
    '/': '&#x2F;',
    '`': '&#x60;',
    '=': '&#x3D;'
  };

  static reverseEntities = {
    'amp': '&',
    'lt': '<',
    'gt': '>',
    'quot': '"',
    'apos': "'",
    'nbsp': '\u00A0',
    'copy': '©',
    'reg': '®',
    'trade': '™',
    'euro': '€',
    'pound': '£',
    'yen': '¥',
    'cent': '¢'
  };

  static encode(str, options = {}) {
    const { mode = 'named', encodeAll = false } = options;
    
    return str.replace(/[&<>"'`=\/]|[^\x00-\x7F]/g, char => {
      if (this.namedEntities[char]) {
        return this.namedEntities[char];
      }
      
      if (encodeAll || char.charCodeAt(0) > 127) {
        const code = char.codePointAt(0);
        return mode === 'hex' 
          ? `&#x${code.toString(16).toUpperCase()};`
          : `&#${code};`;
      }
      
      return char;
    });
  }

  static decode(str) {
    return str
      .replace(/&([a-zA-Z]+);/g, (match, name) => {
        return this.reverseEntities[name.toLowerCase()] || match;
      })
      .replace(/&#(\d+);/g, (match, code) => {
        return String.fromCodePoint(parseInt(code, 10));
      })
      .replace(/&#x([0-9A-Fa-f]+);/g, (match, code) => {
        return String.fromCodePoint(parseInt(code, 16));
      });
  }

  static encodeForAttribute(str) {
    return str.replace(/[&<>"']/g, char => this.namedEntities[char]);
  }

  static encodeForHTML(str) {
    return str.replace(/[&<>]/g, char => this.namedEntities[char]);
  }
}

// Usage
console.log(HTMLEntityEncoder.encode('<script>alert("XSS")</script>'));
// "&lt;script&gt;alert(&quot;XSS&quot;)&lt;/script&gt;"

console.log(HTMLEntityEncoder.decode('&lt;div&gt;Hello&lt;/div&gt;'));
// "<div>Hello</div>"

Common HTML Entity Reference

Character Named Decimal Hex Description
< &lt; &#60; &#x3C; Less than
> &gt; &#62; &#x3E; Greater than
& &amp; &#38; &#x26; Ampersand
" &quot; &#34; &#x22; Double quote
' &apos; &#39; &#x27; Single quote
© &copy; &#169; &#xA9; Copyright
® &reg; &#174; &#xAE; Registered
&trade; &#8482; &#x2122; Trademark
&euro; &#8364; &#x20AC; Euro
£ &pound; &#163; &#xA3; Pound
¥ &yen; &#165; &#xA5; Yen
&nbsp; &#160; &#xA0; Non-breaking space

URL Encoding

URL Encoding Principles

URL encoding (Percent-encoding) safely transmits special characters in URLs:

Space → %20 or +
Chinese → Hexadecimal of UTF-8 bytes

URL Encoding Implementation

class URLEncoder {
  static encode(str) {
    return encodeURIComponent(str);
  }

  static decode(str) {
    return decodeURIComponent(str);
  }

  static encodeQueryParam(params) {
    return Object.entries(params)
      .map(([key, value]) => 
        `${encodeURIComponent(key)}=${encodeURIComponent(value)}`
      )
      .join('&');
  }

  static decodeQueryParam(queryString) {
    const params = {};
    const pairs = queryString.replace(/^\?/, '').split('&');
    
    for (const pair of pairs) {
      const [key, value] = pair.split('=');
      params[decodeURIComponent(key)] = decodeURIComponent(value || '');
    }
    
    return params;
  }
}

// Usage
console.log(URLEncoder.encode('Hello World'));
// "Hello%20World"

console.log(URLEncoder.encodeQueryParam({
  name: 'John Doe',
  message: 'Hello World!'
}));
// "name=John%20Doe&message=Hello%20World!"

Practical Applications

1. XSS Prevention

function sanitizeHTML(input) {
  return HTMLEntityEncoder.encode(input);
}

function createSafeElement(tag, text) {
  const element = document.createElement(tag);
  element.textContent = text;  // Auto-escapes
  return element;
}

// Unsafe
element.innerHTML = userInput;  // ❌ XSS risk

// Safe
element.textContent = userInput;  // ✅ Auto-escape
element.innerHTML = sanitizeHTML(userInput);  // ✅ Manual escape

2. Internationalization Text Processing

function normalizeText(str) {
  // NFD: Decomposition
  // NFC: Composition
  // NFKD: Compatibility decomposition
  // NFKC: Compatibility composition
  return str.normalize('NFC');
}

function compareStrings(a, b, locale = 'en-US') {
  return a.localeCompare(b, locale);
}

// Full-width/Half-width conversion
function toHalfWidth(str) {
  return str.replace(/[\uFF01-\uFF5E]/g, char => 
    String.fromCharCode(char.charCodeAt(0) - 0xFEE0)
  ).replace(/\u3000/g, ' ');
}

function toFullWidth(str) {
  return str.replace(/[\x21-\x7E]/g, char =>
    String.fromCharCode(char.charCodeAt(0) + 0xFEE0)
  ).replace(/ /g, '\u3000');
}

3. File Encoding Detection

async function detectEncoding(file) {
  const buffer = await file.arrayBuffer();
  const bytes = new Uint8Array(buffer);
  
  // Detect BOM
  if (bytes[0] === 0xEF && bytes[1] === 0xBB && bytes[2] === 0xBF) {
    return 'UTF-8';
  }
  if (bytes[0] === 0xFF && bytes[1] === 0xFE) {
    return 'UTF-16LE';
  }
  if (bytes[0] === 0xFE && bytes[1] === 0xFF) {
    return 'UTF-16BE';
  }
  
  // Try UTF-8 decoding
  try {
    new TextDecoder('utf-8', { fatal: true }).decode(bytes);
    return 'UTF-8';
  } catch {
    return 'unknown';
  }
}

Common Issues and Solutions

1. Garbled Text

// Problem: UTF-8 file opened with wrong encoding
// Solution: Specify correct encoding
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(buffer);

// Problem: Database garbled text
// Solution: Ensure consistent connection encoding
// SET NAMES utf8mb4;

2. Emoji Handling

// Problem: Incorrect emoji length
'👨‍👩‍👧‍👦'.length;  // 11 (wrong)

// Solution: Use spread or Array.from
[...'👨‍👩‍👧‍👦'].length;  // 7 (ZWJ sequence)

// Get actual character count
function getCharacterCount(str) {
  const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
  return [...segmenter.segment(str)].length;
}

3. Surrogate Pair Issues

// Problem: Characters outside BMP
const emoji = '😀';
emoji.length;  // 2 (surrogate pair)
emoji.charCodeAt(0);  // 55357 (high surrogate)
emoji.charCodeAt(1);  // 56832 (low surrogate)

// Solution: Use codePointAt
emoji.codePointAt(0);  // 128512 (correct code point)
String.fromCodePoint(128512);  // '😀'

Summary

Character encoding is fundamental to text processing. Key points:

  1. ASCII: Basic encoding, English only
  2. Unicode: Unified character set with code points for all characters
  3. UTF-8: Variable-length encoding, ASCII-compatible, most widely used
  4. HTML Entities: Safely display special characters in HTML
  5. URL Encoding: Safely transmit special characters in URLs

For quick encoding conversions, try our online tools: