Text Encoding Complete Guide: HTML Entities, ASCII, Unicode, and Character Encoding

2024-01-17 - QubitTool Technical Team

Character encoding is the foundation of how computers process text. Understanding encoding principles helps solve garbled text issues and plays a crucial role in web security and internationalization. This article provides an in-depth explanation of various encoding methods.

Character Encoding Basics

Why Do We Need Character Encoding?

Computers can only process numbers (binary), while humans use text. Character encoding establishes the mapping between characters and numbers:

code

Character 'A' → Number 65 → Binary 01000001
Character '中' → Number 20013 → Binary ...

Evolution of Encoding

code

ASCII (1963) → Extended ASCII → ISO-8859 → Unicode (1991) → UTF-8/UTF-16
     ↓              ↓              ↓            ↓
   7-bit/128     8-bit/256     Regional     Unified
   characters    characters    encoding     encoding

ASCII Encoding

ASCII Basics

ASCII (American Standard Code for Information Interchange) is the most fundamental character encoding:

Range: 0-127 (7 bits)
Characters: 128
Includes: English letters, digits, punctuation, control characters

ASCII Table

Range	Type	Examples
0-31	Control characters	NUL, TAB, LF, CR
32-47	Punctuation	Space, !, ", #
48-57	Digits	0-9
65-90	Uppercase	A-Z
97-122	Lowercase	a-z
123-127	Other symbols	{, \|, }, ~

ASCII Conversion Implementation

javascript

class ASCIIConverter {
  static charToCode(char) {
    return char.charCodeAt(0);
  }

  static codeToChar(code) {
    return String.fromCharCode(code);
  }

  static stringToASCII(str) {
    return Array.from(str).map(char => ({
      char,
      decimal: char.charCodeAt(0),
      hex: char.charCodeAt(0).toString(16).toUpperCase(),
      binary: char.charCodeAt(0).toString(2).padStart(8, '0')
    }));
  }

  static asciiToString(codes) {
    return codes.map(code => String.fromCharCode(code)).join('');
  }

  static isASCII(str) {
    return /^[\x00-\x7F]*$/.test(str);
  }

  static toUpperCase(char) {
    const code = char.charCodeAt(0);
    if (code >= 97 && code <= 122) {
      return String.fromCharCode(code - 32);
    }
    return char;
  }

  static toLowerCase(char) {
    const code = char.charCodeAt(0);
    if (code >= 65 && code <= 90) {
      return String.fromCharCode(code + 32);
    }
    return char;
  }
}

// Usage
console.log(ASCIIConverter.stringToASCII('Hello'));
// [
//   { char: 'H', decimal: 72, hex: '48', binary: '01001000' },
//   { char: 'e', decimal: 101, hex: '65', binary: '01100101' },
//   ...
// ]

console.log(ASCIIConverter.asciiToString([72, 101, 108, 108, 111]));
// "Hello"

Python ASCII Implementation

python

class ASCIIConverter:
    @staticmethod
    def char_to_code(char: str) -> int:
        return ord(char)
    
    @staticmethod
    def code_to_char(code: int) -> str:
        return chr(code)
    
    @staticmethod
    def string_to_ascii(s: str) -> list:
        return [
            {
                'char': char,
                'decimal': ord(char),
                'hex': hex(ord(char))[2:].upper(),
                'binary': bin(ord(char))[2:].zfill(8)
            }
            for char in s
        ]
    
    @staticmethod
    def ascii_to_string(codes: list) -> str:
        return ''.join(chr(code) for code in codes)
    
    @staticmethod
    def is_ascii(s: str) -> bool:
        return all(ord(char) < 128 for char in s)

# Usage
print(ASCIIConverter.string_to_ascii('Hello'))
print(ASCIIConverter.ascii_to_string([72, 101, 108, 108, 111]))

Unicode Encoding

Unicode Basics

Unicode is a character set standard that assigns a unique code point to every character in the world:

Range: U+0000 to U+10FFFF
Characters: Over 140,000
Notation: U+XXXX (hexadecimal)

Unicode Planes

Plane	Range	Name	Content
0	U+0000-U+FFFF	BMP	Common characters
1	U+10000-U+1FFFF	SMP	Emoji, ancient scripts
2	U+20000-U+2FFFF	SIP	Extended CJK
14	U+E0000-U+EFFFF	SSP	Special purpose

Unicode Conversion Implementation

javascript

class UnicodeConverter {
  static charToCodePoint(char) {
    return char.codePointAt(0);
  }

  static codePointToChar(codePoint) {
    return String.fromCodePoint(codePoint);
  }

  static stringToUnicode(str) {
    const result = [];
    for (const char of str) {
      const codePoint = char.codePointAt(0);
      result.push({
        char,
        codePoint,
        unicode: `U+${codePoint.toString(16).toUpperCase().padStart(4, '0')}`,
        utf8: this.toUTF8Bytes(codePoint),
        utf16: this.toUTF16(codePoint)
      });
    }
    return result;
  }

  static toUTF8Bytes(codePoint) {
    const bytes = [];
    if (codePoint <= 0x7F) {
      bytes.push(codePoint);
    } else if (codePoint <= 0x7FF) {
      bytes.push(0xC0 | (codePoint >> 6));
      bytes.push(0x80 | (codePoint & 0x3F));
    } else if (codePoint <= 0xFFFF) {
      bytes.push(0xE0 | (codePoint >> 12));
      bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
      bytes.push(0x80 | (codePoint & 0x3F));
    } else {
      bytes.push(0xF0 | (codePoint >> 18));
      bytes.push(0x80 | ((codePoint >> 12) & 0x3F));
      bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
      bytes.push(0x80 | (codePoint & 0x3F));
    }
    return bytes.map(b => b.toString(16).toUpperCase().padStart(2, '0'));
  }

  static toUTF16(codePoint) {
    if (codePoint <= 0xFFFF) {
      return [codePoint.toString(16).toUpperCase().padStart(4, '0')];
    }
    // Surrogate pair
    const offset = codePoint - 0x10000;
    const high = 0xD800 + (offset >> 10);
    const low = 0xDC00 + (offset & 0x3FF);
    return [
      high.toString(16).toUpperCase(),
      low.toString(16).toUpperCase()
    ];
  }

  static escapeUnicode(str) {
    return Array.from(str)
      .map(char => {
        const code = char.codePointAt(0);
        if (code > 0xFFFF) {
          return `\\u{${code.toString(16).toUpperCase()}}`;
        }
        return `\\u${code.toString(16).toUpperCase().padStart(4, '0')}`;
      })
      .join('');
  }

  static unescapeUnicode(str) {
    return str.replace(/\\u\{([0-9A-Fa-f]+)\}|\\u([0-9A-Fa-f]{4})/g, 
      (match, p1, p2) => {
        const codePoint = parseInt(p1 || p2, 16);
        return String.fromCodePoint(codePoint);
      }
    );
  }
}

// Usage
console.log(UnicodeConverter.stringToUnicode('Hello👋'));
console.log(UnicodeConverter.escapeUnicode('Hello World'));

UTF-8 Encoding

UTF-8 Principles

UTF-8 is a variable-length encoding for Unicode:

Unicode Range	UTF-8 Bytes	Format
U+0000-U+007F	1	0xxxxxxx
U+0080-U+07FF	2	110xxxxx 10xxxxxx
U+0800-U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx
U+10000-U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Encoding Example

For the Chinese character "中" (U+4E2D):

code

1. Code point: 0x4E2D = 0100 1110 0010 1101
2. Range: U+0800-U+FFFF, needs 3 bytes
3. Template: 1110xxxx 10xxxxxx 10xxxxxx
4. Fill in:
   - 1110 0100 (E4)
   - 10 111000 (B8)
   - 10 101101 (AD)
5. Result: E4 B8 AD

UTF-8 Codec Implementation

javascript

class UTF8Codec {
  static encode(str) {
    const bytes = [];
    for (const char of str) {
      const codePoint = char.codePointAt(0);
      
      if (codePoint <= 0x7F) {
        bytes.push(codePoint);
      } else if (codePoint <= 0x7FF) {
        bytes.push(0xC0 | (codePoint >> 6));
        bytes.push(0x80 | (codePoint & 0x3F));
      } else if (codePoint <= 0xFFFF) {
        bytes.push(0xE0 | (codePoint >> 12));
        bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
        bytes.push(0x80 | (codePoint & 0x3F));
      } else {
        bytes.push(0xF0 | (codePoint >> 18));
        bytes.push(0x80 | ((codePoint >> 12) & 0x3F));
        bytes.push(0x80 | ((codePoint >> 6) & 0x3F));
        bytes.push(0x80 | (codePoint & 0x3F));
      }
    }
    return new Uint8Array(bytes);
  }

  static decode(bytes) {
    let result = '';
    let i = 0;
    
    while (i < bytes.length) {
      let codePoint;
      const byte1 = bytes[i];
      
      if ((byte1 & 0x80) === 0) {
        codePoint = byte1;
        i += 1;
      } else if ((byte1 & 0xE0) === 0xC0) {
        codePoint = ((byte1 & 0x1F) << 6) | (bytes[i + 1] & 0x3F);
        i += 2;
      } else if ((byte1 & 0xF0) === 0xE0) {
        codePoint = ((byte1 & 0x0F) << 12) | 
                    ((bytes[i + 1] & 0x3F) << 6) | 
                    (bytes[i + 2] & 0x3F);
        i += 3;
      } else {
        codePoint = ((byte1 & 0x07) << 18) | 
                    ((bytes[i + 1] & 0x3F) << 12) | 
                    ((bytes[i + 2] & 0x3F) << 6) | 
                    (bytes[i + 3] & 0x3F);
        i += 4;
      }
      
      result += String.fromCodePoint(codePoint);
    }
    
    return result;
  }

  static toHexString(bytes) {
    return Array.from(bytes)
      .map(b => b.toString(16).toUpperCase().padStart(2, '0'))
      .join(' ');
  }
}

// Usage
const encoded = UTF8Codec.encode('Hello World');
console.log(UTF8Codec.toHexString(encoded));

const decoded = UTF8Codec.decode(encoded);
console.log(decoded);

HTML Entity Encoding

What Are HTML Entities?

HTML entities are encoding methods for representing special characters in HTML:

html

<!-- Named entities -->
&lt;    → <
&gt;    → >
&amp;   → &
&quot;  → "
&nbsp;  → non-breaking space

<!-- Numeric entities -->
&#60;   → < (decimal)
&#x3C;  → < (hexadecimal)

Why Use HTML Entities?

Avoid parsing errors: < and > would be parsed as tags
Prevent XSS attacks: Escape user input
Display special characters: Copyright ©, trademark ™, etc.

HTML Entity Encoding Implementation

javascript

class HTMLEntityEncoder {
  static namedEntities = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    '"': '&quot;',
    "'": '&#39;',
    '/': '&#x2F;',
    '`': '&#x60;',
    '=': '&#x3D;'
  };

  static reverseEntities = {
    'amp': '&',
    'lt': '<',
    'gt': '>',
    'quot': '"',
    'apos': "'",
    'nbsp': '\u00A0',
    'copy': '©',
    'reg': '®',
    'trade': '™',
    'euro': '€',
    'pound': '£',
    'yen': '¥',
    'cent': '¢'
  };

  static encode(str, options = {}) {
    const { mode = 'named', encodeAll = false } = options;
    
    return str.replace(/[&<>"'`=\/]|[^\x00-\x7F]/g, char => {
      if (this.namedEntities[char]) {
        return this.namedEntities[char];
      }
      
      if (encodeAll || char.charCodeAt(0) > 127) {
        const code = char.codePointAt(0);
        return mode === 'hex' 
          ? `&#x${code.toString(16).toUpperCase()};`
          : `&#${code};`;
      }
      
      return char;
    });
  }

  static decode(str) {
    return str
      .replace(/&([a-zA-Z]+);/g, (match, name) => {
        return this.reverseEntities[name.toLowerCase()] || match;
      })
      .replace(/&#(\d+);/g, (match, code) => {
        return String.fromCodePoint(parseInt(code, 10));
      })
      .replace(/&#x([0-9A-Fa-f]+);/g, (match, code) => {
        return String.fromCodePoint(parseInt(code, 16));
      });
  }

  static encodeForAttribute(str) {
    return str.replace(/[&<>"']/g, char => this.namedEntities[char]);
  }

  static encodeForHTML(str) {
    return str.replace(/[&<>]/g, char => this.namedEntities[char]);
  }
}

// Usage
console.log(HTMLEntityEncoder.encode('<script>alert("XSS")</script>'));
// "&lt;script&gt;alert(&quot;XSS&quot;)&lt;/script&gt;"

console.log(HTMLEntityEncoder.decode('&lt;div&gt;Hello&lt;/div&gt;'));
// "<div>Hello</div>"

Common HTML Entity Reference

Character	Named	Decimal	Hex	Description
<	`<`	`<`	`<`	Less than
>	`>`	`>`	`>`	Greater than
&	`&`	`&`	`&`	Ampersand
"	`"`	`"`	`"`	Double quote
'	`'`	`'`	`'`	Single quote
©	`©`	`©`	`©`	Copyright
®	`®`	`®`	`®`	Registered
™	`™`	`™`	`™`	Trademark
€	`€`	`€`	`€`	Euro
£	`£`	`£`	`£`	Pound
¥	`¥`	`¥`	`¥`	Yen
	` `	` `	` `	Non-breaking space

URL Encoding

URL Encoding Principles

URL encoding (Percent-encoding) safely transmits special characters in URLs:

code

Space → %20 or +
Chinese → Hexadecimal of UTF-8 bytes

URL Encoding Implementation

javascript

class URLEncoder {
  static encode(str) {
    return encodeURIComponent(str);
  }

  static decode(str) {
    return decodeURIComponent(str);
  }

  static encodeQueryParam(params) {
    return Object.entries(params)
      .map(([key, value]) => 
        `${encodeURIComponent(key)}=${encodeURIComponent(value)}`
      )
      .join('&');
  }

  static decodeQueryParam(queryString) {
    const params = {};
    const pairs = queryString.replace(/^\?/, '').split('&');
    
    for (const pair of pairs) {
      const [key, value] = pair.split('=');
      params[decodeURIComponent(key)] = decodeURIComponent(value || '');
    }
    
    return params;
  }
}

// Usage
console.log(URLEncoder.encode('Hello World'));
// "Hello%20World"

console.log(URLEncoder.encodeQueryParam({
  name: 'John Doe',
  message: 'Hello World!'
}));
// "name=John%20Doe&message=Hello%20World!"

Practical Applications

1. XSS Prevention

javascript

function sanitizeHTML(input) {
  return HTMLEntityEncoder.encode(input);
}

function createSafeElement(tag, text) {
  const element = document.createElement(tag);
  element.textContent = text;  // Auto-escapes
  return element;
}

// Unsafe
element.innerHTML = userInput;  // ❌ XSS risk

// Safe
element.textContent = userInput;  // ✅ Auto-escape
element.innerHTML = sanitizeHTML(userInput);  // ✅ Manual escape

2. Internationalization Text Processing

javascript

function normalizeText(str) {
  // NFD: Decomposition
  // NFC: Composition
  // NFKD: Compatibility decomposition
  // NFKC: Compatibility composition
  return str.normalize('NFC');
}

function compareStrings(a, b, locale = 'en-US') {
  return a.localeCompare(b, locale);
}

// Full-width/Half-width conversion
function toHalfWidth(str) {
  return str.replace(/[\uFF01-\uFF5E]/g, char => 
    String.fromCharCode(char.charCodeAt(0) - 0xFEE0)
  ).replace(/\u3000/g, ' ');
}

function toFullWidth(str) {
  return str.replace(/[\x21-\x7E]/g, char =>
    String.fromCharCode(char.charCodeAt(0) + 0xFEE0)
  ).replace(/ /g, '\u3000');
}

3. File Encoding Detection

javascript

async function detectEncoding(file) {
  const buffer = await file.arrayBuffer();
  const bytes = new Uint8Array(buffer);
  
  // Detect BOM
  if (bytes[0] === 0xEF && bytes[1] === 0xBB && bytes[2] === 0xBF) {
    return 'UTF-8';
  }
  if (bytes[0] === 0xFF && bytes[1] === 0xFE) {
    return 'UTF-16LE';
  }
  if (bytes[0] === 0xFE && bytes[1] === 0xFF) {
    return 'UTF-16BE';
  }
  
  // Try UTF-8 decoding
  try {
    new TextDecoder('utf-8', { fatal: true }).decode(bytes);
    return 'UTF-8';
  } catch {
    return 'unknown';
  }
}

Common Issues and Solutions

1. Garbled Text

javascript

// Problem: UTF-8 file opened with wrong encoding
// Solution: Specify correct encoding
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(buffer);

// Problem: Database garbled text
// Solution: Ensure consistent connection encoding
// SET NAMES utf8mb4;

2. Emoji Handling

javascript

// Problem: Incorrect emoji length
'👨‍👩‍👧‍👦'.length;  // 11 (wrong)

// Solution: Use spread or Array.from
[...'👨‍👩‍👧‍👦'].length;  // 7 (ZWJ sequence)

// Get actual character count
function getCharacterCount(str) {
  const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
  return [...segmenter.segment(str)].length;
}

3. Surrogate Pair Issues

javascript

// Problem: Characters outside BMP
const emoji = '😀';
emoji.length;  // 2 (surrogate pair)
emoji.charCodeAt(0);  // 55357 (high surrogate)
emoji.charCodeAt(1);  // 56832 (low surrogate)

// Solution: Use codePointAt
emoji.codePointAt(0);  // 128512 (correct code point)
String.fromCodePoint(128512);  // '😀'

Summary

Character encoding is fundamental to text processing. Key points:

ASCII: Basic encoding, English only
Unicode: Unified character set with code points for all characters
UTF-8: Variable-length encoding, ASCII-compatible, most widely used
HTML Entities: Safely display special characters in HTML
URL Encoding: Safely transmit special characters in URLs

For quick encoding conversions, try our online tools:

HTML Entity Encoder - HTML entity encoding/decoding
ASCII Unicode Converter - Character encoding conversion
URL Encoder - URL encoding/decoding

Base64 Encoder - Base64 encoding/decoding
Base Converter - Number base conversion
JSON Escaper - JSON string escaping

Text Encoding Complete Guide: HTML Entities, ASCII, Unicode, and Character Encoding

Character Encoding Basics

Why Do We Need Character Encoding?

Evolution of Encoding

ASCII Encoding

ASCII Basics

ASCII Table

ASCII Conversion Implementation

Python ASCII Implementation

Unicode Encoding

Unicode Basics

Unicode Planes

Unicode Conversion Implementation

UTF-8 Encoding

UTF-8 Principles

UTF-8 Encoding Example

UTF-8 Codec Implementation

HTML Entity Encoding

What Are HTML Entities?

Why Use HTML Entities?

HTML Entity Encoding Implementation

Common HTML Entity Reference

URL Encoding

URL Encoding Principles

URL Encoding Implementation

Practical Applications

1. XSS Prevention

2. Internationalization Text Processing

3. File Encoding Detection

Common Issues and Solutions

1. Garbled Text

2. Emoji Handling

3. Surrogate Pair Issues

Summary

Related Resources