What is Token?
Token is the fundamental unit of text that Large Language Models (LLMs) process, representing a piece of text that can be a word, subword, character, or punctuation mark. Tokenization is the process of breaking down text into these discrete units, enabling models to convert human-readable text into numerical representations that neural networks can understand and process.
Quick Facts
| Full Name | Token (LLM) |
|---|---|
| Created | 2010s (modern subword tokenization with BPE) |
| Specification | Official Specification |
How It Works
In the context of LLMs, tokens serve as the atomic building blocks for text processing. Modern tokenization algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece create a vocabulary of subword units that balance vocabulary size with representation efficiency. BPE iteratively merges the most frequent character pairs, while WordPiece uses likelihood-based merging. This subword approach allows models to handle rare words, morphological variations, and multiple languages effectively. The number of tokens directly impacts model context windows (e.g., GPT-4's 128K tokens), API pricing, and computational requirements. Different models use different tokenizers, resulting in varying token counts for the same text. Context window sizes have expanded dramatically, from GPT-3's 4K tokens to GPT-4 Turbo's 128K tokens and Claude 3's 200K tokens, with some models supporting up to 1M+ tokens. This expansion enables processing of entire codebases, long documents, and extended conversations without truncation.
Key Characteristics
- Subword units that balance vocabulary size and coverage efficiency
- Determines context window limits (e.g., 4K, 8K, 128K tokens)
- Primary billing unit for commercial LLM APIs
- Language-agnostic representation enabling multilingual support
- Variable length mapping where one word may equal 1-4 tokens
- Affects model latency and memory consumption during inference
Common Use Cases
- Estimating API costs for LLM-powered applications
- Optimizing prompts to fit within context window limits
- Building custom tokenizers for domain-specific vocabularies
- Analyzing token efficiency across different languages
- Implementing text chunking strategies for RAG systems
Example
Loading code...Frequently Asked Questions
How do I count tokens before making an API call?
Use tokenizer libraries specific to your model. For OpenAI models, use the 'tiktoken' Python library. For Hugging Face models, use their tokenizers library. Many API providers also offer online tokenizer tools. Token counts vary between models, so always use the correct tokenizer for your target model.
Why do different languages have different token counts for similar content?
Tokenizers are typically trained on English-dominant datasets, so English text tokenizes more efficiently. Languages like Chinese, Japanese, Korean, or Arabic often require more tokens per character or word. This affects both cost and context window usage for non-English applications.
What is a context window and why does it matter?
The context window is the maximum number of tokens a model can process in a single request, including both input and output. Larger context windows (like GPT-4's 128K or Claude's 200K tokens) allow processing longer documents but may increase latency and cost. Manage context carefully for optimal results.
How are tokens related to LLM pricing?
Most LLM APIs charge per token processed, typically with separate rates for input and output tokens. Output tokens usually cost more than input tokens. Understanding tokenization helps estimate costs: approximately 1 token equals 4 characters or 0.75 words in English. Optimize prompts to reduce unnecessary token usage.
What is Byte Pair Encoding (BPE) in tokenization?
BPE is the most common tokenization algorithm for LLMs. It starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens. This creates a vocabulary of subword units that efficiently represents common words while handling rare words through character combinations.