TL;DR
Tokens are the basic units that large language models use to process text, and context windows determine the maximum number of tokens a model can handle at once. This guide covers the tokenization process, mainstream tokenization algorithms (BPE, WordPiece, SentencePiece), context window comparisons across models, long-context technologies (RoPE, ALiBi), and practical strategies for token counting and cost optimization.
Introduction
When using ChatGPT, Claude, or other large language models, have you ever wondered: Why do conversations sometimes get truncated? Why do Chinese and English "lengths" calculate differently? Why are API costs hard to estimate?
The answers to these questions relate to two core concepts: Token and Context Window. Understanding them not only helps you use AI tools more efficiently but can also significantly reduce API costs.
In this guide, you'll learn:
- The definition of tokens and the tokenization process
- How BPE, WordPiece, and SentencePiece tokenization algorithms work
- What context windows are and why they matter
- Context window size comparisons across major LLMs
- Long-context technologies: RoPE, ALiBi, Sliding Window
- How to count tokens and estimate costs
- Practical strategies for optimizing context usage
What is a Token
A token is the smallest unit that large language models use to process text. From the model's perspective, text is not a sequence of characters or words, but a sequence of tokens.
Difference Between Tokens, Characters, and Words
| Concept | Definition | Example |
|---|---|---|
| Character | Smallest text unit | H, e, l, l, o |
| Word | Text separated by spaces | Hello, world |
| Token | Basic unit for model processing | Hello, wor, ld |
Token granularity falls between characters and words. Common words are typically one token, while rare or long words are split into multiple tokens.
Why Use Tokens Instead of Words
- Controllable vocabulary size: English has hundreds of thousands of words, while token vocabularies typically have only 30-50K entries
- Handle unknown words: Any text can be decomposed into combinations of known tokens
- Cross-language support: The same tokenizer can process multiple languages
- Subword sharing: Related words share subword tokens, like "run", "running", "runner"
Tokenization Process Explained
Tokenization is the process of converting raw text into a sequence of tokens.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Hello, world! 你好,世界!"
tokens = enc.encode(text)
print(f"Original text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Token IDs: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
Example output:
Original text: Hello, world! 你好,世界!
Token count: 11
Token IDs: [9906, 11, 1917, 0, 220, 57668, 53901, 3922, 244, 98220, 6447]
Decoded: ['Hello', ',', ' world', '!', ' ', '你好', ',', '世', '界', '!']
Token Efficiency Across Languages
Token efficiency differs significantly between Chinese and English:
def compare_token_efficiency(texts):
enc = tiktoken.encoding_for_model("gpt-4")
for text in texts:
tokens = enc.encode(text)
chars = len(text)
ratio = chars / len(tokens)
print(f"Text: {text}")
print(f"Characters: {chars}, Tokens: {len(tokens)}, Efficiency: {ratio:.2f} chars/token\n")
compare_token_efficiency([
"The quick brown fox jumps over the lazy dog.",
"敏捷的棕色狐狸跳过了懒惰的狗。",
"Transformer architecture revolutionized NLP.",
"Transformer架构彻底改变了自然语言处理领域。"
])
Generally, English averages about 4 characters per token, while Chinese averages about 1.5-2 characters per token.
Mainstream Tokenization Algorithms
BPE (Byte Pair Encoding)
BPE is the tokenization algorithm used by GPT models, building vocabulary by iteratively merging the most frequent character pairs.
def simple_bpe_demo():
"""Simplified BPE demonstration"""
corpus = ["low", "lower", "newest", "widest"]
vocab = set()
for word in corpus:
vocab.update(list(word))
vocab.add("</w>")
print(f"Initial vocabulary: {sorted(vocab)}")
word_freqs = {}
for word in corpus:
chars = list(word) + ["</w>"]
word_freqs[tuple(chars)] = word_freqs.get(tuple(chars), 0) + 1
print(f"Initial tokenization: {list(word_freqs.keys())}")
simple_bpe_demo()
BPE advantages:
- Balances pros and cons of character-level and word-level tokenization
- Effectively handles unknown words
- Controllable vocabulary size
WordPiece
WordPiece is the tokenization algorithm used by BERT, similar to BPE but with a different merging strategy.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "unbelievable"
tokens = tokenizer.tokenize(text)
print(f"WordPiece tokenization: {tokens}")
WordPiece uses the ## prefix to mark subword tokens that are not word-initial.
SentencePiece
SentencePiece is a language-agnostic tokenization tool that treats text as a sequence of Unicode characters.
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('model.model')
text = "This is a test."
tokens = sp.encode_as_pieces(text)
print(f"SentencePiece tokenization: {tokens}")
SentencePiece characteristics:
- No dependency on pre-tokenization (like space splitting)
- Supports both BPE and Unigram algorithms
- Widely used in multilingual models
What is a Context Window
A context window is the maximum number of tokens a large language model can process at once. It determines how much information the model can "see."
Components of a Context Window
The context window contains all input and output tokens:
- System Prompt: Instructions defining model behavior
- Conversation History: Previous dialogue turns
- User Input: Current question or request
- Model Output: The model's generated response
Why Context Windows Matter
| Scenario | Small Context Window | Large Context Window |
|---|---|---|
| Long Document Analysis | Requires chunking | Can process entire document at once |
| Multi-turn Dialogue | Easily loses early context | Maintains complete conversation memory |
| Code Understanding | Can only see partial code | Understands complete codebase |
| RAG Applications | Limited retrieval results | Can include more relevant documents |
Context Window Comparison Across Models
| Model | Context Window | Release Date | Notes |
|---|---|---|---|
| GPT-3.5 Turbo | 4K / 16K | 2023 | 16K version costs more |
| GPT-4 | 8K / 32K / 128K | 2023-2024 | 128K is Turbo version |
| GPT-4o | 128K | 2024 | Multimodal support |
| Claude 3 Opus | 200K | 2024 | ~150K words |
| Claude 3.5 Sonnet | 200K | 2024 | Performance/cost balance |
| Gemini 1.5 Pro | 1M / 2M | 2024 | Currently largest window |
| LLaMA 3 | 8K / 128K | 2024 | Open source model |
| Qwen 2.5 | 128K | 2024 | Chinese optimized |
Context Window vs Actual Available Length
Note that the stated context window size doesn't equal actual available length:
def calculate_available_context(total_window, system_prompt_tokens,
history_tokens, max_output_tokens):
"""Calculate actually available input tokens"""
available = total_window - system_prompt_tokens - history_tokens - max_output_tokens
return max(0, available)
total = 128000
system = 500
history = 10000
max_output = 4096
available = calculate_available_context(total, system, history, max_output)
print(f"Total window: {total}")
print(f"System prompt: {system}")
print(f"History: {history}")
print(f"Reserved output: {max_output}")
print(f"Available input: {available}")
Long Context Technologies
As demand for longer contexts grows, researchers have developed various techniques to extend model context processing capabilities.
RoPE (Rotary Position Embedding)
RoPE encodes position information through rotation matrices, supporting position extrapolation.
import numpy as np
def rope_embedding(x, position, d_model):
"""
RoPE positional encoding
x: Input vector
position: Position index
d_model: Model dimension
"""
freqs = 1.0 / (10000 ** (np.arange(0, d_model, 2) / d_model))
angles = position * freqs
cos_vals = np.cos(angles)
sin_vals = np.sin(angles)
x_even = x[0::2]
x_odd = x[1::2]
rotated_even = x_even * cos_vals - x_odd * sin_vals
rotated_odd = x_even * sin_vals + x_odd * cos_vals
result = np.zeros_like(x)
result[0::2] = rotated_even
result[1::2] = rotated_odd
return result
RoPE advantages:
- Natural encoding of relative position information
- Supports length extrapolation
- Computationally efficient
ALiBi (Attention with Linear Biases)
ALiBi encodes position information by adding linear biases to attention scores.
def alibi_bias(seq_len, num_heads):
"""
Calculate ALiBi bias matrix
"""
slopes = np.array([2 ** (-8 * i / num_heads) for i in range(1, num_heads + 1)])
positions = np.arange(seq_len)
bias = -np.abs(positions[:, None] - positions[None, :])
alibi = slopes[:, None, None] * bias[None, :, :]
return alibi
Sliding Window Attention
Sliding window attention limits each token to only attend to tokens within a fixed range, reducing computational complexity.
Token Counting and Cost Estimation
Accurate token counting is crucial for cost control.
Counting Tokens with tiktoken
import tiktoken
def count_tokens(text, model="gpt-4"):
"""
Count tokens in text
"""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def estimate_cost(input_tokens, output_tokens, model="gpt-4"):
"""
Estimate API call cost (USD)
"""
pricing = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-4-turbo": {"input": 0.01, "output": 0.03},
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
"claude-3-opus": {"input": 0.015, "output": 0.075},
"claude-3-sonnet": {"input": 0.003, "output": 0.015},
}
if model not in pricing:
return None
input_cost = (input_tokens / 1000) * pricing[model]["input"]
output_cost = (output_tokens / 1000) * pricing[model]["output"]
return input_cost + output_cost
text = """
Large Language Models (LLMs) are deep learning-based natural language processing models
that learn statistical patterns and semantic knowledge from large-scale text data through pre-training.
"""
tokens = count_tokens(text)
cost = estimate_cost(tokens, 500, "gpt-4o")
print(f"Input tokens: {tokens}")
print(f"Estimated output tokens: 500")
print(f"Estimated cost: ${cost:.4f}")
Token Counter Utility Class
class TokenCounter:
"""Token counting and cost estimation utility"""
def __init__(self, model="gpt-4"):
self.model = model
try:
self.encoding = tiktoken.encoding_for_model(model)
except KeyError:
self.encoding = tiktoken.get_encoding("cl100k_base")
def count(self, text):
"""Count tokens"""
return len(self.encoding.encode(text))
def count_messages(self, messages):
"""Count tokens in conversation messages"""
total = 0
for message in messages:
total += 4
for key, value in message.items():
total += self.count(value)
if key == "name":
total += -1
total += 2
return total
def truncate_to_limit(self, text, max_tokens):
"""Truncate text to specified token count"""
tokens = self.encoding.encode(text)
if len(tokens) <= max_tokens:
return text
return self.encoding.decode(tokens[:max_tokens])
def split_by_tokens(self, text, chunk_size, overlap=0):
"""Split text by token count"""
tokens = self.encoding.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_size, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(self.encoding.decode(chunk_tokens))
start = end - overlap
return chunks
counter = TokenCounter("gpt-4")
long_text = "This is a very long text..." * 100
chunks = counter.split_by_tokens(long_text, chunk_size=100, overlap=20)
print(f"Split into {len(chunks)} chunks")
Strategies for Optimizing Context Usage
1. Streamline System Prompts
verbose_prompt = """
You are a very helpful AI assistant. Your task is to help users answer various questions.
You should always maintain a friendly and professional attitude. When users ask questions,
you need to carefully analyze the problem and then provide detailed, accurate answers.
If you're not sure about an answer, please honestly tell the user.
"""
concise_prompt = """
You are a professional AI assistant. Provide accurate, concise answers. Be honest when uncertain.
"""
counter = TokenCounter()
print(f"Verbose prompt: {counter.count(verbose_prompt)} tokens")
print(f"Concise prompt: {counter.count(concise_prompt)} tokens")
2. Conversation History Management
def manage_conversation_history(messages, max_tokens, counter):
"""
Manage conversation history within token limits
Strategy: Keep system messages and most recent conversations
"""
system_messages = [m for m in messages if m.get("role") == "system"]
other_messages = [m for m in messages if m.get("role") != "system"]
system_tokens = counter.count_messages(system_messages)
available_tokens = max_tokens - system_tokens
kept_messages = []
current_tokens = 0
for message in reversed(other_messages):
msg_tokens = counter.count_messages([message])
if current_tokens + msg_tokens <= available_tokens:
kept_messages.insert(0, message)
current_tokens += msg_tokens
else:
break
return system_messages + kept_messages
3. Document Chunking Strategies
def smart_chunk_document(text, max_chunk_tokens=1000, overlap_tokens=100):
"""
Smart document chunking: split at sentence boundaries
"""
import re
counter = TokenCounter()
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_tokens = 0
for sentence in sentences:
sentence_tokens = counter.count(sentence)
if current_tokens + sentence_tokens > max_chunk_tokens:
if current_chunk:
chunks.append(' '.join(current_chunk))
overlap_text = ' '.join(current_chunk[-2:]) if len(current_chunk) >= 2 else ''
current_chunk = [overlap_text, sentence] if overlap_text else [sentence]
current_tokens = counter.count(' '.join(current_chunk))
else:
current_chunk.append(sentence)
current_tokens += sentence_tokens
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
4. Compress History with Summaries
def compress_history_with_summary(messages, summarizer_fn, threshold_tokens=2000):
"""
When history is too long, use summaries to compress earlier conversations
"""
counter = TokenCounter()
total_tokens = counter.count_messages(messages)
if total_tokens <= threshold_tokens:
return messages
system_msgs = [m for m in messages if m["role"] == "system"]
other_msgs = [m for m in messages if m["role"] != "system"]
split_point = len(other_msgs) // 2
old_msgs = other_msgs[:split_point]
recent_msgs = other_msgs[split_point:]
old_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_msgs])
summary = summarizer_fn(old_text)
summary_msg = {"role": "system", "content": f"Previous conversation summary: {summary}"}
return system_msgs + [summary_msg] + recent_msgs
Tool Recommendations
The following tools can improve efficiency when working with tokens and context-related tasks:
- JSON Formatter - Format API responses and model configurations
- Text Diff Tool - Compare differences in tokenization results
- Base64 Encoder/Decoder - Handle encoding conversions for embedding vectors
- Text Analyzer - Quickly count text characters and words
- Regex Tester - Test regex patterns for text splitting
Summary
Understanding tokens and context windows is fundamental to efficiently using large language models:
- Tokens are the basic units of LLMs: Different from characters or words, tokens are determined by tokenization algorithms
- Tokenization algorithms have distinct characteristics: BPE, WordPiece, SentencePiece are suited for different scenarios
- Context windows limit total tokens: Including input, output, and conversation history
- Long-context technologies continue to evolve: RoPE, ALiBi, and other technologies extend model capabilities
- Cost optimization requires accurate counting: Use tools like tiktoken for precise estimation
- Multiple strategies can optimize usage: Streamline prompts, manage history, smart chunking
Mastering this knowledge enables you to better control API costs and design more efficient AI applications.
FAQ
Why does Chinese consume more tokens than English?
This relates to the training data of tokenization algorithms. Models like GPT are primarily trained on English corpora, where common English words are typically single tokens, while Chinese characters often require multiple tokens. For example, "人工智能" (artificial intelligence) might need 3-4 tokens, while "AI" only needs 1 token. Using Chinese-optimized models (like Qwen) can improve Chinese token efficiency.
Is a larger context window always better?
Not necessarily. A larger context window means: 1) Higher API costs (charged per token); 2) Longer response times; 3) Potential attention dilution (the model may struggle to focus on key information). Choose an appropriately sized context window based on actual needs, and use techniques like Retrieval-Augmented Generation (RAG) to optimize information utilization.
How can I estimate the token count of a text?
The most accurate method is to use the corresponding model's tokenizer. For OpenAI models, use the tiktoken library; for other models, use their respective tokenizers. For rough estimates, English averages about 4 characters = 1 token, while Chinese averages about 1.5-2 characters = 1 token. Online tools like OpenAI's Tokenizer can also quickly calculate tokens.
Does conversation history consume the context window?
Yes, each turn's input and output accumulates in the context. This is why long conversations may cause early content to be truncated. Consider implementing conversation history management strategies: keep recent conversations, use summaries to compress history, or reset context at appropriate times.
Are tokens interchangeable between different models?
No. Different models use different tokenizers, and the same text may have different token counts across models. For example, GPT-4 uses cl100k_base encoding, while BERT uses WordPiece. When switching models, you need to recalculate token counts and costs.