TL;DR

Tokens are the basic units that large language models use to process text, and context windows determine the maximum number of tokens a model can handle at once. This guide covers the tokenization process, mainstream tokenization algorithms (BPE, WordPiece, SentencePiece), context window comparisons across models, long-context technologies (RoPE, ALiBi), and practical strategies for token counting and cost optimization.

Introduction

When using ChatGPT, Claude, or other large language models, have you ever wondered: Why do conversations sometimes get truncated? Why do Chinese and English "lengths" calculate differently? Why are API costs hard to estimate?

The answers to these questions relate to two core concepts: Token and Context Window. Understanding them not only helps you use AI tools more efficiently but can also significantly reduce API costs.

In this guide, you'll learn:

  • The definition of tokens and the tokenization process
  • How BPE, WordPiece, and SentencePiece tokenization algorithms work
  • What context windows are and why they matter
  • Context window size comparisons across major LLMs
  • Long-context technologies: RoPE, ALiBi, Sliding Window
  • How to count tokens and estimate costs
  • Practical strategies for optimizing context usage

What is a Token

A token is the smallest unit that large language models use to process text. From the model's perspective, text is not a sequence of characters or words, but a sequence of tokens.

graph LR subgraph "Tokenization Process" A["Raw Text"] --> B["Tokenizer"] B --> C["Token Sequence"] C --> D["Token IDs"] D --> E["Embedding Vectors"] end subgraph "Example" T1["Hello world"] --> T2["Hello, world"] T2 --> T3["[15496, 995]"] end

Difference Between Tokens, Characters, and Words

Concept Definition Example
Character Smallest text unit H, e, l, l, o
Word Text separated by spaces Hello, world
Token Basic unit for model processing Hello, wor, ld

Token granularity falls between characters and words. Common words are typically one token, while rare or long words are split into multiple tokens.

Why Use Tokens Instead of Words

  1. Controllable vocabulary size: English has hundreds of thousands of words, while token vocabularies typically have only 30-50K entries
  2. Handle unknown words: Any text can be decomposed into combinations of known tokens
  3. Cross-language support: The same tokenizer can process multiple languages
  4. Subword sharing: Related words share subword tokens, like "run", "running", "runner"

Tokenization Process Explained

Tokenization is the process of converting raw text into a sequence of tokens.

python
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text = "Hello, world! 你好,世界!"
tokens = enc.encode(text)

print(f"Original text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Token IDs: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")

Example output:

code
Original text: Hello, world! 你好,世界!
Token count: 11
Token IDs: [9906, 11, 1917, 0, 220, 57668, 53901, 3922, 244, 98220, 6447]
Decoded: ['Hello', ',', ' world', '!', ' ', '你好', ',', '世', '界', '!']

Token Efficiency Across Languages

Token efficiency differs significantly between Chinese and English:

python
def compare_token_efficiency(texts):
    enc = tiktoken.encoding_for_model("gpt-4")
    for text in texts:
        tokens = enc.encode(text)
        chars = len(text)
        ratio = chars / len(tokens)
        print(f"Text: {text}")
        print(f"Characters: {chars}, Tokens: {len(tokens)}, Efficiency: {ratio:.2f} chars/token\n")

compare_token_efficiency([
    "The quick brown fox jumps over the lazy dog.",
    "敏捷的棕色狐狸跳过了懒惰的狗。",
    "Transformer architecture revolutionized NLP.",
    "Transformer架构彻底改变了自然语言处理领域。"
])

Generally, English averages about 4 characters per token, while Chinese averages about 1.5-2 characters per token.

Mainstream Tokenization Algorithms

BPE (Byte Pair Encoding)

BPE is the tokenization algorithm used by GPT models, building vocabulary by iteratively merging the most frequent character pairs.

graph TB subgraph "BPE Training Process" S1["Initial: All Characters"] --> S2["Count Character Pair Frequencies"] S2 --> S3["Merge Most Frequent Pair"] S3 --> S4["Update Vocabulary"] S4 --> S5{"Target Vocabulary Size Reached?"} S5 -->|No| S2 S5 -->|Yes| S6["Complete"] end
python
def simple_bpe_demo():
    """Simplified BPE demonstration"""
    corpus = ["low", "lower", "newest", "widest"]
    
    vocab = set()
    for word in corpus:
        vocab.update(list(word))
    vocab.add("</w>")
    
    print(f"Initial vocabulary: {sorted(vocab)}")
    
    word_freqs = {}
    for word in corpus:
        chars = list(word) + ["</w>"]
        word_freqs[tuple(chars)] = word_freqs.get(tuple(chars), 0) + 1
    
    print(f"Initial tokenization: {list(word_freqs.keys())}")
    
simple_bpe_demo()

BPE advantages:

  • Balances pros and cons of character-level and word-level tokenization
  • Effectively handles unknown words
  • Controllable vocabulary size

WordPiece

WordPiece is the tokenization algorithm used by BERT, similar to BPE but with a different merging strategy.

python
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "unbelievable"
tokens = tokenizer.tokenize(text)
print(f"WordPiece tokenization: {tokens}")

WordPiece uses the ## prefix to mark subword tokens that are not word-initial.

SentencePiece

SentencePiece is a language-agnostic tokenization tool that treats text as a sequence of Unicode characters.

python
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load('model.model')

text = "This is a test."
tokens = sp.encode_as_pieces(text)
print(f"SentencePiece tokenization: {tokens}")

SentencePiece characteristics:

  • No dependency on pre-tokenization (like space splitting)
  • Supports both BPE and Unigram algorithms
  • Widely used in multilingual models

What is a Context Window

A context window is the maximum number of tokens a large language model can process at once. It determines how much information the model can "see."

graph LR subgraph "Context Window Diagram" direction TB W["Context Window (e.g., 128K tokens)"] W --> P["System Prompt"] W --> H["Conversation History"] W --> I["User Input"] W --> O["Model Output"] end

Components of a Context Window

The context window contains all input and output tokens:

  1. System Prompt: Instructions defining model behavior
  2. Conversation History: Previous dialogue turns
  3. User Input: Current question or request
  4. Model Output: The model's generated response

Why Context Windows Matter

Scenario Small Context Window Large Context Window
Long Document Analysis Requires chunking Can process entire document at once
Multi-turn Dialogue Easily loses early context Maintains complete conversation memory
Code Understanding Can only see partial code Understands complete codebase
RAG Applications Limited retrieval results Can include more relevant documents

Context Window Comparison Across Models

graph TB subgraph "2024-2026 Context Window Evolution" G3["GPT-3.5 4K/16K"] --> G4["GPT-4 8K/32K/128K"] G4 --> G4T["GPT-4 Turbo 128K"] C2["Claude 2 100K"] --> C3["Claude 3 200K"] G1["Gemini 1.0 32K"] --> G15["Gemini 1.5 1M/2M"] end
Model Context Window Release Date Notes
GPT-3.5 Turbo 4K / 16K 2023 16K version costs more
GPT-4 8K / 32K / 128K 2023-2024 128K is Turbo version
GPT-4o 128K 2024 Multimodal support
Claude 3 Opus 200K 2024 ~150K words
Claude 3.5 Sonnet 200K 2024 Performance/cost balance
Gemini 1.5 Pro 1M / 2M 2024 Currently largest window
LLaMA 3 8K / 128K 2024 Open source model
Qwen 2.5 128K 2024 Chinese optimized

Context Window vs Actual Available Length

Note that the stated context window size doesn't equal actual available length:

python
def calculate_available_context(total_window, system_prompt_tokens, 
                                 history_tokens, max_output_tokens):
    """Calculate actually available input tokens"""
    available = total_window - system_prompt_tokens - history_tokens - max_output_tokens
    return max(0, available)

total = 128000
system = 500
history = 10000
max_output = 4096

available = calculate_available_context(total, system, history, max_output)
print(f"Total window: {total}")
print(f"System prompt: {system}")
print(f"History: {history}")
print(f"Reserved output: {max_output}")
print(f"Available input: {available}")

Long Context Technologies

As demand for longer contexts grows, researchers have developed various techniques to extend model context processing capabilities.

RoPE (Rotary Position Embedding)

RoPE encodes position information through rotation matrices, supporting position extrapolation.

python
import numpy as np

def rope_embedding(x, position, d_model):
    """
    RoPE positional encoding
    x: Input vector
    position: Position index
    d_model: Model dimension
    """
    freqs = 1.0 / (10000 ** (np.arange(0, d_model, 2) / d_model))
    
    angles = position * freqs
    
    cos_vals = np.cos(angles)
    sin_vals = np.sin(angles)
    
    x_even = x[0::2]
    x_odd = x[1::2]
    
    rotated_even = x_even * cos_vals - x_odd * sin_vals
    rotated_odd = x_even * sin_vals + x_odd * cos_vals
    
    result = np.zeros_like(x)
    result[0::2] = rotated_even
    result[1::2] = rotated_odd
    
    return result

RoPE advantages:

  • Natural encoding of relative position information
  • Supports length extrapolation
  • Computationally efficient

ALiBi (Attention with Linear Biases)

ALiBi encodes position information by adding linear biases to attention scores.

python
def alibi_bias(seq_len, num_heads):
    """
    Calculate ALiBi bias matrix
    """
    slopes = np.array([2 ** (-8 * i / num_heads) for i in range(1, num_heads + 1)])
    
    positions = np.arange(seq_len)
    bias = -np.abs(positions[:, None] - positions[None, :])
    
    alibi = slopes[:, None, None] * bias[None, :, :]
    
    return alibi

Sliding Window Attention

Sliding window attention limits each token to only attend to tokens within a fixed range, reducing computational complexity.

graph LR subgraph "Sliding Window Attention" T1["Token 1"] --> W1["Window 1-3"] T2["Token 2"] --> W2["Window 1-4"] T3["Token 3"] --> W3["Window 1-5"] T4["Token 4"] --> W4["Window 2-6"] T5["Token 5"] --> W5["Window 3-7"] end

Token Counting and Cost Estimation

Accurate token counting is crucial for cost control.

Counting Tokens with tiktoken

python
import tiktoken

def count_tokens(text, model="gpt-4"):
    """
    Count tokens in text
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    
    return len(encoding.encode(text))

def estimate_cost(input_tokens, output_tokens, model="gpt-4"):
    """
    Estimate API call cost (USD)
    """
    pricing = {
        "gpt-4": {"input": 0.03, "output": 0.06},
        "gpt-4-turbo": {"input": 0.01, "output": 0.03},
        "gpt-4o": {"input": 0.005, "output": 0.015},
        "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
        "claude-3-opus": {"input": 0.015, "output": 0.075},
        "claude-3-sonnet": {"input": 0.003, "output": 0.015},
    }
    
    if model not in pricing:
        return None
    
    input_cost = (input_tokens / 1000) * pricing[model]["input"]
    output_cost = (output_tokens / 1000) * pricing[model]["output"]
    
    return input_cost + output_cost

text = """
Large Language Models (LLMs) are deep learning-based natural language processing models
that learn statistical patterns and semantic knowledge from large-scale text data through pre-training.
"""

tokens = count_tokens(text)
cost = estimate_cost(tokens, 500, "gpt-4o")

print(f"Input tokens: {tokens}")
print(f"Estimated output tokens: 500")
print(f"Estimated cost: ${cost:.4f}")

Token Counter Utility Class

python
class TokenCounter:
    """Token counting and cost estimation utility"""
    
    def __init__(self, model="gpt-4"):
        self.model = model
        try:
            self.encoding = tiktoken.encoding_for_model(model)
        except KeyError:
            self.encoding = tiktoken.get_encoding("cl100k_base")
    
    def count(self, text):
        """Count tokens"""
        return len(self.encoding.encode(text))
    
    def count_messages(self, messages):
        """Count tokens in conversation messages"""
        total = 0
        for message in messages:
            total += 4
            for key, value in message.items():
                total += self.count(value)
                if key == "name":
                    total += -1
        total += 2
        return total
    
    def truncate_to_limit(self, text, max_tokens):
        """Truncate text to specified token count"""
        tokens = self.encoding.encode(text)
        if len(tokens) <= max_tokens:
            return text
        return self.encoding.decode(tokens[:max_tokens])
    
    def split_by_tokens(self, text, chunk_size, overlap=0):
        """Split text by token count"""
        tokens = self.encoding.encode(text)
        chunks = []
        start = 0
        while start < len(tokens):
            end = min(start + chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            chunks.append(self.encoding.decode(chunk_tokens))
            start = end - overlap
        return chunks

counter = TokenCounter("gpt-4")
long_text = "This is a very long text..." * 100
chunks = counter.split_by_tokens(long_text, chunk_size=100, overlap=20)
print(f"Split into {len(chunks)} chunks")

Strategies for Optimizing Context Usage

1. Streamline System Prompts

python
verbose_prompt = """
You are a very helpful AI assistant. Your task is to help users answer various questions.
You should always maintain a friendly and professional attitude. When users ask questions,
you need to carefully analyze the problem and then provide detailed, accurate answers.
If you're not sure about an answer, please honestly tell the user.
"""

concise_prompt = """
You are a professional AI assistant. Provide accurate, concise answers. Be honest when uncertain.
"""

counter = TokenCounter()
print(f"Verbose prompt: {counter.count(verbose_prompt)} tokens")
print(f"Concise prompt: {counter.count(concise_prompt)} tokens")

2. Conversation History Management

python
def manage_conversation_history(messages, max_tokens, counter):
    """
    Manage conversation history within token limits
    Strategy: Keep system messages and most recent conversations
    """
    system_messages = [m for m in messages if m.get("role") == "system"]
    other_messages = [m for m in messages if m.get("role") != "system"]
    
    system_tokens = counter.count_messages(system_messages)
    available_tokens = max_tokens - system_tokens
    
    kept_messages = []
    current_tokens = 0
    
    for message in reversed(other_messages):
        msg_tokens = counter.count_messages([message])
        if current_tokens + msg_tokens <= available_tokens:
            kept_messages.insert(0, message)
            current_tokens += msg_tokens
        else:
            break
    
    return system_messages + kept_messages

3. Document Chunking Strategies

python
def smart_chunk_document(text, max_chunk_tokens=1000, overlap_tokens=100):
    """
    Smart document chunking: split at sentence boundaries
    """
    import re
    
    counter = TokenCounter()
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = []
    current_tokens = 0
    
    for sentence in sentences:
        sentence_tokens = counter.count(sentence)
        
        if current_tokens + sentence_tokens > max_chunk_tokens:
            if current_chunk:
                chunks.append(' '.join(current_chunk))
            
            overlap_text = ' '.join(current_chunk[-2:]) if len(current_chunk) >= 2 else ''
            current_chunk = [overlap_text, sentence] if overlap_text else [sentence]
            current_tokens = counter.count(' '.join(current_chunk))
        else:
            current_chunk.append(sentence)
            current_tokens += sentence_tokens
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

4. Compress History with Summaries

python
def compress_history_with_summary(messages, summarizer_fn, threshold_tokens=2000):
    """
    When history is too long, use summaries to compress earlier conversations
    """
    counter = TokenCounter()
    total_tokens = counter.count_messages(messages)
    
    if total_tokens <= threshold_tokens:
        return messages
    
    system_msgs = [m for m in messages if m["role"] == "system"]
    other_msgs = [m for m in messages if m["role"] != "system"]
    
    split_point = len(other_msgs) // 2
    old_msgs = other_msgs[:split_point]
    recent_msgs = other_msgs[split_point:]
    
    old_text = "\n".join([f"{m['role']}: {m['content']}" for m in old_msgs])
    summary = summarizer_fn(old_text)
    
    summary_msg = {"role": "system", "content": f"Previous conversation summary: {summary}"}
    
    return system_msgs + [summary_msg] + recent_msgs

Tool Recommendations

The following tools can improve efficiency when working with tokens and context-related tasks:

Summary

Understanding tokens and context windows is fundamental to efficiently using large language models:

  1. Tokens are the basic units of LLMs: Different from characters or words, tokens are determined by tokenization algorithms
  2. Tokenization algorithms have distinct characteristics: BPE, WordPiece, SentencePiece are suited for different scenarios
  3. Context windows limit total tokens: Including input, output, and conversation history
  4. Long-context technologies continue to evolve: RoPE, ALiBi, and other technologies extend model capabilities
  5. Cost optimization requires accurate counting: Use tools like tiktoken for precise estimation
  6. Multiple strategies can optimize usage: Streamline prompts, manage history, smart chunking

Mastering this knowledge enables you to better control API costs and design more efficient AI applications.

FAQ

Why does Chinese consume more tokens than English?

This relates to the training data of tokenization algorithms. Models like GPT are primarily trained on English corpora, where common English words are typically single tokens, while Chinese characters often require multiple tokens. For example, "人工智能" (artificial intelligence) might need 3-4 tokens, while "AI" only needs 1 token. Using Chinese-optimized models (like Qwen) can improve Chinese token efficiency.

Is a larger context window always better?

Not necessarily. A larger context window means: 1) Higher API costs (charged per token); 2) Longer response times; 3) Potential attention dilution (the model may struggle to focus on key information). Choose an appropriately sized context window based on actual needs, and use techniques like Retrieval-Augmented Generation (RAG) to optimize information utilization.

How can I estimate the token count of a text?

The most accurate method is to use the corresponding model's tokenizer. For OpenAI models, use the tiktoken library; for other models, use their respective tokenizers. For rough estimates, English averages about 4 characters = 1 token, while Chinese averages about 1.5-2 characters = 1 token. Online tools like OpenAI's Tokenizer can also quickly calculate tokens.

Does conversation history consume the context window?

Yes, each turn's input and output accumulates in the context. This is why long conversations may cause early content to be truncated. Consider implementing conversation history management strategies: keep recent conversations, use summaries to compress history, or reset context at appropriate times.

Are tokens interchangeable between different models?

No. Different models use different tokenizers, and the same text may have different token counts across models. For example, GPT-4 uses cl100k_base encoding, while BERT uses WordPiece. When switching models, you need to recalculate token counts and costs.