What is Tokenizer?
Tokenizer is the component that converts text into the token IDs a language model can process and decodes generated token IDs back into text.
How It Works
A tokenizer is the boundary between human-readable text and the integer sequence consumed by an LLM. It determines how words, punctuation, whitespace, code, numbers, and non-English text are split into tokens. Tokenization affects context-window usage, billing, latency, prompt length, streaming output, and evaluation. Two models with similar parameter counts can behave differently on the same prompt if their tokenizers split the text differently, especially for code, structured data, rare terms, and multilingual content.
Key Characteristics
- Maps text to token IDs and token IDs back to text
- Affects context length, cost, latency, and prompt budgeting
- Can split the same word or symbol differently across model families
- Important for multilingual text, code, JSON, numbers, and rare vocabulary
- Must match the model it was trained with for reliable inference
Common Use Cases
- Estimating prompt length before sending a request to an LLM
- Comparing model cost for English, Chinese, code, and structured prompts
- Debugging why a prompt exceeds the context window
- Designing chunk sizes for RAG based on actual token counts
- Measuring generated output length in tokens
Example
Loading code...Frequently Asked Questions
Why does tokenization matter for LLM cost?
Most LLM APIs bill by input and output tokens. The tokenizer determines how many tokens a prompt and answer contain.
Is a token the same as a word?
No. A token can be a word, part of a word, punctuation, whitespace, or a byte-like unit depending on the tokenizer.
Can two models tokenize the same text differently?
Yes. Different tokenizer vocabularies and algorithms can produce different token counts for the same text.
Why is tokenizer choice important for RAG?
RAG chunk sizes and context budgets should be based on the tokenizer used by the target model, not only on characters or words.