What is BM25?

BM25 is a probabilistic lexical ranking function that scores documents based on query term matches, term frequency saturation, inverse document frequency, and document length normalization.

How It Works

BM25 is a durable baseline for information retrieval because it is simple, interpretable, and surprisingly strong for exact text matching. It improves on raw term frequency by reducing the marginal value of repeated terms and normalizing for document length. In AI search and RAG, BM25 is frequently used as the sparse retrieval branch in hybrid search, especially when queries include exact names, error codes, legal phrases, configuration keys, or other tokens where embeddings alone may be unreliable.

Key Characteristics

Ranks documents using lexical term matching rather than learned semantic vectors
Uses inverse document frequency to reward informative terms
Applies term-frequency saturation so repeated words do not dominate without limit
Normalizes by document length to avoid unfairly favoring long documents
Remains a strong baseline and hybrid-search component for RAG systems

Common Use Cases

Searching developer docs for exact API names and parameters
Retrieving support articles by error code
Providing a lexical baseline for RAG evaluation
Combining BM25 hits with dense-retrieval hits through rank fusion
Serving compliance or legal search where exact wording matters

Example

Loading code...

Frequently Asked Questions

Why is BM25 still used with LLMs?

LLMs often need reliable retrieval. BM25 is strong for literal evidence, exact terms, and identifiers, so it remains useful as part of RAG pipelines.

Is BM25 semantic search?

No. BM25 is lexical: it scores documents by term matches and weighting. Semantic search usually relies on embeddings.

When does BM25 outperform dense retrieval?

It often performs better for exact strings, error codes, product names, legal phrases, and queries where the exact wording is the signal.

How is BM25 used in hybrid search?

BM25 produces sparse lexical candidates, dense retrieval produces semantic candidates, and the rankings are fused or reranked.

Related Tools

Text Analyzer

Free online text analyzer tool. Count words, characters, sentences, paragraphs. Calculate reading time, speaking time, and analyze word frequency. All processing happens in your browser.

JSON Formatter

Format, beautify, validate and minify JSON online for free. Features syntax highlighting, tree view, history tracking, and one-click copy. No signup required. 100% client-side processing for privacy.

AI Websites Directory

An authoritative, comprehensive, and continuously updated AI resources directory. It covers global and domestic model providers, open-source ecosystems, research indexes and leaderboards, developer platforms, and curated tool catalogs—helping you quickly discover, compare, and choose the right AI products and references. Supports keyword search and favorites, with clear category sections and an expanding dataset for better experience.

Related Terms

Sparse Retrieval

Sparse Retrieval is a lexical search method that represents queries and documents with sparse term-weight vectors and retrieves results by matching explicit terms.

Hybrid Search

Hybrid Search is a technique in information retrieval and RAG (Retrieval-Augmented Generation) systems that employs multiple search algorithms simultaneously. The most common combination fuses Dense Vector Retrieval, which captures contextual and conceptual meaning, with Sparse Keyword Retrieval (typically the BM25 algorithm), which focuses on exact lexical matching and finding specific entities. The system runs both searches in parallel and then merges their results using a fusion algorithm (like Reciprocal Rank Fusion, RRF). This ensures the system understands user intent while never missing critical documents containing specific product names, IDs, or industry jargon.

Dense Retrieval

Dense Retrieval is a semantic search method that represents queries and documents as dense embedding vectors and retrieves results by vector similarity.

Retriever

Retriever is a query-to-context component that receives a user or agent query and returns relevant documents, chunks, records, passages, or tool-readable context for downstream reasoning and generation.

Eino RAG Pipeline: A Production Guide from Document Ingestion to Intelligent Q&A

A comprehensive guide to building production RAG pipelines with Eino: Document Loader multi-source ingestion, chunking strategies, Embedding vectorization, Indexer storage, Retriever semantic search, and Reranker scoring. Covers Hybrid Search, caching, incremental indexing, and a complete enterprise knowledge base Q&A implementation in Go.

2026-06-03

Eino Core Components: ChatModel, Tool, and Retriever in Practice

A deep dive into Eino's core component system: ChatModel multi-provider LLM interaction, Tool function calling, Retriever vector search, and the full Document Pipeline. Includes complete Go code examples from interface design to production patterns.