When building Knowledge Base QA systems based on Large Language Models (LLMs), RAG (Retrieval-Augmented Generation) has become the industry standard. However, as the number of ingested documents increases, developers universally encounter a fatal pain point: The retrieved content is simply not what the user wants.

Traditional Naive RAG over-relies on Dense Embedding for semantic similarity calculations. While this method can capture "similar meaning" snippets, it often suffers severe "semantic drift" when handling queries containing proper nouns, product models, or precise numerical values.

This article will take you beyond simple vector comparisons, deeply exploring how to build a high-accuracy enterprise-grade RAG retrieval pipeline by introducing Hybrid Search and Rerank models.

1. Why Does Pure Vector Retrieval (Dense Retrieval) Fail?

Suppose your knowledge base contains user manuals for multiple phones. The user asks: "How do I reset the network settings on an iPhone 15 Pro Max?"

  • Flaws of Pure Vector Retrieval: When the Embedding model encodes, it might consider "iPhone 14 Pro Max" or "restore factory settings" to be semantically very close to the query. The final recalled Top-5 results might be flooded with instructions for other phone models, leading the LLM to generate incorrect guidance.
  • Poor Matching for Long-Tail Entities: For some extremely obscure proper nouns (like an internal system code SYS-REQ-009X), if the Embedding model hasn't seen it during pre-training, it cannot map it to the correct vector space, leading to complete failure.

To compensate for this flaw, we need to introduce traditional Lexical Search (e.g., BM25).

2. Parsing Hybrid Search Architecture

The core concept of Hybrid Search is: Combining the generalization capability of Dense Retrieval with the precise matching capability of Sparse Retrieval (like BM25).

2.1 Architecture Design

In modern vector databases that support hybrid search (such as Pinecone, Weaviate, Milvus, or Elasticsearch), a single query triggers two recall paths simultaneously:

  1. Dense Path: Uses an Embedding model (like text-embedding-3-small) to calculate the query's vector and find the semantically closest Chunks.
  2. Sparse/BM25 Path: Tokenizes the query, counts term frequencies (a variant of TF-IDF, BM25), and finds Chunks containing exact keywords (like iPhone 15 Pro Max).

2.2 RRF (Reciprocal Rank Fusion)

Since there are two recall paths, how do we merge their results into a unified Top-K list? The most commonly used algorithm in the industry is RRF (Reciprocal Rank Fusion).

The calculation formula for RRF is very simple: it does not rely on the absolute value of the Score, but looks at the Rank of the document in its respective list.

$$ RRF_Score = \frac{1}{k + Rank_{dense}} + \frac{1}{k + Rank_{sparse}} $$ (where $k$ is usually 60)

Through RRF, documents that rank high in both recall paths are given the highest final scores.

3. Parsing Rerank Mechanism: Introducing Cross-Encoder

Hybrid Search solves the "recall omission" problem, but it also brings another challenge: the number of recalled documents increases (for example, Dense recalls 50, BM25 recalls 50, after merging and deduplicating there might be 80). If we feed all 80 to the LLM, it's not only expensive but also triggers severe "Lost in the Middle" effects.

At this point, we need a more powerful referee—a Rerank model (usually a Cross-Encoder).

3.1 Bi-Encoder vs Cross-Encoder

  • Bi-Encoder (existing Embedding models): Encodes the Query and Document separately, then calculates cosine similarity. Extremely fast, suitable for initial screening from millions of documents.
  • Cross-Encoder (Rerank models): Concatenates the Query and Document together (e.g., [CLS] Query [SEP] Document [SEP]), and inputs them to the model for deep interactive calculation simultaneously. It can capture extremely subtle logical correlations between the two, with extremely high accuracy, but at a huge computational cost.

3.2 Classic Two-Stage Retrieval Pipeline

graph TD Query["User Query"] --> Dense["Dense Retrieval (Top 50)"] Query --> Sparse["Sparse Retrieval / BM25 (Top 50)"] Dense --> RRF["RRF Fusion"] Sparse --> RRF RRF --> Rerank["Cross-Encoder Rerank (Top 80)"] Rerank --> TopK["Final Top 5 Contexts"] TopK --> LLM["LLM Generation"]

4. Practical Guide: Building Advanced RAG with Hybrid Search + Rerank

Below we will use Python (combining concepts from LangChain or LlamaIndex) to demonstrate the core configuration logic. If you encounter encoding issues when reranking in multilingual environments, you can use the Text Diff Tool or encoding tools to assist debugging.

python
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# 1. Prepare two retrievers
# Dense Retriever (Vector)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

# Sparse Retriever (BM25 Keywords)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 50

# 2. RRF Hybrid Retrieval (Ensemble)
# Defaults to using the RRF algorithm, weights are 0.5 each
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever], weights=[0.5, 0.5]
)

# 3. Introduce BGE-Reranker model for reranking
# Using the local open-source bge-reranker-base (excellent balance of performance and effect)
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=5) # Finally keep only Top 5

# 4. Build the final two-stage retrieval pipeline
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=hybrid_retriever
)

# 5. Execute retrieval
query = "How to reset the network settings on an iPhone 15 Pro Max?"
final_docs = compression_retriever.get_relevant_documents(query)

for i, doc in enumerate(final_docs):
    print(f"Rank {i+1}: {doc.page_content[:100]}...")

5. FAQ

Q: How much latency does Rerank add? How do I balance accuracy and performance? A: Latency depends on the parameter size of the Rerank model and the number of input documents (the longer the Chunk, the slower the calculation). For real-time QA systems, it is recommended to:

  1. Don't recall too many in the initial hybrid retrieval (e.g., control it to 30-50).
  2. Choose a lightweight Rerank model (such as bge-reranker-v2-m3 or Cohere's lightweight version).
  3. If extreme speed is required, use commercial Rerank APIs (such as Jina AI or Cohere Rerank) to shift the computational pressure to the cloud.

Q: How should I tune the weights for Hybrid Search? A: There is no static formula. If your documents contain a lot of IDs, proper nouns, or code snippets, you should appropriately increase the weight of Sparse (BM25); if queries are mostly colloquial, vague descriptions, increase the weight of Dense (Vector).

Conclusion

When building enterprise-grade RAG systems, relying on "brute force" to stuff more documents into the Prompt often backfires. By implementing a Hybrid Search + RRF + Cross-Encoder Rerank two-stage retrieval pipeline, we can vastly improve the precision of recall without significantly increasing LLM Token costs, thereby fundamentally mitigating hallucination problems.