Semantic search is fundamentally transforming how we access information. From Google's intelligent understanding to e-commerce platform recommendations, from enterprise knowledge base Q&A to RAG system context retrieval, semantic search technology has permeated every aspect of our digital lives. This guide will help you deeply understand the core principles of semantic search and teach you step-by-step how to build a high-quality semantic search system.

TL;DR

  • Semantic search is based on semantic understanding rather than keyword matching, capable of understanding query intent and contextual meaning
  • Core technology: Embedding models convert text into vectors, achieving semantic matching through vector similarity
  • Embedding model selection: Use all-MiniLM-L6-v2 for general scenarios, BGE series for Chinese, OpenAI text-embedding-3 for high precision
  • Search strategies: Pure semantic search suits Q&A scenarios, hybrid search (semantic + keyword) suits general search
  • Performance optimization: Vector database indexing, query caching, and chunking strategies are key

Semantic search is a search technology based on natural language understanding that not only matches keywords but also understands the true intent and contextual meaning of queries.

graph TB subgraph SG_Keyword_Search["Keyword Search"] Q1["Query: How to improve code quality"] --> K1[Keyword Extraction] K1 --> K2["Exact Match: code AND quality"] K2 --> K3["Results: Documents containing these words"] end subgraph SG_Semantic_Search["Semantic Search"] Q2["Query: How to improve code quality"] --> S1[Semantic Understanding] S1 --> S2[Vector Representation] S2 --> S3[Similarity Calculation] S3 --> S4["Results: Semantically related documents"] end K3 -.-> R1["May miss: Code review best practices"] S4 --> R2["Can find: Code review best practices Software engineering methodologies Refactoring tips guide"]
Comparison Keyword Search Semantic Search
Matching Method Exact vocabulary matching Semantic similarity matching
Synonym Handling Requires manual configuration Automatic understanding
Query Understanding Literal meaning Deep intent
Long-tail Queries Poor performance Good performance
Implementation Complexity Low Medium
Computational Resources Low Higher

What Problems Can Semantic Search Solve

Problem 1: Synonyms and Near-synonyms

When a user searches for "automobile", keyword search cannot find documents containing "car" or "vehicle". Semantic search understands the semantic relationship between these words.

Problem 2: Query Intent Understanding

When a user searches for "Python handle Excel", the real intent might be looking for pandas or openpyxl tutorials, not just documents containing these keywords.

Problem 3: Long-tail Queries

When a user searches "why is my program running slow", this natural language query can hardly get good results in keyword search.

How Semantic Search Works

The core of semantic search is converting text into vector representations, then measuring semantic relevance through vector similarity.

Core Process

flowchart LR subgraph SG_Indexing_Phase["Indexing Phase"] D[Document Collection] --> C[Text Chunking] C --> E1[Embedding Model] E1 --> V1[Vector Collection] V1 --> DB["(Vector Database)"] end subgraph SG_Query_Phase["Query Phase"] Q[User Query] --> E2[Embedding Model] E2 --> V2[Query Vector] V2 --> S[Similarity Search] DB --> S S --> R[Ranked Results] end

Key Steps Explained

1. Text Vectorization (Embedding)

Embedding models map text to high-dimensional vector space, where semantically similar texts are closer in the vector space.

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

texts = [
    "Semantic search is based on vector similarity",
    "Vector search uses embedding representations",
    "The weather is nice today"
]

embeddings = model.encode(texts)
print(f"Vector dimensions: {embeddings.shape}")

2. Vector Similarity Calculation

The most commonly used similarity metric is cosine similarity, which calculates the cosine of the angle between two vectors.

python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

similarity_01 = cosine_similarity(embeddings[0], embeddings[1])
similarity_02 = cosine_similarity(embeddings[0], embeddings[2])

print(f"Semantic search vs Vector search: {similarity_01:.4f}")
print(f"Semantic search vs Weather: {similarity_02:.4f}")

3. Vector Indexing and Retrieval

For fast retrieval in large-scale data, specialized vector indexing algorithms (such as HNSW, IVF) are needed.

Embedding Model Selection Guide

Choosing the right embedding model is key to building a high-quality semantic search system.

Model Dimensions Language Support Features Use Cases
all-MiniLM-L6-v2 384 Primarily English Lightweight, fast Prototyping, resource-constrained
all-mpnet-base-v2 768 Primarily English Balanced performance General English search
BGE-base-en-v1.5 768 English High quality English semantic search
BGE-M3 1024 Multilingual 100+ languages Multilingual scenarios
text-embedding-3-small 1536 Multilingual API-based High quality requirements
text-embedding-3-large 3072 Multilingual Highest precision Precision-first scenarios

Model Selection Decision Tree

graph TD A[Choose Embedding Model] --> B{Primary Language?} B -->|English| C{Performance Requirements?} B -->|Chinese| D[BGE-base-zh-v1.5] B -->|Multilingual| E[BGE-M3] C -->|Lightweight & Fast| F[all-MiniLM-L6-v2] C -->|Balanced| G[all-mpnet-base-v2] C -->|High Precision| H{Budget?} H -->|Have Budget| I[text-embedding-3-large] H -->|Cost Control| J[text-embedding-3-small]

Local Models vs API Models

Consideration Local Models API Models
Latency Depends on hardware Network latency
Cost One-time hardware investment Pay per call
Privacy Data stays local Data sent to cloud
Maintenance Self-managed No maintenance needed
Quality Depends on model choice Usually higher

Vector Similarity Calculation Explained

Cosine Similarity

Cosine similarity is the most commonly used metric in semantic search, focusing on vector direction rather than magnitude.

python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_product = norm(vec1) * norm(vec2)
    return dot_product / norm_product

def batch_cosine_similarity(query_vec, doc_vecs):
    query_norm = norm(query_vec)
    doc_norms = norm(doc_vecs, axis=1)
    dot_products = np.dot(doc_vecs, query_vec)
    return dot_products / (doc_norms * query_norm)

Euclidean Distance

Euclidean distance calculates the straight-line distance between two points in vector space. Smaller distance means more similar.

python
def euclidean_distance(vec1, vec2):
    return np.sqrt(np.sum((vec1 - vec2) ** 2))

def euclidean_to_similarity(distance, scale=1.0):
    return 1 / (1 + distance * scale)

Dot Product (Inner Product)

When vectors are normalized, dot product equals cosine similarity but computes faster.

python
def dot_product_similarity(vec1, vec2):
    return np.dot(vec1, vec2)

def normalize_vectors(vectors):
    norms = norm(vectors, axis=1, keepdims=True)
    return vectors / norms

Which Metric to Choose

Metric Pros Cons Use Cases
Cosine Similarity Not affected by vector length Slightly slower Text semantic similarity
Euclidean Distance Intuitive Affected by vector length Image feature matching
Dot Product Fastest computation Requires normalization Large-scale retrieval

Comparison of Three Search Methods

graph LR subgraph SG_Full_Text_Search["Full-Text Search"] FT1["BM25/TF-IDF"] --> FT2[Keyword Weights] FT2 --> FT3[Exact Match Ranking] end subgraph SG_Semantic_Search["Semantic Search"] SS1[Embedding] --> SS2[Vector Representation] SS2 --> SS3[Similarity Ranking] end subgraph SG_Hybrid_Search["Hybrid Search"] HS1[Full-Text Search] --> HS3[Score Fusion] HS2[Semantic Search] --> HS3 HS3 --> HS4[Combined Ranking] end
Search Type Advantages Disadvantages Best Scenarios
Full-Text Search Exact matching, fast Cannot understand semantics Exact lookup, code search
Semantic Search Understands intent, handles synonyms May miss exact matches Q&A systems, recommendations
Hybrid Search Balances precision and semantics Complex implementation General search engines

Hybrid Search Implementation

python
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None
        self.bm25 = None
    
    def index(self, documents):
        self.documents = documents
        self.embeddings = self.model.encode(documents)
        
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
    
    def search(self, query, top_k=5, semantic_weight=0.5):
        query_embedding = self.model.encode(query)
        semantic_scores = np.dot(self.embeddings, query_embedding)
        semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min() + 1e-8)
        
        bm25_scores = np.array(self.bm25.get_scores(query.lower().split()))
        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
        
        hybrid_scores = semantic_weight * semantic_scores + (1 - semantic_weight) * bm25_scores
        
        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
        
        return [
            {"document": self.documents[i], "score": hybrid_scores[i]}
            for i in top_indices
        ]

search_engine = HybridSearch()
search_engine.index([
    "Python is a popular programming language",
    "Machine learning requires large training datasets",
    "Semantic search understands query intent",
    "Vector databases store embedding vectors",
    "Natural language processing analyzes text"
])

results = search_engine.search("How to process text data", top_k=3)
for r in results:
    print(f"Score: {r['score']:.4f} | {r['document']}")

Building a Semantic Search System in Practice

Complete Semantic Search System

python
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions
import numpy as np
from typing import List, Dict

class SemanticSearchEngine:
    def __init__(self, model_name='all-MiniLM-L6-v2', persist_dir='./search_db'):
        self.model = SentenceTransformer(model_name)
        self.client = chromadb.PersistentClient(path=persist_dir)
        
        self.embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name=model_name
        )
        
        self.collection = self.client.get_or_create_collection(
            name="documents",
            embedding_function=self.embedding_fn,
            metadata={"hnsw:space": "cosine"}
        )
    
    def add_documents(self, documents: List[Dict], batch_size=100):
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i+batch_size]
            
            self.collection.add(
                documents=[doc['content'] for doc in batch],
                metadatas=[doc.get('metadata', {}) for doc in batch],
                ids=[doc['id'] for doc in batch]
            )
        
        print(f"Indexed {len(documents)} documents")
    
    def search(self, query: str, top_k: int = 5, filter_metadata: Dict = None) -> List[Dict]:
        where_filter = filter_metadata if filter_metadata else None
        
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k,
            where=where_filter
        )
        
        search_results = []
        for i in range(len(results['documents'][0])):
            search_results.append({
                'id': results['ids'][0][i],
                'content': results['documents'][0][i],
                'metadata': results['metadatas'][0][i] if results['metadatas'] else {},
                'score': 1 - results['distances'][0][i]
            })
        
        return search_results
    
    def batch_search(self, queries: List[str], top_k: int = 5) -> List[List[Dict]]:
        results = self.collection.query(
            query_texts=queries,
            n_results=top_k
        )
        
        all_results = []
        for q_idx in range(len(queries)):
            query_results = []
            for i in range(len(results['documents'][q_idx])):
                query_results.append({
                    'id': results['ids'][q_idx][i],
                    'content': results['documents'][q_idx][i],
                    'score': 1 - results['distances'][q_idx][i]
                })
            all_results.append(query_results)
        
        return all_results

search_engine = SemanticSearchEngine()

documents = [
    {"id": "1", "content": "Semantic search returns relevant results by understanding query intent", "metadata": {"category": "search"}},
    {"id": "2", "content": "Vector embeddings convert text into numerical representations", "metadata": {"category": "embedding"}},
    {"id": "3", "content": "HNSW algorithm enables efficient approximate nearest neighbor search", "metadata": {"category": "algorithm"}},
    {"id": "4", "content": "RAG systems combine retrieval and generation to improve answer quality", "metadata": {"category": "rag"}},
    {"id": "5", "content": "Hybrid search combines the advantages of keyword and semantic search", "metadata": {"category": "search"}}
]

search_engine.add_documents(documents)

results = search_engine.search("How to implement intelligent search", top_k=3)
print("\nSearch Results:")
for r in results:
    print(f"  [{r['score']:.4f}] {r['content']}")

Text Chunking Strategies

For long documents, chunking is required before indexing.

python
from typing import List

class TextChunker:
    def __init__(self, chunk_size=500, chunk_overlap=50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def chunk_by_size(self, text: str) -> List[str]:
        chunks = []
        start = 0
        
        while start < len(text):
            end = start + self.chunk_size
            
            if end < len(text):
                break_point = text.rfind('.', start, end)
                if break_point == -1:
                    break_point = text.rfind(' ', start, end)
                if break_point > start:
                    end = break_point + 1
            
            chunks.append(text[start:end].strip())
            start = end - self.chunk_overlap
        
        return chunks
    
    def chunk_by_paragraph(self, text: str) -> List[str]:
        paragraphs = text.split('\n\n')
        chunks = []
        current_chunk = ""
        
        for para in paragraphs:
            if len(current_chunk) + len(para) <= self.chunk_size:
                current_chunk += para + "\n\n"
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = para + "\n\n"
        
        if current_chunk:
            chunks.append(current_chunk.strip())
        
        return chunks

chunker = TextChunker(chunk_size=300, chunk_overlap=30)

long_text = """
Semantic search is a core technology in modern information retrieval. It returns more relevant results by understanding the semantic meaning of queries, not just matching keywords.

Traditional keyword search relies on exact vocabulary matching. If a user searches for "automobile", the system will only return documents containing the word "automobile", not documents containing "car" or "vehicle".

Semantic search solves this problem through vector embedding technology. Embedding models convert text into high-dimensional vectors, where semantically similar texts are closer in vector space. This way, even if queries and documents use different vocabulary, they can be retrieved as long as they are semantically similar.
"""

chunks = chunker.chunk_by_size(long_text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:50]}...")

1. Query Optimization

python
class QueryOptimizer:
    def __init__(self, model):
        self.model = model
    
    def expand_query(self, query: str, expansions: List[str]) -> str:
        return f"{query} {' '.join(expansions)}"
    
    def rewrite_query(self, query: str) -> str:
        rewrites = {
            "how do i": "how to",
            "whats": "what is",
            "cant": "cannot"
        }
        for old, new in rewrites.items():
            query = query.replace(old, new)
        return query
    
    def multi_query_search(self, queries: List[str], search_fn, top_k=5):
        all_results = {}
        
        for query in queries:
            results = search_fn(query, top_k=top_k)
            for r in results:
                doc_id = r['id']
                if doc_id not in all_results:
                    all_results[doc_id] = r
                    all_results[doc_id]['query_count'] = 1
                else:
                    all_results[doc_id]['score'] = max(all_results[doc_id]['score'], r['score'])
                    all_results[doc_id]['query_count'] += 1
        
        sorted_results = sorted(
            all_results.values(),
            key=lambda x: (x['query_count'], x['score']),
            reverse=True
        )
        
        return sorted_results[:top_k]

2. Result Reranking

python
class ResultReranker:
    def __init__(self, cross_encoder_model='cross-encoder/ms-marco-MiniLM-L-6-v2'):
        from sentence_transformers import CrossEncoder
        self.cross_encoder = CrossEncoder(cross_encoder_model)
    
    def rerank(self, query: str, results: List[Dict], top_k: int = 5) -> List[Dict]:
        pairs = [[query, r['content']] for r in results]
        
        scores = self.cross_encoder.predict(pairs)
        
        for i, score in enumerate(scores):
            results[i]['rerank_score'] = float(score)
        
        reranked = sorted(results, key=lambda x: x['rerank_score'], reverse=True)
        
        return reranked[:top_k]

3. Caching Strategy

python
from functools import lru_cache
import hashlib

class SearchCache:
    def __init__(self, max_size=1000):
        self.cache = {}
        self.max_size = max_size
    
    def _hash_query(self, query: str) -> str:
        return hashlib.md5(query.encode()).hexdigest()
    
    def get(self, query: str):
        key = self._hash_query(query)
        return self.cache.get(key)
    
    def set(self, query: str, results):
        if len(self.cache) >= self.max_size:
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
        
        key = self._hash_query(query)
        self.cache[key] = results
    
    def cached_search(self, query: str, search_fn):
        cached = self.get(query)
        if cached:
            return cached
        
        results = search_fn(query)
        self.set(query, results)
        return results

Useful Tools

When building semantic search systems, these tools can improve development efficiency:

💡 When developing AI search applications, you often need to handle various data format conversions. Visit QubitTool for more developer tools.

FAQ

Are semantic search and vector search the same thing?

Vector search is the technical implementation of semantic search. Semantic search is the goal (retrieving based on semantic understanding), while vector search is the means (implemented through vector similarity). Semantic search is usually based on vector search but may also combine other technologies like knowledge graphs.

How to evaluate semantic search effectiveness?

Common evaluation metrics include: 1) Recall@K: proportion of relevant documents retrieved; 2) Precision@K: proportion of relevant documents in returned results; 3) MRR (Mean Reciprocal Rank): reciprocal of the first relevant result's rank; 4) NDCG: comprehensive metric considering ranking positions. It's recommended to build annotated datasets for quantitative evaluation.

Is semantic search suitable for all scenarios?

No. Keyword search may be more appropriate for: 1) Exact lookup (like order numbers, product codes); 2) Code search (requires exact syntax matching); 3) Technical terminology retrieval (terms have fixed spellings). Best practice is to use hybrid search, combining the advantages of both.

How to handle cold start problems in semantic search?

Cold start refers to situations where new documents or new domains lack training data. Solutions: 1) Use pre-trained general embedding models; 2) Fine-tune models on domain data; 3) Combine keyword search as fallback; 4) Use user feedback for continuous optimization.

How to optimize semantic search latency?

Optimization strategies include: 1) Use lightweight embedding models (like all-MiniLM-L6-v2); 2) Vector database index optimization (adjust HNSW parameters); 3) Query result caching; 4) Batch processing requests; 5) Use GPU acceleration for embedding computation; 6) Pre-compute results for popular queries.

Summary

Semantic search is the core technology for building intelligent information retrieval systems. By converting text into vector representations, we can achieve a search experience that truly understands user intent.

Key Takeaways

✅ Semantic search is based on vector similarity, understanding synonyms and query intent
✅ Embedding model selection needs to balance language, performance, and cost
✅ Hybrid search combines keyword and semantic search, suitable for general scenarios
✅ Text chunking, query optimization, and result reranking are key to improving quality
✅ Vector databases are essential components for large-scale semantic search

Further Reading


💡 Start Practicing: Use QubitTool developer tools to accelerate your AI search application development!