Semantic search is fundamentally transforming how we access information. From Google's intelligent understanding to e-commerce platform recommendations, from enterprise knowledge base Q&A to RAG system context retrieval, semantic search technology has permeated every aspect of our digital lives. This guide will help you deeply understand the core principles of semantic search and teach you step-by-step how to build a high-quality semantic search system.
TL;DR
- Semantic search is based on semantic understanding rather than keyword matching, capable of understanding query intent and contextual meaning
- Core technology: Embedding models convert text into vectors, achieving semantic matching through vector similarity
- Embedding model selection: Use all-MiniLM-L6-v2 for general scenarios, BGE series for Chinese, OpenAI text-embedding-3 for high precision
- Search strategies: Pure semantic search suits Q&A scenarios, hybrid search (semantic + keyword) suits general search
- Performance optimization: Vector database indexing, query caching, and chunking strategies are key
What Is Semantic Search
Semantic search is a search technology based on natural language understanding that not only matches keywords but also understands the true intent and contextual meaning of queries.
Semantic Search vs Keyword Search
| Comparison | Keyword Search | Semantic Search |
|---|---|---|
| Matching Method | Exact vocabulary matching | Semantic similarity matching |
| Synonym Handling | Requires manual configuration | Automatic understanding |
| Query Understanding | Literal meaning | Deep intent |
| Long-tail Queries | Poor performance | Good performance |
| Implementation Complexity | Low | Medium |
| Computational Resources | Low | Higher |
What Problems Can Semantic Search Solve
Problem 1: Synonyms and Near-synonyms
When a user searches for "automobile", keyword search cannot find documents containing "car" or "vehicle". Semantic search understands the semantic relationship between these words.
Problem 2: Query Intent Understanding
When a user searches for "Python handle Excel", the real intent might be looking for pandas or openpyxl tutorials, not just documents containing these keywords.
Problem 3: Long-tail Queries
When a user searches "why is my program running slow", this natural language query can hardly get good results in keyword search.
How Semantic Search Works
The core of semantic search is converting text into vector representations, then measuring semantic relevance through vector similarity.
Core Process
Key Steps Explained
1. Text Vectorization (Embedding)
Embedding models map text to high-dimensional vector space, where semantically similar texts are closer in the vector space.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [
"Semantic search is based on vector similarity",
"Vector search uses embedding representations",
"The weather is nice today"
]
embeddings = model.encode(texts)
print(f"Vector dimensions: {embeddings.shape}")
2. Vector Similarity Calculation
The most commonly used similarity metric is cosine similarity, which calculates the cosine of the angle between two vectors.
import numpy as np
from numpy.linalg import norm
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))
similarity_01 = cosine_similarity(embeddings[0], embeddings[1])
similarity_02 = cosine_similarity(embeddings[0], embeddings[2])
print(f"Semantic search vs Vector search: {similarity_01:.4f}")
print(f"Semantic search vs Weather: {similarity_02:.4f}")
3. Vector Indexing and Retrieval
For fast retrieval in large-scale data, specialized vector indexing algorithms (such as HNSW, IVF) are needed.
Embedding Model Selection Guide
Choosing the right embedding model is key to building a high-quality semantic search system.
Popular Embedding Models Comparison
| Model | Dimensions | Language Support | Features | Use Cases |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Primarily English | Lightweight, fast | Prototyping, resource-constrained |
| all-mpnet-base-v2 | 768 | Primarily English | Balanced performance | General English search |
| BGE-base-en-v1.5 | 768 | English | High quality | English semantic search |
| BGE-M3 | 1024 | Multilingual | 100+ languages | Multilingual scenarios |
| text-embedding-3-small | 1536 | Multilingual | API-based | High quality requirements |
| text-embedding-3-large | 3072 | Multilingual | Highest precision | Precision-first scenarios |
Model Selection Decision Tree
Local Models vs API Models
| Consideration | Local Models | API Models |
|---|---|---|
| Latency | Depends on hardware | Network latency |
| Cost | One-time hardware investment | Pay per call |
| Privacy | Data stays local | Data sent to cloud |
| Maintenance | Self-managed | No maintenance needed |
| Quality | Depends on model choice | Usually higher |
Vector Similarity Calculation Explained
Cosine Similarity
Cosine similarity is the most commonly used metric in semantic search, focusing on vector direction rather than magnitude.
import numpy as np
from numpy.linalg import norm
def cosine_similarity(vec1, vec2):
dot_product = np.dot(vec1, vec2)
norm_product = norm(vec1) * norm(vec2)
return dot_product / norm_product
def batch_cosine_similarity(query_vec, doc_vecs):
query_norm = norm(query_vec)
doc_norms = norm(doc_vecs, axis=1)
dot_products = np.dot(doc_vecs, query_vec)
return dot_products / (doc_norms * query_norm)
Euclidean Distance
Euclidean distance calculates the straight-line distance between two points in vector space. Smaller distance means more similar.
def euclidean_distance(vec1, vec2):
return np.sqrt(np.sum((vec1 - vec2) ** 2))
def euclidean_to_similarity(distance, scale=1.0):
return 1 / (1 + distance * scale)
Dot Product (Inner Product)
When vectors are normalized, dot product equals cosine similarity but computes faster.
def dot_product_similarity(vec1, vec2):
return np.dot(vec1, vec2)
def normalize_vectors(vectors):
norms = norm(vectors, axis=1, keepdims=True)
return vectors / norms
Which Metric to Choose
| Metric | Pros | Cons | Use Cases |
|---|---|---|---|
| Cosine Similarity | Not affected by vector length | Slightly slower | Text semantic similarity |
| Euclidean Distance | Intuitive | Affected by vector length | Image feature matching |
| Dot Product | Fastest computation | Requires normalization | Large-scale retrieval |
Semantic Search vs Full-Text Search vs Hybrid Search
Comparison of Three Search Methods
| Search Type | Advantages | Disadvantages | Best Scenarios |
|---|---|---|---|
| Full-Text Search | Exact matching, fast | Cannot understand semantics | Exact lookup, code search |
| Semantic Search | Understands intent, handles synonyms | May miss exact matches | Q&A systems, recommendations |
| Hybrid Search | Balances precision and semantics | Complex implementation | General search engines |
Hybrid Search Implementation
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
class HybridSearch:
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
self.bm25 = None
def index(self, documents):
self.documents = documents
self.embeddings = self.model.encode(documents)
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def search(self, query, top_k=5, semantic_weight=0.5):
query_embedding = self.model.encode(query)
semantic_scores = np.dot(self.embeddings, query_embedding)
semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min() + 1e-8)
bm25_scores = np.array(self.bm25.get_scores(query.lower().split()))
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
hybrid_scores = semantic_weight * semantic_scores + (1 - semantic_weight) * bm25_scores
top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
return [
{"document": self.documents[i], "score": hybrid_scores[i]}
for i in top_indices
]
search_engine = HybridSearch()
search_engine.index([
"Python is a popular programming language",
"Machine learning requires large training datasets",
"Semantic search understands query intent",
"Vector databases store embedding vectors",
"Natural language processing analyzes text"
])
results = search_engine.search("How to process text data", top_k=3)
for r in results:
print(f"Score: {r['score']:.4f} | {r['document']}")
Building a Semantic Search System in Practice
Complete Semantic Search System
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions
import numpy as np
from typing import List, Dict
class SemanticSearchEngine:
def __init__(self, model_name='all-MiniLM-L6-v2', persist_dir='./search_db'):
self.model = SentenceTransformer(model_name)
self.client = chromadb.PersistentClient(path=persist_dir)
self.embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name=model_name
)
self.collection = self.client.get_or_create_collection(
name="documents",
embedding_function=self.embedding_fn,
metadata={"hnsw:space": "cosine"}
)
def add_documents(self, documents: List[Dict], batch_size=100):
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
self.collection.add(
documents=[doc['content'] for doc in batch],
metadatas=[doc.get('metadata', {}) for doc in batch],
ids=[doc['id'] for doc in batch]
)
print(f"Indexed {len(documents)} documents")
def search(self, query: str, top_k: int = 5, filter_metadata: Dict = None) -> List[Dict]:
where_filter = filter_metadata if filter_metadata else None
results = self.collection.query(
query_texts=[query],
n_results=top_k,
where=where_filter
)
search_results = []
for i in range(len(results['documents'][0])):
search_results.append({
'id': results['ids'][0][i],
'content': results['documents'][0][i],
'metadata': results['metadatas'][0][i] if results['metadatas'] else {},
'score': 1 - results['distances'][0][i]
})
return search_results
def batch_search(self, queries: List[str], top_k: int = 5) -> List[List[Dict]]:
results = self.collection.query(
query_texts=queries,
n_results=top_k
)
all_results = []
for q_idx in range(len(queries)):
query_results = []
for i in range(len(results['documents'][q_idx])):
query_results.append({
'id': results['ids'][q_idx][i],
'content': results['documents'][q_idx][i],
'score': 1 - results['distances'][q_idx][i]
})
all_results.append(query_results)
return all_results
search_engine = SemanticSearchEngine()
documents = [
{"id": "1", "content": "Semantic search returns relevant results by understanding query intent", "metadata": {"category": "search"}},
{"id": "2", "content": "Vector embeddings convert text into numerical representations", "metadata": {"category": "embedding"}},
{"id": "3", "content": "HNSW algorithm enables efficient approximate nearest neighbor search", "metadata": {"category": "algorithm"}},
{"id": "4", "content": "RAG systems combine retrieval and generation to improve answer quality", "metadata": {"category": "rag"}},
{"id": "5", "content": "Hybrid search combines the advantages of keyword and semantic search", "metadata": {"category": "search"}}
]
search_engine.add_documents(documents)
results = search_engine.search("How to implement intelligent search", top_k=3)
print("\nSearch Results:")
for r in results:
print(f" [{r['score']:.4f}] {r['content']}")
Text Chunking Strategies
For long documents, chunking is required before indexing.
from typing import List
class TextChunker:
def __init__(self, chunk_size=500, chunk_overlap=50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def chunk_by_size(self, text: str) -> List[str]:
chunks = []
start = 0
while start < len(text):
end = start + self.chunk_size
if end < len(text):
break_point = text.rfind('.', start, end)
if break_point == -1:
break_point = text.rfind(' ', start, end)
if break_point > start:
end = break_point + 1
chunks.append(text[start:end].strip())
start = end - self.chunk_overlap
return chunks
def chunk_by_paragraph(self, text: str) -> List[str]:
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) <= self.chunk_size:
current_chunk += para + "\n\n"
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
chunker = TextChunker(chunk_size=300, chunk_overlap=30)
long_text = """
Semantic search is a core technology in modern information retrieval. It returns more relevant results by understanding the semantic meaning of queries, not just matching keywords.
Traditional keyword search relies on exact vocabulary matching. If a user searches for "automobile", the system will only return documents containing the word "automobile", not documents containing "car" or "vehicle".
Semantic search solves this problem through vector embedding technology. Embedding models convert text into high-dimensional vectors, where semantically similar texts are closer in vector space. This way, even if queries and documents use different vocabulary, they can be retrieved as long as they are semantically similar.
"""
chunks = chunker.chunk_by_size(long_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk[:50]}...")
Tips for Optimizing Semantic Search
1. Query Optimization
class QueryOptimizer:
def __init__(self, model):
self.model = model
def expand_query(self, query: str, expansions: List[str]) -> str:
return f"{query} {' '.join(expansions)}"
def rewrite_query(self, query: str) -> str:
rewrites = {
"how do i": "how to",
"whats": "what is",
"cant": "cannot"
}
for old, new in rewrites.items():
query = query.replace(old, new)
return query
def multi_query_search(self, queries: List[str], search_fn, top_k=5):
all_results = {}
for query in queries:
results = search_fn(query, top_k=top_k)
for r in results:
doc_id = r['id']
if doc_id not in all_results:
all_results[doc_id] = r
all_results[doc_id]['query_count'] = 1
else:
all_results[doc_id]['score'] = max(all_results[doc_id]['score'], r['score'])
all_results[doc_id]['query_count'] += 1
sorted_results = sorted(
all_results.values(),
key=lambda x: (x['query_count'], x['score']),
reverse=True
)
return sorted_results[:top_k]
2. Result Reranking
class ResultReranker:
def __init__(self, cross_encoder_model='cross-encoder/ms-marco-MiniLM-L-6-v2'):
from sentence_transformers import CrossEncoder
self.cross_encoder = CrossEncoder(cross_encoder_model)
def rerank(self, query: str, results: List[Dict], top_k: int = 5) -> List[Dict]:
pairs = [[query, r['content']] for r in results]
scores = self.cross_encoder.predict(pairs)
for i, score in enumerate(scores):
results[i]['rerank_score'] = float(score)
reranked = sorted(results, key=lambda x: x['rerank_score'], reverse=True)
return reranked[:top_k]
3. Caching Strategy
from functools import lru_cache
import hashlib
class SearchCache:
def __init__(self, max_size=1000):
self.cache = {}
self.max_size = max_size
def _hash_query(self, query: str) -> str:
return hashlib.md5(query.encode()).hexdigest()
def get(self, query: str):
key = self._hash_query(query)
return self.cache.get(key)
def set(self, query: str, results):
if len(self.cache) >= self.max_size:
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
key = self._hash_query(query)
self.cache[key] = results
def cached_search(self, query: str, search_fn):
cached = self.get(query)
if cached:
return cached
results = search_fn(query)
self.set(query, results)
return results
Useful Tools
When building semantic search systems, these tools can improve development efficiency:
- JSON Formatter - Process JSON data returned by search APIs
- Text Diff Tool - Compare search result differences between queries
- Random Data Generator - Generate test document datasets
💡 When developing AI search applications, you often need to handle various data format conversions. Visit QubitTool for more developer tools.
FAQ
Are semantic search and vector search the same thing?
Vector search is the technical implementation of semantic search. Semantic search is the goal (retrieving based on semantic understanding), while vector search is the means (implemented through vector similarity). Semantic search is usually based on vector search but may also combine other technologies like knowledge graphs.
How to evaluate semantic search effectiveness?
Common evaluation metrics include: 1) Recall@K: proportion of relevant documents retrieved; 2) Precision@K: proportion of relevant documents in returned results; 3) MRR (Mean Reciprocal Rank): reciprocal of the first relevant result's rank; 4) NDCG: comprehensive metric considering ranking positions. It's recommended to build annotated datasets for quantitative evaluation.
Is semantic search suitable for all scenarios?
No. Keyword search may be more appropriate for: 1) Exact lookup (like order numbers, product codes); 2) Code search (requires exact syntax matching); 3) Technical terminology retrieval (terms have fixed spellings). Best practice is to use hybrid search, combining the advantages of both.
How to handle cold start problems in semantic search?
Cold start refers to situations where new documents or new domains lack training data. Solutions: 1) Use pre-trained general embedding models; 2) Fine-tune models on domain data; 3) Combine keyword search as fallback; 4) Use user feedback for continuous optimization.
How to optimize semantic search latency?
Optimization strategies include: 1) Use lightweight embedding models (like all-MiniLM-L6-v2); 2) Vector database index optimization (adjust HNSW parameters); 3) Query result caching; 4) Batch processing requests; 5) Use GPU acceleration for embedding computation; 6) Pre-compute results for popular queries.
Summary
Semantic search is the core technology for building intelligent information retrieval systems. By converting text into vector representations, we can achieve a search experience that truly understands user intent.
Key Takeaways
✅ Semantic search is based on vector similarity, understanding synonyms and query intent
✅ Embedding model selection needs to balance language, performance, and cost
✅ Hybrid search combines keyword and semantic search, suitable for general scenarios
✅ Text chunking, query optimization, and result reranking are key to improving quality
✅ Vector databases are essential components for large-scale semantic search
Further Reading
- Vector Embeddings Complete Guide - Deep understanding of Embedding technology
- Vector Database Complete Guide - Choosing the right vector storage solution
- RAG Retrieval-Augmented Generation Guide - Building retrieval-based AI applications
💡 Start Practicing: Use QubitTool developer tools to accelerate your AI search application development!