Vector embeddings are the cornerstone technology of modern AI applications. From semantic understanding in search engines to personalized matching in recommendation systems, from knowledge retrieval in RAG systems to cross-domain understanding in multimodal AI, embedding technology is everywhere. Mastering vector embeddings is mastering the key to unlocking AI applications.

TL;DR

  • Vector embeddings convert text, images, and other data into dense numerical vectors that capture semantic information
  • Similarity calculation measures semantic relationships between vectors using cosine similarity or Euclidean distance
  • Popular models include OpenAI text-embedding-3, Sentence-Transformers, and BGE
  • Core applications: semantic search, recommendation systems, clustering analysis, RAG knowledge retrieval
  • Dimension selection requires balancing precision and performance, typically 256-1536 dimensions

What Are Vector Embeddings

Vector embeddings are a technique for mapping high-dimensional discrete data (such as text and images) to a low-dimensional continuous vector space. In this vector space, semantically similar content is mapped to nearby positions.

graph LR A[Raw Data] --> B[Embedding Model] B --> C[Vector Representation] subgraph SG_Input["Input"] A1["Text: Cats are cute"] A2["Text: Kittens are adorable"] A3["Text: Nice weather today"] end subgraph SG_Vector_Space["Vector Space"] C1["0.23, 0.87, ..."] C2["0.25, 0.85, ..."] C3["0.91, 0.12, ..."] end A1 --> B A2 --> B A3 --> B B --> C1 B --> C2 B --> C3

Why Vector Embeddings Matter

Traditional text processing methods (like keyword matching and TF-IDF) cannot understand semantics. For example, "car" and "automobile" are completely different words in traditional methods, but vector embeddings can capture their semantic similarity.

Method Pros Cons
Keyword Matching Simple and fast Cannot understand synonyms
TF-IDF Considers term frequency Ignores word order and semantics
One-Hot Encoding Easy to implement Curse of dimensionality, no semantics
Vector Embeddings Captures semantic relationships Requires computational resources

Evolution of Embedding Technology

Word2Vec: Pioneer of Word Embeddings

In 2013, Google's Word2Vec pioneered the era of word embeddings. It's based on a simple but profound hypothesis: semantically similar words tend to appear in similar contexts.

python
from gensim.models import Word2Vec

sentences = [
    ["machine", "learning", "is", "a", "branch", "of", "AI"],
    ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"],
    ["neural", "networks", "are", "the", "foundation", "of", "deep", "learning"]
]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

similar_words = model.wv.most_similar("learning", topn=3)
print(similar_words)

word_vector = model.wv["machine"]
print(f"Vector dimension: {len(word_vector)}")

Two training modes of Word2Vec:

  • CBOW (Continuous Bag of Words): Predicts the center word from context
  • Skip-gram: Predicts context from the center word
graph TB subgraph SG_CBOW["CBOW"] C1[Context Word 1] --> P1[Predict] C2[Context Word 2] --> P1 C3[Context Word 3] --> P1 P1 --> T1[Target Word] end subgraph SG_Skip_gram["Skip-gram"] T2[Center Word] --> P2[Predict] P2 --> O1[Context Word 1] P2 --> O2[Context Word 2] P2 --> O3[Context Word 3] end

From Word Embeddings to Sentence Embeddings

The limitation of Word2Vec is that it can only generate word-level embeddings. How do we represent a sentence or paragraph?

Early approach: Simple averaging of word vectors

python
import numpy as np

def sentence_embedding_average(sentence, word2vec_model):
    words = sentence.split()
    vectors = [word2vec_model.wv[w] for w in words if w in word2vec_model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    return np.zeros(word2vec_model.vector_size)

Modern approach: Transformer-based sentence embedding models

  • BERT: Bidirectional Transformer encoder
  • Sentence-BERT: Optimized for sentence similarity
  • Sentence-Transformers: Easy-to-use sentence embedding library

OpenAI Embedding Models

OpenAI provides powerful text embedding APIs. The latest text-embedding-3 series supports flexible dimension selection.

python
from openai import OpenAI

client = OpenAI()

def get_embedding(text, model="text-embedding-3-small", dimensions=None):
    params = {"input": text, "model": model}
    if dimensions:
        params["dimensions"] = dimensions
    
    response = client.embeddings.create(**params)
    return response.data[0].embedding

text = "Vector embeddings are the core technology of AI applications"
embedding = get_embedding(text, dimensions=256)
print(f"Embedding dimension: {len(embedding)}")
Model Dimensions Performance Price Use Case
text-embedding-3-small 512-1536 Good $0.02/1M tokens Cost-sensitive applications
text-embedding-3-large 256-3072 Excellent $0.13/1M tokens High precision requirements
text-embedding-ada-002 1536 Good $0.10/1M tokens Legacy system compatibility

Sentence-Transformers

The open-source Sentence-Transformers library provides rich pre-trained models with local deployment support.

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Vector embedding technology is very important",
    "Embedding is the foundation of AI",
    "The weather is really nice today"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

For multilingual scenarios, these models perform excellently:

Model Source Features
BGE Series BAAI Bilingual, excellent performance
M3E Moka AI Chinese optimized, open source
multilingual-e5 Microsoft 100+ languages support
python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

texts = ["What is vector embedding?", "Embedding technology explained"]
embeddings = model.encode(texts, normalize_embeddings=True)

Similarity Calculation Methods

Cosine Similarity

Cosine similarity is the most commonly used vector similarity metric, calculating the cosine of the angle between two vectors.

python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

vec_a = np.array([0.1, 0.2, 0.3, 0.4])
vec_b = np.array([0.15, 0.25, 0.28, 0.38])
vec_c = np.array([0.9, 0.1, 0.05, 0.02])

print(f"Similarity between A and B: {cosine_similarity(vec_a, vec_b):.4f}")
print(f"Similarity between A and C: {cosine_similarity(vec_a, vec_c):.4f}")

Euclidean Distance

Euclidean distance calculates the straight-line distance between two points in vector space.

python
def euclidean_distance(vec1, vec2):
    return np.sqrt(np.sum((vec1 - vec2) ** 2))

distance_ab = euclidean_distance(vec_a, vec_b)
distance_ac = euclidean_distance(vec_a, vec_c)

print(f"Distance between A and B: {distance_ab:.4f}")
print(f"Distance between A and C: {distance_ac:.4f}")

Which Metric to Choose?

graph TD A[Choose Similarity Metric] --> B{Are vectors normalized?} B -->|Yes| C["Cosine Similarity = Dot Product"] B -->|No| D{Focus on direction or distance?} D -->|Direction| E[Cosine Similarity] D -->|Distance| F[Euclidean Distance] E --> G[Suitable for text similarity] F --> H[Suitable for clustering analysis]

Practical Application Scenarios

Semantic Search System

Traditional search relies on keyword matching, while semantic search understands query intent.

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "Python is a popular programming language",
    "Machine learning requires large amounts of data",
    "Deep learning uses neural networks",
    "Natural language processing analyzes text",
    "Vector databases store embedding vectors"
]

doc_embeddings = model.encode(documents, convert_to_tensor=True)

def semantic_search(query, top_k=3):
    query_embedding = model.encode(query, convert_to_tensor=True)
    scores = util.cos_sim(query_embedding, doc_embeddings)[0]
    top_results = scores.argsort(descending=True)[:top_k]
    
    results = []
    for idx in top_results:
        results.append({
            "document": documents[idx],
            "score": scores[idx].item()
        })
    return results

query = "How to process text data"
results = semantic_search(query)
for r in results:
    print(f"Similarity: {r['score']:.4f} - {r['document']}")

Recommendation System

Embedding-based recommendation systems can discover latent associations between content.

python
class EmbeddingRecommender:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.items = []
        self.embeddings = None
    
    def add_items(self, items):
        self.items = items
        self.embeddings = self.model.encode(items, convert_to_tensor=True)
    
    def recommend(self, user_history, top_k=5):
        history_embedding = self.model.encode(
            user_history, 
            convert_to_tensor=True
        ).mean(dim=0)
        
        scores = util.cos_sim(history_embedding, self.embeddings)[0]
        top_indices = scores.argsort(descending=True)[:top_k]
        
        return [self.items[i] for i in top_indices]

recommender = EmbeddingRecommender()
recommender.add_items([
    "Python Machine Learning in Practice",
    "Deep Learning Getting Started Guide",
    "Web Development Best Practices",
    "Data Science Handbook",
    "Algorithms and Data Structures"
])

user_history = ["Python Programming Basics", "Data Analysis Introduction"]
recommendations = recommender.recommend(user_history, top_k=3)
print("Recommendations:", recommendations)

Text Clustering

Use embedding vectors for text clustering to automatically discover topics.

python
from sklearn.cluster import KMeans
import numpy as np

texts = [
    "Python is the most popular programming language",
    "JavaScript is used for web development",
    "Machine learning has changed the AI field",
    "Deep learning requires GPU acceleration",
    "React is a frontend framework",
    "Neural networks simulate the brain"
]

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)

kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(embeddings)

for i, (text, cluster) in enumerate(zip(texts, clusters)):
    print(f"Cluster {cluster}: {text}")

Embedding Dimension Selection and Optimization

Impact of Dimensions on Performance

graph LR A[Low Dimension 128-256] --> B["Small storage Fast computation Lower precision"] C[Medium Dimension 512-768] --> D["Balanced choice Suitable for most scenarios"] E[High Dimension 1024-3072] --> F["High precision Large storage Slow computation"]

Dimension Selection Recommendations

Scenario Recommended Dimension Reason
Large-scale retrieval 256-512 Storage and computation efficiency
Precise matching 768-1536 Higher semantic precision
Real-time applications 256-384 Low latency requirements
Research experiments 1024+ Exploring performance limits

Dimensionality Reduction Techniques

When you need to reduce storage or speed up computation, dimensionality reduction techniques can be used.

python
from sklearn.decomposition import PCA

original_dim = 768
target_dim = 256

embeddings_768 = model.encode(texts)

pca = PCA(n_components=target_dim)
embeddings_256 = pca.fit_transform(embeddings_768)

print(f"Original dimension: {embeddings_768.shape}")
print(f"After reduction: {embeddings_256.shape}")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.4f}")

Vector Database Integration

When handling large-scale embedding vectors, specialized vector databases are needed.

Database Features Use Case
Pinecone Fully managed, easy to use Quick deployment
Milvus Open source, feature-rich Self-hosted deployment
Weaviate GraphQL API Complex queries
Chroma Lightweight Prototyping
Qdrant Rust implementation, high performance Production environment

Chroma Quick Start

python
import chromadb
from chromadb.utils import embedding_functions

client = chromadb.Client()

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.create_collection(
    name="documents",
    embedding_function=ef
)

collection.add(
    documents=[
        "Vector embeddings are the foundation of AI",
        "Semantic search understands query intent",
        "Recommendation systems provide personalized content"
    ],
    ids=["doc1", "doc2", "doc3"]
)

results = collection.query(
    query_texts=["How to implement intelligent search"],
    n_results=2
)
print(results)

Useful Tools

When developing embedding applications, these tools can improve efficiency:

💡 When developing AI applications, you often need to handle various data format conversions. Visit QubitTool for more developer tools.

FAQ

What's the difference between vector embeddings and word vectors?

Word vectors (Word Embeddings) are a type of vector embedding specifically for words. Vector embedding is a broader concept that can be applied to sentences, paragraphs, images, and various other data types. Modern embedding models typically generate sentence or document-level embeddings directly.

How to choose the right embedding model?

Choosing an embedding model requires considering: 1) Language support (choose models that support your target language); 2) Performance requirements (precision vs. speed); 3) Deployment method (API calls vs. local deployment); 4) Budget. It's recommended to start with Sentence-Transformers for prototyping, then evaluate commercial APIs for production.

Can embedding vectors be used across models?

No. Embedding vectors generated by different models exist in different vector spaces and cannot be directly compared. If you need to switch models, you must regenerate all embedding vectors. This is why careful consideration is needed when choosing a model.

How to handle embeddings for very long texts?

Most embedding models have input length limits (e.g., 512 or 8192 tokens). Methods for handling long texts: 1) Truncate to maximum length; 2) Chunk and average or concatenate embeddings; 3) Use models that support long texts like BGE-M3; 4) Extract key paragraphs for embedding.

How to update embedding vectors?

When source data changes, embedding vectors need to be regenerated. Recommendations: 1) Establish a data change tracking mechanism; 2) Use incremental update strategies; 3) Periodically rebuild completely to ensure consistency; 4) Consider using vector databases that support real-time updates.

Summary

Vector embedding technology is the core infrastructure of modern AI applications. By converting text, images, and other data into semantic vectors, we can achieve:

  • Semantic understanding: Go beyond keyword matching to truly understand content meaning
  • Similarity calculation: Quickly find semantically similar content
  • Knowledge retrieval: Provide precise context for RAG systems
  • Personalized recommendations: Intelligent recommendations based on semantic similarity

Key Takeaways

✅ Vector embeddings map data to semantic vector space
✅ Cosine similarity is the most commonly used similarity metric
✅ Model selection requires balancing precision, speed, and cost
✅ Vector databases are essential components for large-scale applications
✅ Dimension selection needs to be weighed according to the scenario

Further Reading


💡 Start Practicing: Use QubitTool developer tools to accelerate your AI application development!