Understanding RAG: A Practical Guide for Engineers - Part 2: Vectors, Embeddings, and How RAG Actually Works

Part 1 covered what RAG is and why it matters. Now let’s get into the mechanics - what vectors actually are, how embeddings capture meaning, and how the pieces fit together.

This is the explanation I needed when I started. Technical enough to be useful, without assuming a background in machine learning.

Vectors Are Just Lists of Numbers

That’s it. When someone says “embedding vector,” they mean something like:

[0.23, -0.15, 0.87, 0.42, -0.63, 0.91, ...]

These lists typically have hundreds or thousands of numbers (1536 or 3072 for modern embedding models), but conceptually it’s just a list.

Why Numbers Instead of Text?

Computers can’t “understand” text the way we do. When you read “authentication,” you instantly know what it means and how it relates to other concepts - login, OAuth, security, tokens. Computers need a different representation.

Vectors are how we represent meaning as numbers. Each position in the list captures some aspect of meaning. The details of what each number represents are learned during training, but the key insight is simple:

Similar meanings → similar numbers

"authentication"     → [0.8, 0.2, 0.1, 0.15, ...]
"login"              → [0.75, 0.25, 0.12, 0.18, ...]  ← Close to "authentication"
"OAuth"              → [0.72, 0.28, 0.09, 0.16, ...]  ← Also close
"database schema"    → [0.1, 0.05, 0.9, 0.82, ...]    ← Very different

Measure the mathematical distance between these vectors and “authentication” and “login” are close together. “Authentication” and “database schema” are far apart. This is how computers represent that “authentication” and “login” are related but “authentication” and “database schema” aren’t.

The Technical Term: Embeddings

Converting text into these number vectors is called “embedding,” and the resulting vector is an “embedding.”

Models like OpenAI’s text-embedding-3-small are trained on billions of documents to learn these representations. You send text, they return a vector:

Input:  "POST request requires authentication token"
Output: [0.023, -0.145, 0.872, ..., 0.234]  # 1536 numbers

The model learned these representations by reading massive amounts of text and learning which words and phrases tend to appear in similar contexts. Words that appear in similar contexts get similar vectors.

How Vectors Enable Semantic Search

This is where it gets useful.

Traditional Keyword Search

Document: "POST /api/auth requires Bearer token in Authorization header"
Query:    "authentication POST endpoint"

Match? YES - contains "authentication" and "POST"

Document: "POST /api/auth requires Bearer token in Authorization header"
Query:    "How do I login to the API?"

Match? NO - doesn't contain "login" or "API"

Traditional search looks for exact word matches. Different words, no match - even if the meaning is identical.

Vector/Semantic Search

Document: "POST /api/auth requires Bearer token in Authorization header"
→ Vector: [0.34, 0.12, 0.89, ...]

Query: "How do I login to the API?"
→ Vector: [0.36, 0.14, 0.87, ...]

Distance between vectors: SMALL (similar meaning)
Match? YES ✓

The vectors for “POST /api/auth requires Bearer token” and “How do I login to the API?” are mathematically similar, even though the words are different. The system understands they’re asking about the same thing.

This is what makes RAG work. It doesn’t match words - it matches meaning.

The Three Stages of RAG

Now let’s walk through exactly how RAG works. Understanding this helps you evaluate systems and understand where they fail.

RAG Three Stages Overview

Stage 1: Indexing (Preparation)

This happens once when you set up your RAG system.

Step 1: Split documents into chunks

Full document: "The /api/users endpoint accepts GET and POST requests.
                GET returns user list. POST creates new user with
                email and name fields required..."

Split into chunks:
- Chunk 1: "The /api/users endpoint accepts GET and POST requests."
- Chunk 2: "GET returns user list."
- Chunk 3: "POST creates new user with email and name fields required."

Why chunk? You can’t feed entire documents to the LLM (context limits), and smaller chunks give more precise retrieval.

Step 2: Convert each chunk to a vector

Chunk: "The /api/users endpoint accepts GET and POST requests"
  ↓ (send to embedding model)
Vector: [0.12, 0.45, 0.89, -0.23, 0.67, ...]  # 1536 numbers

Step 3: Store in a vector database

Vector Database:
┌─────────────────────────────────────────────┬──────────────────────────┐
│ Text Chunk                                  │ Vector                   │
├─────────────────────────────────────────────┼──────────────────────────┤
│ "/api/users accepts GET and POST requests"  │ [0.12, 0.45, 0.89, ...]  │
│ "GET returns user list"                     │ [0.67, 0.23, 0.11, ...]  │
│ "POST creates user with email, name"        │ [0.34, 0.78, 0.91, ...]  │
└─────────────────────────────────────────────┴──────────────────────────┘

Vector databases (Pinecone, Weaviate, Chroma, pgvector) are optimized for finding similar vectors quickly, even with millions of them.

Stage 2: Retrieval (Search)

This happens every time a user asks a question.

Step 1: Convert the question to a vector

User question: "What HTTP methods does the users endpoint support?"
  ↓ (send to same embedding model)
Query vector: [0.15, 0.43, 0.91, -0.21, 0.69, ...]

Step 2: Find the most similar vectors in the database

Query vector:     [0.15, 0.43, 0.91, ...]

Comparing to database:
Chunk A: [0.12, 0.45, 0.89, ...]  ← Distance: 0.05 (VERY CLOSE!)
Chunk B: [0.67, 0.23, 0.11, ...]  ← Distance: 0.82 (far)
Chunk C: [0.34, 0.78, 0.91, ...]  ← Distance: 0.54 (medium)

Return: Chunk A ("/api/users accepts GET and POST requests")

The vector database calculates the distance between the query vector and all stored vectors, then returns the closest matches.

Notice: the question was “What HTTP methods does the users endpoint support?” and the chunk says “/api/users accepts GET and POST requests” - different wording, but the vectors are nearly identical. The system understands they’re about the same thing.

Step 3: Return the top-k chunks

Usually you retrieve multiple chunks (k=3 to k=10), not just one, to give the LLM more context.

Stage 3: Generation (Creating the Answer)

Step 1: Build a prompt with the retrieved context

prompt = f"""
You are a helpful assistant. Answer the question based on the context below.

Context:
{retrieved_chunk_1}
{retrieved_chunk_2}
{retrieved_chunk_3}

Question: {user_question}

Answer:
"""

Step 2: Send to the LLM

Context: "/api/users accepts GET and POST requests. GET returns user list.
          POST creates user with email and name fields required..."

Question: "What HTTP methods does the users endpoint support?"

LLM generates: "The /api/users endpoint supports GET and POST requests.
                GET retrieves the user list, while POST creates a new user."

The LLM doesn’t “know” your API’s endpoints from its training. But because we retrieved that information and included it in the prompt, it can generate an accurate answer.

Understanding the Failure Modes

Now that you understand how RAG works mechanically, you can understand where it fails:

Bad chunking → Context gets split poorly → Important info spread across chunks → LLM doesn’t get complete picture

Poor embeddings → Similar concepts get different vectors → Relevant chunks aren’t retrieved → LLM has no context to work with

Wrong top-k value → Too few chunks (miss important context) or too many chunks (irrelevant info confuses LLM)

Outdated vector database → Someone updates a document but vectors aren’t updated → Old information gets retrieved

The retrieval quality determines everything. If you retrieve the wrong chunks, the best LLM in the world can’t save you.

Vector Databases You’ll Encounter

Now that you understand what vectors are and how they’re used, here are the options you’ll see:

Managed/Cloud:

Pinecone - Easiest to get started, fully managed
Weaviate Cloud - Good balance of features and ease
Qdrant Cloud - Fast, good for large scale

Self-hosted:

Chroma - Simple, great for prototyping
Weaviate - Feature-rich, open source
Milvus - Built for scale
pgvector - Postgres extension (use what you know)

What to consider:

Scale: How many vectors? (thousands vs. millions)
Latency: How fast do queries need to be?
Cost: Managed services charge by storage + queries
Features: Filtering, metadata, hybrid search support

The Key Insights

If you’ve made it this far, here’s what should click:

Vectors are coordinates in a high-dimensional space. Similar meanings cluster together.
Embeddings capture semantic meaning. “Authentication” and “login” get similar vectors because they appear in similar contexts in training data.
Semantic search is finding nearby vectors. The math finds similar meaning, not similar words.
RAG is search + prompting. Find relevant chunks (retrieval), give them to the LLM (generation). No retraining.
Quality depends on retrieval. If you retrieve the wrong chunks, the whole system fails.

What’s Next

You now understand how RAG works mechanically - vectors, embeddings, the three stages, where failures happen.

But here’s what the tutorials don’t tell you: naive RAG isn’t good enough for production.

The basic version I just described typically achieves 50-60% accuracy. That’s not good enough for anything users depend on.

In Part 3, I’ll cover what actually matters when you’re building or evaluating RAG systems for real:

Hybrid search: Why vector-only search fails and what to do about it
Chunking strategies: The difference between 60% and 90% accuracy
Evaluation metrics: How to measure quality objectively
Data freshness: Handling updates without re-indexing everything
Cost modeling: Why production costs surprise everyone

These are the details that separate a demo from something users trust.

Next: Part 3 - RAG in Production: What Actually Matters

I’m building Stache, an open-source RAG platform, and writing this series to share what I’ve learned. Connect on LinkedIn or check out the project on GitHub.