The Problem RAG Solves
Large Language Models (LLMs) like are incredibly powerful, but they have a critical limitation: they can only answer based on what they were trained on. This leads to two major issues:
- Hallucinations: The model might confidently provide incorrect information
- Limited Knowledge: The model doesn't know about your specific data, documents, or recent information
RAG solves this by giving the LLM a "brain" made of your specific data.
The Three Steps of RAG
Step 1: RETRIEVAL (The "Librarian")
What happens:
- When you ask a question, we DON'T send it to the AI immediately
- Instead, we first ask Elasticsearch to find the exact "pages" from our data that match the intent of the question
How it works:
- Your question is converted into a vector (a list of numbers representing meaning) using OpenAI's embedding model
- Elasticsearch performs a KNN (K-Nearest Neighbors) search to find documents with similar meaning
- Returns the top 5 most relevant document chunks
Key Teaching Point:
"Elastic isn't looking for words; it's looking for meaning using Vector Search. This is semantic search, not keyword search."
// Convert question to vector
const questionEmbedding = await generateEmbedding(question);
// Search Elasticsearch using KNN
const searchResponse = await elasticClient.search({
index: INDEX_NAME,
knn: {
field: "embedding",
query_vector: questionEmbedding,
k: 5, // Return top 5 most similar documents
num_candidates: 100,
},
});
Step 2: AUGMENTATION (The "Context Window")
What happens:
- Elastic combines the search results with your original question
- This creates a "context window" that tells the LLM exactly what information to use
How it works:
- Take the retrieved document chunks
- Format them with their source information
- Combine them into a single context string
- Add your question to this context
Key Teaching Point:
"We are 'augmenting' the AI. We're telling it: 'Here is the data you need. Only use this info to answer the question.' This prevents hallucinations."
// Format context from retrieved documents
const context = relevantDocs
.map((doc: any) => `[Source: ${doc.title}]\n${doc.content}`)
.join("\n\n---\n\n");
Step 3: GENERATION (The "Speaker")
What happens:
- OpenAI's GPT-4 takes the context and your question
- It generates a human-sounding, accurate response based ONLY on the provided context
How it works:
- Send a system prompt that instructs the LLM to only use the provided context
- Include the retrieved context + your question in the user message
- GPT-4 generates a response that summarizes and synthesizes the context
Key Teaching Point:
"The AI isn't guessing anymore. It's summarizing the high-quality data that Elastic provided. This is why RAG is so powerful—it combines the reasoning ability of LLMs with the accuracy of your own data."
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: `You are a helpful assistant that answers questions based on the provided context... Only use information from the provided context.`,
},
{
role: "user",
content: `Context from knowledge base:\n\n${context}\n\nQuestion: ${question}`,
},
],
});
How Elastic Acts as "Long-Term Memory"
Traditional LLM (Without RAG):
User Question → GPT-4 → Answer (based on training data only)
Problem: Limited to what GPT-4 was trained on, can't access your documents
RAG with Elastic (With RAG):
User Question
→ Generate Embedding
→ Elasticsearch Vector Search (finds your documents)
→ Combine Context + Question
→ GPT-4 → Answer (grounded in your data)
Solution: Elastic acts as the "long-term memory" that stores and retrieves your specific knowledge
Why Elastic is Perfect for This:
-
Vector Storage: Elasticsearch's
dense_vectorfield type stores embeddings efficiently - KNN Search: Fast similarity search using cosine similarity
- Scalability: Can handle millions of documents
- Real-time: Documents are searchable immediately after indexing
- Hybrid Search: Can combine vector search with traditional keyword search
The Complete Flow
┌─────────────────┐
│ User Question │
│ "What is RAG?" │
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ Step 1: RETRIEVAL │
│ ───────────────────── │
│ 1. Generate embedding │
│ 2. Search Elasticsearch │
│ 3. Get top 5 docs │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ Step 2: AUGMENTATION │
│ ───────────────────── │
│ Combine: │
│ - Retrieved docs │
│ - Original question │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ Step 3: GENERATION │
│ ───────────────────── │
│ GPT-4 generates answer │
│ based on context │
└────────┬────────────────┘
│
▼
┌─────────────────┐
│ Final Answer │
│ (with sources) │
└─────────────────┘
🎓 Key Concepts Explained
What are Embeddings?
Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings (vectors that are close together in high-dimensional space).
Example:
- "dog" and "puppy" → Similar vectors (close in space)
- "dog" and "airplane" → Different vectors (far apart in space)
What is KNN (K-Nearest Neighbors)?
KNN finds the K most similar vectors to your query vector. In our case:
- K = 5 means we get the 5 most semantically similar documents
- Uses cosine similarity to measure "closeness"
- Returns documents that mean similar things, not just contain similar words
Why Chunk Documents?
Large documents are split into smaller chunks because:
- Context Limits: LLMs have token limits
- Precision: Smaller chunks allow more precise retrieval
- Relevance: You get exactly the relevant section, not the entire document
Why This Demonstrates Elastic's Mission
Elastic is the Search AI Company. This RAG application showcases:
- Vector Search: Moving beyond keyword matching to semantic understanding
- Production-Ready: Elasticsearch Serverless handles scale and reliability
- Developer Experience: Simple API, powerful capabilities
- Real-World Use Cases: RAG is one of the most important AI applications today
Top comments (0)