Etop - Essien Emmanuella Ubokabasi

Posted on Dec 29

Why RAG is the Future of Search (And How Elastic Search Makes it Possible )

#elasticsearch #ai #tutorial #rag

The Problem RAG Solves

Large Language Models (LLMs) like are incredibly powerful, but they have a critical limitation: they can only answer based on what they were trained on. This leads to two major issues:

Hallucinations: The model might confidently provide incorrect information
Limited Knowledge: The model doesn't know about your specific data, documents, or recent information

RAG solves this by giving the LLM a "brain" made of your specific data.

The Three Steps of RAG

Step 1: RETRIEVAL (The "Librarian")

What happens:

When you ask a question, we DON'T send it to the AI immediately
Instead, we first ask Elasticsearch to find the exact "pages" from our data that match the intent of the question

How it works:

Your question is converted into a vector (a list of numbers representing meaning) using OpenAI's embedding model
Elasticsearch performs a KNN (K-Nearest Neighbors) search to find documents with similar meaning
Returns the top 5 most relevant document chunks

Key Teaching Point:

"Elastic isn't looking for words; it's looking for meaning using Vector Search. This is semantic search, not keyword search."

// Convert question to vector
const questionEmbedding = await generateEmbedding(question);

// Search Elasticsearch using KNN
const searchResponse = await elasticClient.search({
  index: INDEX_NAME,
  knn: {
    field: "embedding",
    query_vector: questionEmbedding,
    k: 5, // Return top 5 most similar documents
    num_candidates: 100,
  },
});

Step 2: AUGMENTATION (The "Context Window")

What happens:

Elastic combines the search results with your original question
This creates a "context window" that tells the LLM exactly what information to use

How it works:

Take the retrieved document chunks
Format them with their source information
Combine them into a single context string
Add your question to this context

Key Teaching Point:

"We are 'augmenting' the AI. We're telling it: 'Here is the data you need. Only use this info to answer the question.' This prevents hallucinations."

// Format context from retrieved documents
const context = relevantDocs
  .map((doc: any) => `[Source: ${doc.title}]\n${doc.content}`)
  .join("\n\n---\n\n");

Step 3: GENERATION (The "Speaker")

What happens:

OpenAI's GPT-4 takes the context and your question
It generates a human-sounding, accurate response based ONLY on the provided context

How it works:

Send a system prompt that instructs the LLM to only use the provided context
Include the retrieved context + your question in the user message
GPT-4 generates a response that summarizes and synthesizes the context

Key Teaching Point:

"The AI isn't guessing anymore. It's summarizing the high-quality data that Elastic provided. This is why RAG is so powerful—it combines the reasoning ability of LLMs with the accuracy of your own data."

const completion = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    {
      role: "system",
      content: `You are a helpful assistant that answers questions based on the provided context... Only use information from the provided context.`,
    },
    {
      role: "user",
      content: `Context from knowledge base:\n\n${context}\n\nQuestion: ${question}`,
    },
  ],
});

How Elastic Acts as "Long-Term Memory"

Traditional LLM (Without RAG):

User Question → GPT-4 → Answer (based on training data only)

Problem: Limited to what GPT-4 was trained on, can't access your documents

RAG with Elastic (With RAG):

User Question
  → Generate Embedding
  → Elasticsearch Vector Search (finds your documents)
  → Combine Context + Question
  → GPT-4 → Answer (grounded in your data)

Solution: Elastic acts as the "long-term memory" that stores and retrieves your specific knowledge

Why Elastic is Perfect for This:

Vector Storage: Elasticsearch's dense_vector field type stores embeddings efficiently
KNN Search: Fast similarity search using cosine similarity
Scalability: Can handle millions of documents
Real-time: Documents are searchable immediately after indexing
Hybrid Search: Can combine vector search with traditional keyword search

The Complete Flow

┌─────────────────┐
│  User Question  │
│  "What is RAG?" │
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│  Step 1: RETRIEVAL      │
│  ─────────────────────   │
│  1. Generate embedding  │
│  2. Search Elasticsearch │
│  3. Get top 5 docs      │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Step 2: AUGMENTATION   │
│  ─────────────────────   │
│  Combine:               │
│  - Retrieved docs       │
│  - Original question    │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Step 3: GENERATION     │
│  ─────────────────────   │
│  GPT-4 generates answer │
│  based on context       │
└────────┬────────────────┘
         │
         ▼
┌─────────────────┐
│  Final Answer   │
│  (with sources) │
└─────────────────┘

🎓 Key Concepts Explained

What are Embeddings?

Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings (vectors that are close together in high-dimensional space).

Example:

"dog" and "puppy" → Similar vectors (close in space)
"dog" and "airplane" → Different vectors (far apart in space)

What is KNN (K-Nearest Neighbors)?

KNN finds the K most similar vectors to your query vector. In our case:

K = 5 means we get the 5 most semantically similar documents
Uses cosine similarity to measure "closeness"
Returns documents that mean similar things, not just contain similar words

Why Chunk Documents?

Large documents are split into smaller chunks because:

Context Limits: LLMs have token limits
Precision: Smaller chunks allow more precise retrieval
Relevance: You get exactly the relevant section, not the entire document

Why This Demonstrates Elastic's Mission

Elastic is the Search AI Company. This RAG application showcases:

Vector Search: Moving beyond keyword matching to semantic understanding
Production-Ready: Elasticsearch Serverless handles scale and reliability
Developer Experience: Simple API, powerful capabilities
Real-World Use Cases: RAG is one of the most important AI applications today

DEV Community

Why RAG is the Future of Search (And How Elastic Search Makes it Possible )

The Problem RAG Solves

The Three Steps of RAG

Step 1: RETRIEVAL (The "Librarian")

Step 2: AUGMENTATION (The "Context Window")

Step 3: GENERATION (The "Speaker")

How Elastic Acts as "Long-Term Memory"

Traditional LLM (Without RAG):

RAG with Elastic (With RAG):

Why Elastic is Perfect for This:

The Complete Flow

🎓 Key Concepts Explained

What are Embeddings?

What is KNN (K-Nearest Neighbors)?

Why Chunk Documents?

Why This Demonstrates Elastic's Mission

Further Learning

Top comments (0)