Engineering Fluency OS

Why This Matters

LLMs know what they learned during training — but that knowledge has a cutoff date, cannot be updated without retraining, and sometimes hallucinates facts that sound plausible but are wrong. Retrieval-Augmented Generation (RAG) solves this by fetching relevant information from external sources and including it in the prompt, so the model generates answers grounded in real data.

RAG is the most popular architecture for building production AI applications. It powers chatbots that answer questions about your company's documents, search engines that summarize results, and coding assistants that understand your codebase. If you build anything with LLMs in production, you will likely build a RAG system.

Define Terms

Visual Model

User Query

Embed Query

Vector DBDocument embeddings

Top-K ResultsRetrieved docs

LLM + ContextQuery + retrieved chunks

ResponseGrounded answer

DocumentsChunked + embedded

The full process at a glance. Click Start tour to walk through each step.

RAG: embed documents into a vector database, retrieve relevant chunks for each query, and augment the LLM prompt with real evidence.

Code Example

Code

// Simple RAG pipeline simulation

// Step 1: Chunk documents
function chunkText(text, chunkSize) {
  const words = text.split(" ");
  const chunks = [];
  for (let i = 0; i < words.length; i += chunkSize) {
    chunks.push(words.slice(i, i + chunkSize).join(" "));
  }
  return chunks;
}

// Step 2: Simple embedding (in reality, use an API)
function simpleEmbed(text) {
  // Toy embedding: character frequency vector
  const vec = new Array(26).fill(0);
  for (const ch of text.toLowerCase()) {
    const idx = ch.charCodeAt(0) - 97;
    if (idx >= 0 && idx < 26) vec[idx]++;
  }
  const mag = Math.sqrt(vec.reduce((s, v) => s + v * v, 0)) || 1;
  return vec.map(v => v / mag); // normalize
}

// Step 3: Vector store (in-memory)
class SimpleVectorStore {
  constructor() { this.items = []; }

  add(text) {
    this.items.push({ text, vector: simpleEmbed(text) });
  }

  search(query, k = 3) {
    const qVec = simpleEmbed(query);
    const scored = this.items.map(item => ({
      text: item.text,
      score: item.vector.reduce((s, v, i) => s + v * qVec[i], 0)
    }));
    scored.sort((a, b) => b.score - a.score);
    return scored.slice(0, k);
  }
}

// Build the pipeline
const docs = [
  "Python is a programming language used for web development and data science.",
  "JavaScript runs in the browser and powers interactive web pages.",
  "Neural networks learn patterns from data using layers of neurons.",
  "Transformers use attention mechanisms to process sequences in parallel.",
  "Vector databases store embeddings for fast similarity search."
];

const store = new SimpleVectorStore();
docs.forEach(doc => store.add(doc));

// Search
const results = store.search("How do neural networks learn?", 2);
console.log("Top results:");
results.forEach(r => console.log(` [${r.score.toFixed(3)}] ${r.text}`));

// Step 4: Build prompt
const context = results.map(r => r.text).join("\n");
const prompt = `Context:\n${context}\n\nQuestion: How do neural networks learn?\nAnswer:`;
console.log("\nFinal prompt:\n", prompt);

Interactive Experiment

Try these exercises:

Add more documents to the vector store on different topics. Does the search still find relevant results?
Try a query that does not match any document well. What is the similarity score? How would you set a threshold to say "no relevant results found"?
Change the chunk size from full documents to 5-word chunks. How does this affect search quality?
What happens if you search for a query in French when all documents are in English? (With real embeddings, multilingual models handle this.)

Quick Quiz

Coding Challenge

Build a Document Search

Write a function called `searchDocs` that takes a query string and an array of document strings, and returns the top-k most relevant documents using cosine similarity on simple character-frequency vectors. Each document should be embedded as a 26-dimensional vector of normalized character frequencies (a-z only).

Loading editor...

Real-World Usage

RAG is the dominant pattern for building production AI applications:

Enterprise chatbots: Companies use RAG to let employees ask questions about internal documentation, policies, and knowledge bases.
Legal research: Law firms use RAG to search case law and contracts, generating summaries grounded in actual legal text.
Customer support: RAG-powered bots retrieve relevant help articles and compose accurate, source-cited answers.
Code assistants: Tools like Cursor and Cody index your codebase in a vector database and retrieve relevant files when you ask questions.
Medical AI: RAG systems search clinical databases and research papers to provide evidence-based answers for healthcare professionals.

RAG & Vector Databases