🧠 RAG Fundamentals Q3 / 19

What are the main components of a RAG pipeline?

AI-Powered Answer ✓ Answered

A Retrieval-Augmented Generation (RAG) pipeline enhances the capabilities of large language models (LLMs) by giving them access to external, up-to-date, and domain-specific information. This typically involves two core phases: an offline indexing phase and an online retrieval and generation phase.

1. Indexing Phase (Offline Data Preparation)

This phase is performed once or periodically to prepare the external knowledge base. It involves transforming raw data into a searchable format that can be effectively utilized by the retrieval component.

  • Document Loading: Ingesting data from various sources (e.g., text files, PDFs, databases, web pages) into the system.
  • Document Splitting (Chunking): Breaking down large documents into smaller, more manageable segments or 'chunks.' This ensures that retrieved contexts are relevant and fit within the LLM's token limit.
  • Embedding: Converting each text chunk into a numerical vector representation (an 'embedding') using an embedding model. These embeddings capture the semantic meaning of the text.
  • Vector Storage: Storing these embeddings along with their corresponding original text chunks in a vector database or vector store. This specialized database allows for efficient similarity searches based on vector distance.

2. Retrieval and Generation Phase (Online Runtime)

This phase executes in real-time when a user submits a query. It involves fetching relevant information from the prepared knowledge base and using it to inform the LLM's response generation.

  • Query Embedding: The user's input query is transformed into a vector embedding using the same embedding model employed during the indexing phase. This allows for semantic comparison with the stored document chunks.
  • Retrieval: The query embedding is used to search the vector store for the most semantically similar document chunks. The retrieval component identifies and fetches the top-k relevant chunks based on a similarity metric (e.g., cosine similarity).
  • Context Augmentation: The retrieved relevant chunks are appended to the user's original query, forming an augmented prompt. This enriched prompt provides the LLM with specific, factual context directly relevant to the query.
  • Language Model (LLM) Generation: The augmented prompt is then fed into a Large Language Model. The LLM uses this provided context, along with its inherent knowledge, to generate a coherent, accurate, and contextually grounded answer to the user's query.