How are documents converted into embeddings in Naive RAG?
In Naive RAG (Retrieval-Augmented Generation), the process of converting raw documents into numerical representations called embeddings is fundamental for efficient retrieval. This conversion allows the system to search for semantically similar document chunks based on a user's query.
1. Document Preprocessing and Chunking
Before documents can be converted into embeddings, they typically undergo preprocessing steps. This involves cleaning the text, removing irrelevant formatting (HTML tags, special characters), and sometimes normalizing content. The most critical step for large documents is 'chunking', where the document is split into smaller, manageable segments. This is necessary because embedding models often have input token limits and smaller chunks allow for more granular retrieval.
- Fixed-size chunking: Documents are split into chunks of a predefined character or token length, often with a configurable overlap between consecutive chunks to maintain context across boundaries.
- Semantic chunking: Attempts to split documents based on semantic boundaries, such as paragraphs, sections, or by using language models to identify coherent blocks of text, aiming to keep related information together.
- Recursive chunking: A common strategy that tries different splitting delimiters (e.g., sections, paragraphs, sentences) recursively until chunks are small enough to fit within the embedding model's context window.
2. The Embedding Model
Once documents are divided into chunks, each individual chunk is fed into an embedding model. These models are typically deep learning models, often based on transformer architectures (like BERT, RoBERTa, or specialized Sentence Transformers), that are trained to produce dense vector representations of text.
- Input: Each text chunk from the preprocessing step serves as input to the embedding model.
- Transformation: The model processes the text and transforms it into a fixed-size numerical vector (an array of numbers). This vector captures the semantic meaning and contextual information of the chunk in a high-dimensional space. Texts with similar meanings will have embeddings that are closer together in this vector space.
- Examples: Popular embedding models used in RAG include Sentence-BERT (SBERT), OpenAI's
text-embedding-adamodels, various models from Hugging Face Transformers, or proprietary models offered by cloud providers.
3. Storing Embeddings
The resulting embedding vectors, along with their corresponding original text chunks and any relevant metadata (e.g., source document, page number, section title), are then stored in a specialized database known as a vector database or vector index. This storage is optimized for efficient similarity search using algorithms like Approximate Nearest Neighbors (ANN), which is crucial for the rapid retrieval of relevant document chunks during the RAG process.
When a user poses a query, that query undergoes the same embedding process using the *exact same embedding model*. This query embedding is then used to find the most semantically similar document chunk embeddings stored in the vector database, enabling the system to retrieve the most relevant information to answer the user's question.