Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models with external knowledge retrieval that allows you to take generative AI from generalized to specific—it’s the pattern that orchestrates the AI to do something very focused without jeopardizing data security.

Generative AI (GenAI) is revolutionizing industries, enabling businesses to automate tasks, generate content, and create personalized experiences at scale. But to unlock its full potential, you need the right architecture. This blog will help you understand what RAG really is, the steps involved, why RAG architecture matters, and how to build a system that works seamlessly while avoiding the most common mistakes.

What is GenAI and RAG, and Why Do They Matter?

At its core, Generative AI (GenAI) refers to artificial intelligence systems that produce content based on learned data patterns. This content can range from text to images to music, depending on the model’s training. While powerful on its own, GenAI often struggles with precision when tasked with delivering highly specific or specialized content it wasn’t previously trained on.

That’s where RAG comes into play. It augments the generative capabilities of AI by allowing it to pull specific, relevant data points from external sources to create a targeted response. This concept is called “grounding” and refers to the process of anchoring the language model’s outputs to factual, retrieved information based on specific, relevant facts from the retrieved information rather than solely on its pre-trained knowledge.

This architecture combines a pre-trained LLM model with your organization’s data or “special sauce” (see what I did there with schema sauce?) ensures the AI can retrieve detailed information that isn’t out in the public, making it far more accurate for business use cases like customer service, finding the right HR document, or summarizing an email chain.

In short, while GenAI produces content, RAG ensures that content is specific, relevant, and useful without letting your sensitive data out into the wild. If you want your AI to provide tailored results rather than generalized responses, RAG is essential.

Best Practices for Building a RAG System

Building a successful RAG system isn’t just about deploying a large language model and hoping for the best. It requires a structured approach to ensure scalability, performance, and security.

1. Collect, Clean, Prep Your Data

Gather relevant documents, databases, or other data sources you want to ground your responses in. Remove irrelevant information, formatting issues, or noise. Correct errors or inconsistencies in the data. For PDFs, images, or other non-plain text formats, extract the text content (if necessary).

2. Chunk Your Data

Whether it is structured data (rows, columns from tables) or unstructured data (media or documents), it needs to be chunked up into bite sized pieces the LLM can ingest.

Chunking is the process of dividing larger pieces of information into smaller, manageable units. Text is split into chunks based on sentence structure, token limits, or semantic meaning. Each chunk is then tokenized by the embedding model (converts into tokens) then the embedding model encodes the tokenized chunk into a vector representation for storage in a vector database. This is crucial for RAG systems to effectively retrieve and process information.

Ways to define chunk sizes:

Fixed-Length Token Count: chunks are defined based on a fixed number of tokens to ensure compatibility with embedding model limits.
Fixed-Length Character Count: chunks are split based on predefined character limit (e.g. 1000 characters). This is simpler but less accurate since word and token lengths will vary.
Sentence-Based Chunking: text is split at sentence boundaries using Natural Language Processing (NLP) tools to preserve meaning BUT may create incosistent chunk sizes.
Paragraph-Based Chunking: keeps related sentences together for better context and works well for documents like articles or reports that are well organized.
Recursive Overlapping Sliding Windows (BEST PRACTICE): uses overlapping chunks (e.g. 256 tokens per chunk with 50-token overlap) to help retain context between adjacent chunks and improve retrieval accuracy. This is useful for avoiding loss of meaning when context is split between two chunks.
Semantic Chunking: uses NLP techniques to split content based on meaning. This sounds great, but is more computationally expensive to run and requires an LLM-based chunking strategy like sentence similarity.

Different models often require different chunk sizes due to their varying context window sizes. Here’s why:

Context Window: The maximum number of tokens a model can process at once. It includes both the input (prompt) and output (response) tokens.
- Varies by model: e.g., BERT (512 tokens), GPT-3 (4096 tokens), GPT-4 (8192 tokens).
Chunk Size Considerations:
- Should be smaller than the model’s context window.
- Leaves room for prompts and generated text.
- Typically aim for 50-75% of the context window size.

Context windows are measured in tokens, not words. A token is a piece of text that the model treats as a single unit. It can be a word, part of a word, or a character, depending on the tokenization method used.

Examples:

In simple word-based tokenization: “hello” and “world” are separate tokens.
In subword tokenization: “unhappy” might be split into “un” and “happy”.
Punctuation marks are often separate tokens.

Purpose:

Tokens help convert text into a format that embedding models can process to turn into vector representations.
They serve as the basic units for model input and output.

Once you have defined the parameters and your data is sufficiently chunked into the right amount of tokens to fit the context window, it’s time for the next step UNLESS you want to get really fancy.

After the data is chunked, you can add meta-data to the chunks (e.g. data source, date, category) to give even more properties for the LLM to make inferences from. This meta-data lives alongside the vector embeddings which we will dive into the next step.

3. Vectorize Your Data

To make RAG work, your data needs to be in a format that the language model can understand. This is done by converting the raw data into vector embeddings, which represent the data in a mathematical form the AI can process. Ensuring compatibility between your vectorized data and the model is critical—mismatched formats will prevent the system from functioning properly.

Vector embeddings are like giving objects special number tags that describe their features. These tags help computers organize and compare things by how similar their features are, just like you might group toys based on their color, size, and shape.

You use an Embedding Model to process the chunks into vector embeddings. The choice of embedding model should be compatible with your retrieval system and somewhat aligned with your LLM, but it doesn’t necessarily have to be the same as your LLM.

Common choices include models like:

Sentence transformers (e.g., all-MiniLM-L6-v2)
OpenAI’s text-embedding-ada-002
Models from Hugging Face’s transformers library

The LLM itself (like GPT-3 or GPT-4) is typically not used for creating these embeddings, as it would be computationally expensive and unnecessary.

In summary: You choose an embedding model, then use it to vectorize your chunks. These vectors are what get stored and later used for retrieval when working with your chosen LLM in the RAG system.

Yes… the amount of different types of models, new terms, and components is dizzying! Don’t worry… we are ALMOST there 🙂

4. Choose Your Vector Database

The decision on which vector database to use typically happens before or during the Index Creation step. Here’s when and how you might make this choice:

When to choose:
- Early in the project planning phase.
- After you’ve determined your data volume and query patterns.
- Before you start large-scale data processing and ingestion.
Factors influencing the choice:
- Scale: Expected data volume and query load.
- Performance requirements: Query latency and throughput needs.
- Integration: Compatibility with your existing tech stack.
- Managed vs. Self-hosted: Your team’s operational capacity.
- Cost: Both in terms of hosting and potential licensing fees.
- Features: Support for metadata filtering, multimodal data, etc.
- Scalability: Ability to handle growing data and user base.
Common options and their strengths:
- DataStax Astra DB (My favorite! And my place of work): Fully managed, optimized for production-scale deployments offering both cloud and self-hosted options. Supports structured data and vector data.
- Pinecone: the household name for SaaS vector databases, great early on for GenAI projects at small scale, but once you get to enterprise scale and real-time performance doesn’t deliver. Supports vector data.
- Milvus: Open-source, highly scalable, good for large-scale deployments. Supports vector data.
- Microsoft Cosmos DB: Azure native DB with all the Microsoft integrations, self managed, and globally cloud distributed. Supports structured data and vector data.
- Qdrant: Rust-based, known for high performance, supports filtering. Supports structured data and vector data.
- Elasticsearch with vector plugin: Good if you’re already using Elasticsearch.
- FAISS: Not a full database, but excellent for high-performance vector search, often used with other storage solutions.

Before you choose and move to production, It’s often beneficial to run a small-scale proof of concept with a few options to test the factors on your data. This helps in understanding real-world performance and integration challenges. That is why I love Astra DB, the open source tool Langflow is directly integrated to make running POCs with different models and test RAG outputs with your data quickly and easily. More to come on this! We will talk about “abstraction layers” like Langflow and Langchain later rather than painstakingly building multiple POCs from hand written code.

5. Index Creation

Now that your data is in a readable format for an LLM, it’s time for indexing! You can apply the concept of indexing from what you may already know from Microsoft Excel or traditional databases. The purpose for indexing is to speed up data retrieval operations and quickly locate specific rows based on column values.

Given VLOOKUP was the first sophisticated function I learned in Excel, putting things in context to VLOOKUP always helps me understand what is exactly happening.

VLOOKUP function:

It’s a built-in Excel function for looking up data in a table or range.
VLOOKUP performs a linear search through the first column of the specified range.

How VLOOKUP works:

It starts at the top of the first column and scans down until it finds a match or reaches the end.
This is essentially a brute-force search method.

Performance implications:

For small datasets, VLOOKUP is quick and efficient.
For large datasets, it can become slow as it has to scan through many rows.

Now instead of VLOOKUP, if you want to get REALLY fancy and working with a ton of data, you will have to revert to using INDEX-MATCH combination:

INDEX retrieves a value from a specified position in a range

MATCH finds the position of a lookup value within a range

Why it’s faster?

a. Column independence:

VLOOKUP requires the lookup column to be the leftmost in the table
INDEX-MATCH can look up and return values from any columns
This flexibility often means less data manipulation and fewer calculations

b. Search efficiency:

MATCH function uses a binary search algorithm for sorted data
This is much faster than VLOOKUP’s linear search, especially in large datasets

c. Range reference:

VLOOKUP references the entire table for each lookup
INDEX-MATCH only references the specific columns needed

How does this impact performance?

For small datasets: Negligible difference
For large datasets: INDEX-MATCH can be significantly faster
The performance gap widens as the dataset size increases

How does this impact memory usage?

INDEX-MATCH typically uses less memory
This is because it doesn’t need to reference the entire table for each lookup

OK, now time to apply this concept to Index Creation for vector embeddings!

Purpose of Index Creation:

Enable fast similarity searches over large numbers of vector embeddings
Efficiently retrieve the most relevant chunks when given a query

Why it’s Necessary:

Brute-force comparison of a query against millions of vectors is too slow
Indexes use specialized data structures to speed up similarity searches

How Index Creation Works:

Organize vectors into a searchable structure (e.g., trees, graphs, or quantized representations)
Optimize for approximate nearest neighbor (ANN) search
Balance between search speed and accuracy

Key Concepts in Indexing:

Dimensionality reduction: Compress high-dimensional vectors
Clustering: Group similar vectors together
Quantization: Represent vectors with fewer bits

Process:

A. Choose an indexing method based on your needs (speed, accuracy, scalability)

B. Initialize the index with parameters suitable for your data

C. Add your vector embeddings to the index

D. (Optional) Train the index for better performance

However, it’s important to note that in practice, these steps often overlap or are handled seamlessly by the vector database system. Here’s why:

Many modern vector databases handle indexing internally (like DataStax Astra DB):
- When you ingest data, the database often automatically creates and updates its index.
- You don’t always need to create the index structure separately.
Some systems allow for real-time indexing (like DataStax Astra DB):
- You can ingest data continuously, and the index updates in real-time.
The process can be streamlined:
- Some vector databases offer APIs that accept raw text, automatically handle chunking and embedding, and then ingest the resulting vectors.
- Some tools like Unstructured.io pull data from different data file types and do the chunking and embedding on the fly then ingest the resulting vectors.

By creating an efficient index, you’re essentially building a “smart phonebook” for your vector embeddings. This allows your RAG system to quickly find the most relevant information without having to exhaustively search through every single vector, greatly enhancing the speed and efficiency of the retrieval process.

6. Quality Check

Just like it’s crucial to have quality data, it’s also crucial to have strong quality checks in place for the data you have pulled into your vector database and indexed. You can do this by:

Manual review of retrieved results for sample queries.
Automated testing with predefined query-result pairs.
Fine-tune chunk size if context is insufficient or redundant.
Experiment with different embedding models if semantic matching is poor.
Adjust index parameters (e.g., number of clusters, search depth) for better precision/recall balance.

This is typically the “uh-oh” point for many projects as you realize your data isn’t as clean as you expected it to be. Common issues that get surfaced are:

Inconsistent formatting: Even within the same dataset, you might find multiple date formats, inconsistent spacing, or varying representations of the same information
HTML artifacts: Remnants of web scraping like , incomplete tag stripping, or encoded characters that slipped through
Duplicate content with slight variations: The same information presented with minor differences that creates nearly identical vectors and noise in your results
Missing context: Chunks that made sense in the original document but lose critical context when separated
OCR errors: If dealing with scanned documents, misread characters or formatting that creates nonsensical content
Boilerplate pollution: Headers, footers, and standard disclaimers that add noise to your vector space
Versioning confusion: Multiple versions of the same document without clear indicators of which is most current
Inconsistent metadata: Missing or contradictory tags, categories, or other metadata that should help filter results

The challenge is that these issues often don’t become apparent until you’re actually testing your retrieval results, and fixing them usually requires revisiting your entire data processing pipeline. This can be a MAJOR bummer, but you aren’t alone! Many organizations face these issues. Overcoming them will take dedicated teamwork.

7. Choose Your Language Model

We have made it to the point where we can finally start using in-vogue terms like LLM! What does an LLM mean and what choices do we have at this point?

Large Language Models (LLMs) are models like GPT-4, Claude, and PaLM are massive models with 100B+ parameters. They are excellent at complex reasoning, understanding context, and generating human-like responses (not sentient!). LLMs are typically accessed through cloud APIs and have higher latency and cost compared to smaller models. Given they are API based it is easy to spin up the keys and implement the model to your RAG architecture.

Best for: complex reasoning tasks, creative content generation, nuanced understanding

Small Language Models (SLMs) are models like Mistral 7B, Llama 2 13B, and Microsoft’s Phi models. These models can run on edge devices or local hardware, and yes even your laptop! They have faster inference times and lower cost which make them enticing although they have limited context window and reasoning capabilities compared to LLMs.

Best for: edge processing, real-time applications, specific domain tasks

Examples of edge processing use cases:

Local chat bots without internet connectivity
IoT devices requiring natural language processing
Privacy-sensitive applications where data can’t leave the device
Mobile applications where low latency is crucial

Foundation Models are pre-trained models that serve as a base for fine-tuning. These models can be adapted to specific domains or tasks through additional training with examples like GPT, BERT, RoBERTa, and T5. These models provide a starting point for creating specialized models which (shocker) are expensive and require access to GPU computational power. Most organization’s don’t go down the route of training or fine-tuning their models because of its resource intensiveness. This is typically done by ISVs and SaaS companies.

Now you may ask, where do I find these models? How do I compare them? Here are the most common.

Model Hubs and Marketplaces:

Hugging Face Hub: Largest repository of open-source models
Azure AI Studio: Enterprise-ready models with Azure integration
Amazon SageMaker: AWS’s collection of pre-trained models
Google Cloud Model Garden: Collection of Google’s foundation models
Anthropic’s Claude Models: Specialized for safety and reasoning
GitHub repositories of major AI labs
Meta’s open source models and libraries

Hugging Face is the most common with tons of analytics on each model to help you choose the right one.

Considerations for Model Selection:

Inference costs and pricing models
Hardware requirements and deployment constraints
Licensing and usage restrictions
Community support and documentation
Fine-tuning capabilities and requirements
Model update frequency and maintenance
Ethical considerations and bias evaluations (IMPORTANT!!)

8. Test, Tune, Iterate

Now we can FINALLY put it all together!! Testing, tuning, and iterating is the name of the game to put your chosen architecture to the test and making sure the models you chose effectively chunked your data and provide the most relevant response outputs. This is where tools like Langchain and Langflow help the most. Here are the things to consider for each step.

Test: It is best to use a diverse set of queries covering expected use cases and programmatically run these sets when testing outputs of your RAG system. Make sure to Include edge cases and potential failure modes you would expect to see how the LLM handles the situation. An easy example is asking the same prompt whether the RAG system knows you are a boy or girl. Do you get a different response? This is checking for bias. How about if you ask the same question twice? Or the same question, slightly differently? This is checking for consistency.

Here are evaluation metrics for you to track:

Relevance of retrieved information.
Quality and accuracy of LLM responses.
Response time and system efficiency.

Tune: Refine each component based on test results. This might involve adjusting chunking strategies, trying different embedding models, adding guard rails, or fine-tuning prompts.

Iterate: back to testing and trying again! Once you have a good blueprint for your architecture you can plan for a production push.

Common Mistakes to Avoid in GenAI App Deployment

Mistakes during the design and deployment of RAG infrastructure can lead to inefficiencies, security vulnerabilities, or outright system failures. Here are some of the most common mistakes—and how to avoid them.

1. Misaligned Data and Models

Perhaps the most frequent error is failing to properly align your data with the model. For example, if your vector embeddings don’t match what the language model expects, the system won’t be able to interpret the data correctly. Think of it as trying to fit a square peg into a round hole—without alignment, nothing works as it should.

2. Inefficient Orchestration

Poor orchestration can cripple your system. If the vector database and the LLM aren’t communicating smoothly, you’ll experience delays and incomplete responses. Ensure your orchestration engine is robust and capable of handling complex queries without losing speed or accuracy.

3. Overlooking Scalability

Many teams fail to plan for the inevitable growth of data and demand. Without scalable infrastructure, you could end up with bottlenecks that slow down performance or require expensive retrofits. Always plan for expansion, even if your current needs are modest.

4. Neglecting Security

With sensitive data being processed and retrieved, neglecting security is a surefire way to invite trouble. Weak encryption or poor access controls can expose your system to malicious attacks or data breaches. Prioritize security from the outset to protect both the system and the data it handles.

Overarching Concepts for Your RAG System

Optimize Orchestration

Orchestration is the glue that holds the RAG system together. It manages the flow between the language model and the vector database, ensuring that queries are processed smoothly. Whether you use Python scripts or specialized orchestration tools, the key is seamless communication between components to avoid bottlenecks or delays in response time.

Scale Thoughtfully

As data and usage grow, so will the demands on your infrastructure. Start with a system that can scale easily, whether that means leveraging cloud-based storage, high-performance databases, or advanced orchestration tools. A scalable infrastructure will prevent performance degradation as your datasets expand.

Prioritize Security

RAG systems often deal with sensitive data, so building in security measures from the beginning is non-negotiable. This includes encryption, secure access controls, and data governance policies to ensure compliance with privacy regulations. Don’t leave security as an afterthought; it should be baked into your infrastructure from day one.

A New Age of AI Infrastructure

Building and maintaining GenAI apps and RAG infrastructure requires more than just technical expertise. It involves a deep understanding of how data flows through the system, how to scale efficiently, and how to secure sensitive information. By mastering the best practices outlined here, and avoiding the common pitfalls, you’ll be well-positioned to deploy a system that maximizes the potential of both generative AI and RAG.

Conclusion: Is Your Data Ready for the Future?

PHEW!! In a world where AI systems are becoming integral to business operations, having a strong infrastructure is critical. Whether you’re just getting started with GenAI or looking to optimize your RAG setup, the key is to focus on alignment, orchestration, scalability, and security. Want to dive deeper into optimizing your AI systems? Get in touch with us to explore how we can help you build a future-proof infrastructure tailored to your needs.

Disclaimer: I am an employee of DataStax, the company that develops Astra DB and Langflow. While I am compensated as a salaried employee, I do not receive any additional commission or affiliate marketing income for discussing or promoting Astra DB. The views and opinions expressed in this blog post are my own and are based on my professional experience and knowledge of the product. I strive to provide accurate and helpful information, but I encourage readers to conduct their own research and make informed decisions based on their specific needs and circumstances.

Understanding RAG Architecture for GenAI Development: Best Practices and Common Pitfalls