Understanding RAG Architecture for GenAI Development: Best Practices and Common Pitfalls
Image created with AI, credit Microsoft Designer
“RAG is a technique that combines the power of large language models with external knowledge retrieval that allows you to take generative AI from generalized to specific—it’s the pattern that orchestrates the AI to do something very focused without jeopardizing data security.”
Generative AI (GenAI) is revolutionizing industries, enabling businesses to automate tasks, generate content, and create personalized experiences at scale. But to unlock its full potential, you need the right infrastructure—and that’s where our core concept of RAG (Retrieval-Augmented Generation) comes in. This blog will help you understand what RAG really is, the steps involved, why infrastructure matters, and how to build a system that works seamlessly while avoiding the most common mistakes.
What is GenAI and RAG, and Why Do They Matter?
At its core, Generative AI (GenAI) refers to artificial intelligence systems that produce content based on learned data patterns. This content can range from text to images to music, depending on the model’s training. While powerful on its own, GenAI often struggles with precision when tasked with delivering highly specific or specialized content it wasn’t previously trained on.
That’s where RAG (Retrieval-Augmented Generation) comes into play. It augments the generative capabilities of AI by allowing it to pull specific, relevant data points from external sources to create a targeted response. This concept is called “grounding” and refers to the process of anchoring the language model’s outputs to factual, retrieved information based on specific, relevant facts from the retrieved information rather than solely on its pre-trained knowledge.
This architecture combines a pre-trained LLM model with your organization’s data or “special sauce” (see what I did there with schema sauce?) ensures the AI can retrieve detailed information that isn’t out in the public, making it far more accurate for business use cases like customer service, finding the right HR document, or summarizing an email chain..
In short, while GenAI produces content, RAG ensures that content is specific, relevant, and useful without letting your sensitive data out into the wild. If you want your AI to provide tailored results rather than generalized responses, RAG is essential.
Best Practices for Building GenAI/RAG Infrastructure
Building a successful GenAI/RAG system isn’t just about deploying a large language model and hoping for the best. It requires a structured approach to ensure scalability, performance, and security.
1. Collect, Clean, Prep Your Data
Gather relevant documents, databases, or other data sources you want to ground your responses in. Remove irrelevant information, formatting issues, or noise. Correct errors or inconsistencies in the data. For PDFs, images, or other non-plain text formats, extract the text content (if necessary).
2. Chunk Your Data into Tokens
No we aren’t at Chuck-e-Cheese where you chunk your dollar into tokens and use them to win prizes… wait or are we?
Whether it is structured data (rows, columns from tables) or unstructured data (media or documents), it needs to be chunked up into bite sized pieces the LLM can ingest.
Chunking is the process of dividing larger pieces of information into smaller, manageable units. This is crucial for RAG systems to effectively retrieve and process information.
Different models often require different chunk sizes due to their varying context window sizes. Here’s why:
- Context Window: The maximum number of tokens a model can process at once.
- Varies by model: e.g., GPT-3 (4096 tokens), GPT-4 (8192 or 32768 tokens), BERT (512 tokens).
- Chunk Size Considerations:
- Should be smaller than the model’s context window.
- Leaves room for prompts and generated text.
- Typically aim for 50-75% of the context window size.
Context windows are measured in tokens, not words. A token is a piece of text that the model treats as a single unit. It can be a word, part of a word, or a character, depending on the tokenization method used.
Examples:
- In simple word-based tokenization: “hello” and “world” are separate tokens.
- In subword tokenization: “unhappy” might be split into “un” and “happy”.
- Punctuation marks are often separate tokens.
Purpose:
- Tokens help convert text into a format that machine learning models can process.
- They serve as the basic units for model input and output.
Once you have defined the parameters and your data is sufficiently chunked into the right amount of tokens to fit the context window, it’s time for the next step UNLESS you want to get really fancy.
After the data is chunked, you can add meta-data to the chunks (e.g. data source, date, category) to give even more properties for the LLM to make inferences from. This meta-data lives alongside the vector embeddings which we will dive into the next step.
3. Vectorize Your Data
To make RAG work, your data needs to be in a format that the language model can understand. This is done by converting the raw data into vector embeddings, which represent the data in a mathematical form the AI can process. Ensuring compatibility between your vectorized data and the model is critical—mismatched formats will prevent the system from functioning properly.
Vector embeddings are like giving objects special number tags that describe their features. These tags help computers organize and compare things by how similar their features are, just like you might group toys based on their color, size, and shape.
You use an Embedding Model to process the chunks into vector embeddings. The choice of embedding model should be compatible with your retrieval system and somewhat aligned with your LLM, but it doesn’t necessarily have to be the same as your LLM.
Common choices include models like:
- Sentence transformers (e.g., all-MiniLM-L6-v2)
- OpenAI’s text-embedding-ada-002
- Models from Hugging Face’s transformers library
The LLM itself (like GPT-3 or GPT-4) is typically not used for creating these embeddings, as it would be computationally expensive and unnecessary.
So, in summary: You choose an embedding model, then use it to vectorize your chunks. These vectors are what get stored and later used for retrieval when working with your chosen LLM in the RAG system.
Yes… the amount of different types of models, new terms, and components is dizzying! Don’t worry… we are ALMOST there 🙂
4. Choose Your Vector Database
The decision on which vector database to use typically happens before or during the Index Creation step. Here’s when and how you might make this choice:
- When to choose:
- Early in the project planning phase.
- After you’ve determined your data volume and query patterns.
- Before you start large-scale data processing and ingestion.
- Factors influencing the choice:
- Scale: Expected data volume and query load.
- Performance requirements: Query latency and throughput needs.
- Integration: Compatibility with your existing tech stack.
- Managed vs. Self-hosted: Your team’s operational capacity.
- Cost: Both in terms of hosting and potential licensing fees.
- Features: Support for metadata filtering, multimodal data, etc.
- Scalability: Ability to handle growing data and user base.
- Common options and their strengths:
- DataStax Astra DB (My favorite! And my place of work): Fully managed, optimized for production-scale deployments offering both cloud and self-hosted options. Supports structured data and vector data.
- Pinecone: the household name for SaaS vector databases, great early on for GenAI projects at small scale, but once you get to enterprise scale and real-time performance doesn’t deliver. Supports vector data.
- Milvus: Open-source, highly scalable, good for large-scale deployments. Supports vector data.
- Microsoft Cosmos DB: Azure native DB with all the Microsoft integrations, self managed, and globally cloud distributed. Supports structured data and vector data.
- Qdrant: Rust-based, known for high performance, supports filtering. Supports structured data and vector data.
- Elasticsearch with vector plugin: Good if you’re already using Elasticsearch.
- FAISS: Not a full database, but excellent for high-performance vector search, often used with other storage solutions.
Now before you choose, It’s often beneficial to run a small-scale proof of concept with a few options. This helps in understanding real-world performance and integration challenges. That is why I love Astra DB, the open source tool Langflow is directly integrated to make running POCs with different models and test RAG outputs with your data quickly and easily. More to come on this! We will talk about “abstraction layers” like Langflow and Langchain later rather than painstakingly building multiple POCs from hand written code.
5. Index Creation
Now that your data is in a readable format for an LLM, it’s time for indexing! You can apply the concept of indexing from what you may already know from Microsoft Excel or traditional databases. The purpose for indexing is to speed up data retrieval operations and quickly locate specific rows based on column values.
Given VLOOKUP was the first sophisticated function I learned in Excel, putting things in context to VLOOKUP always helps me understand what is exactly happening.
VLOOKUP function:
- It’s a built-in Excel function for looking up data in a table or range.
- VLOOKUP performs a linear search through the first column of the specified range.
How VLOOKUP works:
- It starts at the top of the first column and scans down until it finds a match or reaches the end.
- This is essentially a brute-force search method.
Performance implications:
- For small datasets, VLOOKUP is quick and efficient.
- For large datasets, it can become slow as it has to scan through many rows.
Now instead of VLOOKUP, if you want to get REALLY fancy and working with a ton of data, you will have to revert to using INDEX-MATCH combination:
INDEX retrieves a value from a specified position in a range
MATCH finds the position of a lookup value within a range
Why it’s faster: a. Column independence:
- VLOOKUP requires the lookup column to be the leftmost in the table
- INDEX-MATCH can look up and return values from any columns
- This flexibility often means less data manipulation and fewer calculations
b. Search efficiency:
- MATCH function uses a binary search algorithm for sorted data
- This is much faster than VLOOKUP’s linear search, especially in large datasets
c. Range reference:
- VLOOKUP references the entire table for each lookup
- INDEX-MATCH only references the specific columns needed
Performance impact:
- For small datasets: Negligible difference
- For large datasets: INDEX-MATCH can be significantly faster
- The performance gap widens as the dataset size increases
Memory usage:
- INDEX-MATCH typically uses less memory
- This is because it doesn’t need to reference the entire table for each lookup
OK, now time to apply this concept to Index Creation for vector embeddings!
Purpose of Index Creation:
- Enable fast similarity searches over large numbers of vector embeddings
- Efficiently retrieve the most relevant chunks when given a query
Why it’s Necessary:
- Brute-force comparison of a query against millions of vectors is too slow
- Indexes use specialized data structures to speed up similarity searches
How Index Creation Works:
- Organize vectors into a searchable structure (e.g., trees, graphs, or quantized representations)
- Optimize for approximate nearest neighbor (ANN) search
- Balance between search speed and accuracy
Key Concepts in Indexing:
- Dimensionality reduction: Compress high-dimensional vectors
- Clustering: Group similar vectors together
- Quantization: Represent vectors with fewer bits
Process: A. Choose an indexing method based on your needs (speed, accuracy, scalability) B. Initialize the index with parameters suitable for your data C. Add your vector embeddings to the index D. Optiotionally, train the index for better performance
However, it’s important to note that in practice, these steps often overlap or are handled seamlessly by the vector database system. Here’s why:
- Many modern vector databases handle indexing internally (like DataStax Astra DB):
- When you ingest data, the database often automatically creates and updates its index.
- You don’t always need to create the index structure separately.
- Some systems allow for real-time indexing (like DataStax Astra DB):
- You can ingest data continuously, and the index updates in real-time.
- The process can be streamlined:
- Some vector databases offer APIs that accept raw text, automatically handle chunking and embedding, and then ingest the resulting vectors.
- Some tools like Unstructured.io pull data from different data file types and do the chunking and embedding on the fly then ingest the resulting vectors.
By creating an efficient index, you’re essentially building a “smart phonebook” for your vector embeddings. This allows your RAG system to quickly find the most relevant information without having to exhaustively search through every single vector, greatly enhancing the speed and efficiency of the retrieval process.
6. Quality Check
Just like it’s crucial to have quality data, it’s also crucial to have strong quality checks in place for the data you have pulled into your vector database and indexed. You can do this by:
- Manual review of retrieved results for sample queries.
- Automated testing with predefined query-result pairs.
- Fine-tune chunk size if context is insufficient or redundant.
- Experiment with different embedding models if semantic matching is poor.
- Adjust index parameters (e.g., number of clusters, search depth) for better precision/recall balance.
This is typically the “uh-oh” point for many projects as you realize your data isn’t as clean as you expected it to be. Common issues that get surfaced are:
- Inconsistent formatting: Even within the same dataset, you might find multiple date formats, inconsistent spacing, or varying representations of the same information
- HTML artifacts: Remnants of web scraping like , incomplete tag stripping, or encoded characters that slipped through
- Duplicate content with slight variations: The same information presented with minor differences that creates nearly identical vectors and noise in your results
- Missing context: Chunks that made sense in the original document but lose critical context when separated
- OCR errors: If dealing with scanned documents, misread characters or formatting that creates nonsensical content
- Boilerplate pollution: Headers, footers, and standard disclaimers that add noise to your vector space
- Versioning confusion: Multiple versions of the same document without clear indicators of which is most current
- Inconsistent metadata: Missing or contradictory tags, categories, or other metadata that should help filter results
The challenge is that these issues often don’t become apparent until you’re actually testing your retrieval results, and fixing them usually requires revisiting your entire data processing pipeline. This can be a MAJOR bummer, but you aren’t alone! Many organizations face these issues. Overcoming them will take dedicated teamwork.
7. Choose Your Language Model
We have made it to the point where we can finally start using in-vogue terms like LLM! What does an LLM mean and what choices do we have at this point?
Large Language Models (LLMs) are models like GPT-4, Claude, and PaLM are massive models with 100B+ parameters. They are excellent at complex reasoning, understanding context, and generating human-like responses (not sentient!). LLMs are typically accessed through cloud APIs and have higher latency and cost compared to smaller models. Given they are API based it is easy to spin up the keys and implement the model to your RAG architecture.
Best for: complex reasoning tasks, creative content generation, nuanced understanding
Small Language Models (SLMs) are models like Mistral 7B, Llama 2 13B, and Microsoft’s Phi models. These models can run on edge devices or local hardware, and yes even your laptop! They have faster inference times and lower cost which make them enticing although they have limited context window and reasoning capabilities compared to LLMs.
Best for: edge processing, real-time applications, specific domain tasks
Examples of edge processing use cases:
- Local chat bots without internet connectivity
- IoT devices requiring natural language processing
- Privacy-sensitive applications where data can’t leave the device
- Mobile applications where low latency is crucial
Foundation Models are pre-trained models that serve as a base for fine-tuning. These models can be adapted to specific domains or tasks through additional training with examples like GPT, BERT, RoBERTa, and T5. These models provide a starting point for creating specialized models which (shocker) are expensive and require access to GPU computational power. Most organization’s don’t go down the route of training or fine-tuning their models because of its resource intensiveness. This is typically done by ISVs and SaaS companies.
Now you may ask, where do I find these models? How do I compare them? Here are the most common.
Model Hubs and Marketplaces:
- Hugging Face Hub: Largest repository of open-source models
- Azure AI Studio: Enterprise-ready models with Azure integration
- Amazon SageMaker: AWS’s collection of pre-trained models
- Google Cloud Model Garden: Collection of Google’s foundation models
- Anthropic’s Claude Models: Specialized for safety and reasoning
- GitHub repositories of major AI labs
- Meta’s open source models
Hugging Face is the most common with tons of analytics on each model to help you choose the right one.
Considerations for Model Selection:
- Inference costs and pricing models
- Hardware requirements and deployment constraints
- Licensing and usage restrictions
- Community support and documentation
- Fine-tuning capabilities and requirements
- Model update frequency and maintenance
- Ethical considerations and bias evaluations (IMPORTANT!!)
8. Test, Tune, Iterate
Now we can FINALLY put it all together!! Testing, tuning, and iterating is the name of the game to put your chosen architecture to the test and making sure the models you chose effectively chunked your data and provide the most relevant response outputs. This is where tools like Langchain and Langflow help the most. Here are the things to consider for each step.
Test: It is best to use a diverse set of queries covering expected use cases and programmatically run these sets when testing outputs of your RAG system. Make sure to Include edge cases and potential failure modes you would expect to see how the LLM handles the situation. An easy example is asking the same prompt whether the RAG system knows you are a boy or girl. Do you get a different response? This is checking for bias. How about if you ask the same question twice? Or the same question, slightly differently? This is checking for consistency.
Here are evaluation metrics for you to track:
- Relevance of retrieved information.
- Quality and accuracy of LLM responses.
- Response time and system efficiency.
Tune: Refine each component based on test results. This might involve adjusting chunking strategies, trying different embedding models, adding guard rails, or fine-tuning prompts.
Iterate: back to testing and trying again! Once you have a good blueprint for your architecture you can plan for a production push.
Common Mistakes to Avoid in GenAI/RAG Deployment
Mistakes during the design and deployment of RAG infrastructure can lead to inefficiencies, security vulnerabilities, or outright system failures. Here are some of the most common mistakes—and how to avoid them.
1. Misaligned Data and Models
Perhaps the most frequent error is failing to properly align your data with the model. For example, if your vector embeddings don’t match what the language model expects, the system won’t be able to interpret the data correctly. Think of it as trying to fit a square peg into a round hole—without alignment, nothing works as it should.
2. Inefficient Orchestration
Poor orchestration can cripple your system. If the vector database and the LLM aren’t communicating smoothly, you’ll experience delays and incomplete responses. Ensure your orchestration engine is robust and capable of handling complex queries without losing speed or accuracy.
3. Overlooking Scalability
Many teams fail to plan for the inevitable growth of data and demand. Without scalable infrastructure, you could end up with bottlenecks that slow down performance or require expensive retrofits. Always plan for expansion, even if your current needs are modest.
4. Neglecting Security
With sensitive data being processed and retrieved, neglecting security is a surefire way to invite trouble. Weak encryption or poor access controls can expose your system to malicious attacks or data breaches. Prioritize security from the outset to protect both the system and the data it handles.
Overarching Concepts for Your RAG System
Optimize Orchestration
Orchestration is the glue that holds the RAG system together. It manages the flow between the language model and the vector database, ensuring that queries are processed smoothly. Whether you use Python scripts or specialized orchestration tools, the key is seamless communication between components to avoid bottlenecks or delays in response time.
Scale Thoughtfully
As data and usage grow, so will the demands on your infrastructure. Start with a system that can scale easily, whether that means leveraging cloud-based storage, high-performance databases, or advanced orchestration tools. A scalable infrastructure will prevent performance degradation as your datasets expand.
Prioritize Security
RAG systems often deal with sensitive data, so building in security measures from the beginning is non-negotiable. This includes encryption, secure access controls, and data governance policies to ensure compliance with privacy regulations. Don’t leave security as an afterthought; it should be baked into your infrastructure from day one.
A New Age of AI Infrastructure
Building and maintaining GenAI/RAG infrastructure requires more than just technical expertise. It involves a deep understanding of how data flows through the system, how to scale efficiently, and how to secure sensitive information. By mastering the best practices outlined here, and avoiding the common pitfalls, you’ll be well-positioned to deploy a system that maximizes the potential of both generative AI and RAG.
Conclusion: Is Your Data Ready for the Future?
PHEW!! ThaIn a world where AI systems are becoming integral to business operations, having a strong infrastructure is critical. Whether you’re just getting started with GenAI or looking to optimize your RAG setup, the key is to focus on alignment, orchestration, scalability, and security. Want to dive deeper into optimizing your AI systems? Get in touch with us to explore how we can help you build a future-proof infrastructure tailored to your needs.
Disclaimer: I am an employee of DataStax, the company that develops Astra DB and Langflow. While I am compensated as a salaried employee, I do not receive any additional commission or affiliate marketing income for discussing or promoting Astra DB. The views and opinions expressed in this blog post are my own and are based on my professional experience and knowledge of the product. I strive to provide accurate and helpful information, but I encourage readers to conduct their own research and make informed decisions based on their specific needs and circumstances.