9 min read 10 June

Building multitenant RAG the AWS-native way

A practical guide to designing multitenant Retrieval-Augmented Generation systems using AWS-native services — from vector stores to tenancy patterns.

Amazon Bedrock
Amazon OpenSearch
Amazon S3

Paweł Grabiński AI Engineer

If you’ve ever tried to build a multitenant application, you know the feeling — it starts as a clean architecture decision and quickly spirals into a maze of access controls, data isolation questions, and cost trade-offs. Now add vector search and RAG into the mix, and things get interesting fast.

I recently gave a webinar on this exact topic, and after the session I realized the content deserved a more permanent home. Building multitenant RAG on AWS doesn’t have to be painful — but it does require making the right choices early. I did the research, so you don’t have to.

In this guide, I’ll walk you through the full picture: what RAG actually is (and why it’s not going anywhere), what AWS gives you to build it, which vector stores to pick, and how to design multitenancy patterns that balance security, cost, and performance.

First things first: RAG is not dead

Batman slapping Robin — RAG is dead... / RAG is agentic!

Every few months a post appears on LinkedIn proclaiming that RAG is dead. These claims usually fall into two categories: either the implementation architecture has changed (true), or context windows are now large enough to make retrieval unnecessary (not true). Neither means RAG is going away.

RAG — Retrieval-Augmented Generation — is, at its core, a system design pattern. It means retrieving any data that may be relevant to the task at hand and using it to improve generation results. That’s it. As long as generative models benefit from additional context, RAG will exist. Only the implementation details change over time.

The “context window” argument deserves a closer look. Yes, models now routinely offer 200K tokens, and some go up to a million. But stuffing everything into the context is slow, expensive, and suffers from context rot — the well-documented phenomenon where models lose accuracy as context grows. And what if you have thousands of documents? You simply can’t fit an entire corporate knowledge base into a prompt. Context engineering will always get you.

How RAG has evolved

The original RAG pipeline was beautifully simple: take a user query, retrieve relevant documents via vector similarity search, and feed them to an LLM for generation. A straight line from question to answer.

That simplicity was also its limitation. What if the query was ambiguous? What if relevant information was spread across multiple documents that reference each other? What if you needed five results for one question and fifty for another?

Enter agentic RAG. Modern RAG systems use ReAct-style agents that can reason about what to search for, execute multiple retrieval steps, adjust their queries based on intermediate results, and decide when they have enough context to generate an answer. For example, an agent asked about a specific department’s processes might first retrieve the company’s organizational structure, identify the relevant business unit, and then search specifically within that unit’s documentation. This iterative, intelligent retrieval finds information that the simple linear pipeline never could — like a contract amendment that references an original agreement three levels deep.

So when someone says “RAG is dead,” what they really mean is that the naive 2023-era implementation has matured. RAG itself is stronger and more capable than ever.

What you need to build RAG

At a high level, you need two things: a generation model and a retrieval engine.

Generation is the straightforward part these days. Pick the latest generation of foundation models and you’ll be fine. Anthropic’s Claude on Amazon Bedrock is a solid default — strong quality and the security guarantees of running within AWS. There are always trade-offs between cost and capability across model families, but the differences between generations far outweigh the differences between providers within the same generation.

Retrieval is where the real engineering happens. You need an embedding model to convert text (or any data) into vector representations — sequences of numbers that encode semantic meaning. Similar vectors indicate similar meaning, which is the core assumption that makes semantic search work. It’s a leap of faith, but one that holds up remarkably well in practice. A good rule of thumb: pick the latest generation of embedding models with mid-sized vector dimensions, and you’ll be in good shape.

Then you need somewhere to store those vectors and a way to search through them efficiently. This is where AWS-native choices become critical.

Going AWS-native with Bedrock Knowledge Bases

In theory, you can deploy any vector store on AWS — spin up an EC2 instance and run whatever you like. But to truly go AWS-native, the smart starting point is Amazon Bedrock Knowledge Bases.

Think of Knowledge Bases as an orchestration layer (similar to what LlamaIndex provides in the open-source world) that streamlines the entire data ingestion pipeline. They handle parsing of your documents (with default, custom, or even foundation-model-powered parsers), chunking strategies (fixed-length, semantic, parent-child, or fully custom), embedding, and synchronization with your vector store. The built-in crawler watches your data sources for changes — additions, edits, deletions — and updates the vector store incrementally without redundant operations.

Data sources can be S3 buckets (where you can store anything), but Knowledge Bases also integrate with popular enterprise tools like SharePoint, Confluence, and Salesforce. Setting it up is mostly a matter of credentials and permissions.

A recent addition worth noting is Bedrock Data Automation — essentially intelligent document processing and OCR capabilities that let you reliably extract structured information from complex documents, including tables and non-text elements.

Choosing your vector store

Knowledge Bases support several vector store backends, but the decision usually comes down to two primary choices and a handful of alternatives.

The main contenders: OpenSearch vs. S3 Vectors

Amazon OpenSearch is the flagship option, available as either a managed cluster or a serverless collection. It delivers strong retrieval performance and — crucially — supports not just semantic search but also lexical (keyword) search and hybrid combinations of both. This matters more than you might think. Imagine searching project documentation for a specific technology name: semantic search might find conceptually related content, but a keyword search will give you the complete list of every project that mentions it. OpenSearch lets you do both.

The trade-off is cost. OpenSearch isn’t pay-per-use — even the “serverless” option requires reserved computational units. It’s often one of the largest line items in a PoC budget. One collection can serve multiple Knowledge Bases by creating separate indices, which helps with efficiency.

Amazon S3 Vectors is the newer entrant — a fully serverless, pay-per-use vector store with pricing that mirrors S3 itself: you pay for storage and requests. It’s significantly cheaper than OpenSearch, making it excellent for PoCs, batch workloads, and applications where sub-second latency isn’t critical. The retrieval is slower, and it doesn’t support lexical search on its own. But here’s an interesting option: you can combine S3 Vectors for storage with OpenSearch for retrieval, getting near-OpenSearch performance with much better pricing for large-scale, infrequently accessed data.

The rule of thumb: OpenSearch when you need low latency, S3 Vectors when you need low cost.

The alternatives

Aurora PostgreSQL with pgvector shines when you need to combine semantic search with traditional SQL queries — for example, filtering by structured metadata fields before or alongside vector similarity. It used to be the budget option, but S3 Vectors has largely taken that spot.

Pinecone rivals or exceeds OpenSearch in retrieval benchmarks and offers hybrid search. Currently cloud-only, with a bring-your-own-cloud option coming soon. The networking and security controls are less granular than OpenSearch, which matters in enterprise environments.

Redis Enterprise is the ultra-low-latency choice — think voicebots where every millisecond counts because a user is waiting on the other end of a phone call. It’s RAM-only, which limits scale, but for real-time multi-query agent scenarios it can be the right tool.

Amazon Neptune Analytics is for graph RAG — when your documents are densely interconnected with references, amendments, and cross-links (think legal contracts). It’s powerful but complex, and many teams report after implementation that a simpler approach would have sufficed. Proceed with caution.

Multitenancy patterns for RAG

Now for the main event. Multitenancy isn’t unique to RAG, but the way it maps to vector stores and Knowledge Bases creates specific trade-offs worth understanding deeply.

There are three established patterns: Pool, Bridge, and Silo. They range from fully shared infrastructure to fully isolated environments.

Multitenant architectural patterns — Pool, Bridge, and Silo

Pool pattern: shared everything

In the Pool pattern, all tenants share the same application, the same vector store instance, and the same index. Separation is handled entirely through application logic — specifically, metadata filtering. When you ingest documents, you tag each chunk with a tenant identifier (a department ID, user ID, or organization ID) as metadata. At query time, you filter results to only return chunks matching the current tenant.

Bedrock Knowledge Bases supports metadata ingestion and filtering out of the box with any compatible vector store. The key detail: make sure tenant IDs are stored as filterable metadata, not embedded into the vector itself.

Cost: Since everything runs on a single instance, there’s no financial incentive to pick the cheapest option. Go with OpenSearch or another high-performance store.

Scale consideration: A single index for all tenants means search latency grows as data grows. Vertical scaling helps up to a point, but very large deployments (think 10,000+ employees) may eventually need to move to a pattern with more separation.

Risk: One incorrectly defined metadata filter can leak data between tenants. This pattern requires thorough testing of your access control logic.

Bridge pattern: shared compute, separated data

The Bridge pattern keeps a shared application layer but physically separates data. In AWS terms, this means creating a separate Knowledge Base per tenant, each with its own index in the shared OpenSearch cluster or collection.

This is often the sweet spot. You get physical data separation (queries go to a specific Knowledge Base ID, making accidental cross-tenant access much harder) while keeping infrastructure costs reasonable — it’s still one cluster serving all tenants.

The caveat: AWS imposes quotas on the number of Knowledge Bases you can create. For business-unit-level tenancy (tens of tenants), this is rarely a problem. For per-user tenancy (thousands of tenants), you’ll likely need to request quota increases — which AWS may or may not grant depending on the scale.

Silo pattern: fully isolated

In the Silo pattern, each tenant gets their own vector store instance, their own application environment, and their own everything. Maximum isolation, maximum security, and maximum compliance guarantees.

The trade-off is obvious: cost. Running a separate OpenSearch collection per tenant means paying infrastructure costs per tenant. For a SaaS product with hundreds of users, this becomes prohibitively expensive.

This is where S3 Vectors truly shines for the Silo pattern. Because it’s pay-per-use, a tenant who creates an account but rarely uses the system costs you almost nothing. For enterprise applications where data isolation is non-negotiable — insurance, legal, healthcare — S3 Vectors gives you full silo isolation at a fraction of the cost that OpenSearch would demand.

Complexity note: Deploying and managing separate environments per tenant is operationally heavier than pool or bridge. The mapping and routing infrastructure needs to be robust. But if your compliance requirements demand it, the architectural clarity of true isolation can actually simplify security auditing.

Putting it all together: pattern-store matching

Here’s the practical decision framework:

Pool + OpenSearch — Best for small-to-medium scale with simple tenant structures. Low cost (single instance), fast retrieval, but requires careful metadata filtering and thorough security testing.

Bridge + OpenSearch — The recommended default for most enterprise use cases. Physical data separation via separate Knowledge Bases, shared infrastructure costs, and strong performance. Watch the Knowledge Base quotas.

Silo + S3 Vectors — The go-to for compliance-heavy industries needing full tenant isolation. Pay-per-use pricing makes it economically viable even at scale. Accept the higher latency as a trade-off for isolation and cost efficiency.

Silo + OpenSearch — When you need both full isolation and low latency, and the budget supports it. Think high-value enterprise clients where each tenant justifies dedicated infrastructure.

Conclusion: your next steps

Building multitenant RAG on AWS isn’t as daunting as it might seem — the key is making informed choices about where to put the complexity. AWS-native services like Bedrock Knowledge Bases, OpenSearch, and S3 Vectors give you a solid foundation, and the Pool/Bridge/Silo framework gives you a clear mental model for the trade-offs.

If you’re starting a new project, my recommendations:

Start with Bridge + OpenSearch for most use cases — it gives you physical data separation with manageable costs and complexity.
Use Bedrock Knowledge Bases to handle the ingestion pipeline — don’t reinvent parsing, chunking, and synchronization.
Consider S3 Vectors for Silo patterns when compliance requires full isolation — the pay-per-use model changes the economics dramatically.
Plan your tenancy model early — the choice between per-user, per-team, or per-organization tenancy drives everything downstream.

Have questions about building multitenant RAG on AWS? Reach out to us at hello@chaosgears.com or visit chaosgears.com. I’m also always happy to connect on LinkedIn.

Resources