RAG puts relevant documents in the prompt before the model answers. It's the right answer for "the model needs information it doesn't have," but it's also the most over-engineered part of most AI apps.
Before reaching for RAG
- Does the corpus fit in context? Frontier models hold 200K-2M tokens. If your knowledge base is under 100K tokens, just paste it in. Cache the prefix.
- Is the question really about retrieval? Many "RAG" problems are actually classification or extraction. Don't retrieve when you can deterministically look up.
- Do you need fresh data? A web search tool is often a better fit than a vector store.
The minimal pipeline
- Chunk the source documents. Start with paragraphs or fixed token windows. Refine later.
- Embed each chunk with a model like
text-embedding-3-large or Voyage voyage-3.
- Store in any vector database: Postgres with pgvector, Qdrant, Pinecone, Turbopuffer.
- Retrieve the top K chunks by cosine similarity to the query.
- Rerank with a cross-encoder (Cohere Rerank, Voyage Rerank). Big quality win for small cost.
- Stuff the top reranked chunks into the prompt with clear citations.
What moves the needle
- Hybrid search. Combine BM25 (keyword) with dense retrieval. Most corpora benefit.
- Reranking. Often more impact than swapping embedding models.
- Better chunking. Paragraph-aware, structure-aware, or per-document logic beats fixed windows.
- Query rewriting. Use the model to expand or rephrase the query before searching.
- Metadata filters. Filter by user, doc type, date before semantic search.
What rarely moves the needle
- Swapping vector databases.
- Changing the embedding model from a recent good one to a different recent good one.
- HyDE, multi-query, and other clever query expansions on small corpora.
Reading