RAG

RAG puts relevant documents in the prompt before the model answers. It's the right answer for "the model needs information it doesn't have," but it's also the most over-engineered part of most AI apps.

Before reaching for RAG

Does the corpus fit in context? Frontier models hold 200K-2M tokens. If your knowledge base is under 100K tokens, just paste it in. Cache the prefix.
Is the question really about retrieval? Many "RAG" problems are actually classification or extraction. Don't retrieve when you can deterministically look up.
Do you need fresh data? A web search tool is often a better fit than a vector store.

The minimal pipeline

Chunk the source documents. Start with paragraphs or fixed token windows. Refine later.
Embed each chunk with a model like text-embedding-3-large or Voyage voyage-3.
Store in any vector database: Postgres with pgvector, Qdrant, Pinecone, Turbopuffer.
Retrieve the top K chunks by cosine similarity to the query.
Rerank with a cross-encoder (Cohere Rerank, Voyage Rerank). Big quality win for small cost.
Stuff the top reranked chunks into the prompt with clear citations.

What moves the needle

Hybrid search. Combine BM25 (keyword) with dense retrieval. Most corpora benefit.
Reranking. Often more impact than swapping embedding models.
Better chunking. Paragraph-aware, structure-aware, or per-document logic beats fixed windows.
Query rewriting. Use the model to expand or rephrase the query before searching.
Metadata filters. Filter by user, doc type, date before semantic search.

What rarely moves the needle

Swapping vector databases.
Changing the embedding model from a recent good one to a different recent good one.
HyDE, multi-query, and other clever query expansions on small corpora.

RAG

Before reaching for RAG

The minimal pipeline

What moves the needle

What rarely moves the needle

Reading