The demo works. Leadership is excited. Then you ship it and retrieval quality falls off a cliff. This is the gap between notebook RAG and production RAG — and it's wider than most teams think.
In a notebook, you control the questions. In production, users ask things you didn't anticipate, in ways your chunker didn't expect, about documents your embedding model hasn't really learned. The naive cosine-similarity-over-chunks approach falls apart fast.
The fixes aren't glamorous: hybrid search, reranking, query rewriting, semantic chunking that respects document structure, and — crucially — an evaluation harness so you can tell when something regressed.
(Full post coming soon.)