RAG as an experimentation platform — Toward RAGOps

RAG is often framed as a solved recipe: ingest, embed, retrieve, prompt, deploy. In real projects, it does not behave like a solved recipe at all.

Most “working” RAG systems are a stack of assumptions:

chunk size and overlap,
retriever and reranker combinations,
fusion strategy,
prompt structure and context budget.

Each decision nudges quality, latency, and cost in a different direction. The problem is not that these knobs exist; the problem is that they are usually tuned by instinct and then forgotten.

This project started from one premise: if we want reliable RAG, we need an experimentation layer, not just a pipeline.

What This Project Tries To Fix

Instead of optimizing for a single “best setup,” I built a workspace that makes retrieval behavior inspectable:

swap chunking strategies,
compare BM25, FAISS, and hybrid retrieval,
try score fusion methods like RRF,
inspect what was retrieved, why it ranked where it ranked, and where it failed.

Early RAGOps prototype

The point is to make retrieval decisions explicit. When results improve, I want to know exactly what changed. When results degrade, I want a reproducible trail back to the failure mode.

This is less about “building a smarter chatbot” and more about building better engineering discipline around RAG.

Where This Is Going

RAGOps, as I see it, is not a product pitch yet. It is an operating model:

treat retrieval as an iterative system,
evaluate changes against concrete baselines,
make tradeoffs legible across quality, latency, and cost,
preserve enough observability that future changes stay auditable.

A Marimo interface became a practical environment for this workflow. It behaves more like a lab notebook than a polished app: ingest data, run retrieval experiments, inspect outputs, iterate, repeat.

My main takeaway: RAG failures are usually not model failures first. They are systems and DX failures first. If teams cannot reason about retrieval quality, cannot reproduce evaluations, and cannot compare pipeline variants fairly, production reliability becomes guesswork.

This work is intentionally incomplete. The goal is not to claim that RAG is solved. The goal is to make RAG less opaque to build, test, and scale in real systems.