Refactoring A0X's collective brain — from ColBERT to OpenRouter+Cohere
We replaced a self-hosted ColBERT ONNX retrieval pipeline with OpenRouter embeddings + Cohere rerank. The numbers, the reasoning, and honest reflections on production RAG cost vs. performance tradeoffs.
Production RAG systems fail in a way that local prototypes don’t. In development, you benchmark on a curated test set with 50 queries, get great numbers, and ship. In production, the query distribution shifts, the data grows, the model provider has an outage, and you’re paying $800/month in embedding costs you didn’t plan for.
This is the story of one such migration.
The numbers
We ran both pipelines in shadow mode for 2 weeks before cutover — every production query went to both, we logged both results, and 3 engineers rated differences on 400 sampled queries.
Pipeline: old vs. new
ColBERT ONNX pipeline
- 1BM25 — Pinecone sparseKeyword-based first stage — fast, exact, no semantic understanding
- 2ColBERT ONNX MaxSimLocal inference on Cloud Run — semantically rich but expensive at volume
- 3Top-5 to agent340ms median latency · P@3: 0.81 code, 0.61 Spanish
- BM25 sparse first stage — exact, no semantics
- ColBERT ONNX reranker — locally deployed on Cloud Run
- All retrieval on one GCP instance
- ~$520/mo infrastructure cost
- 340ms median latency
- Good on: Cadence/Solidity code queries (P@3: 0.81)
- Weak on: Spanish conversational (P@3: 0.61)
OpenRouter + Cohere pipeline
- Dense embedding first stage (OpenRouter)
- Cohere cross-encoder reranker — trained on massive recent dataset
- Code queries routed to BM25 fast path (no reranker)
- ~$225/mo total (API costs replace infra)
- 210ms median latency
- Strong on: Spanish conversational (P@3: 0.74)
- Code: 0.82 after BM25 fast path
Starting state: ColBERT ONNX
The A0X collective brain — the shared knowledge system used across all agents — originally ran a two-stage retrieval pipeline:
- First stage (BM25): keyword-based retrieval using Pinecone’s sparse vector support. Fast, exact, bad at semantics.
- Second stage (ColBERT MaxSim): a locally-deployed ColBERT ONNX model for dense reranking. Semantically rich, expensive to run.
ColBERT MaxSim is academically elegant — late interaction, query token embeddings compared to document token embeddings, better relevance judgments than bi-encoders on complex queries.
The problem: running ONNX inference on every retrieval at production volume on a single GCP Cloud Run instance was not economical. The brain-api service was consuming more GCP compute budget than the main backend.
Query type breakdown
Shadow mode revealed a distribution shift. The new pipeline was significantly better on Spanish-language queries and significantly worse on technical Cadence/Solidity code queries. Fix: a query classifier that routes code-heavy queries to a specialized code retrieval pipeline (BM25 + exact match, no reranker). After adding the BM25 fast path, code query precision went from 0.61 to 0.82.
What went wrong
Embedding model outputs are not compatible between providers. When we switched from ColBERT embeddings to OpenRouter/mistral-embed, all previously stored knowledge chunks needed re-embedding. We had ~180K chunks. The re-embedding job took 8 hours and cost ~$45.
The Cohere rate limits surprised us. At peak query volume (Friday afternoons), we hit the limit twice in the first week. Fix: retry with exponential backoff and queue excess requests. Should have been Day 1.
On production RAG cost
The honest lesson: you cannot predict production RAG costs from prototypes. The query distribution, volume, data size, and infrastructure load all interact in ways that are hard to model in advance.
What you can do:
- Run shadow mode before cutover
- Log query costs at the provider level (OpenRouter and Cohere both have per-request cost metadata)
- Build retrieval quality metrics into your monitoring — don’t just watch latency, watch precision
RAG architecture is not a one-time decision. It evolves as your query distribution and scale evolve.
The commit that captures the change: 5f1063ce feat(rag-api): replace ColBERT ONNX with OpenRouter embed + Cohere rerank.