Refactoring A0X's collective brain — from ColBERT to OpenRouter+Cohere

−57%

Total retrieval cost — $520/mo → $225/mo

+7 precision points · −38% latency

Production RAG systems fail in a way that local prototypes don’t. In development, you benchmark on a curated test set with 50 queries, get great numbers, and ship. In production, the query distribution shifts, the data grows, the model provider has an outage, and you’re paying $800/month in embedding costs you didn’t plan for.

This is the story of one such migration.

The numbers

+7.0%

Precision@3

0.71 → 0.76 (human labels)

−38%

Median latency

340ms → 210ms

−84%

Infrastructure cost

$520/mo → $85/mo GCP

−57%

Total retrieval cost

$520/mo → $225/mo all-in

We ran both pipelines in shadow mode for 2 weeks before cutover — every production query went to both, we logged both results, and 3 engineers rated differences on 400 sampled queries.

Pipeline: old vs. new

ColBERT ONNX (old)

OpenRouter + Cohere (new)

ColBERT ONNX pipeline

1

BM25 — Pinecone sparse sparse retrieval

Keyword-based first stage — fast, exact, no semantic understanding
2

ColBERT ONNX MaxSim ~$520/mo GCP

Local inference on Cloud Run — semantically rich but expensive at volume
3

Top-5 to agent output

340ms median latency · P@3: 0.81 code, 0.61 Spanish

BM25 sparse first stage — exact, no semantics
ColBERT ONNX reranker — locally deployed on Cloud Run
All retrieval on one GCP instance
~$520/mo infrastructure cost
340ms median latency
Good on: Cadence/Solidity code queries (P@3: 0.81)
Weak on: Spanish conversational (P@3: 0.61)

OpenRouter + Cohere pipeline

flowchart TD NQ["Query"] --> CLS["Query classifier\ncode vs. general"] --> OR["OpenRouter embed\nmistral/mistral-embed"] --> ANN["ANN search\ntop-20"] --> CO["Cohere rerank-v3.0"] --> N5["top-5 to agent"] CLS -->|"code query"| BM2["BM25 fast path\nexact match"] --> C5["top-5 direct"]

Dense embedding first stage (OpenRouter)
Cohere cross-encoder reranker — trained on massive recent dataset
Code queries routed to BM25 fast path (no reranker)
~$225/mo total (API costs replace infra)
210ms median latency
Strong on: Spanish conversational (P@3: 0.74)
Code: 0.82 after BM25 fast path

Starting state: ColBERT ONNX

The A0X collective brain — the shared knowledge system used across all agents — originally ran a two-stage retrieval pipeline:

First stage (BM25): keyword-based retrieval using Pinecone’s sparse vector support. Fast, exact, bad at semantics.
Second stage (ColBERT MaxSim): a locally-deployed ColBERT ONNX model for dense reranking. Semantically rich, expensive to run.

ColBERT MaxSim is academically elegant — late interaction, query token embeddings compared to document token embeddings, better relevance judgments than bi-encoders on complex queries.

The problem: running ONNX inference on every retrieval at production volume on a single GCP Cloud Run instance was not economical. The brain-api service was consuming more GCP compute budget than the main backend.

Query type breakdown

scale: Precision@3 (higher = better)

Code queries — Cohere (BM25 fast path) After BM25 fast path added

0.82

Code queries — ColBERT ColBERT strong on code

0.81

Mixed English/code — Cohere Cohere wins mixed

0.76

English conversational — Cohere Cohere wins conversational

0.78

Spanish conversational — Cohere Big win over ColBERT

0.74

Spanish conversational — ColBERT Biggest ColBERT weakness

0.61

Shadow mode revealed a distribution shift. The new pipeline was significantly better on Spanish-language queries and significantly worse on technical Cadence/Solidity code queries. Fix: a query classifier that routes code-heavy queries to a specialized code retrieval pipeline (BM25 + exact match, no reranker). After adding the BM25 fast path, code query precision went from 0.61 to 0.82.

What went wrong

Embedding model outputs are not compatible between providers. When we switched from ColBERT embeddings to OpenRouter/mistral-embed, all previously stored knowledge chunks needed re-embedding. We had ~180K chunks. The re-embedding job took 8 hours and cost ~$45.

The Cohere rate limits surprised us. At peak query volume (Friday afternoons), we hit the limit twice in the first week. Fix: retry with exponential backoff and queue excess requests. Should have been Day 1.

On production RAG cost

The honest lesson: you cannot predict production RAG costs from prototypes. The query distribution, volume, data size, and infrastructure load all interact in ways that are hard to model in advance.

What you can do:

Run shadow mode before cutover
Log query costs at the provider level (OpenRouter and Cohere both have per-request cost metadata)
Build retrieval quality metrics into your monitoring — don’t just watch latency, watch precision

RAG architecture is not a one-time decision. It evolves as your query distribution and scale evolve.

The commit that captures the change: 5f1063ce feat(rag-api): replace ColBERT ONNX with OpenRouter embed + Cohere rerank.