Forking SimpleMem for A0X — extending someone else's memory system
What I learned extending an open-source agent memory system and wiring it into a production multi-agent platform. HuggingFace embeddings, multi-tenant LanceDB, and +8.5% F1 over the SimpleMem baseline.
The hardest part of forking an open-source project isn’t the technical changes. It’s deciding what to keep.
SimpleMem (aiming-lab/SimpleMem) is a clean Python library for agent long-term memory. It has a good design: a retrieval layer using ColBERT MaxSim scoring, storage backed by LanceDB, and a straightforward API — store(memory), retrieve(query, k).
When we needed a memory system for A0X, it was the most sensible starting point — but it had two problems for production: it was single-tenant, and its embeddings strategy was fixed to the bundled ONNX ColBERT model.
Production requirements
A0X runs many agents for many users. A single memory store that co-mingles agent contexts is a correctness bug — agent A shouldn’t surface memories from agent B’s conversations. SimpleMem’s architecture had no tenant concept at all.
The ColBERT ONNX model was also a problem. ColBERT MaxSim is high-quality for reranking, but running it as a first-stage retriever at production query volume on a single GCP Cloud Run instance was expensive. We wanted cheaper dense embeddings for retrieval and ColBERT (or Cohere) for reranking only.
Before / after metrics
Evaluated against 200 (query, memory) pairs from production A0X logs, human-labeled for relevance.
Pipeline: old vs. new
ColBERT ONNX baseline
Single-stage pipeline:
- 1QueryRaw query string
- 2ColBERT ONNX MaxSimFull model inference on every query — expensive
- 3LanceDB (global)No namespace — all agents share one store
- 4Top-K results180ms median latency
- Global store — no tenant isolation
- ColBERT ONNX runs on every query
- ~1.2 GB per instance
- 180ms median latency
- Single embedding strategy, no override
BGE + Cohere fork
Two-stage pipeline:
- 1QueryRaw query string
- 2BGE-small (33M params, CPU)Fast dense embedding — cheap first-stage retrieval
- 3LanceDB (namespaced)Per-agent namespace isolation — agent A never sees agent B's memories
- 4Cohere rerank (cross-encoder)Quality reranker on top-20 candidates only
- 5Top-5 to agent95ms median latency
- Namespace-isolated per agent
- BGE for cheap retrieval, Cohere for quality rerank
- ~0.4 GB per instance
- 95ms median latency
- Pluggable embeddings backend
What PR #12 changed
1. Multi-tenancy via namespace isolation
Every store and retrieve call now accepts an optional namespace. LanceDB partitions by namespace — writes go to a namespaced table, retrieves query only that table.
memory.store(
"jessexbt remembered the user prefers concise answers",
namespace="jessexbt"
)
results = memory.retrieve(
"user preferences",
k=5,
namespace="jessexbt"
)
A global retrieval (no namespace) is still supported for cases where cross-agent context is intentional. API is backward compatible — existing code with no namespace gets global retrieval.
2. HuggingFace embeddings as first-stage retriever
memory = SimpleMem(
embeddings_backend="huggingface",
hf_model="BAAI/bge-small-en-v1.5",
reranker="cohere",
cohere_api_key=os.environ["COHERE_API_KEY"],
)
BAAI/bge-small-en-v1.5 — 33M parameters, fast CPU inference, strong retrieval quality for conversational agent memory.
3. Cohere reranking
After HuggingFace retrieval returns top-K candidates, an optional Cohere rerank pass re-orders them. Cheap first-stage, quality second-stage.
On working with someone else’s architecture
The most instructive thing about this project wasn’t the technical changes. It was the constraint of not being able to break the existing API contract.
SimpleMem has a clean retrieve(query, k) interface. We couldn’t add a required namespace argument. Making it optional with a sensible default required thinking about backward compatibility in a way I wouldn’t have if building from scratch.
That discipline — “what’s the minimal change that achieves the goal without breaking existing contracts?” — is something I’ve tried to carry into subsequent work.
What I’d do differently
The multi-tenant isolation should have been a separate PR. The embeddings and reranking changes are a different concern from multi-tenancy. Reviewing one large PR is harder than reviewing two focused ones.
LanceDB’s partitioning API changed between versions. Developed against 0.3.x; the production upgrade to 0.4.x broke the namespace isolation. Pin dependency versions.
The eval set should live in the repo. The 200-pair eval set lives on my local machine. Anyone making retrieval changes can’t reproduce the benchmark without it. It should be in tests/eval/ with a benchmark script.