Forking SimpleMem for A0X — extending someone else's memory system

+8.5%

F1 improvement over SimpleMem baseline

47% latency reduction · 67% memory reduction

The hardest part of forking an open-source project isn’t the technical changes. It’s deciding what to keep.

SimpleMem (aiming-lab/SimpleMem) is a clean Python library for agent long-term memory. It has a good design: a retrieval layer using ColBERT MaxSim scoring, storage backed by LanceDB, and a straightforward API — store(memory), retrieve(query, k).

When we needed a memory system for A0X, it was the most sensible starting point — but it had two problems for production: it was single-tenant, and its embeddings strategy was fixed to the bundled ONNX ColBERT model.

Production requirements

A0X runs many agents for many users. A single memory store that co-mingles agent contexts is a correctness bug — agent A shouldn’t surface memories from agent B’s conversations. SimpleMem’s architecture had no tenant concept at all.

The ColBERT ONNX model was also a problem. ColBERT MaxSim is high-quality for reranking, but running it as a first-stage retriever at production query volume on a single GCP Cloud Run instance was expensive. We wanted cheaper dense embeddings for retrieval and ColBERT (or Cohere) for reranking only.

Before / after metrics

+8.5%

F1@5

0.720 → 0.782

+8.8%

Precision@1

0.680 → 0.740

−47%

Median latency

180ms → 95ms

−67%

Memory/instance

1.2 GB → 0.4 GB

Evaluated against 200 (query, memory) pairs from production A0X logs, human-labeled for relevance.

Pipeline: old vs. new

Baseline (SimpleMem)

Fork (A0X production)

ColBERT ONNX baseline

Single-stage pipeline:

1

Query input

Raw query string
2

ColBERT ONNX MaxSim ~1.2 GB RAM

Full model inference on every query — expensive
3

LanceDB (global) no isolation

No namespace — all agents share one store
4

Top-K results output

180ms median latency

Global store — no tenant isolation
ColBERT ONNX runs on every query
~1.2 GB per instance
180ms median latency
Single embedding strategy, no override

BGE + Cohere fork

Two-stage pipeline:

1

Query input

Raw query string
2

BGE-small (33M params, CPU) ~0.4 GB RAM

Fast dense embedding — cheap first-stage retrieval
3

LanceDB (namespaced) tenant-safe

Per-agent namespace isolation — agent A never sees agent B's memories
4

Cohere rerank (cross-encoder) precision boost

Quality reranker on top-20 candidates only
5

Top-5 to agent output

95ms median latency

Namespace-isolated per agent
BGE for cheap retrieval, Cohere for quality rerank
~0.4 GB per instance
95ms median latency
Pluggable embeddings backend

What PR #12 changed

1. Multi-tenancy via namespace isolation

Every store and retrieve call now accepts an optional namespace. LanceDB partitions by namespace — writes go to a namespaced table, retrieves query only that table.

memory.store(
    "jessexbt remembered the user prefers concise answers",
    namespace="jessexbt"
)
results = memory.retrieve(
    "user preferences",
    k=5,
    namespace="jessexbt"
)

A global retrieval (no namespace) is still supported for cases where cross-agent context is intentional. API is backward compatible — existing code with no namespace gets global retrieval.

2. HuggingFace embeddings as first-stage retriever

memory = SimpleMem(
    embeddings_backend="huggingface",
    hf_model="BAAI/bge-small-en-v1.5",
    reranker="cohere",
    cohere_api_key=os.environ["COHERE_API_KEY"],
)

BAAI/bge-small-en-v1.5 — 33M parameters, fast CPU inference, strong retrieval quality for conversational agent memory.

3. Cohere reranking

After HuggingFace retrieval returns top-K candidates, an optional Cohere rerank pass re-orders them. Cheap first-stage, quality second-stage.

On working with someone else’s architecture

The most instructive thing about this project wasn’t the technical changes. It was the constraint of not being able to break the existing API contract.

SimpleMem has a clean retrieve(query, k) interface. We couldn’t add a required namespace argument. Making it optional with a sensible default required thinking about backward compatibility in a way I wouldn’t have if building from scratch.

That discipline — “what’s the minimal change that achieves the goal without breaking existing contracts?” — is something I’ve tried to carry into subsequent work.

What I’d do differently

The multi-tenant isolation should have been a separate PR. The embeddings and reranking changes are a different concern from multi-tenancy. Reviewing one large PR is harder than reviewing two focused ones.

LanceDB’s partitioning API changed between versions. Developed against 0.3.x; the production upgrade to 0.4.x broke the namespace isolation. Pin dependency versions.

The eval set should live in the repo. The 200-pair eval set lives on my local machine. Anyone making retrieval changes can’t reproduce the benchmark without it. It should be in tests/eval/ with a benchmark script.