Dakera scores 87.6% on the full LoCoMo benchmark — 50 sessions, 1,540 questions, no LLM reranking or synthesis. Here is every number, broken down by category, with the evaluation code so you can verify it yourself.
LoCoMo tests four distinct recall challenges. Dakera leads on three; temporal inference is our hardest category and an active area of improvement.
All numbers below are Dakera's own results on the full LoCoMo dataset (v0.11.54, May 2026). Reproducible via dakera-bench.
| Category | Score | Questions | Description |
|---|---|---|---|
| Overall | 87.6% | 1,540 (full dataset, 50 sessions) | Standard single-pass evaluation, no LLM reranking |
| Cat1 — Single-hop recall | 86.9% | 282 questions | Direct facts from recent or distant sessions |
| Cat2 — Multi-hop reasoning | 85.4% | 321 questions | Cross-session reasoning chains requiring multiple recall steps |
| Cat3 — Temporal inference | 73.9% | 92 questions | Time-anchored questions — our most challenging category, actively improving |
| Cat4 — Open-domain | 91.0% | 841 questions | Mixed topics and entities spanning multiple sessions |
All scores use the full LoCoMo dataset — 50 simulated long-term conversations, 1,540 questions. We do not use sampled subsets. Standard single-pass retrieval: no LLM reranking, no post-processing synthesis step. Results are scored by LLM judge on the LoCoMo framework.
Version: Dakera v0.11.54, evaluated May 2026. The benchmark harness is open source at github.com/Dakera-AI/dakera-bench. Run it yourself with one command.
We publish our evaluation methodology in full so you can reproduce, audit, or challenge the results.
Dataset: Full LoCoMo benchmark — 50 simulated long-term conversations, 1,540 questions across all four categories. We do not use sampled subsets. Two percentage points on a 100-question eval is statistically noise; on 1,540 questions it represents a real signal.
POST /v1/memory API. Session boundaries are preserved. No preprocessing or summarization.POST /v1/recall query. Dakera returns its top-k memories using HNSW + BM25 hybrid retrieval with temporal re-ranking.The full evaluation script and dataset ingestion pipeline is documented in our benchmark methodology post. Reproducibility is a first-class requirement — if you find a discrepancy, open an issue on GitHub.
Benchmark scores translate directly to agent behavior. Memory recall failures cause agents to repeat questions, forget context, and give inconsistent answers.
At 87.6% recall, your agent remembers what the user told it — across sessions. No "as I mentioned earlier" failures.
91.0% open-domain recall means agents carry context across days, weeks, and months — not just within a single conversation.
Sub-10ms recall at P99. No LLM rerank post-pass. Your agent gets the right memory fast enough to use it in real-time.
Dakera is open core — the engine, SDKs, CLI, and MCP server are MIT-licensed and self-hostable. Spin up a local instance, ingest the LoCoMo dataset, and reproduce these results against your own deployment.