Woz - Engineering

W5 Memory Rebuild, Game Plan

Replacing a half-dead dual memory system with a single file-native one. Markdown is the source of truth; every index or database is a disposable cache. Boring first, benchmarked, staged, private, reversible.

Gemini: approved GPT: approved + edits folded in Karpathy: approve, simplify Phase 4 ChaseAI: build it, add a health signal

Where we landed on Hermes

Hermes is out.

Hermes Agent (MaxHermes / MiniMax, the Telegram self-improving-skills agent) is declined, evaluated 20+ times and not adopted. Its one good idea, an agent that writes its own reusable skills from completed work, survives as Phase 8: deferred to the very end, built file-native (Claudeception / UniM0cha patterns), and only after the boring core proves out on the benchmark. The product is rejected; the capability is parked at the back of the line where it cannot do damage.

The one rule

Markdown living-docs are the single source of truth.

Every database, index, or cache is a disposable, rebuildable copy. If a cache disappears we lose speed, never knowledge. This is the rule whose violation killed the Postgres brain (recall returned nothing, 0 graph links from 9,315 jobs, only 35 of 378 memories ever read). It stays at the top.

What we are building, and not

Building (the four goals)

Same-session save: a lesson learned today is folded into the right doc today, staged for your approval
Passive recall: relevant notes are pushed in front of the agent automatically, so it cannot forget to look
Raw auto-ingestion: drop a transcript or email, the agent files and backlinks it
Self-improving skills (last): the agent writes its own reusable skills, only after the core is proven

Deferred / cut from v1

Vector index (sqlite-vec, fastembed): grep wins at 23 docs; build only if the benchmark fails
vstash: a Phase 7 candidate, verify its Claude Code hook claim first
claude-obsidian fork, Claudeception, UniM0cha, HyDE, bi-temporal frontmatter, daily cron
Direct canonical auto-fold: always staged, never automatic into the source of truth

The lean build order

Delete and measuremostly done
Cut the noise hooks, em-dash blocker to silent autofix, password to Vaultwarden, sync fix shipped. Remaining: strip dead boot references and the ~90-directive wall down to the load-bearing 15, fold stranded notes.
Passive recall, grep only (two hooks)next
session_context.py on SessionStart (MEMORY.md + STATUS.md + hot.md, stable). prompt_recall.py on UserPromptSubmit (greps the actual prompt). Under 8,000 chars total, under 500ms. No embeddings.
Recall benchmark
25 to 50 test prompts with expected doc + heading + facts. Pass = 22 of 25. This gate decides whether we ever need vectors. Judging recall by vibes is how the last system died quietly.
Staged auto-fold + /memory-review
Tiny Stop hook logs an NDJSON row. A scheduled worker writes folds only to _staging. /memory-review is the control panel: approve, reject, or edit before anything touches a canonical doc.
Docker migration gate, then kill Postgres
The 5 agent containers are idle poll loops barely using memory, so this may be near-trivial. Verify each recalls from files, then drop the dead Postgres brain.
Raw ingest, small and no fork
A 50-line /ingest: read raw/_inbox, summarize, route to the best living-doc, staged append with a source citation, archive the original. Contradiction callouts once routing is proven.
Vector tierdeferred
Only if the benchmark fails. Evaluate vstash before hand-building. Keep it disposable, keep grep as a permanent fallback.
Self-improving skillsdeferred
Patch-before-mint, mirror-on-write to ~/.claude/skills, human review for the first 10, never auto-skill from pure research.

Reviews

GeminiApproved

Called the architecture "a necessary correction" and validated the single-source-of-truth pivot. Key pushes, all folded in: do not build the vector index for 23 docs (grep + index wins per Karpathy's 50-to-500 rule); staging must be permanent, not a phase; mirror-on-write, not symlinks (Obsidian Sync breaks symlinks); and private financial figures must never be summarized into the synced wiki. Verified the 10,000-char injection cap, async hook support (2.1.0), and that vstash is real.

GPTApproved with edits

"Approve the architecture, reject the complexity of v1." Independently reached the same cuts as Gemini, and added the sharpest single fix: SessionStart runs before the prompt exists, so passive recall must split into SessionStart (stable context) plus UserPromptSubmit (prompt-aware grep). Also: a recall benchmark is the missing piece; use a scheduled fold worker, not async-from-Stop; make /memory-review a real command; and an explicit no-vector-until-benchmark-fails gate. All folded in.

Andrej KarpathyApprove, cut Phase 4

"One of the few AI-memory plans I've read that doesn't immediately reach for a vector database and a graph and three frameworks before it has a single user. You killed a Postgres brain and replaced it with markdown files and grep. That's the right direction." The single-source-of-truth rule at the top is "the whole ballgame," and the no-vector-until-the-benchmark-fails gate is "exactly the discipline almost everyone skips. 23 docs. You can grep 23 docs faster than you can import the embedding library."

Where it still over-builds: Phase 4 is heavier than it needs to be. NDJSON capture, a launchd worker, a lock file, staging and rejected dirs, a review panel, all before a single fact has been auto-folded. "Collapse it: the Stop hook appends a candidate note to one _staging/inbox.md. That's it. The cron worker is solving a throughput problem you do not have." Also: drop session_context.py as a separate hook, let UserPromptSubmit do all the recall with a no-prompt-yet branch.

On safety: staging plus human review is correct, "and I'd go further. Keep the human in that loop indefinitely for anything load-bearing, not just the first 20. A wrong fact re-injected as trusted context is the worst failure mode in the whole system and it's silent."

Bottom line: "Ship Phases 1-3, gut half of Phase 4, and you'll have a memory system better than 95% of the ones with a Series A."

Simulated, channeling Karpathy's known positions on LLMs, context, and markdown knowledge bases. Not an actual review by him.

ChaseAIBuild it

"90% of AI-memory plans die because they're built for the architecture diagram, not for the human at 7am who just wants their tool to remember stuff. This one gets it." Two things stood out: passive recall ("memory surfaces without you asking, no can-you-check-my-notes dance, that's the single biggest unlock for a busy operator") and /memory-review as a control panel ("you're not trusting a robot to silently rewrite your source of truth, you get a queue, you approve or kill. That's how you build trust with a non-technical user").

The worry: "Eight phases, launchd workers, NDJSON queues, lock files, Docker shims. For a solo operator this is a LOT of moving parts. My worry isn't the build, it's month three, when a launchd job silently dies and nobody notices the folds stopped." The fix he wants: a dead-simple health signal, one line in the daily note, "memory captured 4 things yesterday, 2 waiting for your review." If the system goes quiet, you should feel it.

On Obsidian: vault-as-truth is "100% correct," and the symlink-to-mirror fix is "a scar, not a theory." Borrow from kepano's obsidian-skills: "lean harder on aliases and frontmatter for recall routing before you ever think about vectors." Two adoption nudges: a 2-minute onboarding that shows one real recall happening live, and default /memory-review to a gentle morning-note nudge so reviewing becomes a habit.

Bottom line: "Boring-first, benchmarked, and reversible, this is the rare memory system a real operator will actually stick with. Build it."

Simulated, channeling Chase's practical Obsidian + Claude Code creator voice. Not an actual review by him.

What both new reviews agree on (new signal)

Cut Phase 4 down, and make memory visible.

Gemini and GPT validated the architecture. Karpathy and ChaseAI, from opposite ends (minimalist builder vs everyday operator), independently land on the same two changes: (1) collapse Phase 4, drop the NDJSON queue, launchd worker, and lock file in favor of appending candidates to one staging file you review when you review; and (2) add a health signal, one line in the daily note showing what memory captured and what is waiting, so a silently-dead worker can never strand folds the way the old system stranded everything. Both are easy folds into the plan.