Hermes Memory Architecture¶

Filesystem-based memory with progressive disclosure, replacing the deprecated built-in MEMORY.md hot storage.

Current Architecture¶

~/.hermes/memory/
├── system/              # ALWAYS-PINNED (read every session start)
│   ├── identity.md      # Core rules, instruction lock
│   └── active_context.md # Current priorities, routing
├── backlog.yaml         # PINNED — active task backlog
├── backlog_done.yaml    # PINNED — completed tasks (7-day archive)
├── knowledge/           # ON-DEMAND (read when relevant via context map)
│   ├── bio-bridge.md
│   ├── google-setup.md
│   └── ...              # Other project/topic files
├── context_map.yaml     # Directory index (scanned for retrieval)
└── scripts/
    └── generate_map.py  # Rebuild context_map.yaml from files

How It Works¶

System Files (pinned)¶

Read at the start of every session. Core identity and current priorities. Small (~1.5KB total).

Context Map (retrieval index)¶

context_map.yaml lists all files with descriptions and keywords. Scanning this map tells the agent what knowledge is available without reading every file.

Knowledge Files (on-demand)¶

When a topic comes up, check the context map for relevant files, then read the specific file(s) needed. Do NOT preload all knowledge files.

Retrieval Protocol¶

Check context map — read_file ~/.hermes/memory/context_map.yaml
Match by description/keywords
Read specific files from knowledge/
If no match found — proceed with what you know

Retrieval Tools¶

search_knowledge.py — BM25-based search over descriptions and keywords
generate_map.py — Rebuilds context_map.yaml from filesystem

Security Rules¶

NEVER store secrets — no API keys, tokens, passwords
system/ files are read-only to agent main loop
Atomic writes (write to .tmp, then mv)
700 dirs, 600 files

Vault (Permanent Archive)¶

Obsidian vault at ~/hermes-vault/ for permanently validated facts, project histories, decisions. Wikilinks, YAML frontmatter, graph view. Managed by pkm-archive skill.

Migration Status¶

Built-in MEMORY.md deprecated (cleared to ~8% with deprecation pointers)
All content migrated to filesystem memory
Cron jobs updated to use read_file/write_file instead of memory tool
Git-tracked at ~/.hermes/memory/

[[Context Portfolio]]
[[Hermes Infrastructure]]
[[Hermes Vision]]
[[Multica Evaluation]]

Document RAG UI and citation surfaces — June 2026¶

Document agents need user-visible verification surfaces, not just hidden retrieval logs. The open-source document RAG/agent UI-kit pattern is valuable because it exposes source documents, bounding-box citations, intake status, and answer grounding in the same workflow.

Hermes implication: for PDF/DOCX/XLSX workflows, prefer interfaces that let Adam inspect the cited page/region beside the answer. This is especially relevant for document extraction, forms, LocalStack onboarding material, and any workflow where the failure mode is "confident answer from the wrong source".

Source: "We built an open source UI kit for document RAG/agents" (2026-06-18).

Staleness checks for docs, skills, and context — June 2026¶

Treedocs' "documentation that automatically checks for staleness" maps to a recurring Hermes failure mode: stale skills, stale docs, stale context packs, and unchanged instructions after architecture shifts.

Durable rule: any context surface that agents rely on should have an explicit freshness mechanism. Useful checks include linked file existence, referenced command smoke tests, last-updated age, owner/review date, and assertions that can be probed deterministically.

Hermes implication: skill maintenance should evolve from manual cleanup to scheduled staleness detection that creates Linear follow-ups when a skill's commands, paths, or assumptions rot.

Source: "Show HN: Treedocs: Documentation that automatically checks for staleness" (2026-06-23).

Source: slack-intake
Domains: agentic_engineering, hermes_system
Why it was promoted: This is directly applicable to Personal Context Management and Hermes memory architecture. It should update the architecture note/design backlog rather than remain raw-only.
Raw source: /home/adam/.hermes/context-inbox/raw/intake/2026-06-13/onebrain-reddit-hermesagent.md

Summary: A Hermes user describes a shared local-memory stack: Obsidian as the only durable store, gbrain as a local PGLite + Ollama/nomic-embed-text + knowledge-graph brain layer, and one HTTP MCP endpoint used by Claude, Codex, Grok, and Hermes. The useful takeaway is the separation of responsibilities: always-on instructions/rules in the agent context, small stable preferences in memory.md, and factual/project/research knowledge in the vault/brain. The second brain improves recall but cannot enforce behaviour unless the always-on layer tells the agent when to search it.

Reusable context: High relevance to Adam's Hermes/Mnemosyne/Obsidian direction. Strong design lessons: keep Obsidian as canonical source of truth; make indexes rebuildable; expose a shared MCP query surface to all agents; exclude secrets and config-mirror notes before indexing; leash autonomous enrichment with kill-switch/watchdog; and use a dedicated embedding model rather than a chat model. PGLite single-writer locking is a gbrain-specific pitfall; our Mnemosyne/SQLite/Postgres choices should be reviewed for equivalent write-lock risks.

2026-06-13 — Larger Context Windows Don’t Fix RAG — So I Built a System That Does ¶

Source: unknown
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Summary: Critique of context windows vs improved RAG systems with proposed alternatives.

Reusable context: What happened The article highlights a key failure in current RAG systems: larger context windows do not resolve accuracy issues, particularly for analytical queries. It proposes a "QueryRouter" system that intelligently routes queries based on intent ("Computation" or "Retrieval") to address this "Error Observability Collapse."

Why it matters This is critical for analytics engineers and those working with AI/ML tooling, as it underscores that LLMs are not reliable computational engines for aggregations. Relying solely on RAG for analytical questions leads to polished but incorrect results.

What to do Evaluate implementing a query classification layer (like the proposed QueryRouter) in your AI/analytics stack to direct computational queries to deterministic engines (e.g., dbt, Snowflake) and factual retrieval queries to RAG.

2026-06-13 — Megathread Summary: I Asked Multiple Reddit Communities How to Build a Living Memory /Context Engine for Business. Here's what everyone had to say.¶

Source: unknown
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened A Reddit megathread summarized community discussions on building a "living memory" or context engine for businesses, focusing on design philosophies like "Query-First Design," architectural choices such as append-only event logs and hybrid search, and memory management strategies including significance scoring.

Why it matters This research directly informs the development of advanced AI tooling and agent frameworks by providing practical insights into managing and synthesizing enterprise knowledge, which is critical for analytics engineers integrating AI with data platforms and orchestration tools.

What to do Evaluate hybrid search (vector + relational/graph) solutions and append-only event log architectures for future knowledge management systems within your data stack.

2026-06-13 — Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload ¶

Source: unknown
Domains: agentic_engineering, hermes_system
Why it was promoted: High-signal durable story with actionable implications for the context library.

Summary: Local-first RAG parser (Docling) for complex table data.

Reusable context: What happened IBM Research released Docling, an open-source tool for local PDF parsing, offering high-fidelity extraction of text, tables, and images, particularly beneficial for Retrieval Augmented Generation (RAG) pipelines without relying on cloud services.

Why it matters Docling addresses data privacy and compliance concerns for analytics engineers by enabling local processing of sensitive documents. It enhances developer productivity through a unified API that consistently handles various parsing engines.

What to do Evaluate Docling for your RAG pipelines, especially for scenarios requiring on-premise PDF processing and complex table extraction, to maintain data sovereignty and improve parsing quality.

2026-06-15 — Building a CPU LLM engine in C99 - stuck at 1.90 tok/s on DeepSeek MoE while llama.cpp does 13.79. Potential root cause identified. Implementation is not.¶

Source: Reddit r/LocalLLaMA
Domains: analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened A developer building a CPU LLM engine in C99 encountered a severe performance bottleneck, achieving only 1.90 tok/s on DeepSeek MoE compared to llama.cpp's 13.79 tok/s. The root cause was identified as memory bandwidth contention due to dequantizing Q4_K weights to F32 before computation.

Why it matters This issue underscores the critical role of low-level optimization and efficient data handling in achieving performant local LLM inference, particularly for complex models like MoE. It demonstrates that basic implementation choices can drastically impact AI/ML tooling efficiency on data platforms.

What to do When developing or evaluating AI/ML tooling for local LLM inference, prioritize solutions that utilize highly optimized, fused matvec kernels (e.g., ggml_vec_dot_q4_K_q8_K in llama.cpp) to minimize memory bandwidth usage and leverage specialized CPU instructions.

2026-06-15 — I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]¶

Source: Reddit r/MachineLearning
Domains: analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened An open-source Knowledge Graph pipeline, "GraphRAG Studio," was developed using Django/React. It extracts entities and builds knowledge graphs from text, employing hybrid retrieval (dense vector + BM25) and graph traversal to enhance LLM multi-hop reasoning and address limitations of standard vector search.

Why it matters This directly impacts AI/ML tooling and developer productivity by offering a robust solution for complex LLM applications. It provides a blueprint for integrating structured knowledge into RAG systems, which is crucial for analytics engineers building advanced data platforms that leverage LLMs for deeper insights and more reliable query responses.

What to do Evaluate GraphRAG Studio for potential integration into your LLM-powered data analytics workflows to improve multi-hop reasoning capabilities.

2026-06-15 — I'm still surprised on how good the kv quantization has become ¶

Source: unknown
Domains: analytics_engineering, hermes_system
Why it was promoted: High-signal durable story with actionable implications for the context library.

Summary: Advancements in KV cache quantization efficiency.

Reusable context: What happened Recent advancements in KV quantization have significantly improved the efficiency of Large Language Model (LLM) deployments. By compressing the Key-Value (KV) cache, these techniques reduce memory footprint and increase inference throughput, particularly beneficial for LLMs with expanding context windows.

Why it matters These developments offer analytics engineers and data platform specialists tangible benefits, including reduced infrastructure costs, increased concurrency for AI workloads, and the capability to deploy more sophisticated LLMs with longer context windows within existing hardware constraints.

What to do Explore and integrate LLM serving frameworks like vLLM or TensorRT-LLM that leverage advanced KV quantization formats (e.g., FP8, NVFP4) to optimize memory utilization and enhance the performance of your AI deployments.

2026-06-15 — I made a private on-device LLM app for Android (notes + recall, nothing leaves the phone)¶

Source: unknown
Domains: agentic_engineering, hermes_system
Why it was promoted: High-signal durable story with actionable implications for the context library.

Summary: New private, on-device Android LLM app for notes and recall, ensuring data privacy.

Reusable context: What happened A developer created an Android app featuring a private, on-device Large Language Model (LLM) for note-taking, audio transcription, and semantic recall, functioning entirely offline without cloud interaction.

Why it matters This demonstrates the viability of decentralized, privacy-first AI/ML tooling and local Retrieval-Augmented Generation (RAG) on mobile hardware, significantly impacting developer productivity and data security by eliminating cloud dependencies.

What to do Evaluate the feasibility of deploying local LLMs for sensitive data processing within your current stack.

2026-06-15 — Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models ¶

Source: unknown
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable story with actionable implications for the context library.

Summary: New research proposing the shift from pretrained world models to fine-tuned action models for agents.

Reusable context: What happened NVIDIA's blog introduces World-Action Models (WAMs), a new paradigm for robotic foundation models that predict future world states and actions using pretrained video backbones, shifting from passive world models to active ones. This approach aims to close the "grounding gap" between language instructions and physical execution.

Why it matters WAMs represent a significant evolution in AI/ML tooling for agents, offering improved data efficiency and zero-shot imagination for complex robotic tasks. However, their high training costs, slow inference, and substantial GPU memory requirements present critical challenges for data platforms and developer productivity.

What to do Research emerging agent frameworks and hardware solutions optimizing for WAMs' computational demands and explore hybrid VLA+WAM architectures for future AI deployments.

2026-06-16 — Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes ¶

Source: NVIDIA Technical Blog
Domains: analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened NVIDIA BioNeMo Recipes now support LoRA for efficient fine-tuning of large biological foundation models like ESM2 and Evo2, enabling state-of-the-art results on standard workstation hardware.

Why it matters This significantly democratizes access to and adaptation of massive protein and DNA models for AI/ML tooling and data platforms by reducing memory footprints and increasing throughput for fine-tuning.

What to do Evaluate BioNeMo Recipes to fine-tune biological models for specific tasks on commodity hardware within your AI/ML pipelines.

2026-06-16 — Making ast.walk 220x Faster ¶

Source: unknown
Domains: agentic_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Summary: Technical deep dive into optimizing Python's 'ast.walk' by 220x.

Reusable context: What happened Reflex significantly optimized Python's ast.walk function, achieving a 220x speed improvement. This was accomplished through iterative Python optimizations and a critical porting of the AST traversal logic to Rust using PyO3, alongside direct memory access and precomputed metadata.

Why it matters This performance boost is crucial for developer tools, linters, static analyzers, and AI/ML code generation, which heavily rely on efficient AST traversal. Faster code processing directly enhances developer productivity and the responsiveness of AI-powered coding assistants.

What to do Evaluate ast.walk usage in your Python-based developer tools or AI/ML code processing pipelines for potential performance bottlenecks and consider similar optimization techniques.

2026-06-17 — GLM 5.2 Performance Benchmarks ¶

Source: unknown
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Summary: Performance benchmarks for the new GLM 5.2 model.

Reusable context: What happened GLM-5.2, a 753B parameter reasoning model with a 1M token context window, achieved the #1 ranking on the Artificial Analysis Intelligence Index, demonstrating strong performance in agentic tool use and terminal tasks.

Why it matters This model's capabilities are highly relevant for AI/ML tooling and data platforms, especially for complex Retrieval Augmented Generation (RAG) and long-horizon agentic workflows, despite its higher cost and verbosity as an open-weights model.

What to do Evaluate GLM-5.2 for potential integration into your AI/ML stack, focusing on its advanced reasoning and extensive context window for agentic applications, while considering its cost-performance trade-offs.

2026-06-17 — Hermes Architecture EXPLAINED: Memory, Context & Gateways ¶

Source: unknown
Domains: agentic_engineering, hermes_system
Why it was promoted: High-signal durable story with actionable implications for the context library.

Summary: Exploration of the Hermes architecture components: memory, context, and gateway patterns.

Reusable context: I am unable to access YouTube content directly or process video transcripts with the web_fetch tool. The web_fetch tool is designed for processing text-based web pages. Therefore, I cannot provide a summary of the YouTube video.

2026-06-23 — Same model, same prompt, 4 different agents ¶

Source: Reddit r/LocalLLaMA
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened A user tested the exact same LLM and prompt across four different agent frameworks, demonstrating significant variance in outputs, tool usage, and reliability due to differences in framework orchestration logic.

Why it matters For analytics engineers evaluating AI tooling, this highlights that the agent framework's architecture (how it handles memory, planning, and tool calling) impacts results as much as the underlying model choice, complicating reproducibility in data pipelines.

What to do Standardize evaluation criteria across agent frameworks before committing to one for dbt/Snowflake integrations or MWAA orchestration, ensuring you test identical prompts across multiple frameworks to isolate framework-induced variance.

2026-06-23 — When RAG Users Ask Vague Questions: Clarify Once, Learn the Default ¶

Source: Towards Data Science
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened The article proposes a RAG pattern where the system asks clarifying questions when users submit vague queries, but then learns and stores the user's clarification as a default preference—so the same ambiguity doesn't trigger repeated clarification loops.

Why it matters For analytics engineers building LLM-powered data assistants (e.g., natural-language SQL interfaces over Snowflake), this pattern directly addresses UX friction: users abandon tools that ask too many follow-up questions. The approach uses Pydantic for structured clarification capture and persistent preference storage, which maps cleanly onto existing data pipelines and metadata tables you likely already manage in dbt/Snowflake.

What to do Prototype this "clarify-once, learn-default" pattern in your next RAG agent build—store clarification preferences in a Snowflake table keyed by user ID, and wire the retrieval step to inject stored defaults before the LLM generates a response. Evaluate whether frameworks like LangGraph or LlamaIndex support this natively or if you need a custom Pydantic-based implementation.

2026-06-25 — Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG ¶

Source: Towards Data Science
Domains: agentic_engineering, analytics_engineering, hermes_system
Why it was promoted: High-signal durable technical story suitable for synthesis into the context library.

Reusable context: What happened The article argues that enterprise RAG should be modeled as a multi-stage filtering pipeline—not a single semantic search step—where metadata, access controls, and business rules progressively narrow the corpus before LLM-based retrieval, improving precision and reducing hallucination risk.

Why it matters Analytics engineers already think in staged transformations (dbt models, CTEs). This mental model maps directly: treat retrieval like a dbt DAG where each layer filters by metadata, row-level security, or freshness before semantic ranking—making RAG more deterministic and auditable, especially over Snowflake-hosted documents.

What to do When evaluating RAG frameworks or building agents on MWAA, design retrieval as a filter chain (metadata → permissions → recency → semantic) rather than a single vector search call, and instrument each stage for observability the way you would a dbt model.

Hermes Memory Architecture¶

Current Architecture¶

How It Works¶

System Files (pinned)¶

Context Map (retrieval index)¶

Knowledge Files (on-demand)¶

Retrieval Protocol¶

Retrieval Tools¶

Security Rules¶

Vault (Permanent Archive)¶

Migration Status¶

Related¶

Document RAG UI and citation surfaces — June 2026¶

Staleness checks for docs, skills, and context — June 2026¶

2026-06-12 — One Obsidian vault → one local brain → 4 AI agents sharing the same memory¶

2026-06-13 — Larger Context Windows Don’t Fix RAG — So I Built a System That Does¶

2026-06-13 — Megathread Summary: I Asked Multiple Reddit Communities How to Build a Living Memory /Context Engine for Business. Here's what everyone had to say.¶

2026-06-13 — Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload¶

2026-06-15 — Building a CPU LLM engine in C99 - stuck at 1.90 tok/s on DeepSeek MoE while llama.cpp does 13.79. Potential root cause identified. Implementation is not.¶

2026-06-15 — I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]¶

2026-06-15 — I'm still surprised on how good the kv quantization has become¶

2026-06-15 — I made a private on-device LLM app for Android (notes + recall, nothing leaves the phone)¶

2026-06-15 — Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models¶

2026-06-16 — Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes¶

2026-06-16 — Making ast.walk 220x Faster¶

2026-06-17 — GLM 5.2 Performance Benchmarks¶

2026-06-17 — Hermes Architecture EXPLAINED: Memory, Context & Gateways¶

2026-06-23 — Same model, same prompt, 4 different agents¶

2026-06-23 — When RAG Users Ask Vague Questions: Clarify Once, Learn the Default¶

2026-06-25 — Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG¶

2026-06-12 — One Obsidian vault → one local brain → 4 AI agents sharing the same memory ¶