Read-only Memory
Adding memory to LLMs and agents let’s you do a few interesting things:
Create agents with infinite memory for past conversations.
Read documents in chunks.
Have the assistant remember user preferences.
Inspired by the MemGPT paper from over one year ago, I build an LLM system with read-only local memory - allocated within the context.
This is the first in a two part series - in the next part I’ll look at how to add read/write memory to store user preferences and allow for reading documents in chunks (actually, as Cursor does for reading long files).
Enjoyed the video or have Qs? Drop a comment below or reply to this email with a 🤙.
Cheers, Ronan
🛠 Explore Fine-tuning, Inference, Vision, Audio, and Evaluation Tools
💡 Consulting (Technical Assistance OR Market Insights)
Adding Read-Only Memory to LLMs and LLM Agents
Large language models (LLMs) can be enhanced with memory systems that allow them to access information beyond their context window. This video examines how to implement a read-only memory system that enables an LLM to retrieve and reference past conversations.
Core Memory Architecture
The system consists of two main components:
Local memory: The LLM's context window, limited to a fixed token count (e.g., 2,500 tokens)
Disk memory: A database storing the complete conversation history
The local memory is further divided into:
System message (500 tokens)
Read-only retrieval block (500 tokens)
Recent chat history (remaining tokens)
Memory Management System
The implementation uses a First-In-First-Out (FIFO) approach:
Recent conversations stay in local memory until pushed out by newer ones
All conversations are saved to the disk database
The LLM can query past conversations through search commands
Search and Retrieval
The system implements a paginated search mechanism:
LLM issues search commands using XML-style tags:
<fetch_memory>query</fetch_memory>
Results are returned in pages of 3 conversation turns
LLM can request subsequent pages using
<fetch_memory_page>2</fetch_memory_page>
Search uses keyword matching (can be upgraded to BM25 or embeddings)
Technical Implementation Details
Context Management:
Total context = 2,500 tokens
System message = 500 tokens
Read-only memory = 500 tokens
User/Assistant messages = 250 tokens each
Remaining space allocated to recent chat history
Database Structure:
Conversations stored in JSON format
Each entry includes:
User message
Assistant response
Timestamp
Token count
Search Implementation:
Case-insensitive keyword matching
Results grouped into conversation turns
Pagination with 3 turns per page
Search state maintained for multi-page retrieval
Command Processing
The system processes LLM commands through:
Regular expression extraction of search queries
Token counting for context management
Result formatting and injection into read-only memory
Confirmation messages back to the LLM
Future Enhancements
The system can be improved through:
Postgres implementation with BM25 search
Vector embeddings for semantic search
Date-based retrieval
Rate limiting and throttling
Multi-user support
Authentication
Technical Requirements
The implementation requires:
Python environment
Anthropic's Claude API
JSON for storage (or Postgres for production)
Regular expressions for command parsing
Token counting utilities
This memory system enables LLMs to maintain conversational context beyond their standard context window while maintaining a clean, modular architecture that can be extended for more complex use cases.
You are tackling an interesting topic which I hit every day too. Especially if I read the next “RAG is dead”.
The main problem we have right now when working with knowledge (especially in enterprise), you don’t want to send the document every time to the llm to answer one question.
That’s just horrible inefficient.
There must be a way that keeps the document close to the llm (like your disk idea), so I can send several questions including chat history.
You mentioned curse how they do it. Do you have any interesting links?