"Why does my AI keep forgetting what I told it?" This is the #1 frustration with LLM applications. The culprit: context windows. Here's what's actually happening and how to fix it.
The Problem: Context Windows Are Finite
Every LLM has a context window—the maximum amount of text it can "see" at once. GPT-4 has 128K tokens. Claude has 200K. Sounds like a lot, right?
It's not. Here's why:
The Math Problem
- • System prompt: ~500-2000 tokens
- • Tool definitions: ~1000-5000 tokens
- • Conversation history: grows with every message
- • Retrieved documents (RAG): ~1000-10000 tokens
- • Current user message + response: ~500-2000 tokens
After a few exchanges, you're already using 20-50K tokens. After an hour of work? You hit the limit.
What Happens When You Hit the Limit?
The typical solution is truncation: drop old messages to make room for new ones. The result? Your AI forgets:
- What you discussed an hour ago
- Decisions you made together
- Your preferences and context
- Errors it already helped you fix
Every session starts from scratch. You repeat yourself constantly.
Why RAG Isn't Enough
Retrieval-Augmented Generation (RAG) helps by fetching relevant documents. But it has limits:
RAG is good for:
- ✅ Static documents
- ✅ Knowledge bases
- ✅ FAQ-style retrieval
RAG can't do:
- ❌ Remember conversations
- ❌ Learn preferences over time
- ❌ Track decisions and outcomes
- ❌ Build relationships between facts
RAG retrieves documents. Memory retrieves experiences. They solve different problems.
The Solution: Persistent Memory
Persistent memory is a separate system that stores what your AI learns—outside the context window. When needed, relevant memories are retrieved and injected into the prompt.
How It Works
- 1. Store: Important facts, decisions, and learnings go into memory
- 2. Index: Memories are embedded as vectors for semantic search
- 3. Retrieve: When relevant, memories are pulled into the prompt
- 4. Forget: Old, unused memories decay naturally (like human memory)
Memory vs RAG vs Fine-tuning
| Approach | Best For | Limitation |
|---|---|---|
| RAG | Static knowledge bases | Doesn't learn or remember |
| Fine-tuning | Permanent behavior changes | Expensive, slow, can't undo |
| Memory | Dynamic, personal context | Requires retrieval at runtime |
Most applications need all three. RAG for documents, fine-tuning for core behaviors, memory for personalization and learning.
Adding Memory in Practice
from shodh_memory import Memory
memory = Memory()
# Store what you learn
memory.remember("User prefers TypeScript over JavaScript", memory_type="Decision")
memory.remember("Project deadline is January 15th", memory_type="Context")
memory.remember("Auth bug was caused by expired JWT", memory_type="Error")
# Later, retrieve relevant context
context = memory.recall("What's the deadline?", limit=3)
# Returns: "Project deadline is January 15th"
# Inject into your LLM prompt
prompt = f"""Context from memory:
{context}
User question: When do we need to ship?"""
# Your LLM now has persistent contextThe Result
With persistent memory, your AI:
- Remembers preferences — No more "I prefer dark mode" every session
- Tracks decisions — "We decided to use PostgreSQL because..."
- Learns from errors — "Last time this failed because..."
- Builds context over time — Gets smarter the more you use it
Get Started
pip install shodh-memoryNo cloud accounts. No API keys. Runs entirely on your machine.