How to build AI agents with memory — a complete playbook

I watched a live walkthrough from Google for Developers featuring Sita Lakshmi Sangameswaran and Kimberly Milam, and I’m reporting what I learned so you can build agents that remember — not just chat. In this article I break down the full playbook they presented: why memory matters, the memory services inside the Agent Development Kit (ADK), how Vertex AI Memory Bank works, how to generate and retrieve memories, how to customize what your agent keeps, and how to set time-to-live (TTL) policies so your memory stays useful and affordable.

This is written in a straight-from-the-field, informal news-report style: I’ll tell you the facts, explain the trade-offs, and share practical guidance so you can apply memory to your own AI agents. If you’re building conversational agents, customer assistants, or any multi-session experience, read on — memory will change how users experience your product and how much you spend to operate it.

📰 Opening: why forgetful agents are expensive news
🧠 Agentic memory concepts — how I think about memory
⚙️ ADK memory services explained — the interfaces I use
💻 Building the agent — my step-by-step code workflow
📂 In-memory vs Memory Bank — what I found when I looked under the hood
⏱ Memory generation in Memory Bank — asynchronous by design
🔎 Scopes and search — how I retrieve the right memories
🔁 Consolidation in action — how duplicates are avoided
🔧 How I automate memory generation — using callbacks and direct sources
🧩 Using memories in the agent — preload, load, and custom callbacks
🛠 Custom memory extraction — managed topics and custom topics
⏳ Setting TTL — how long memories should live
💡 Practical tips — what I recommend you do first
🔐 Privacy and cost considerations I weigh before deployment
📈 When memory reduces costs — the ROI story I observed
🗂 Operational pitfalls I warn teams about
🧭 Advanced patterns I use in production agents
📚 Learning resources and next steps I recommend
🔍 FAQ — the questions I keep getting
📣 Final verdict — my summary report
📌 Resources and next actions I recommend
📝 Closing note — an invitation from my reporting desk
❓ More FAQs — quick-fire answers I keep handy
📬 Final call-to-action

📰 Opening: why forgetful agents are expensive news

Here’s the headline: the most expensive bug in many AI agents is not a misfired prompt or a slow model — it’s the agent’s lack of memory. When an agent starts every conversation from scratch, it wastes tokens, frustrates users, and repeats the same “learning” over and over again. The good news is that adding the right memory architecture makes agents smarter over time and reduces cost.

I’ll summarize the core problem in plain terms: imagine a user tells your agent their dietary preferences, past purchases, or family details. If your agent forgets all of that at the end of the session, it has to ask again (annoying) or infer the same thing repeatedly (expensive). The fix is persistent memory — a curated store of facts and preferences that the agent can access across sessions.

This article covers volatile vs persistent memory, how ADK orchestrates memory with memory services, the behavior differences between simple in-memory storage and Vertex AI Memory Bank, memory generation and consolidation mechanics, memory retrieval strategies, callbacks and tools that automate memory saves and preloads, and how to tune what gets persisted using custom topics and TTL rules.

🧠 Agentic memory concepts — how I think about memory

When I explain memory to product teams I use a simple mental model: every agent has short-term (volatile) memory and long-term (persistent) memory.

Short-term / volatile memory: fast, local, and ephemeral. ADK offers an in-memory session service that keeps the conversation history only for the current runtime. It’s great for development and prototypes because it requires no infra and gives you quick stateful behavior. But it disappears when the process restarts or when the user ends the session.
Long-term / persistent memory: durable storage for facts, preferences, and other signals you want your agent to recall across sessions. In production you should use a managed memory service (like Vertex AI Memory Bank) or a custom persistence layer. Persistent memory should be curated and compact to reduce cost and retrieval noise.

Key concept: memory is not just storage; it is a pipeline of generation, consolidation, and retrieval. The stage where you extract and compress facts is as important as where you store them.

⚙️ ADK memory services explained — the interfaces I use

ADK exposes a common base memory service interface that all memory integrations implement. The two core responsibilities are:

add_session_to_memory(session) — takes a session (turn-by-turn conversation), extracts meaningful facts, and persists them to storage.
search_memory — given a scope (app / user key) and a query, finds relevant memories for context in the next turn.

Why this matters: ADK gives a consistent integration point so you can switch underlying memory providers without changing agent orchestration logic. But the implementations differ widely in behavior.

The two memory services I’ll compare:

In-memory memory service — stores raw session events as a dictionary on the VM. No LLM-based extraction or consolidation. Full-text queries are token-word matching over verbose turn data. It’s fast and simple but uncurated and expensive at scale.
Vertex AI Memory Bank service — a managed layer that calls the Agent Engine Memory Bank. Memory Bank runs extraction and consolidation pipelines, stores compact facts, supports similarity search and scopes, and processes generation asynchronously (non-blocking by default).

💻 Building the agent — my step-by-step code workflow

I’ll lay out the typical workflow I followed with ADK (you don’t need the exact code here to understand the pattern):

Set up your GCP project and location — ADK needs to know where to send model and memory requests.
Define your ADK agent — in my example it’s a conversational agent that answers user questions; tools are optional and can be added to do things like fetching external data or running actions.
Define a session service — for prototyping I used in-memory sessions; for production use a persistent session DB (Agent Engine sessions or your own DB).
Define a runner — the runner orchestrates calls to the agent and session service.
Optionally define a memory service — you can start without one and call the memory service manually to see differences. ADK also supports wiring a memory service so the runner can orchestrate memory work.
Call add_session_to_memory or generate_memories — this is where the memory service extracts and stores facts.
Retrieve memories via preload or search APIs — insert retrieval results into the system message or the turn context before the model runs.

Notice the separation: sessions capture raw dialog, memory services curate and persist useful facts, and retrieval injects curated context back into the agent’s prompt.

📂 In-memory vs Memory Bank — what I found when I looked under the hood

I did a direct comparison between two simple sessions to illustrate how differently the services behave:

Chit-chat session — user asks “What types of questions do you answer?” A verbose model reply. This exchange is unlikely to contain long-term facts worth storing.
Recommendation session — user mentions “I have a three-year-old niece” and “I like the idea of a bike for a present.” The model responds with suggestions like a balanced bike for a three-year-old. Those are actual facts you might want persisted.

Here’s what I observed:

In-memory memory service stored the full turn-by-turn conversational content (both user and agent text) in a dictionary keyed by app name and user ID. Search worked by matching overlapping words. The stored data was verbose and not condensed, leading to larger token footprints and duplicate content if the same conversation was saved multiple times. Retrieval returned big chunks of dialog, which is expensive and often not actionable.
Vertex AI Memory Bank accepted the same raw session but ran LLM-based extraction and consolidation in the background. It produced compact facts like “User has a three-year-old niece” and “User is considering a bike for a birthday present.” It filtered out the chit-chat because it didn’t match the definition of meaningful content. Consolidation prevented duplicate memories by updating an existing memory instead of appending redundant entries.

In short: in-memory = raw, verbose, duplicate-prone; Memory Bank = curated, compact, consolidated.

⏱ Memory generation in Memory Bank — asynchronous by design

Important operational note: Memory Bank runs memory generation asynchronously. That means when you call generate_memories from ADK, the API call sends the raw session to Memory Bank and returns immediately. The heavy lifting — multiple chained LLM calls to extract facts and consolidate them — happens in the background. This is deliberate because memory generation can be latency-intensive and is usually not needed for the current model turn.

Practical consequences:

Your client does not block on memory extraction by default.
Memories might not be available immediately after the call returns — wait a few seconds or design your UX assuming the memory will be available on the next interaction.
If you need blocking behavior (for example, you want to show a confirmation to the user that a memory was saved), you can call the underlying Agent Engine SDK and explicitly wait for completion.

When Memory Bank extracts memories, it applies a couple of key steps under the hood:

Memory extraction — the system decides which parts of the session contain meaningful, persistent facts (for example: user preferences, explicit instructions to remember, identity facts). Managed topics (predefined categories) guide this extraction.
Consolidation — new memories are compared to existing ones. Duplicate or overlapping content is merged; contradictory info can cause updates. The result is a curated corpus that evolves rather than just grows.

🔎 Scopes and search — how I retrieve the right memories

Memory Bank stores memories within a scope key that determines isolation. ADK builds scope from the app name and the user ID by default. If you interact with Memory Bank directly, you must use the same scope you configured in ADK — otherwise you won’t find the memories.

Search options:

Similarity search — Memory Bank supports vector-based similarity search. Provide a query (e.g., the current user turn) and get back memories sorted by distance.
Retrieve all — Omitting similarity parameters returns all memories for the scope. Use with care — results may be large.

When Memory Bank returns data, ADK wraps it into a SearchMemoryResponse object, but the raw Memory Bank response can contain more diagnostic fields. If you want full transparency into memory entries and metadata, calling Memory Bank (Agent Engine SDK) directly may be better than going through ADK.

🔁 Consolidation in action — how duplicates are avoided

Here’s a real-life pattern I tested: I uploaded the same fact twice (e.g., “I like the idea of getting a bike for my niece”). Instead of creating two separate memory entries, Memory Bank consolidated them and updated the existing memory. That’s the consolidation mechanism preserving a single, updated fact rather than a noisy timeline of repeats.

Consolidation matters for cost and clarity. When the agent later searches memories, it sees one clear, compact fact instead of multiple noisy duplicates. That's why Memory Bank outperforms a naïve database of dialogues: it does LLM-powered post-processing at write time so retrieval is cheaper and higher quality.

🔧 How I automate memory generation — using callbacks and direct sources

Saving a session to memory can be done manually via save_session_to_memory or generate_memories calls. But for production agents you want automation. I use callbacks in ADK to automatically trigger memory generation at the end of each agent interaction. The callback has access to the full invocation context (turn-by-turn conversation) so it can call add_session_to_memory for that user without me wiring the calls by hand in every handler.

Two patterns to consider:

Full-session generation — send the entire session history. Convenient but can be duplicative because previous turns are sent multiple times across turns.
Direct memory source / last-turn only — extract a pre-computed fact from the callback context and pass it as a direct memory source. This prevents duplicate uploads because you're only sending new information (the last turn's content). The trade-off is you might lose context from earlier turns that would matter for extraction.

In practice I combine both patterns: for most turns I send only the latest user utterance; for certain events (explicit “remember this” user instructions or forms completed), I generate a direct memory source and mark it for priority processing.

🧩 Using memories in the agent — preload, load, and custom callbacks

There are several strategies for adding retrieved memories into the model prompt:

Preload memory tool (ADK’s 80K preload) — this tool runs before the agent executes and injects memories into the system instructions. It’s not invoked by the model; it’s the runner that adds memories to the prompt every time. Use this when you want the model to always have some persistent facts available right away.
Load memory tool (agent-invoked) — the model can decide to call the tool if it needs memory. This is more selective: the prompt instructs the LLM that a memory tool exists, and the model chooses whether to call it during reasoning. Use this when memory is sometimes relevant but not always.
Custom callbacks — build your own callback to call Memory Bank, shape the response format, and insert memory into any part of the prompt (system message, assistant message, or user message). This gives full control over scope keys, formatting, filtering, and placement.

I experimented with preloading and model-invoked loads:

Preload example: a new session starts; the system prompt contains “Remember: user recently turned three and planned to get a doll for Christmas.” When I ask the agent “What did I get my niece for Christmas?” it answers correctly because the info was injected into the system message before the model ran.
Load example: the model’s reasoning flow decides when to call the memory tool. If it determines memory is not relevant to the current question, it won’t call the tool, reducing token use. If it needs the memory, it will call and the runner will fetch matching memories and return them to the LLM.

For advanced use cases I build custom callbacks to set the scope differently (e.g., just user ID, not app name + user ID) and format results as structured bullet points or key-value items. This is useful if your prompt expects facts in a domain-specific schema (e.g., “User: has_allergies=true; allergies=[peanuts]”).

🛠 Custom memory extraction — managed topics and custom topics

Memory Bank ships with managed topics that define categories of extractable things: personal info, user preferences, key conversation events, task outcomes, and explicit remember/forget instructions. Those are the default signals Memory Bank uses to decide what to save.

If the managed topics cover your needs, great. But if you need domain-specific memory — for instance, extracting structured user feedback about a café experience — Memory Bank lets you define custom topics:

You provide a label, description, and few-shot examples of both positive extraction cases and negative (no-op) cases.
Memory Bank will use your examples to decide what to persist and how to condense it.
You can update the agent engine configuration dynamically to iterate on your extraction rules.

I tested a “customer feedback” custom topic: when a user says “You should offer more milk options,” Memory Bank extracted a memory condensing that to “The coffee shop should offer more milk options.” When another user said “You should have almond milk,” Memory Bank consolidated the new info into the existing memory, updating it to “The coffee shop should have more milk options such as almond milk.”

Custom topics are powerful because they let you use memories for many use cases beyond personal details — feature requests, bug reports, satisfaction signals, and any structured business feedback.

⏳ Setting TTL — how long memories should live

Retention is operationally and ethically important. Not every memory should live forever.

Memory Bank supports TTL at multiple granularities:

Default TTL on the memory bank — set a global retention period for generated memories (e.g., 30 days). Useful when you want a baseline lifecycle.
Granular TTL per operation — define TTL behavior per operation type (e.g., create vs update) so you can control whether updates reset the clock. Example: set “create” TTL to one year but do not refresh TTL on “update” so consolidation will not extend retention.
Per-memory TTL — you can set TTL on an individual memory resource if you want absolute control.

Use cases for TTL:

Short-lived facts (e.g., “I’m traveling next week”) are relevant only for a window and should expire automatically.
Long-lived facts (e.g., “I’m allergic to peanuts”) should persist and may even have different retention/archival rules.
Privacy compliance: TTL plus deletion policies supports data minimization and regulatory needs.

When I set a 30-day TTL in the demo, memories created had expiration timestamps one month from the creation date. With granular TTL configuration I ensured that updates to the memory did not reset the TTL (so duplication or consolidation won’t accidentally keep the memory alive forever).

From my tests and the walkthrough, here’s the short checklist I recommend when adding memory to an agent:

Prototype with in-memory services to validate whether memory helps your UX (fast, cheap, no infra). Don’t use it for production because it’s volatile and uncurated.
Design your memory schema — decide what categories you want: personal profile, preferences, recent tasks, product feedback, etc.
Use a managed service like Vertex AI Memory Bank when you need high-quality extraction, consolidation, and lifecycle management.
Pick your retrieval strategy — preload for always-available facts, model-invoked load for selective access, or custom callbacks for specialized behavior.
Set TTLs from day one to avoid runaway data accumulation and to implement privacy-by-design.
Use scopes consistently (ADK uses app name + user ID by default) so memories are found reliably.
Monitor memory quality and iterate on custom topics or extraction examples to reduce false positives and improve consolidation logic.

🔐 Privacy and cost considerations I weigh before deployment

Memory changes the threat model and cost profile of agents. A few guidelines I follow:

Minimize — store only what you need. Extraction and consolidation help here, but define strict extraction rules and custom topics if necessary.
Encrypt and control access — apply principle of least privilege for memory reads/writes and protect PII in storage and in transit.
TTL and deletion policies — enforce automatic expiry and explicit deletion endpoints to comply with data requests.
Cost control — smaller memory payloads = lower retrieval tokens and cheaper vector searches. Consolidation reduces duplicates so storage and retrieval costs remain bounded.

📈 When memory reduces costs — the ROI story I observed

The economic benefit is simple to explain: a curated memory store reduces the number of tokens the model needs in future prompts to recall context. Instead of re-sending long past conversations, you provide concise facts. That reduces prompt size, lowers model compute usage, and speeds up response times.

Memory Bank also performs extraction at write time rather than forcing the model to process raw dialog at read time. This shifts compute to the background and reduces the repeated cost of extracting the same facts every time you search. Consolidation keeps the corpus compact so retrieval is cheaper and more accurate over time.

🗂 Operational pitfalls I warn teams about

In my testing I repeatedly saw a few traps teams fall into:

Storing raw dialog (the in-memory/dedicated DB anti-pattern): This creates noisy, heavy datasets that are expensive and hard to search.
Not setting TTLs: Memory grows without bound.
Inconsistent scope keys: If your app uses different scope conventions in different components, you’ll “lose” memories when retrieval expects a different key.
Expecting instantaneous memory availability: Because Memory Bank often generates memories asynchronously, assume some delay or explicitly wait when needed.

🧭 Advanced patterns I use in production agents

These are higher-order patterns that I found useful beyond basic usage:

Event-driven memory creation: For structured events like order completion or profile updates, use direct memory source inputs and mark them with specific operation types so TTL and consolidation are predictable.
Hybrid retrieval: Combine similarity search with explicit key-value lookups. Use structured memory fields for critical facts (e.g., allergy flag) and similarity search for soft preferences.
Memory summaries: Periodically run jobs that condense collections of memories into higher-level summaries (e.g., “User prefers vegan bakery items”) that are cheaper to pass to models.
Human-in-the-loop review: For high-risk memory topics (medical preferences, legal instructions), queue extracted memories for human verification before persistence.

If you want to go deeper, start with these steps:

Read ADK memory docs to understand the base memory service APIs and runner behavior.
Experiment with Memory Bank using a small agent engine and Gemini model for extraction behavior.
Iterate on custom topics with few-shot examples to improve domain-specific extraction.
Set up TTL policies and test different retention scenarios.

These steps map to the hands-on path I took: prototype fast, validate the UX and extraction quality, then move to managed Memory Bank and implement robust retention and privacy rules before scaling.

🔍 FAQ — the questions I keep getting

Q: Why not just use a regular database (Postgres, Firestore) for memory?

A: You can store anything in a database, but it’s often the raw dialogue. Databases don’t automatically extract, dedupe, consolidate, and surface concise facts for prompts. Memory Bank performs extraction and consolidation at write time, delivering a curated set of facts that your agent can use immediately. That reduces token usage and avoids re-processing the same raw content on every query.

Q: If I tell my agent “remember X,” does Memory Bank automatically persist it?

A: You should distinguish agent instructions from memory extraction rules. The agent can collect a user instruction and call a memory API, but Memory Bank’s extraction logic decides what’s meaningful to persist. If you need the agent to remember something because the user said “remember this,” implement a custom topic or direct memory source that forces persistence, or orchestrate a save operation explicitly from the agent.

Q: Can Memory Bank work with non-Gemini models?

A: Memory Bank’s current integrations are optimized for Gemini models (you provide a Gemini model name when creating a memory bank). If you want to use other models, you may need to handle extraction with your own pipeline or check for updates to Memory Bank’s supported models.

Q: How soon are memories available after I call generate_memories?

A: Memory generation is asynchronous by default, so expect a short processing delay. If you need synchronous behavior, use the Agent Engine SDK directly with a blocking option. Design your UX and agent flow with the expectation that newly generated memories will appear in the next interaction, not the immediate one, unless you explicitly wait.

Q: How does consolidation handle conflicting facts?

A: Consolidation tries to merge duplicates and update facts, but conflict resolution may vary based on your custom topic rules and Memory Bank configuration. For critical facts, validate updates with a human-in-the-loop. You can also tune consolidation rules or mark certain memory types as immutable.

Q: Can I control what kind of information Memory Bank extracts?

A: Yes. Use managed topics for standard categories and custom topics for domain-specific extraction. Provide few-shot examples to guide the extraction model. You can update the agent engine’s memory bank configuration dynamically to refine extraction behavior.

Q: How do I set TTLs and granular retention?

A: Memory Bank supports default TTLs for the memory bank and more granular TTL policies per operation (create, update, etc.). You can also set TTL at the memory resource level. Use a shorter TTL for ephemeral facts and a longer TTL for permanent information. Granular TTLs let you control whether consolidation resets expiration times.

Q: What are the best retrieval strategies?

A: Use preload tools when certain facts should always be visible to the model. Use model-invoked tools when memory is conditionally needed. Use custom callbacks when you need special scope handling, formatting, or security restrictions. A hybrid approach often yields the best UX and cost trade-offs.

Q: Can Memory Bank deduplicate across multiple users or apps?

A: Memory Bank uses scope isolation to partition data. If you need cross-user aggregation (for analytics or product insights), build a separate analytics pipeline that reads memory entries (with proper privacy controls) and aggregates them outside of Memory Bank’s per-user scopes.

📣 Final verdict — my summary report

Memory is not a feature — it’s an architectural shift. I’ve seen firsthand how memory transforms agent quality: more personalized responses, less repetitive questioning, and lower long-term token costs. The Agent Development Kit gives you the integration points and runtime orchestration that make memory practical, while Vertex AI Memory Bank adds managed extraction, consolidation, similarity search, and lifecycle management that you won’t get from a naïve database.

What’s the trade-off? You exchange a bit of upfront configuration and modeling thinking for cleaner, cheaper, and more useful memories. And you get built-in features for privacy and retention via TTLs. For production agents that must scale and provide coherent cross-session experiences, Memory Bank or a similar managed memory service is a sensible investment.

If you want to get started quickly: prototype with in-memory services to validate user value, then migrate to Memory Bank as you add extraction rules, custom topics, and TTL policies. Use callbacks to automate saves and preload tools to simplify retrieval. Monitor and iterate on extraction quality — that’s where your real ROI is earned.

To apply these ideas today, I suggest:

Create a small ADK agent with an in-memory session and an in-memory memory service to validate the UX quickly.
Add a callback to automatically call save_session_to_memory or generate_memories after each turn so you can observe the pipeline without manual steps.
Spin up a Vertex AI Memory Bank and a test agent engine with a Gemini model to see extraction and consolidation in action.
Iterate on a custom topic for any business-specific memory you want to extract (feedback, product preferences, etc.).
Set TTL policies appropriate to each memory type and test them.

📝 Closing note — an invitation from my reporting desk

Building memory-powered agents is one of the quickest ways to increase user satisfaction and reduce repeated compute. I’ve walked you through the ADK’s memory model, in-memory vs Memory Bank behavior, extraction and consolidation, retrieval strategies, automation with callbacks, custom topics, and TTL patterns. Use these recommendations as a practical blueprint to get from prototype to production.

"When your agent starts conversation from scratch every single time, you're not just frustrating your users, but you're also paying more for your agent to learn the same facts again and again."

That quote captures the economics and UX of memory in one sentence — and it’s the reason I’ll always recommend thinking about memory early in agent design. If you build something interesting or run into edge cases, consider sharing your experience with the community — the more real-world feedback we have, the better memory systems will become.

❓ More FAQs — quick-fire answers I keep handy

Can I set different retention for different memory types?

Yes — use granular TTL rules in Memory Bank or set TTL per-memory on creation. Choose conservative retention for sensitive data.

Is Memory Bank GDPR/CCPA-friendly?

Memory Bank supports TTL and deletion operations, which you should use as part of your compliance program. Also design consent flows and data minimization practices around memory extraction.

Will memory slow my agent down?

Memory generation is asynchronous to avoid blocking the main flow. Retrieval adds minimal cost if you restrict the size of memory payloads and use similarity search thresholds. Preload increases prompt size, so use it judiciously.

How do I debug what Memory Bank extracted?

Use the Agent Engine SDK to query memory entries directly and inspect metadata and raw extraction outputs. ADK’s search response is a filtered view; the underlying SDK is more verbose for debugging.

What if a user asks me to forget something?

Implement explicit forget flows that call Memory Bank delete operations or mark memories as expired. You can also create a custom topic for user-driven forget instructions.

How should I design prompts to use memory effectively?

Always format memory content clearly in the system message or tool response. Use concise bullet points or JSON-like key-value pairs when the agent expects structured input. Avoid dumping raw dialogues into prompts.

📬 Final call-to-action

Memory is a differentiator for agents. Start small, measure user benefit and cost impact, and iterate. If you want to talk through a specific use case — I’d be glad to help map memory requirements to an ADK + Memory Bank architecture.

Happy building — and remember: the best agent is the one that remembers the right things at the right time.