Build Hour: Agentic Tool Calling

🚀 Introduction — What this Build Hour covered and why it matters

I'm Ilan Bigio from the Developer Experience team at OpenAI, and in this Build Hour I walked through a practical, hands-on approach to building agentic systems that not only reason but also act. Sarah Urbanis joined me to kick things off and to frame the session for startup builders, product teams, and engineers who want to scale what they do using OpenAI’s newest APIs and SDKs. The session was hosted by OpenAI and focused on what’s new in 2025: the Responses API, the Agents SDK, hosted tools, Codex, and several other capabilities that make long-horizon, goal-directed automation achievable.

My goal in this write-up is to walk you through the same concepts, architecture patterns, and demo-driven techniques I used during the Build Hour so you can reproduce, adapt, and improve on them. I’ll explain the underlying ideas behind agentic tool calling, break down the components of a task-oriented agent system, show the end-to-end architecture I built during the demo, and summarize practical guidance from our Q&A. I’ll also share recommendations for evaluation, failure handling, and product UX so you can ship robust agent-driven features safely.

📦 Overview — What changed in 2025 and why it matters for builders

2025 brought a number of developments that collectively shift the way we think about models and what they can handle. The headline items that matter for agentic systems are:

Responses API — A unified API capable of running complex flows, integrating hosted tools, and handling longer-running operations.
Agents SDK — An SDK that wraps common agent loops and makes it straightforward to express tools, local function calls, and higher level coordination patterns.
Hosted Tools and MCP — The ability to register remote tools and microservices with the system so a single API call can orchestrate interactions across multiple systems.
Codex — A capability to operate directly on repositories, run tests, and make changes with persistent runtime environments, ideal for developer workflows and long-running code tasks.
Model advances — Models trained to reason, plan, and learn using reinforcement-based approaches that emphasize final outcomes over step-by-step demonstrations.

Put together, these pieces let us build “agents” that are not only good at thinking but are also good at doing — calling tools, invoking functions, interacting with other services, and handling long-horizon objectives with robustness.

🧭 Agentic Tool Calling — Definition and core idea

Agentic tool calling is a paradigm where a model combines its emerging internal reasoning (chain-of-thought) with access to external tools and functions. The model figures out how to reach a goal and then executes the steps it needs — often by calling tools — to get to that goal. The key difference from older approaches is that we train and evaluate on outcomes, not on prescribed step-by-step routes. The model is rewarded for obtaining the correct result and learns to invent and refine plans and actions to reach it.

“We train the models on solutions, not the steps that make them up.”

That sentence captures the leap: models learn to discover the steps themselves. When combined with tools and function calling, reasoning becomes doing.

🧠 Why reasoning matters — How models learned to plan and why that matters

Last year’s work introduced models that can reason implicitly. Instead of telling a model how to solve a task step-by-step through in-context examples, we let the model attempt to solve the problem and then graded whether the outcome was correct. Through reinforcement learning techniques, the model began to refine its internal chain of thought and strategies that led to correct outcomes. That’s crucial because:

Emergent planning: Models can invent multi-step strategies rather than being constrained to mimic templates.
Robustness: When a tool fails or data is missing, the model can recover and try alternative strategies.
Long-horizon consistency: Agents can maintain a coherent plan across many steps (potentially tens or hundreds of function calls).

In practice, this means the model can do things like fetch user data, call a billing service, check logs, and apply corrections — all as part of a coherent plan when given the right tools and the correct goal description.

🛠️ Tools + Function Calling — From reasoning to action

Tools give the model access to actions that it cannot natively perform. Consider tools as functions that the model can call. Each function has:

A name and input schema
An implementation (local or hosted) that performs work
An output schema

The Agents SDK (and the Responses API) can convert these functions into schemas the model understands and then let the model choose which tool to call. Critically, the model can call tools as part of its chain of thought — not after the fact. That allows it to reason about the world, decide to gather additional evidence using tools, and then proceed.

For example, when dealing with a customer who complains about double charges, an agent can:

Call get_user_data(user_id) to fetch account information.
Call get_recent_orders(user_id) to inspect recent transactions.
Call issue_refund(order_id) if the billing data shows a duplicate charge.
Call notify_user(ticket_id, message) to update the customer.

Because the model reasons about the goal (“resolve the duplicate charge and notify the customer”), it can choose which combination of the above actions is necessary and in what order — while also handling failures and retries.

🗂️ Tasks as a primitive — Treating work as long-running, goal-oriented units

Agents shift our mental model from short chat turns to long-horizon tasks. A task is a self-contained unit of work with:

An identifier and state
A description or goal
Context and inputs (user messages, relevant facts)
Tools it’s allowed to use
A lifecycle and events stream

Thinking in terms of tasks lets you design systems that:

Run reliably in the background
Scale across many concurrent jobs
Expose progress and traceability to users and admins
Delegate and hand off work between specialized agents

Below I’ll describe the specific components I used to build a demonstration task system: agent definition, infrastructure to run tasks, product UX to display results, and evaluation/tracing to measure correctness.

🎯 Designing agents — Goal specification, tools and delegation

When designing an agent you should consider four high-level areas: the agent itself, the infrastructure that runs it, the product that surfaces it to users, and how you evaluate it.

Agent design considerations:

Goal specification: Instead of enumerating steps, define the desired end state. Tell the agent what success looks like and what constraints it must respect.
Tool declaration: List the tools the agent can use and their semantics. That provides the agent with the capabilities it needs to act in your environment.
Delegation: Allow the agent to kick off other tasks (or agents) asynchronously if appropriate. This enables non-blocking interactions and distributed responsibility.
Human-in-the-loop: Determine whether certain decisions must be escalated to humans and create handoff patterns.

When you specify tools, you are giving the agent actions it can perform. The agent will decide whether and when to use them. For reliability, you can either:

Wrap complex sequences inside a single function (so they run sequentially in predictable order), or
Expose granular tools and rely on the model to orchestrate them — which gives more flexibility but requires more thoughtful failure handling.

As a rule of thumb: if you need deterministic sequencing, implement it in code (a single function or service). If you want flexibility and emergent behavior, expose tools and rely on the agent’s reasoning.

🏗️ Infrastructure — Running agents at scale and handling state

Agents require infrastructure patterns that differ from a simple synchronous request/response. Key elements include:

Task queue / worker pool: A job queue that accepts tasks and workers that execute them asynchronously.
Event streaming: A mechanism to stream task events (progress updates, outputs) to interested clients — typically implemented with Server-Sent Events (SSE) or WebSockets.
State persistence: Storage for task objects (ID, status, items list, to-dos) so tasks can be resumed, audited, and retried.
Runtime isolation: Containers or sandboxes for tasks that need side-effects (like repository modifications or code execution).
Monitoring & tracing: Logs and traces for debugging agent decisions and tool calls.

In the demo I built during the session, I used an asyncio-based worker pattern and an SSE events endpoint to stream events from the worker back to the front end. The same pattern works in production with more robust queues (e.g., Redis + RQ, Celery, AWS SQS + Lambda, Kubernetes jobs).

🧩 Product — Presenting progress to users and surfacing agent activity

One of the most important parts of shipping an agent is the user experience. Long-running agents need signals that tell users what’s happening and why they should trust the output. I found the following UI patterns particularly effective:

Progress as to-dos: Have the model produce an initial plan expressed as to-do items and then check them off as it completes them. This gives users an intuitive progress bar and a breakdown of what happened.
Streaming reasoning: Stream chain-of-thought or intermediate steps in real-time (if appropriate for transparency), but consider filtering out raw function calls to keep the UI clean.
Task list & detail view: Expose an overall task list and a detail view with a timeline of events, logs, and final outputs.
Non-blocking delegation: Allow the conversational UI to remain responsive while background tasks are running — users can continue to interact with the assistant while jobs proceed.

These patterns reduce user anxiety, make agent behavior auditable, and improve trust.

✅ Evaluation & tracing — Grading end states and monitoring performance

With tasks, evaluation should focus on outcomes rather than every intermediate step. Key approaches are:

LLM-based graders: Provide a rubric and use a model to grade final outputs. Collect graded examples and iterate on your rubric.
Reinforcement fine-tuning: If you have a high-quality golden set, use reinforcement fine-tuning (RFT) to teach the grader behavior more robustly.
Telemetry & tracing: Capture events, tool calls, and errors so you can evaluate across dimensions like latency, success rate, and recovery behavior.
Human-in-the-loop review: For higher-stakes tasks, maintain human review gates and audit logs.

Tracing is particularly important for debugging agent behavior. If your agent calls a remote tool and the tool fails, trace the call, the tool’s error, and the model’s subsequent actions. That makes it easier to retrain or to patch the agent’s prompt and tool contract.

💻 Live demo recap — Building an agentic task system from scratch

During the Build Hour I implemented a minimal but complete task system. Below I’ll recount the same steps and explain design decisions so you can implement a similar system.

Demo goals

The demo aimed to process a backlog of tickets (e.g., customer support items) with an agent that can:

Read ticket content
Query user data
Call internal services (orders, refunds, runbooks)
Produce a final, customer-ready resolution
Report progress in the UI via to-dos and event streams

High-level architecture

The demo used four primary components:

Front end: A small UI that lists tasks and connects to the events endpoint to receive streaming updates.
Task API / backend: Endpoints to create tasks and serve an SSE stream for events.
Task runner / worker: An asyncio worker that executes the agent using the Agents SDK and publishes events.
Tools & mocks: A set of mock service endpoints (get_user_data, get_order_details, issue_refund, write_document, etc.) used as tools the model can call.

Defining tools

Tools were defined as simple functions with typed inputs and outputs. In Python, the Agents SDK automatically converts such functions into a schema the model can call. I implemented functions like:

get_user_data(username) — returns mock account and order history.
get_order_details(order_id) — returns details, price, and whether a charge was duplicated.
issue_refund(order_id) — mock refund function that returns a refund receipt.
write_document(doc) — a tool to create internal runbook documentation.
get_runbook_by_category(category) — fetches internal policies to consult.

Each tool was registered with the agent so that when the model decided it needed information or to take an action, it would call the appropriate function.

Agent prompt & policy

Rather than instructing the agent step-by-step, I wrote goal-oriented instructions with a couple of explicit execution directives. The most important instruction was:

“Get all the context you need upfront, then execute task to completion without asking for more.”

That directive encourages the agent to first gather details (user context, relevant orders, policies) and then to run the full resolution as a single, uninterrupted job — ideal for background tasks.

I also included a smaller policy that told the agent to always produce a plan as to-dos and then check them off as it completed each step. Those to-dos would be stored in the task object and streamed back to the UI.

Front end → task creation

The front end sends a POST to the /tasks endpoint with the user's input and an optional previous_response_id (to preserve conversation history without including full context each time). The backend creates a task object and schedules an asyncio background worker. The /tasks endpoint returns the task_id almost immediately so the UI can show the new task in the task list.

Events endpoint and streaming

To provide a real-time view, the frontend opens a Server-Sent Events (SSE) connection to an /events endpoint. The /events endpoint reads from a queue of published events and streams them to the client. Any component in the backend (workers, task lifecycle hooks) can publish structured events into this queue. Events are encoded as SSE messages, decoded on the client, and used to update task state in the UI.

Worker loop and agents runner

Each worker instantiates an Agents SDK runner. It passes:

The agent definition (prompt + tools)
Input items (user message and context)
The previous_response_id if available
A context object pointing at the persistent task object

The runner streams events (response events, tool calls, chain-of-thought segments). The worker intercepts these events and publishes them to the global events queue for the front end to receive.

Task object and to-dos

The task object contains an ID, a list of items representing the conversation/actions, a to-dos field that the model can populate, and a status. I used a pattern where the agent had two additional functions for task management:

add_todos(task_id, todos) — the model supplies a list of to-do text items; this function appends them to the task and publishes an update event.
set_todo_state(task_id, todo_id, state) — the model marks to-dos as complete and publishes changes.

Because these functions operate on external state (the task object), the model can plan and then update the front end without having the task details exposed directly in the model context. This pattern creates a clean separation between the model’s reasoning and your persisted system state.

A running example: refunding a duplicate charge

Here’s a condensed narrative of what happened when we asked the agent to investigate a ticket stating “I was charged twice for my monthly subscription.”

The agent called get_user_data to load account and recent orders.
It discovered two charges for the same subscription period (duplicate).
Following the plan in the to-dos, it called get_order_details and then issue_refund to create a refund for the duplicate charge.
The agent wrote an internal runbook document summarizing the steps it took with write_document so the team could audit it later.
The agent updated the to-dos as each step completed and published events so the front end displayed real-time progress.
At the end it wrote a final message to the user and created an internal ticket log entry.

Because the agent had the directive to gather all required context, it mostly ran unattended. The front end could remain responsive and let the user continue other interactions while the background task executed.

🔁 Delegation and background tasks — Non-blocking, multi-agent workflows

Delegation is the idea that a top-level assistant (e.g., a chat interface model) can kick off background tasks that are executed by other agent processes. That allows the conversational interface to remain usable while long-running work continues. The pattern I used looked like this:

The primary agent collects context and asks follow-ups so it has what it needs.
When ready, it calls a function like start_task(payload) that returns immediately with a task_id.
A background worker picks up that task_id, executes the agent runner (which may use a different model family tuned for longer jobs or tool-heavy flows), and streams events back to the task queue.
The primary agent (or UI) can query the task or open an SSE connection to receive updates. When the task completes, the primary agent can summarise results to the user.

This approach decouples interactive chat from long-running work, enabling multi-agent compositions where different agents handle different responsibilities or have access to different tools.

💡 Practical guidance and Q&A highlights

During the session we answered a number of practical questions. I'll summarize the most actionable advice here:

How should I orchestrate sequential and conditional tool calling? 🧩

If you need strict sequencing and conditional logic, use code. Wrap the logic into a single function that the model calls — that way you get deterministic behavior, better performance (lower latency / fewer API calls), and easier debugging. If you want the model to plan its own sequence, expose granular tools and rely on the model’s reasoning, but be prepared to handle increased variability.

How should memory be managed for long-horizon tasks? 🧠

There are multiple approaches:

Context persistence: Keep the conversational or chain-of-thought history as a sequence of items in your task object. When needed, stitch it into future calls or reference it via previous_response_id.
External memory stores: Use a vector store / embeddings to persist facts for later retrieval. This is useful for recall tasks or personalized features.
Explicit memory API: Allow the agent to save facts or preferences via tools and retrieve them via a "get_memory" tool when needed.

In demo systems, you can combine strategies: store immediate chain-of-thought in task items and make longer-term facts available through a memory database.

How many tools is too many? 🔢

There isn’t a strict limit, but practically I try to keep toolsets under ~20 where possible. If you find yourself with dozens of tools, ask whether you can combine or specialize them into sub-agents and use handoffs so each agent has a focused toolset.

Can I mix OpenAI-hosted functions with my custom functions? ⚙️

Absolutely. A powerful pattern is to use hosted services (OpenAI hosted functions / code interpreter / MCP servers) for computation and combine them with your internal APIs (databases, billing, CRM). For example, fetch transaction data from your internal API and then use a hosted code execution tool to do analysis on it.

Does the Responses API support MCP and background mode? 🔌

Yes. As of the recent updates, the Responses API supports registering remote MCP servers and hosted tools. If your workflow uses only remote tools and no local function calls, you can set background=true and fire off a single responses call that runs for a long time, performs many remote calls, and then returns or posts results when complete.

Here’s a distilled checklist and set of patterns that I recommend you follow when launching agentic features:

Design and scope

Start small: Pick a single repeatable workflow (refunds, account updates, runbook creation) and instrument everything around it.
Define success clearly: Create a precise end-state definition you can programmatically check or grade with an LLM-based rubric.
Choose models by responsibility: Use different models for interactive chat (low-latency, cheaper) and for background agent runners (capable of long contexts and tool-heavy flows).

Tools and APIs

Prefer composite functions for deterministic logic: If you require exact ordering or conditional branching that must be enforced, implement it server-side.
Expose only necessary tools: Restrict tools to the minimum set needed for the task and enforce permissions and auditing for higher-risk operations (billing, deletion).
Use schemas: Define clear input/output schemas for tools so the model can call them reliably and failures are understandable.

Runtime and operations

Use a robust task queue: Redis, SQS, or Kubernetes Jobs paired with a worker pool will scale better than ad-hoc asyncio in production.
Persist task state: Store task objects in a database so you can resume, audit, and replay runs.
Event streaming: Use SSE or WebSockets for live UI updates and for cross-system notifications.
Trace everything: Capture tool calls, inputs/outputs, and model reasoning as events so you can diagnose issues and improve agents.

Safety and guardrails

Role-based permissions: Secure tools with authentication and enforce least privilege for agents.
Human review: For irreversible or sensitive operations, include an approval step or human-in-the-loop process.
Rate limiting & quotas: Protect systems from runaway agent loops and costly mistakes.

📚 Resources and next steps

If you want to follow along with the demo code or replicate the environment I used, these resources are where I recommend you start:

Build Hours GitHub repo: A curated set of demos and code samples that mirror the live sessions.
OpenAI Developers site: Official docs, SDK references, and examples for the Responses API and Agents SDK.
Practical guide for building agents: A step-by-step guide I co-authored to help teams design, implement, and evaluate agentic systems.
Upcoming Build Hours: Regular live sessions that dive deeper into ImageGen, advanced agents patterns, and integrations.

🏁 Final thoughts — Why I’m excited and what I’d build next

Agentic tool calling represents a major change in how we build intelligent systems. Instead of wiring a game of brittle if/then rules, we can provide goal-oriented instructions, a curated set of tools, and a runtime that supports long-horizon execution with robust tracing and evaluation. The result is agents that are:

Goal-directed: They focus on end states, not just step-by-step templates.
Resourceful: They fetch data, execute functions, and compose services to reach objectives.
Recoverable: They handle tool failures and can retry or pivot strategies.
Auditable: They produce event streams and to-dos that give users and teams transparency into what happened.

If I were building next, I’d start with a high-impact internal workflow that needs auditing and can benefit from automation — for example:

Automated billing reconciliation and refunds with human approval gates.
Developer workflows that run tests, patch code, and open PRs using a Codex-backed runtime.
Customer success automation that triages inbound tickets, runs checks, and drafts responses for human review.

All of those use cases benefit from task primitives, a modular toolset, background workers, and a UI that surfaces progress and results.

💬 Closing — Thanks and where to engage next

Thanks for reading. I enjoyed building the demo live and answering practical questions about orchestration, memory, tool design, and evaluation. If you want to explore the code, check the Build Hours GitHub and the OpenAI developer docs. I’m excited to see what you build with these patterns — whether it’s a simple internal automation or a full multi-agent product feature. Happy building!