Build Hour: Voice Agents — How Real-Time Speech-to-Speech Models Are Changing Conversational AI

Featured

I recently attended and reported on an OpenAI Build Hour session titled "Build Hour: Voice Agents," where Christine from the Startup Marketing team hosted a deep, hands-on exploration of voice agents with Brian Fioca and Prashant Mital from the Applied AI/solutions architecture teams. The event showcased the state of modern voice AI, walked through new platform features, and demonstrated a live build of a voice-powered workspace with agent handoffs, guardrails, and evaluation patterns. I’m writing this as a first-person news-style account to summarize the key announcements, explain the technical approaches, and share practical advice for teams that want to deploy voice agents in production.

📰 Executive Summary

Voice agents have moved beyond simple transcription. With recent model and SDK updates, they can natively understand audio, reason about context, stream audio back to users, and call tools or services in real time. At the Build Hour, the team presented a few core concepts and practical patterns:

  • Why voice agents matter now — accessibility, expressiveness, and personalization.
  • Two primary architectures: the chained pipeline (speech-to-text → LLM → text-to-speech) and end-to-end speech-to-speech models.
  • Real-time features and SDK updates, including a TypeScript Agents SDK, real-time traces in the platform, and a June 3rd snapshot with a speed parameter.
  • A live demo where Brian built a voice-powered workspace manager and a designer agent that could hand off tasks to an estimator agent.
  • Operational best practices: handoffs, guardrails, voice activity detection (VAD), evaluation (evals), and testing strategies.

Below I unpack all of these points in detail and add practical recommendations based on what I learned during the session.

🎙️ What I Mean by “Agent”

When I use the word “agent” in this piece, I mean a software application composed of three parts:

  1. An AI model (the core reasoning or language engine).
  2. A set of instructions or a prompt that shapes the model’s behavior.
  3. Connections to external tools or services that extend the agent’s capabilities.

In practice, an agent runs in an execution environment with its own lifecycle. That means the agent can decide when it has completed a task and stop executing—this is very different from one-off LLM calls and allows for richer flows such as routing, delegation, and multi-step tool usage.

🔎 Why Voice Agents Are an Inflection Point

I left the Build Hour convinced that voice agents are at a major tipping point for three big reasons, and I’ll explain them here in the same order the team highlighted:

1. Flexibility and Ambiguity Handling

Older voice systems were deterministic and brittle: they relied heavily on precise transcriptions and simple rule-based responses. The new generation of speech models are far more flexible. They can interpret ambiguous utterances, handle broader intent sets, and adapt dynamically to user goals during a live, multi-turn conversation.

That flexibility is crucial in real-world interactions where users don’t speak like they write. They pause, change their mind mid-sentence, or use elliptical phrases. Modern speech-to-speech models handle those nuances better than prior pipelines because they operate directly on audio and can act on that information instantly.

2. Accessibility and Real-World Interaction

Voice is inherently accessible. I know I use voice features when I’m commuting, walking a dog, or doing chores where typing is inconvenient. The Build Hour reinforced an obvious point: users are discovering voice “wow” moments every day in products like ChatGPT’s advanced voice mode and others. That increases user expectations; soon people will expect voice in many of their favorite apps.

3. Personalization, Emotion, and Nonverbal Cues

Voice carries subtleties that text-based systems lose: tone, cadence, sarcasm, frustration, excitement, and more. Because speech-to-speech models operate directly on audio tokens instead of relying on lossy transcriptions, they preserve cues that help tailor responses and escalate or de-escalate appropriately. That’s why the team described voice agents as “APIs to the real world” — they’re a new channel for capturing human state and addressing last-mile integration problems with more nuance.

⚙️ Two Architectures for Voice Applications

Throughout the session, Brian and Prashant contrasted two common architectures. I’ll summarize each, note the trade-offs, and give examples of when to pick one over the other.

Architecture A: Chained Pipeline (Speech → Text → LLM → Speech)

This approach is straightforward: you transcribe user audio into text using a speech-to-text model, send the text to a text-only LLM (e.g., GPT-4 variants) to decide on a response, and then generate speech with a text-to-speech model for playback.

Benefits:

  • Modularity: you can mix-and-match different models for different parts of the pipeline.
  • Reusability: existing text-only agents and infrastructure can be repurposed by wrapping them with speech-to-text and text-to-speech layers.
  • Control: text-based systems make it easier to log, audit, and apply guardrails at the textual layer.

Drawbacks:

  • Loss of audio cues: Tone and cadence are lost with transcription, which can reduce personalization.
  • Latency: multiple model calls and conversions can increase response time unless carefully optimized.
  • Potential semantic loss: transcription is inherently lossy and might miss hesitations or subtle content, causing a model to misinterpret intent.

Architecture B: End-to-End Speech-to-Speech Models

This new generation of models ingests audio, reasons over it natively (no transcription step required), and emits audio tokens to play back responses. These models power advanced voice features like real-time voice mode in ChatGPT and OpenAI’s Realtime API.

Benefits:

  • Speed: by eliminating transcription and the extra text-to-speech step, these models are fast and support real-time streaming interactions.
  • Emotional intelligence: they preserve and react to vocal nuances—tone, cadence, and emotion—yielding more natural interactions.
  • Simplified developer experience: fewer components, especially for conversational flows that prioritize expressiveness.

Drawbacks and Mitigations:

  • Reasoning limits: some speech-to-speech models may not have the same depth of reasoning as the best text LLMs. The solution is delegation—have the speech model call out to slower, smarter text models for high-stakes or complex tasks.
  • Debugging challenges: since these models operate in audio token space, you need audio logging and traces to debug; the platform’s new traces feature helps with that.

🧰 New Real-Time Platform Features and SDK Updates

The Build Hour announced several platform and SDK updates that dramatically reduce friction for building voice agents. I’ll walk through each one and why it matters.

TypeScript Agents SDK with Real-Time Support

Until recently, the Agents SDK had feature parity mainly in Python. The team released a TypeScript version that matches the Python primitives (including handoffs) and adds first-class support for the Realtime API.

Why this matters:

  • Developer ergonomics: many real-time front-end applications are built in JavaScript/TypeScript. A native SDK drastically shortens the path between prototyping and shipping.
  • WebRTC integration: the SDK automatically chooses WebRTC in browser contexts and WebSockets on servers, simplifying real-time transport handling.
  • Single-line conversion: the SDK allows turning any agent into a real-time agent with a single constructor change—very powerful for converting text-first workflows to voice.

Realtime Traces in the Platform Dashboard

One significant operational improvement is that Realtime sessions can now be logged to the OpenAI platform’s traces tab. If your app uses the Agents SDK, all input and output audio gets stored in traces along with associated tool calls.

Why this matters:

  • Debugging made simpler: to diagnose a bad completion from a speech-to-speech model, you need the audio. The platform now captures the audio so you can replay the exact conversation.
  • Human-in-the-loop review: teams can review real session audio and associated tool calls to tune prompts, guardrails, or behaviors.
  • Future eval integration: the team plans to make traces convertible into evals so high-quality or misbehaving conversations can be used to build evaluation suites automatically.

June 3 Snapshot and a Speed Parameter

The team shipped a new snapshot for the Realtime API that improves instruction adherence and tool-calling accuracy. They also introduced a speed parameter, giving finer control over how fast the model speaks—important when designing a UI that mixes streaming audio and readable text.

🧩 Handoffs: A Core Primitive for Multi-Agent Workflows

Handoffs let one agent delegate control to another during a conversation. I found this pattern to be one of the most practical takeaways from the Build Hour. Here’s what I learned and how I’d apply it.

Why handoffs matter

Real-world systems are naturally modular. In the demo, there were at least three distinct roles:

  • Workspace manager — good at manipulating UI state, creating tabs, and enforcing structure.
  • Designer agent — domain expertise in interior design and ideation.
  • Estimator agent — focused on calculations, budgets, and schedule logic.

Rather than stuffing all instructions into a single monolith, the team recommended breaking agents into role-aligned components. Handoffs enable:

  • Domain specialization — each agent can have a narrow prompt and toolset optimized for its task.
  • Clear responsibilities — designers don’t accidentally become estimators if you don’t want them to, and vice versa.
  • Composable systems — you can assemble networks of agents that route tasks between them deterministically or via runtime decisions.

How handoffs work in practice

The workspace manager agent used a tool interface that allowed other agents—like the designer—to call a makeWorkspaceChanges tool. From the designer’s perspective, it calls a single tool with a brief description of the desired change. The workspace manager then performs the low-level actions (creating tabs, setting tab content) and returns a confirmation.

Practical pattern: keep a thin "manager" interface that owns state and makes changes. Let domain agents decide what should change and hand off the change requests. This separation reduces prompts' cognitive load and avoids redundant logic across agents.

🛡️ Guardrails, Output Moderation, and Safety

Deploying voice agents publicly brings legitimate safety concerns: you don’t want agents going off-script or providing disallowed content. The Agents SDK supports output guardrails that inspect the transcript in near real time and take action if constraints are violated.

How real-time guardrails work

Because audio streams generate transcription tokens faster than the speaker finishes, the system runs guardrails on the transcript as it is produced. If the guardrail detects a violation, it can interrupt the agent mid-speech, inform the agent why the interruption happened, and instruct the agent to apologize or pivot to a safe behavior.

Example from the session: a designer agent that was constrained to interior design topics was prompted to create a “workspace for formulating and manufacturing a new food product.” The guardrail tripped, the agent apologized, and it continued the conversation within allowed boundaries. That’s the kind of feedback loop that prevents agents from drifting into sensitive or off-topic content.

Operational recommendations

  • Define precise guardrails for your domain. Narrow is better than broad when it comes to safety constraints.
  • Consider checkpointing workspace changes. If a guardrail triggers after making edits, you should be able to roll back or hide those edits until the interaction is validated.
  • Give the guardrail a reason. If you send a structured message back to the agent indicating why it was interrupted, the agent can apologize and adjust its response gracefully.

🔍 Traces, Evals, and Testing for Voice Agents

Operational maturity requires reliable testing and evaluation. The Build Hour emphasized multiple levels of verification for agents:

Integration tests and automated scenarios

Brian described writing an integration test that stubs tools and runs a predictable script—e.g., “make three tabs”—to verify the agent calls the expected tools. This is the equivalent of unit/integration tests for traditional back-end code and is essential for catching regressions early.

Key advice:

  • Start small. Build basic tests that assert fundamental behaviors first (tool calls, tab creation).
  • Automate test runs against agent prompts and flows as part of CI/CD.
  • Incrementally add complex scenarios and mock external services to test long flows.

Evals and human review

OpenAI’s evals ecosystem enables teams to grade agent behavior. The Build Hour showed how traces in the platform could be turned into evals—either manually today or automatically in the near future—letting you sample real conversations to capture failure modes or exemplars.

Examples of evaluations to run:

  • Correct tool usage: Did the agent call the right tool for a given user intent?
  • Follow-the-script checks: Did the agent respect conversation states and transitions?
  • User satisfaction proxies: Did the agent reflect empathy or apologize appropriately after errors?

Human-in-the-loop review

Capturing audio and the full conversation trace lets QA teams review problematic exchanges. The traces UI in the dashboard allows playback of both user audio and agent audio alongside logs of tool calls, enabling faster root cause analysis than trying to infer from text alone.

🧪 Demo Walkthrough: Building a Voice-Powered Workspace

Brian led the live demo. I ran notes and replayed the flow carefully; here’s an annotated walkthrough of what he did and why it’s instructive.

Initial workspace manager

Brian started with a simple workspace interface similar to a note-taking or tabs app. Initially, he typed commands to create tabs for inspiration, project plan, and budget. That’s the “old way”—type, wait, edit. Even with a basic agent, it felt faster, but it wasn’t conversationally satisfying.

Adding voice and basic tool calls

Switching to voice, he invoked a real-time agent that responded with a robotic voice. It understood commands like “set up a workspace for a small kitchen remodel” and used its tool calls to create tabs. That demonstrated how voice can replace typing for rapid ideation, but it also exposed UX gaps: the agent’s voice didn’t narrate intermediate steps, and interruptions were not obvious in the interface.

Improving narration and UX

Brian edited the agent’s prompt to add filler phrases and instruct the model to narrate actions prior to tool calls. That improved transparency and helped users understand what the agent is doing when it calls tools. Tip: If a function runs long, mention it in the function description so the agent can say, “This might take a moment, please hold on.”

Specialized designer agent and offloading to smarter models

Brian then introduced a second, specialized agent—an expert interior designer. The designer agent had a narrower prompt space, a different set of conversation states (greeting, gather inspiration, set requirements, handoff), and tools optimized for its responsibilities.

Important pattern: the designer agent didn’t try to perform budgeting calculations or complex reasoning. Instead, it called a makeWorkspaceChanges tool, which was implemented by a workspace manager agent that could run on a more capable text model (GPT-4.1 style) or on the server. This is the responder/thinker pattern: let a fast real-time voice agent handle conversation and delegate heavy reasoning to a slower, smarter model.

Search the web and tool callbacks

The designer agent used a web search tool via a GPT-4-based model to fetch “latest trends for contemporary kitchens in 2025.” The voice agent narrated “Let me search the web, one moment please,” and after a pause, summarized the results and updated the inspiration tab by calling the workspace changes tool.

That’s a powerful demonstration: a voice-first client can orchestrate background work via tool calls without losing the conversational thread. From a user perspective, it feels like a single coherent assistant doing everything; from an engineering perspective, you’ve achieved separation of concerns.

Estimator agent and extension possibilities

Brian mentioned an estimator agent as the natural handoff target to calculate budget and scheduling. He also described giving the estimator access to the Code Interpreter (or a code execution environment) to run precise calculations—e.g., uploading a bill of materials or running a Python script to compute totals. This shows how you can sequence domain experts into a multi-agent workflow and leverage execution tools for accuracy.

📱 Mobile, WebRTC, and Client Architecture

A frequent question in the Q&A was about mobile app integration. The answer rests on a few foundational points I captured from Prashant and Brian.

Use WebRTC for client → inference server

WebRTC is the recommended transport because it connects the client directly to the real-time inference server, eliminating intermediate hops. That improves latency and reliability and is particularly beneficial for mobile apps where responsiveness matters.

Client-side vs server-side responsibilities

In the demo architecture, the manager and designer agents can run on the client, while heavier tools and text models (like the workspace editor or external services) run on a server. The client gets an ephemeral Realtime API token from the server, limiting exposure if the device is compromised. This pattern enables truly serverless voice workflows for simple use cases and a hybrid approach for complex scenarios:

  • Keep conversational, latency-sensitive agents on the client (via WebRTC + real-time models).
  • Keep data-sensitive logic, heavy reasoning, or systems requiring persistent API keys on the server.
  • Route calls back to server-only tools (e.g., image generation, billing, or database updates) as needed.

Practical mobile tips

  • Use ephemeral tokens with TTLs to protect API keys.
  • Choose codecs and sampling rates wisely to balance audio quality and bandwidth usage—especially on mobile networks.
  • Design for intermittent connectivity: support local caching of conversation state and queuing of tool calls when offline.

🗣️ Voice Activity Detection (VAD), Semantic VAD, and Latency Controls

One technical question I found particularly interesting was about VAD and configuration choices for real-time interactions.

Basic vs semantic VAD

Historically, VAD simply observed audio amplitude and silence to detect turn ends. The newer Semantic VAD considers content as well, which prevents the model from prematurely cutting off during natural pauses in speech like “My name is…” followed by a pause. Semantic VAD improves the user experience in multi-turn conversations where people often pause to think.

Temperature and creativity vs determinism

Temperature remains an important parameter. For voice agents, the documentation and speakers recommended a constrained temperature range (roughly 0.8–1.1 for real-time models, depending on the snapshot). The higher the temperature, the more creative and less deterministic the agent is. Keep it low when you want adherence to scripts and instructions, and raise it for ideation or brainstorming.

Speed parameter

The new speed parameter lets you tune how quickly the model speaks. This is crucial when you stream text and audio simultaneously and want to allow users to read ahead. A slower voice helps users follow, but faster speech can feel more natural in certain contexts. The best practice is to experiment with mixed modes and user preferences (give control to the user when possible).

🔁 Speech-to-Tool Scenarios and Use Cases

During the Q&A, there was a discussion around whether speech-to-speech models are appropriate if most interactions end up as tool calls. I summarized the conversation and added my own take.

When speech-to-speech + tools makes sense

If your product goal is a voice-first experience—phone-based support, home automation, or on-the-go assistants—speech-to-speech agents make a lot of sense. The agent can verbally narrate actions while delegating the heavy lifting to tools:

  • Home automation: “Set thermostat to 72 degrees”—voice agent calls tool to alter thermostat settings and narrates status.
  • Customer support flows: agent identifies intent, calls backend tools for account lookup, and returns a spoken summary with escalations if needed.
  • Phone-based workflows: integrate Twilio to let users call up agents, initiate background actions, and get verbal updates.

When you might prefer chained architectures

If your primary requirement is deep reasoning, long-form documentation generation, or heavy multimodal outputs (e.g., generating a detailed legal contract or complex Python scripts), a text-first chain with a powerful LLM for reasoning may be preferable. In practice, you often want both: a fast voice agent for the conversational front-end and delegation to a text model for the heavy thinking.

✍️ Prompting, Meta Prompts, and Designing Agent Personas

Prompting remains a critical design lever. I found the team’s emphasis on meta prompts particularly helpful. Brian referenced a meta prompt created by a coworker (Noah) designed to bootstrap voice agent personas. It includes identity, demeanor, tasks, and conversation states—essentially a template for agent personality and behavior.

Guidelines for good prompts

  • Start with a clear role and identity. Define what the agent is responsible for and what it’s not allowed to do.
  • Use meta prompts to factor common behaviors: greeting style, error handling, narration around tool calls, and transitions between conversation states.
  • Include one-shot or few-shot examples for tricky scenarios to guide preferred behavior.
  • Keep long prompts for production. In real usage, prompts can be hundreds of tokens to capture brand voice and detailed instructions.

Design for interruptions and collaboration

Voice interactions are often collaborative and interruptible. Prompt your agents to support interruptions gracefully—acknowledge and adopt new input without losing context. Where possible, stream partial text to the UI so users who prefer reading can “read ahead” and interrupt at will. The agents SDK supports this iteratively when push-to-talk is disabled.

🔧 Production Considerations and Operational Playbook

It’s one thing to prototype a voice agent in a demo; it’s another to run it in production. Based on what I learned, here’s a playbook for getting to launch:

  1. Start with small, well-defined agents: pick a narrow role and build it end-to-end.
  2. Instrument everything: capture traces (audio in/out, tool calls, transcript) for every session to enable debugging and audit.
  3. Automate tests: write integration and scenario tests that assert high-level behaviors (e.g., tool calls, state transitions).
  4. Build eval suites: convert curated traces into evals to continuously measure regressions and improvements.
  5. Implement guardrails and rollback: ensure content moderation layers and state checkpointing are in place so that unsafe or irrelevant changes can be undone.
  6. Use ephemeral tokens for clients to reduce risk of API key leaks; TTLs limit exposure.
  7. Monitor for latency and cost: real-time models and high-bandwidth audio codecs can be expensive—profile your usage and tune codecs and model choices.
  8. Plan for escalation: use handoff-to-human workflows or supervised escalation when agents detect high user frustration or high-stakes queries.

📣 Q&A Highlights — Practical Answers from the Session

I compiled several practical answers that Brian and Prashant shared in the Q&A portion, framed as short takeaway points.

Which real-time model and parameters did you use in the demo?

I used the June 3 snapshot of the real-time model because it offered improved instruction-following and tool-calling accuracy. For transcription, Whisper-1 was used in the demo. Codec choice depended on the demo environment; they used high-fidelity audio because it was a live stream. In production, balance fidelity and bandwidth based on use case.

How should I set temperature for real-time models?

Use a conservative range for real-time models; experimentation is essential. The recommended guide was roughly 0.8–1.1 for those snapshots, but the exact numbers depend on your desired balance of creativity vs determinism.

How do you handle interruptions and VAD?

Semantic VAD is the newer, more robust option. It understands content and prevents premature cut-offs during pauses. The Agents SDK supports interruption handling out of the box—agents are designed to be interruptible and continue the conversation or pivot appropriately.

How to build a voice-first mobile app?

Use WebRTC to directly connect the mobile client to the Realtime inference server. Keep latency-sensitive agents client-side and heavyweight reasoning server-side. Use ephemeral tokens that your server issues to the client with short TTLs for security. That design minimizes server hops and keeps the experience snappy.

Are speech-to-speech models useful when mostly doing tool calls?

Yes—especially when the user experience benefits from voice narration or when the emotional state and pronunciation cues matter. Even when most actions are tool calls, real-time voice agents can narrate progress and maintain conversational continuity. For deeply analytical tasks, mix in text-based models for background computation and reasoning.

How important are prompts?

Very important. The team recommended long, detailed prompts that include persona, conversation flow, and example dialogues. Use meta prompts to generate consistent, brand-aligned agent behaviors and add few-shot examples for edge cases.

🔮 Looking Ahead — Opportunities and Future Work

The Build Hour made it clear that the voice agent stack is rapidly evolving. From the improvements they shared—TypeScript Agents SDK, traces, model snapshots, and guardrail integrations—it's obvious the platform is moving toward easier, safer, and faster voice applications.

Some promising directions I would watch:

  • Eval automation from traces: turning production traces into evaluation datasets will accelerate iteration cycles and boost quality assurance.
  • Multi-agent orchestration patterns: more complex networks of agents with robust routing, state-sharing, and failure modes will enable richer application functionality.
  • Deeper multimodal integrations: voice + image generation + code execution in a single flow (e.g., generate a mood board, compute budget, and produce shopping links) will unlock new product classes.
  • Enhanced personalization and memory: refining how voice agents track and recall personal user preferences will make agents more delightful and sticky.

🏁 Conclusion

As someone who’s observed this space closely, I found the Build Hour content both practical and forward-looking. The new real-time models and tooling make voice agents more viable than ever, and the recommended architectural patterns—end-to-end speech-to-speech for conversational front-ends, delegation to smarter text models for heavy reasoning, and handoffs to domain-specific agents—are sound engineering practice.

If you’re building voice experiences, my top takeaways are:

  • Design agents by role: make them narrow and composable.
  • Instrument and test aggressively: traces and evals are your friends.
  • Plan for safety: guardrails and checkpointing reduce risk.
  • Optimize user experience: semantic VAD, narration of tool calls, and speed controls matter.
  • Use the right architecture: pick speech-to-speech for expressiveness and chained models for deep reasoning, and combine them when appropriate.

I encourage you to explore the OpenAI resources the team shared—particularly the Build Hours code repository and the voice agents guide—if you want to get hands-on quickly. I’ll be watching how these agents evolve, and I expect to see increasingly fluid, expressive voice experiences in mainstream apps very soon.

Resources I referenced during my coverage:

  • OpenAI Build Hours repository: https://github.com/openai/build-hours
  • Voice agents guide: https://platform.openai.com/docs/guides/voice-agents
  • Sign up for upcoming Build Hours: https://webinar.openai.com/buildhours

Thanks to Christine, Brian Fioca, and Prashant Mital for the demo and for sharing best practices. If you’re building voice agents, I’d love to hear what you’re trying to solve—I’m actively tracking implementations and would be happy to cover real-world case studies in future reports.


AI World Vision

AI and Technology News