Orchestrating Agents at Scale: Building, Deploying, and Optimizing Agentic Workflows with AgentKit

Today I want to walk you through a new suite of tools we introduced at OpenAI that I helped build: AgentKit. In the original presentation from OpenAI, I — James, a tech lead on our forward-deployed engineering team — demonstrated how AgentKit brings together three core capabilities for designing, shipping, and improving agentic workflows: the Agent Builder, ChatKit, and integrated evals and tracing. My colleague Rohan also joined me to show how you can export a workflow as code and self-host components when you need to.

This article is a deep-dive, written in first person, into what AgentKit does, how I used it to build a real workflow for maintenance engineers, and how you can take those same building blocks to run agentic systems safely and reliably at scale. I'll cover the end-to-end flow: from visually designing a workflow, to embedding a polished UI, to exporting and self-hosting the code, to monitoring, grading, and automatically optimizing models and prompts. I’ll also give practical tips and best practices I’ve learned while working on these systems.

🔧 What is AgentKit — a concise overview

AgentKit is a set of building blocks that makes it fast and intuitive to create, run, and improve agentic workflows. Agentic workflows—what many people call "agents"—are sequences of model calls, tool invocations, and logic that collaborate to solve a task. AgentKit brings three main capabilities together:

Agent Builder — a visual canvas for dragging and dropping nodes (LLM agents, tools, logical nodes like loops and conditionals) to construct workflows. You can run workflows hosted on OpenAI's platform or export them as JavaScript/Python code.
ChatKit — a front-end framework with pre-built components for GenTech workflows. Instead of building streaming chat UIs, progress visualizers, or widgets from scratch, you can embed ChatKit components into your apps.
Evals & Tracing — integrated observability and evaluation tools that automatically create traces for every workflow run and let you define graders to score outputs, or automatically optimize prompts based on grader feedback.

Put simply: AgentKit helps you click, connect, create — and then monitor and improve. It's built to support quick iteration in the cloud, and for teams that require it, straightforward export and self-hosting on private infrastructure.

🛠 Live demo: an agentic workflow for semi-truck maintenance

To make things concrete during our demo, I used an example that our team could imagine using in the real world: a semi-truck manufacturer that receives thousands of maintenance inquiries every day. The goal was to build a workflow that helps maintenance engineers diagnose issues and provide step-by-step instructions and the exact parts needed for repairs.

Here's the scenario I walked the audience through: a maintenance engineer (or field technician) reports that fuel economy is very poor on a particular mini-truck. The workflow should:

Interpret and expand the user's input into a semantically rich query.
Search the company's repair manuals (PDFs uploaded into a vector store) for the best-matching procedure.
Extract the procedure ID and call a parts management service to get the parts associated with that procedure.
Synthesize a clear, usable set of instructions for the engineer to follow.
Run a guardrail that checks whether the instructions and parts are grounded in the company’s source data (to prevent hallucination).
Return a polished response through a chat-like UI with an attractive widget showing instructions and parts.

I built that exact flow in the Agent Builder during the demo. On the app side, ChatKit provided the chat interface on the right and a custom widget displaying instructions and parts on the left. When a user enters “fuel economy has been super low on their mini truck,” the workflow executes the nodes and streams back the results in real time. It looks like a chat, but there's a full pipeline running behind it: query expansion → file search → metadata extraction → parts lookup (via an MCP tool) → synthesis → hallucination guardrail → summary.

One of the things I emphasized is how the UI components made with ChatKit are pre-built, streaming-aware, and production-ready. Building dynamic UIs for complex agentic workflows is normally very hard and error-prone. ChatKit handles the streaming, token-level updates, widget rendering, and other UX details so engineers can focus on the workflow logic.

The components used in the demo

Query Expansion Agent — refines the user's phrase into a richer search query that yields better retrieval results.
File Search Node — searches a hosted vector store of repair manual PDFs and retrieves the top matching document.
Data Transform — extracts metadata (procedure ID) from the retrieved document.
MCP Tool — calls a GraphQL API to return all parts associated with the procedure ID. In production this maps to the team's parts database.
Synthesis Agent — composes human-readable repair instructions from the retrieved content and tool outputs.
Hallucination Guardrail — validates the final output against the repair manuals to reduce incorrect recommendations.
Summary/Widget — formats the final response and displays it in a visually useful way to the maintenance engineer.

When we needed to tweak the output (I wanted a subtler or funnier Dev Day pun in the summary), I was able to edit the summary prompt in the Agent Builder, attach a pre-built widget from Widget Studio, and publish—deploying an updated workflow in seconds. A simple page refresh in the app pulled in the new behavior immediately because the workflow was running as a hosted process in the platform.

🧩 Building workflows in the Agent Builder (visual authoring)

The Agent Builder is the "glue" piece where conversations about behavior, prompt engineering, and business logic come together. I described it as a canvas where developers or subject-matter experts can drag components into place and wire them together.

Key capabilities of the Agent Builder:

Node picker — add LLM agents, tool wrappers (file search, MCP), and control-flow nodes (while, if/else).
Live previews — run the workflow in the browser to see how it behaves before deploying.
Prompt editing — each agent exposes prompts or instructions that you can edit inline; prompts support variables tied to upstream outputs.
Guardrails — add verification nodes that assert outputs are grounded in your data (e.g., check for hallucination using the same vector store you searched).
Widgets — attach a visual component to the final node so the UI can render the output in a friendly, actionable format.
Publish & Deploy — once satisfied, click Publish to deploy the workflow to production hosted by OpenAI, or export.

I intentionally demonstrated an edit-and-deploy iteration: delete a few nodes, tweak the prompt to make the output more digestible, upload a widget from Widget Studio, and publish. The whole loop took only a few seconds between editing and seeing the updated UI. This is the kind of rapid iteration that frees teams up to focus on quality of content and behavior rather than plumbing and infra.

Why visual composition helps

Agentic systems can quickly become complex. As workflows grow, understanding the flow of data between agents, tooling, and checks becomes essential. Visual composition helps in three ways:

Readability: Stakeholders can understand the high-level pipeline without reading dozens of lines of code.
Faster iteration: It's quicker to swap nodes, change prompts, or insert a guardrail graphically than to chase down configuration across repositories.
Collaboration: Non-engineers (support leads, domain experts) can help craft or validate prompts and graders in context.

That said, visual builders don't replace code entirely. Which brings us to the next big capability: exporting and self-hosting.

🖥️ ChatKit — building polished, streaming-ready UIs

ChatKit is our front-end layer for GenTech workflows. It's a set of pre-built components for chat interfaces, streaming token updates, and custom widgets so you don't have to reinvent the UI parts of an agentic application.

Core features of ChatKit:

Streaming support: Token-by-token streaming so users see progress in real time.
Widgets and visual components: Render the output of workflows using configurable components (e.g., instructions + parts list for a repair job).
Integration points: ChatKit speaks a simple protocol — receive messages, send outputs — so you can run a ChatKit front-end against a hosted OpenAI backend or your own server implementing the ChatKit API.
Pre-built UX: Chat, conversation history, tool usage indicators, and other elements that are non-trivial to implement robustly.

During the demo, I used ChatKit for the engineer-facing interface. The chat on the right streamed the trace of the running workflow, while the widget rendered the final instructions and parts. The ease of plugging in a widget and having it rendered responsively was a highlight: we built the widget in Widget Studio, uploaded it, previewed it, and published it from the Agent Builder.

Embedded components and consistent experiences

One of the practical benefits of ChatKit is consistency: teams across an organization can reuse the same chat components and widgets to provide a unified UX. This reduces cognitive load for users and reduces engineering effort across multiple teams wanting similar behaviors (customer support, field operations, internal tooling).

🧾 Exporting workflows and self-hosting (Rohan’s demo)

In many enterprises, you can't put all your logic or data on a public service for compliance, latency, or integration reasons. Rohan showed how to take a workflow built in the Agent Builder and run it on your own infrastructure.

Here's how the export and self-host flow works in practice:

From the Agent Builder, click the Code button to export the workflow as JavaScript or Python. The export uses the OpenAI Agents SDK — an open-source library we released earlier.
Paste the exported code into your own editor or repo. The exported code includes all agents, tools, and orchestration that the visual builder described.
Replace any hosted tools with local equivalents. For example, swap calls to the hosted MCP server with a GraphQL endpoint running in your private cloud or a local SSE server.
Run the workflow locally or in your private cloud. Modify ChatKit client configuration to point at your ChatKit-compatible backend endpoint (e.g., /api/chatkit) instead of openai.com.

Rohan demonstated this live: he exported Python code for the same maintenance workflow, switched the MCP tool to talk to a local SSE server, updated the ChatKit endpoint to the local backend, and restarted the application. The chat and widget continued to work unchanged from the front-end perspective, but now everything ran in his local environment.

What this unlocks for teams

Self-hosting with exported code enables:

Data locality: Workflows that need access to private cloud resources (internal databases, proprietary document stores) can run without exposing those resources externally.
Compliance: Teams with strict data governance can keep data on-premises while still using the Agent Builder for authoring.
Debugging and extension: Developers can add custom instrumentation, replace tools, or integrate richer business logic in code while preserving the original, designer-friendly workflow.

The important point is that you don't lose front-end features like streaming tokens, reasoning summaries, or widgets when self-hosting. AgentKit was designed so the visual-level and UX-level investments remain valuable regardless of where the back-end runs.

📊 Observability: traces, graders, and evaluating at scale

When you run agentic workflows for a few users, everything may appear smooth. But when you scale to hundreds, thousands, or millions of users, problems emerge: model outputs that hallucinate, incorrect tool usage, slow paths, or logic bugs. To manage that, AgentKit includes integrated tracing and an evals workflow that automatically creates and surfaces traces for every run.

Each time a workflow executes — whether in ChatKit, the preview, or deployed — a trace is generated. A trace is a full audit of what happened during the run: the agents invoked, the order of operations, timing, model used, tokens in/out, tool calls, and outputs. Within the Agent Builder, clicking the Evaluate button opens a traces view that summarizes executions and allows you to dive in.

What traces show you

Agent timeline: A visual sequence of agents and nodes that executed and how long each took.
Model-level details: Which model was called, token counts, inputs and outputs for that call.
Tool usage: When and what tools were invoked (e.g., file search, MCP), including parameters and responses.
Guardrail results: Whether a guardrail check passed or failed and why.

But traces are only as actionable as your ability to interpret them. That's why we built a simple way to ask a grader — another model — to judge whether a trace produced a correct or desirable result. In the demo I added a grader: "Was the final output correct and readable?" The grader can inspect the entire trace and make a pass/fail call. In production you can add more granular graders that check business rules, policy compliance, or steps being taken correctly (for example, whether a refund agent asked for supervisor approval before issuing a refund).

✅ Building evals (grading) and optimizing workflows

Evals are the next level of observability. An eval links inputs, ground truth, outputs, and graders so you can measure performance against known examples. In the Agent Builder, you can open a visual eval builder for any agent and configure:

Example inputs (the user messages or prompts you want to test).
Ground truth content for comparison (expected answers, canonical procedure IDs, or golden transcripts).
Model outputs — the actual outputs the system produced during the run.
Graders — automated judges that evaluate correctness, formatting, readability, or other criteria.

In the demo, I showed two graders: a formatting grader and a correctness grader. The left pane showed the agent prompt (with variables). The right columns showed the sample inputs, the ground truth for those inputs, and the model outputs for that agent. The graders can reference any of these columns to make decisions.

When I first ran the eval against a handful of examples, the formatting grader passed only 40% of the time. I edited the prompt to be clearer about expected formatting and regenerated outputs for the same examples. After rerunning the grader, the formatting score improved to 80%. That improvement wasn't guesswork — it was empirical. The graders applied the same criteria to historical or new examples so changes are measured consistently.

Automatic prompt optimization

One of the features I found especially compelling during the demo is the Optimize button. For complex grading tasks, manually tuning prompts can be difficult and time-consuming, especially if you have many graders or subtle failure modes. The Optimize feature uses the existing examples, grader results, and reasoning for why examples passed or failed to synthesize improved prompt instructions automatically.

Optimization works by looking at:

Your current prompt and variables.
Ground truth examples and the outputs that failed.
Grader judgments and any human-provided explanations for pass/fail outcomes.

Then it suggests a prompt revision aimed at improving grader outcomes. In many cases for hard-to-tune agents, this can yield significant gains and save a lot of manual prompting work.

🔍 Best practices for orchestrating agents at scale

From building this system and helping customers deploy AgentKit workflows, I've compiled a set of practical best practices you should consider when building agentic systems for real-world teams.

1. Start small; iterate quickly

Begin with a minimal workflow that solves a specific pain point. Use the Agent Builder to prototype rapidly and ship a production-ready flow. Collect traces and real user inputs to guide the next set of improvements.

2. Make guardrails explicit

Put guardrail nodes near points where hallucinations or incorrect tool outputs can have significant consequences. Validate that outputs reference the same canonical sources you used to retrieve content (e.g., vector store or internal docs).

3. Keep prompts modular and testable

Design prompts as modular components exposed on agents. That lets you evaluate each agent independently with its own graders and ground truth. When a workflow fails, you can often identify the weakest agent rather than reworking the entire pipeline.

4. Grade early and often

Use graders even in development, with small example sets. Over time, expand your dataset as real user inputs arrive so your evaluations represent production behavior.

5. Leverage exports for compliance and integrations

If you have compliance or latency requirements, export the workflow and run key components on-premises or in a private cloud. Keep the front-end UX consistent by pointing ChatKit at your local API endpoint.

6. Monitor traces for performance bottlenecks

Traces show much more than correctness: they reveal latency, tool failures, and costly model calls. Use tracing to identify agents that are slow or frequently failing and prioritize optimization there.

7. Version workflows and prompts

Treat workflows as artifacts: version them, test changes in staging, and deploy with clear rollbacks. The Agent Builder's publish workflow + code export make it easy to snapshot a workflow at release time.

8. Involve domain experts in grader creation

Graders are only as useful as the metrics they encode. Domain experts should author grader logic (or the ground truth examples) so the evals reflect the real criteria that matter for your business.

🔁 Common use cases and real-world examples

AgentKit applies to many domains where complex, multi-step reasoning and tool use are needed. Below are several examples that illustrate the breadth of its usefulness.

Maintenance and field operations

The demo workflow is a perfect example: leveraging internal repair manuals, parts databases, and synthesis agents to give field technicians precise instructions and parts lists. Guardrails ensure that safety-critical instructions are grounded.

Customer support and help centers

Support teams can wire together retrieval over knowledge bases, policy checkers, and escalation tools (e.g., open a ticket) into a single workflow. ChatKit provides the chat experience, and graders enforce policy compliance or SLA behaviors (like checking if supervisor approval was sought).

eCommerce and fulfillment

Workflows can query inventory systems, run rules engines for discounts, and compose responses to customers about order status. Guardrails help prevent incorrect price substitutions or incorrect stock statements.

Developer and data operations

Teams can create agents that triage infrastructure alerts by searching documentation, running diagnostics tools, and suggesting fixes. The visual builder allows SREs to articulate logic without committing to large code changes early.

Legal and compliance workflows

Agents can search internal policies, check clauses, and apply compliance rules to draft responses or redline documents. Exporting workflows ensures that sensitive data stays in the private cloud if necessary.

🧭 A practical, step-by-step guide to get started

If you're ready to try AgentKit (or a similar orchestration approach) in your organization, here's a pragmatic checklist that will help you ship faster.

Define a narrow first scope: Pick one clear task with measurable success criteria (e.g., reduce ticket resolution time for a specific issue by X%).
Collect representative data: Assemble example user inputs and the canonical answers (ground truth) you want the agent to return.
Prototype in the Agent Builder: Compose a minimal pipeline with retrieval, a synthesis agent, and a guardrail. Add a simple widget if UI makes the result more usable.
Test locally & iterate: Use previews and small eval suites to ensure agents behave. Collect traces.
Publish to staging: Deploy the workflow to a staging environment with ChatKit and let a small set of users exercise it.
Author graders: Build formatting and correctness graders that reflect your business rules.
Optimize: Use the Optimize feature for prompts you struggle to tune manually. Re-run graders on updated outputs.
Export if needed: If data or compliance requires it, export the workflow code and self-host. Ensure ChatKit endpoints are pointed to your environment.
Monitor in production: Use traces and aggregate grader results to spot regressions and new error modes.
Maintain versioning and rollbacks: Keep a release plan for workflows, so you can revert any changes that cause regressions.

🔐 Security, privacy, and compliance considerations

Agentic workflows often interface with internal tooling and sensitive data. Here are some practical policies and patterns I recommend adopting:

Least privilege for tools: Ensure tools invoked by agents have granular permissions and audit logs.
Data minimization: Only send the necessary data into model prompts or to external services.
Configurable guardrails: Use guardrail nodes to enforce policy checks (PII detection, legal clauses, or safety constraints) before content is returned to users.
Local hosting for sensitive resources: Export flows and run them in private clouds when data must not leave your environment.
Trace retention policies: Decide how long to keep traces, and scrub sensitive tokens or payloads from long-term storage if necessary.
Human-in-the-loop: For high-risk decisions, require human review before any action that has significant consequences (financial transactions, legal advice, safety-critical instructions).

⚙️ Developer tooling and integrations

AgentKit integrates well with standard developer workflows. The exported code uses the OpenAI Agents SDK, allowing you to:

Integrate with CI/CD pipelines — test workflows as part of your pipeline with automated evals.
Add custom monitoring and logs — instrument traces with your observability stack.
Extend tools — implement bespoke tools (e.g., internal ticketing systems, inventory services) and plug them into the workflow easily.
Automate deployment — script the publish/export process so changes to workflows can be managed like any other deployment artifact.

🧠 On the human side: collaboration and ownership

Building successful agentic workflows is not just technical work. It requires cross-functional collaboration:

Product managers define user-facing goals and success metrics.
Domain experts author ground truth and grader logic.
Engineers integrate tools, export workflows, and run self-hosted components.
Designers choose UX components and craft helpful widgets.
Operators monitor traces and handle incident response.

The Agent Builder and ChatKit were designed with this collaboration in mind: visual artifacts make it easier for non-engineers to participate, while exports let engineers take over when deeper integration is required.

📈 Measuring success: metrics to track

Decide upfront what success looks like. Useful metrics include:

Task completion rate: Percentage of interactions where the agent provided a correct and usable answer.
Time-to-resolution: For support or maintenance contexts, how long did it take the user to resolve their issue using the agent?
Human escalation rate: How often did the agent escalate to a human?
Grader pass rate: Aggregate results from your graders over time.
Average model cost: Tokens consumed and downstream service calls to keep an eye on cost growth.
Latency per agent: From traces: which agents create the longest tail latencies?

Tracking these metrics helps you prioritize whether it's a prompt to update, a tool to optimize, or an architectural change (like moving a call on-premises) that's needed.

🔮 The future of agent orchestration

AgentKit is an early expression of what I think is a necessary platform for engineering teams adopting agentic workflows at scale. In my view, future work will center on:

More powerful optimization helpers: Automated multi-objective optimization that balances correctness, formatting, latency, and cost.
Stronger cross-team governance: Organizational tools to control who can publish workflows and what guardrails are required for certain domains.
Smarter traces: Built-in anomaly detection that surfaces unexpected behavior without a user having to write a grader.
Run-time policy enforcement: Live prevention of disallowed responses or actions by hooking into guardrails at the model-call layer.

This is an exciting space because the number of real-world domains that can be improved by agentic orchestration is vast. But it's also a space that requires careful tooling to ensure outputs are reliable, safe, and auditable.

📣 Wrapping up: how I recommend you proceed

Here’s how I’d advise teams to evaluate AgentKit-style tooling in their orgs:

Identify a high-impact use case that’s appropriately scoped (customer support, internal ops, manufacturing maintenance).
Prototype in a hosted environment using visual authoring (Agent Builder) and ChatKit to validate UX and user acceptance fast.
Instrument with graders and traces from day one so you understand baseline behavior.
If data residency or compliance matters, export and self-host early, not as an afterthought.
Iterate using automated optimization where prompt tuning is hard, and include domain experts in grader design.

We built AgentKit to shorten the loop between idea and production and to give teams the tools they need to operate agentic systems responsibly at scale. I’ve seen firsthand how adding things like guardrails and eval-driven optimization prevents regressions and improves user outcomes faster than ad-hoc manual tuning.

“Your agentic workflow is only as strong as the weakest agent.”

I shared that line in the demo and I’ll repeat it here because it captures the essence of why traceability, per-agent evaluation, and targeted optimization are so important: a single weak agent can degrade the entire user experience. By breaking systems into agents, grading them independently, and optimizing each one, you raise the bar for the whole workflow.

📚 Additional resources and next steps

If you want to explore further:

Try visually composing a small workflow in Agent Builder and connect it to a ChatKit demo UI to feel the end-to-end experience.
Author a few graders early — even simple formatting checks provide immediate signal.
If you have compliance needs, practice exporting a workflow and running the exported code in a sandboxed internal environment.
Think about instrumentation for traces and how you will monitor grader metrics over time.

Rohan and I will be available in the OpenAI Discord and at Dev Day for questions, and we’re excited to see what people build with these tools. AgentKit is designed to make building agentic workflows approachable, safe, and maintainable — whether you run them in the cloud or in your own data center.

🎯 Final thoughts

AgentKit brings together visual design, polished front-end components, and a tight feedback loop of tracing and grading. The combination lets teams prototype quickly, ship safely, and improve iteratively. From the maintenance engineer solving a truck repair to support agents handling complex customer workflows, these tools make it practical to deliver reliable agentic experiences.

If you’re exploring agentic solutions, my single piece of advice is this: measure each agent, validate outputs against real ground truth, and automate the hard parts of prompt optimization when you can. That approach will keep your workflows dependable as you scale.

Thanks for reading — and if you want to discuss specific use cases or tooling integration, I’m happy to help. We can dive into prompts, grader design, or the export/self-host workflow together.