Context Engineering & Coding Agents with Cursor

I recently gave a talk, published on OpenAI's channel, about how building software has evolved and how Cursor is driving the next wave of that evolution. In the presentation I explained how we moved from simple autocomplete to fully autonomous coding agents, how context engineering shapes agent behavior, and what we’re exploring next to make agents trustworthy, productive teammates. Michael, Cursor's CEO, closed the session with a vision for where software engineering is headed. In this article I’ll report on that talk, expand on the technical and product details I covered, and offer practical guidance for teams and engineers who want to adopt coding agents today.

From Punch Cards to AI: A Short History 🧭

I opened by placing the current moment in historical context. That matters because the arc of software tooling helps explain why today’s tools feel so transformative. Programming used to be an arcane art: punch cards and line terminals in the 1960s made code writing a specialist’s job. In the 1970s many of us "grew up" on BASIC running on Apple IIs and Commodore 64s — those environments made programming more approachable, but still tactile.

The 1980s introduced graphical user interfaces at scale, yet most programming stayed rooted in text-based terminals. It wasn’t until the 1990s and 2000s that visual tools for building software became mainstream: front-end editors like FrontPage and Dreamweaver gave beginners a drag-and-drop path to websites, while full-featured IDEs such as Visual Studio enabled professionals to work across massive codebases with tooling for debugging, refactoring, and navigation. Sublime Text and other editors made it easier for power users to bend code to their will.

Now, with AI, we’re accelerating yet another shift. The transition from terminals to GUIs unfolded over decades. The shift to coding with AI is happening on a hyper-compressed timeline: decades of progress being speedrun into a few years. That acceleration brings both opportunity and nuance: models get better, UX evolves quickly, and the design space for "how we build software" is actively being rewritten.

Next Action Prediction: Tab and the Rise of Autocomplete ✨

One of the earliest lessons we took to heart is that small changes in user experience can dramatically amplify model usefulness. GitHub Copilot popularized the idea that autocomplete can be an actual productivity multiplier. At Cursor we launched Tab in 2023 to explore how far next-action prediction could go — from predicting the next token to predicting the next line, and eventually to predicting where your cursor will move next.

Tab became a workhorse: it now handles over 400 million requests per day. That volume gave us a reliable signal about what users accept and reject, and it let us move beyond off-the-shelf models to specialized models trained specifically for next-action prediction.

Balancing speed, quality, and UX

Autocomplete is deceptively simple to design well because it sits inside the developer's flow. Two factors dominate the UX trade space:

Latency: If a suggestion is slower than ~200 milliseconds it interrupts flow. Fast suggestions are critical.
Relevance: Super-fast but low-quality suggestions are worse than slightly slower but useful ones.

To manage that tradeoff we now show fewer suggestions, but each suggestion has higher confidence that it will be accepted. And the model itself is continuously updated with user feedback.

Online reinforcement learning and fast feedback

We use data from accepts and rejects as reinforcement signals. Positive feedback reinforces behaviors that lead to accepted suggestions; negative feedback attenuates ones that lead to rejections. Crucially, we run a near-real-time loop: accept a suggestion, and within roughly 30 minutes that feedback can be applied to the Tab model using online RL.

That speed is more than just a neat engineering trick — it materially improves the quality of suggestions that developers see. The cycle of suggestion → accept/reject → model update makes Tab adapt to our users’ patterns fast.

When Autocomplete Isn't Enough: Enter Coding Agents 🤖

Autocomplete helps where the bottleneck is the user's typing speed. Most people type around 40 words per minute. Autocomplete can help fill repetitive patterns and common idioms, but what if the models could write larger chunks of code, reason across files, or perform multi-step edits?

This is where coding agents come in. Agents let you talk to models directly in the editor, have them create or update entire blocks of code, and orchestrate multi-file edits. The evolution inside Cursor looks like this:

Inline suggestions: pass current line + file context to a model; get a diff-style edit back.
Composer (early 2023): a conversational UI for multi-file edits and higher-level prompts.
Fully autonomous agents (2024): models that self-gather context and call tools, allowing longer interactions and broad edits.

We focused intensely on giving users control over the level of autonomy an agent has. A developer should be able to choose: do I want a small suggestion, an inline diff, or a multi-file autonomous refactor?

Context Engineering: Intentional Context and Retrieval 🧠

As models become more capable, the old tricks of prompt engineering give way to a broader discipline I call context engineering. Better outputs are rarely the result of clever token-level prompts alone. Instead, it’s about supplying the model with the right, intentional context — a minimal, high-quality set of tokens that help the model reason correctly.

Models tend to degrade at recalling information as the context length increases. Throwing the entire repository into the prompt is both expensive and counterproductive. So the question becomes: how do we retrieve the fewest, best bits of context a model needs to do its job?

Retrieval is the backbone of context engineering

We build systems that retrieve high-quality context automatically. For example, if a user asks an agent to update the top navigation, you don't want the agent to blindly search for "top nav" strings across the repo. Instead, semantic retrieval can find the file named header.tsx even though the request used different words. This saves tokens, reduces latency, and improves correctness.

Searching at Scale: Grep, Semantic Search, and Embeddings 🔍

Most coding agents start with a tool like grep or ripgrep to find literal matches across files. Those tools are fast and precise for string matches, but they miss semantic equivalence and synonyms. As models improved at tool-calling, their "grep + reasoning" workflows got better, but we found a hybrid approach works best.

We index repositories and compute embeddings to enable semantic search. That gives us two major advantages:

Accuracy: The agent can find relevant files even when names and words don’t exactly match the human’s phrasing.
Performance shift: We pay compute and latency up front during indexing, not at inference time. That means faster, cheaper responses when the agent runs.

We moved from off-the-shelf embedding models to a custom embedding model trained on our data, and we A/B test search performance continuously. Users interacting with semantic search send fewer follow-up questions and consume fewer tokens compared to grep-only workflows.

Takeaway: you likely want both grep and semantic search. Grep is precise for literal matches; semantic search finds the conceptual equivalents. Indexing and embeddings let you trade offline compute for much better online performance.

UX and Interfaces: CLIs, Web, and Beyond 💻

There’s been a lot of exploration around the right interface for agents. CLIs are minimal and scriptable — OpenAI's Codex demos, Claude, and Cursor's own CLI have all shown the value of a terse harness. I like the terminal because it opens a surface where agents can run anywhere: local machines, the web, phones, or as Slack bot responders to bug reports.

But I don’t think CLIs are the final form factor. The terminal is an excellent tool for power users because it's scriptable; you can chain agents into pipelines and integrate them into existing tooling. We use such scripts internally to auto-generate docs and update code. But for many users the ideal experience is embedded: an agent that gently lives inside your editor or project board, with controls for autonomy.

Examples of where an agent can live:

Editor integration (foreground, fast, interactive)
Background agents (cloud sandboxes doing long horizon tasks)
Chat surfaces (Slack, issue trackers, or the CLI)
Mobile surfaces (lightweight drivers for review/triage)

The right interface depends on the task. Fast, short edits belong in the editor. Long-running investigations fit cloud sandboxes. The challenge is giving developers the right primitives to orchestrate across these surfaces.

Specialized Agents and Bugbot 🐞

General-purpose agents are useful, but specialization adds value. Last year we built an internal tool called Bugbot that focused on reading and reviewing code to find logic bugs, not just writing code.

We dogfooded Bugbot for about six months and were pleasantly surprised: it found issues our human reviewers missed. Based on that performance we made a public beta. In a memorable twist, Bugbot once found a bug that took Bugbot itself down — and we initially ignored the report, which taught us to treat agent comments seriously.

The lesson here is simple: task-specific agents that have a clear objective (find logic bugs, enforce style, triage on-call tickets) can outperform general-purpose agents when they have the right tools and context.

Longer Horizon Tasks: Planning, To-Dos, and Verification 📋

Newer models can sustain and reason over longer tasks, but to get good results you can't just say "plan better." The product needs to support planning as a first-class activity. We make agents do research and plan before touching files, and that planning yields two big wins:

Higher-quality code: when an agent plans and gathers context it reduces blind edits and rework.
Verifiability: a plan is a place for human reviewers to step in, correct assumptions, and guide direction before code changes.

To-do lists and notes

We give agents a structured to-do list that persists across the agent run. This helps prevent token waste and keeps the agent from forgetting its objectives. The agent's notes become a running idea log the user can inspect at any time. For smaller projects I personally want my to-do list to be sourced from the codebase itself — a single source of truth where tasks and implementation are linked. We're actively exploring how to connect these dots so a to-do is not just ephemeral metadata but a tracked change inside the repository.

Self-checks and testing

An agent should be able to run tests, execute code, and verify behavior. We’re experimenting with giving agents "computer use": running browsers to inspect DOM snapshots, tracking network requests, and confirming that changes passed the intended checkpoints. That loop (edit → run → verify) is critical if we expect agents to be trusted to make substantial changes.

Managing Multiple Agents: Parallelism, Isolation, and Competition ⚖️

When agents get better, it's tempting to run many of them in parallel: one agent to update docs, another to refactor, another to add a feature. But parallelism introduces a set of engineering and UX challenges.

First, it’s easy to become unproductive if you try to juggle too many agents at once. I don’t recommend immediately trying to run nine CLIs in parallel — you'll spend more time coordinating than creating. Running multiple agents also forces you to think about environment setup, port management, and file conflicts.

Local vs cloud agents

There are tradeoffs between running agents in the cloud and locally:

Cloud sandboxes: great for long horizon tasks because they give you isolated VMs, deterministic environments, and easy rollback. But they take longer to boot and may require extra configuration for database or secrets access.
Local agents: give you instant access to services and local dev databases, but isolation becomes a concern. If multiple agents try to modify the same files you need strategies like Git worktrees to separate their contexts.

People are already building personal scripts and hacks to manage these environments. We’ve been dogfooding native primitives in Cursor to manage worktrees, ports, and environment configuration so that parallel runs feel less like a systems administration project and more like product functionality.

Competition among agents

Another idea we’re exploring is agent competition. Imagine launching the same prompt across several models (high reasoning vs medium vs low) or different providers, and then picking the best output. Cursor will soon give users the option to run 1-to-N for any prompt and compare results. This lets us effectively ensemble diverse strategies and benefit from cross-model strengths.

Safety, Trust, and Human-in-the-Loop 🛡️

Trust is a prerequisite for adoption. Engineers need to know that agents won't unintentionally run harmful shell commands or make irreversible changes without consent. We baked several guardrails into Cursor to keep humans in control:

When an agent requests to run shell commands, we ask the user: run once or add to allow list?
Allow lists and deny lists are stored in code so teams can share safety rules explicitly.
Custom hooks let teams integrate external checks at every step of the agent run (for example, run a shell script after the agent finishes).

This design emphasizes the human-in-the-loop model: agents accelerate work, but humans remain the final arbiters of intent.

Extensibility: Custom Commands, Rules, and Shared Workflows 🔧

Agents become more useful when they are extensible and shareable across teams. We support:

Custom commands: package a prompt and its parameters into a reusable command that anyone on the team can call.
Rules: include important context in every agent conversation — for example, commit standards or code formatting rules.

Engineers on my team have been packaging our commit guidelines into a custom slash command called /commit. When you run an agent against a Linear ticket, the agent uses that rule set and the ticket metadata to produce commits that conform to our standards. Power users frequently invent these workflows in user space; once they prove useful, we bring them into the product as built-in features. Plans, memories, and rules are all examples of features that began as user-created patterns and became core functionality.

Product Lessons: What We Learned Shipping Agents 📝

Running agents in the wild taught us lessons that are product, infrastructure, and culture oriented. Here are the things I talk about most often when advising teams:

Start with narrow objectives: task-specific agents (linting, bug detection, tests) outperform general ones early.
Retrieve high-quality context: index your repo and use semantic search to save tokens and improve accuracy.
Keep humans in the loop: explicit checkpoints for shell runs and code writes maintain trust.
Package and share workflows: codify your conventions into commands and rules so teams benefit from shared knowledge.
Use both offline and online compute: index and embed offline to make online inference fast and cheap.

What’s Next: The Future of Software Engineering (Michael’s Vision) 🚀

Michael, our CEO, closed the talk with a vision worth repeating. He framed Cursor’s mission as automating the mechanical parts of coding so engineers can be more ambitious, inventive, and fulfilled. His picture of the near future is compelling:

"Imagine waking up in the morning, opening Cursor, and seeing that all of your tedious work has already been handled. On-call issues were fixed and triaged overnight. Boilerplate you never wanted to write was generated, tested, and ready to merge."

Michael emphasized agents that deeply understand the codebase, team style, and product sense — agents that propose ideas at a high level, break down complex projects into reviewable pieces, and "show their work" when they fail so developers never start from scratch.

He argued that this is possible sooner than many expect, and that a future where building software feels less like toil and more like play is within reach. I agree — but getting there requires careful engineering and product design, not just better models.

Practical Advice: How to Start Using Agents Today 🧩

If you're curious and want to adopt agents in your workflow, here’s a practical guide based on our experience with Tab, Composer, Bugbot, and Cursor’s agent harness.

1) Start small and instrument everything

Pick one pain point — pull request descriptions, bug triage, unit test generation — and build a focused agent to handle it. Instrument accept/reject signals and logging so you can measure impact and iterate.

2) Index your codebase and build semantic search

Before you push an agent into production, invest in indexing and embeddings. That upfront compute improves runtime performance and reduces token costs. Combine grep for literal matches with semantic search for conceptual matches.

3) Keep humans in the loop for side effects

Guard anything that can have side effects (running shell commands, writing to the repository). Use permission prompts, allow lists, and code-stored policies so teams explicitly share trust boundaries.

4) Use to-dos and plans as checkpoints

Require agents to produce a plan before executing large changes. That plan becomes the place for human reviewers to provide early course corrections and maintain oversight.

5) Pack workflows into shareable commands

Create custom commands and rules for your team’s style guide and conventions. Package them as slash commands or CLI verbs so everyone benefits from the same encoded knowledge.

6) Evaluate with automated metrics and human review

Measure token usage, suggestion acceptance, CI pass rates, and user sentiment. Combine these quantitative metrics with qualitative reviews from engineers to detect drift or regressions early.

7) Experiment with agent isolation

For parallel agents, use Git worktrees or containerized sandboxes. Consider cloud-based sandboxes for long-running tasks and local runs for fast, interactive tasks.

Real-World Examples and Case Studies 📂

To ground these recommendations, here are a few concrete examples of how agents are used in production at Cursor and in teams we work with.

Automated documentation generation

We use an internal script where an agent reads public functions, tests, and comments, then generates documentation in a docs/ directory and opens a PR. The PR includes a summary, the changed files, and the tests run. This saves hours per week for engineers maintaining user-facing docs.

Bug triage and on-call assistance

Bugbot and similar agents scan crash logs, run quick hypothesis tests (e.g., "is this path reachable with the current inputs?"), and suggest repro steps. On-call engineers wake up to a prefilled PR or a possible fix estimate, which is then reviewed and merged if accepted.

Commit message enforcement

A custom /commit command ensures that commit messages follow our guidelines and that related ticket IDs and changelog entries are present. The agent formats the message and lists any testing or migration notes automatically.

Large refactors with verification

For larger refactors, we launch an agent that first produces a plan (break into 5 tasks), then runs in a sandbox to make changes, runs the full test suite, and produces a diff for human review. The agent includes notes on risky files and a checklist of tests it ran.

Limitations and Open Problems ⚠️

Agents are powerful, but they are not magic. Here are some limitations I emphasized in the talk that teams should keep in mind:

Context fragility: Models still make errors when they lack the right context or when the context is noisy.
Tool integration complexity: Getting agents to reliably call external tools, manage credentials, and run in production-like environments is non-trivial.
Parallel change conflicts: Multiple agents making changes to the same files can create complex merge conflicts if not coordinated.
Model drift: As models change under the hood, agent behavior can shift — continuous evaluation is necessary.
Security and secrets: Any system that executes code or shell commands must be carefully sandboxed and audited.

These challenges are surmountable, but they require product thinking: design explicit feedback loops, invest in instrumentation, and treat agents as part of your development infrastructure, not just a toy feature.

Metrics That Matter 📊

When evaluating agents, we look at a mix of developer-centric and system-level metrics. If you’re trying to decide whether to adopt agents or measure ROI, consider these:

Accept rate for agent suggestions and edits (the single best proxy for usefulness)
Time saved on common tasks (documentation, PR writing, triage)
Token usage and cost per completed task
CI pass/fail rates for agent-generated code
Number of follow-up questions users ask after agent runs (lower is better)

For Tab specifically, acceptance rates and latency were critical. We adjusted the product to show fewer suggestions but with higher confidence so that the accept rate increased even as suggestion volume decreased. That kind of targeted UX decision is why instrumentation matters: you can co-optimize for both speed and relevance.

How We Ship: Iteration, Dogfooding, and Listening 🚢

We ship features incrementally. We dogfood early versions internally, gather feedback, and iterate quickly. Many of our major product ideas started as internal scripts and workflows. Power users are often the first to discover interesting patterns, which then become product features when they generalize.

We also actively solicit user feedback. If you try Cursor’s agent features, you’ll see that we expose feedback controls and telemetry so we can understand what works and what doesn’t. That loop — build → internal dogfood → public beta → iterate — is how we’ve been able to introduce higher-autonomy features while keeping safety and trust intact.

Final Thoughts: From Toil to Play 🎮

When I close conversations about agents, I come back to the same idea: our goal is to make building software less like perpetual busywork and more like exploration and creativity. Agents are a way to offload the tedious, repetitive parts of engineering — triage, boilerplate, rote refactors — so humans can focus on the hard design problems and the joy of building.

But getting to that future responsibly requires integrating models, product design, and safety engineering. It’s not enough to have a smarter model. You need the right context retrieval, an expressive but safe harness, clear human checkpoints, and a culture that packages and shares workflows across teams.

If you're experimenting with agents, remember: start small, index your codebase, guard side effects, and measure impact. Build plans into every major change, and treat agent output as explainable artifacts that humans can inspect and guide. Do that, and you'll find agents that extend your ambition rather than replace your judgement.

If you're curious about Cursor's agent features or want to discuss patterns for integrating agents into your team, come find me — I'd love to hear what you build.