Automatic code reviews with OpenAI Codex

I am announcing a practical evolution in how engineering teams can catch bugs and speed up verification work. I want to explain how automatic code reviews with OpenAI Codex work, why they matter, how to set them up, and how teams can use them in practice. I will also share how we trained the models to be useful reviewers, how the tool differs from static analysis, and how you can customize behavior using a simple agents file. Finally, I will cover workflows that combine cloud reviews and local reviews in the Codex CLI, plus my recommended best practices and caveats.

📰 Quick summary

I enabled a code review capability that automatically reviews pull requests in GitHub and can run locally in the Codex CLI. The model has repository awareness, can execute tests and commands to validate hypotheses, and is trained to focus on catching real bugs with high precision so it produces useful, actionable feedback. Teams can customize how the agent behaves using an agents.md file in their repository or by adding instructions in PR comments. This is meant to scale human verification as AI-generated code proliferates, while keeping developers in control.

🔧 Why automatic code review matters

Software development teams already use code review as a vital safety and quality gate. As AI agents generate more code and help developers write faster, the human verification bottleneck becomes more obvious. Humans are required to read diffs, reason about context, imagine runtime behaviors, and run tests. That work is time consuming and prone to oversight, especially in large repositories where a single change can interact with many components.

I view automatic code review as a force multiplier. When a well-trained model can reliably find issues that matter, it frees engineers to focus on higher level design decisions and edge cases that require judgement. The capability is especially helpful in catching configuration errors, subtle integration bugs, or cases where the change touches code the author is not deeply familiar with. It works both as a pre-PR safety net and as a reviewer after a PR is opened.

🛠️ How to enable automatic reviews in Codex

Turning automatic code review on is simple. You can enable the feature in your Codex web settings. Once enabled, any new pull requests opened in the repositories that Codex monitors will be automatically reviewed. The agent will post findings as comments on the PR and can respond to additional instructions you leave for it in comments.

You can also start a review earlier. If you do not want human reviewers looking at a work in progress, you can trigger the review when the PR is still a draft. That makes Codex effectively an early checkpoint that advises the author while the change is being shaped. The model will collect context and will produce feedback that the author can act on before marking the PR ready for human attention.

In addition to cloud based reviews, Codex supports local command line reviews via the Codex CLI. When you are still working locally and want a quick check before you push or open a PR, you can run a review right from your terminal. This local option can execute tests and scripts on your machine so you can validate behavior earlier in the development workflow.

🔎 What the model looks at and how it differs from static analysis

Static analyzers are powerful tools. They shine at detecting syntactic problems, obvious type errors, unreachable code, and a number of well defined rule violations. They are fast, deterministic, and useful as a first line of defense. However, the Codex code review agent is designed to go beyond static analysis in several important ways.

Repository awareness. Codex does not only inspect the diff. It can access the whole repository, follow imports, and reason about how a change interacts with broader code. That context is essential for catching issues that are not local to a single file.
Interactive hypothesis testing. The model can write and run code snippets or tests to check whether a suspected issue actually reproduces. Instead of only outputting a conjecture, it can validate that conjecture by executing commands or by running tests it generated to confirm the behavior.
Runtime and configuration checks. Codex can find issues that only reveal themselves when code runs or when the environment is configured incorrectly. It can detect, for example, misconfigured training loops, runtime exceptions that static analysis misses, or incorrect assumptions about the environment.
Steerable output. You can ask Codex to focus on specific areas or to ignore certain categories of issues. That steerability makes it less noisy than a generic static analyzer that always reports everything it finds.

Because of these capabilities, Codex often identifies problems that a pure static checker would miss. For instance, a code change that removes a React prop might look harmless in isolation, but when considered against the broader component tree and CSS expectations, removing that prop can break styling or behavior. Codex can track down those connections.

🎯 How we trained Codex to produce high signal reviews

Training a model to act as an effective code reviewer requires a different focus than training a model to generate code. For code generation, recall and the ability to produce useful code fragments are important. For code review, precision matters more. False positives are frustrating for engineers and erode trust. Therefore I prioritized making the model conservative and precise in when it flags something as an issue.

We incorporated specific training tasks and datasets that emphasize identifying real bugs developers care about. That meant collecting examples of real world bugs, labels indicating whether a finding was actionable, and data that taught the model how to form hypotheses and verify them. We also put effort into minimizing incorrect comments. In model evaluations, this generation of Codex showed a much lower incorrect comments rate compared to earlier model generations.

But the real test of any tool is how it performs in practice. Rather than relying solely on internal metrics, we deployed the tool across code bases to let engineers use it in real work. That interaction with real repositories surfaces the practical tradeoffs between recall and precision and helps identify where the model needs to be more careful or more helpful.

How verification scales with AI capabilities

As AI becomes better at producing code, the need for scalable human verification grows. Human verification is the current bottleneck in many workflows. If AI writes more code than humans can reasonably review, bugs will slip through. My goal is to bring verification capacity closer to the speed at which AI can produce code.

To achieve this, the model was trained to act like a diligent reviewer that can take time to run tests and double check assumptions. The review agent was also designed to be composable so it can call tools, execute commands, and generate targeted tests. In practice this reduces the need for a human to mentally simulate every scenario and run every test themselves.

💡 Examples of Codex catching real issues

In practice, Codex has already prevented several serious kinds of issues. I want to share some concrete examples to show the types of problems the model finds and how it integrates into developer workflows.

Training run configuration bugs. Codex found misconfigurations in training scripts that would have caused long training runs to fail or behave incorrectly. These bugs are not always visible from the PR diff alone because they depend on runtime settings and how code paths are exercised during training.
Cross file and component interactions. In one case, the model flagged a change made to a VS Code extension where a prop was removed from a React component. This removal appeared innocuous in the changed file but created a subtle styling bug when combined with CSS elsewhere. The model traced the usage and identified the issue.
Repository unfamiliarity. Engineers sometimes contribute to code bases they do not fully understand. The model allowed contributors to confidently make changes by surfacing issues that only a deep inspection of the repository would reveal.
Automation to fix issues. After producing a finding, the model can be instructed to open a new task to fix the issue. That capability turns review findings into actionable work and helps reduce the friction between identifying and resolving bugs.

📚 Customizing behavior with AGENTS.md

One of the most important features I built is the ability to steer the code review agent using a repository based agents file. When you place an agents.md file in your code base, Codex will look for it and use the instructions inside to influence how it reviews a PR. This offers a flexible, human readable way to encode team norms and review guidelines.

For example, you can add directions that tell the model to focus on security sensitive sections, to ignore certain stylistic choices, or to prefer a particular coding style. You can also tell the agent which problems you consider noise so it will suppress those findings. The file can include preferences about the tone of the feedback as well, so you can ask the model to be concise, exhaustive, or to use a friendly voice.

Because the model is steerable, you can also pass custom instructions at review time by adding them in a comment. A team lead can leave a short request such as "Review this PR and focus on performance implications" and the agent will prioritize that concern in its analysis. That kind of steerability makes Codex a flexible teammate that respects your team's unique needs.

What to include in agents.md

An agents.md should be concise and easy to maintain. Here are some suggestions about what to include:

Priority rules. Tell the agent which issues are high priority for this repository. For example, "Be strict about unsafe deserialization" or "Pay special attention to concurrency handling in core modules".
Noise filters. List warnings or stylistic issues you do not want to be flagged. For example, "Ignore line length warnings in generated files".
Testing guidance. Indicate test commands or locations the agent should use when it wants to run tests. For example, "Run tests with npm test in the root, and run integration tests with npm run integration".
Tone and format. Specify how the agent should present feedback. For example, "Use concise bullet points and include a suggested fix when possible".
Repository layout hints. If your repo uses an unusual layout, add notes so the agent can find important places. For example, "Main server code is in services/server".

Agents.md is a small but powerful lever that helps teams integrate AI code reviewers into existing processes without changing how people work. It is also a safe place to encode constraints, like telling the agent not to run certain commands or not to push fixes automatically to protected branches.

💬 Integrating Codex into your PR workflows

Codex integrates into normal pull request workflows in two ways. First, automatic reviews run in the cloud and post comments on PRs. Second, local reviews run in the Codex CLI as a pre commit or pre push check.

Here is a typical flow that I recommend for teams adopting Codex gradually:

Enable automatic reviews. Turn the feature on in Codex web settings so PRs are reviewed automatically. Start with one or two repositories to gain familiarity.
Use early draft reviews. Encourage authors to trigger a review while the PR is still a draft. That helps catch obvious issues before humans spend review time, and it allows the author to iterate quickly.
Customize agents.md. Add team specific guidance to an agents.md file so the agent aligns with your priorities. Iterate on the file as you learn what it flags and what you want suppressed.
Run local reviews. Incorporate the Codex CLI review into local developer workflows. Developers can run a review from the terminal before pushing to catch environment specific issues.
Human verification. Treat Codex findings as highly useful but still require a human to sign off on changes before merging. The agent is a powerful reviewer but not an infallible replacement for human judgement.
Feedback loop. Use the codex output to improve tests and CI checks so that recurring issues are captured automatically. When you notice a false positive or a missed class of bugs, update agents.md and your test suite.

This incremental approach helps teams get value quickly while building trust in the model. Over time, as the model's feedback becomes more aligned with your standards, you can rely on it to reduce mundane review overhead and free developers for deeper work.

🔁 From finding issues to fixing them

It is one thing to identify a bug and another to fix it. Codex helps with both. After the agent posts a finding on a PR, you can ask it to take the next step and open a new task to fix the problem. This workflow turns a passive review bot into an active collaborator that can produce code changes and propose patches.

For example, if the agent detects a missing null check that could lead to an exception, you can ask it to create a branch with a fix, run tests, and open a follow up PR with the change. That reduces the friction between spotting a problem and creating a concrete solution. It also helps engineers, especially those pressed for time, to fix issues quickly instead of delaying the work for later.

That said, I recommend reviewing any automated fix to ensure it matches the intended design. The agent can produce plausible changes, and human oversight catches design-level concerns or unintended consequences that might not be obvious to the model.

🖥️ Codex CLI and local reviews

Code review is most useful when it happens early. That is why I built a local review mode in the Codex CLI. Developers can call a command, such as slash review, to request the model to evaluate the current uncommitted changes in the working tree. This gives quick feedback before a PR is created and helps maintain quality with less back and forth.

Local reviews can run things on your own machine. That allows the agent to execute tests, evaluate environment dependent behavior, and run scripts that are not accessible in a cloud environment. The combination of local execution and model reasoning is a powerful way to detect bugs that require runtime validation.

Some tips for using local reviews effectively:

Run reviews often. Make running a local review part of your inner development loop. A 30 second check can prevent hours of rework later.
Limit potentially destructive commands. Configure the agent to avoid running commands that modify global state unintentionally. Use agents.md to note unsafe commands and to provide safe alternatives.
Make test commands explicit. If your project requires a special command to run tests, include that in agents.md so the agent knows how to validate behavior locally.
Use pre push hooks. Consider integrating Codex CLI reviews into pre push or pre commit hooks to guard repository quality.

🧭 Steering the review: instructions and context

Codex is designed to be steerable. You can instruct the agent while requesting a review to focus on specific goals. For instance, you can say "Review this PR and focus on security and input validation" or "Please prioritize performance impacts over stylistic suggestions". The model will adapt its attention and reporting to match those instructions.

You can also include repository level context in agents.md to help the agent navigate the code base better. For complex monorepos or unusual layouts, those hints are invaluable. They cut down on irrelevant findings and make the agent more precise.

When writing instructions, be explicit about what you want and what you want to avoid. Clear instructions produce better, more actionable feedback.

⚖️ Precision, recall, and the human in the loop

In code review, there is an important tradeoff between precision and recall. A high recall system reports many potential issues but risks overwhelming developers with noise. A high precision system reports fewer things but those things are more likely to matter. For a code review assistant, precision matters a lot because developers need to trust and act on the findings.

I prioritized high precision in training the review agent. That means the model will often refrain from flagging suspicious but non actionable items, and it will avoid noisy commentary. In practice we saw a much lower incorrect comments rate compared to previous models. However, that design choice also means the agent might miss some low probability bugs that a more conservative, high recall tool would catch. That is why I recommend combining Codex with a complementary suite of automated checks and tests.

Human oversight remains essential. The model helps scale human verification but does not fully replace it. Engineers still need to evaluate design, security context, and edge cases where human judgement is required.

🔐 Security, privacy, and safety considerations

Any tool that reads and reasons about source code raises questions about security and privacy. I designed Codex reviews with several considerations in mind to help teams adopt the tool responsibly.

Repository access control. Only repositories that you grant Codex access to will be reviewed by the agent. Access is controlled through your Codex and GitHub integration settings.
Agents.md constraints. You can instruct the agent not to run certain commands or not to submit changes automatically. Encoded constraints reduce the risk of accidental operations.
Human verification. By default, I recommend treating Codex findings as advisory until a human approves changes. This prevents automated fixes from slipping into protected branches without review.
Testing dangerous or sensitive code. For repositories handling secret keys, cryptographic material, or other sensitive artifacts, restrict the agent's ability to run environment altering commands and consider local reviews in isolated environments.
Audit logs. Keep review logs and the sequence of agent actions for auditing. That allows teams to trace what the agent did and why it produced certain findings.

Responsible deployment is about balancing automation benefits with careful controls. With the right settings and conservative defaults, teams can use Codex to accelerate verification without compromising security.

📈 Adoption strategies and best practices

Adopting automatic code review across a team is not a flip of a switch. I recommend a staged approach and a set of best practices to get the most value with minimal disruption.

Start small. Enable the feature in a single repository and gather feedback. Use that pilot to tune agents.md and to see how the model performs in your code base.
Educate developers. Teach team members how to interpret Codex findings and how to ask the agent for clarification or for an automated fix. Make sure they know how to adjust agents.md and leave instructions on PRs.
Set review conventions. Define what constitutes an actionable Codex finding. For example, agree that only findings marked as critical by the team require immediate blocking action.
Iterate on agents.md. Use real examples to refine your repository specific guidance. Suppress recurring noise and add checks for frequent failure modes.
Integrate with CI. Where possible, convert recurring problems that Codex finds into CI checks that run deterministically. This reduces the need to have the model re-flag the same issues repeatedly.
Measure value. Track metrics such as time to fix critical bugs, false positive rates, and developer satisfaction. Use those metrics to guide how much trust you place in the agent over time.

🚧 Known limitations and areas to watch

Codex is a powerful reviewer but it is not perfect. Here are some limitations and areas where teams should exercise caution:

False negatives. Because we focused on high precision, the agent may fail to flag some low probability bugs. Pair Codex with robust test coverage and additional static analysis tools for defense in depth.
Complex architectural judgments. Some design level decisions require human understanding of product goals and tradeoffs. The agent can suggest issues but a human should make final calls on architectural changes.
Context gaps. If repository documentation is missing or the code base has unusual structure, the agent may misunderstand intentions. Use agents.md to provide extra context.
Reliance on testability. For some runtime behaviors that are hard to test in isolation, the agent may not be able to validate a hypothesis fully. That is particularly true when external systems or hardware dependencies are involved.
Safety sensitive code. For code that can cause physical harm or has legal impacts, use extra human review gates and limit automated fixes.

Understanding these limitations helps teams build a complementary tooling stack that maximizes safety and efficiency.

📣 Real world impact at OpenAI

At OpenAI we have already seen concrete benefits from using Codex code review internally. The agent caught issues that would have caused delays in training runs and it found configuration bugs that are not visible from the diff alone. It also enabled contributors who are not deep experts in a particular code base to make changes with higher confidence.

One practical example involved a contribution to a VS Code extension. The agent determined the change introduced a bug that could not be easily spotted without a full repository context. After the agent flagged the issue, a human corrected it and the change proceeded without causing downstream problems. In another situation, the model helped avoid a training run misconfiguration that would have cost many hours of compute.

These examples show how Codex can act as a second pair of eyes that scales verification and reduces the risk of costly mistakes.

🔭 What comes next

Automatic code review is only the beginning of integrating AI into the engineering workflow responsibly. My near term priorities for improving the review experience include:

Better local integrations. Expanding the capabilities of the Codex CLI to run more thorough local validations and to fit smoothly into developer tooling.
Richer agents.md features. Making the agents format more expressive while keeping it easy to author and maintain, so teams can encode complex review policies without writing code.
Improved verification tooling. Enhancing the agent's ability to produce reproducible tests and to report evidence supporting its findings so humans can validate results faster.
Stronger auditability. Building better logging and traceability so teams can audit what the agent did and why it produced each finding.
Tighter CI integration. Allowing teams to convert high value checks from the agent into deterministic CI rules when appropriate.

These developments will further reduce the friction of using an AI reviewer and increase trust in its findings.

📝 Sample agents.md snippets

To make the idea of agents.md concrete, below are some example entries you can put in a repository to steer Codex. You would add these as a file named agents.md at the root or a relevant package folder.

Example 1. Prioritize security and avoid noisy style findings

Priority: security

Focus areas: input validation, authentication, deserialization

Ignore: minor linting issues such as single quote vs double quote

Tone: concise bullet points with suggested fixes

Example 2. Local testing commands and unsafe command restrictions

Test command: npm run test:unit

Integration tests: npm run test:integration

Do not run: scripts/deploy.sh, scripts/reset-database.sh

Repository layout: server code in packages/server, front end in packages/web

Example 3. Style and format

Treat generated files in generated/ as read only

Prefer idiomatic Python for server side changes

If suggesting a change include a minimal test demonstrating the fix

These examples are short and easy to change. In practice teams evolve agents.md as they learn what kinds of findings are helpful and what noise they want to suppress.

🧩 Frequently asked questions

Will Codex replace human reviewers?

No. I see Codex as a powerful assistant that scales verification but it does not replace the human judgement needed for architectural, product, and safety decisions. Codex reduces the mundane parts of review and helps surface non obvious bugs. Humans should remain in the loop to approve changes and make high level decisions.

How does Codex handle private repositories?

Codex reviews repositories you have authorized. Access is controlled through your Codex and GitHub integration settings. For sensitive code, you can restrict the agent's permissions and disable automated fixes. You can also prefer local reviews in isolated environments where control is entirely within your infrastructure.

What about intellectual property and data privacy?

When you use Codex, the same data handling and privacy practices that apply to other Codex features are in effect. You should evaluate how your organization wants to manage source code access and whether to enable cloud reviews or keep reviews local. Document constraints in agents.md when needed.

Can Codex fix issues automatically?

Yes, Codex can be instructed to propose fixes and open follow up tasks or PRs. I recommend reviewing any automated fix before merging. You can configure the agent to never push changes to protected branches so automation never bypasses human review.

How do I get started?

Enable code review in your Codex settings, add agents.md to a repository to guide the agent, and try both cloud based and local CLI reviews. Start with a small pilot and iterate on agents.md and review conventions as you learn which findings are most valuable.

✅ Closing thoughts

I built automatic code reviews with a clear objective. As models grow more capable of writing code, verification must scale too. A review agent that can access the full repository, run tests, and validate hypotheses helps reduce the burden on human reviewers and improves safety and quality.

Codex is not a replacement for human expertise. It is a teammate that follows instructions, runs checks, produces evidence, and helps developers move faster while catching problems earlier. When integrated carefully with agents.md, local CLI reviews, and CI, the agent becomes a practical tool that fits into existing workflows.

My advice for teams is to start small, tune the agent to your priorities, make local reviews part of the inner loop, and keep humans in the loop for critical decisions. With those guardrails, automatic code review can accelerate development while reducing risky surprises.

For teams eager to try this in practice, enable the feature, author a simple agents.md to guide behavior, and run a few reviews to see what kinds of issues the model surfaces in your code base. The goal is to make verification more scalable, not to outsource responsibility. I look forward to seeing how teams use Codex to ship better and safer software.