News

Codex in Production: Self-Improving Agents

OpenAI's May 27 engineering write-up is one of the clearest signals that agent development is shifting from one-off prompt tuning to continuous, eval-driven production loops.

June 1 2026


TL;DR

The practical change is not "better prompts." It is better systems. Teams capture expert corrections as structured evidence, turn recurring failures into targeted evals, and use Codex inside bounded tasks to investigate, propose, and validate fixes before human-reviewed shipping.

What Actually Changed

In its May 27 engineering post, OpenAI details a six-month collaboration with Thrive Holdings and Crete (a network of 30+ accounting firms) to develop Tax AI. Their goal was to move a complex tax preparation workflow from mostly manual correction to a measurable self-improvement loop. Instead of treating production failures as isolated incidents, the product records them in a format that can be replayed, grouped, and tested.

The reported results are compelling: Tax AI drafts returns with up to 97% accuracy, saves practitioners about a third of their time, and increases throughput by about 50%. The pilot processed 7,000 returns across participating Crete firms during tax season.

The self-improvement trajectory is the most concrete evidence the approach works: at launch, only a quarter of returns reached 75% correct field completion. Within six weeks, 86% did. That curve doesn't happen from better prompts alone — it happens when expert corrections, product traces, and targeted evals are built into the product itself. Failure data becomes part of the product architecture, not just support noise in a ticket queue.


The Three-Part Loop

The pattern described in the article focuses on moving beyond static AI models. Its three-part loop is highly reusable outside of tax or finance:

The important constraint is scope. A correction does not automatically become a code change. It first has to be reviewed, grouped into an actionable finding, and converted into a bounded task with explicit success criteria. Codex is operating inside that environment, not acting as an unbounded autopilot over the whole codebase.


What Not to Overclaim

"Self-improving" is easy to read as "the agent rewrites itself." That is not what this case study shows. It shows a governed improvement loop where production evidence makes the next engineering task legible.


Why This Matters for Developers

Most teams have already discovered that "just adjust the prompt" stops working once real users and edge cases arrive. This case study gives a concrete alternative, especially for extraction, classification, mapping, and workflow automation products:

If your AI feature quality currently depends on one senior engineer manually triaging incidents, this architecture is likely the next maturity step: turn the incident stream into a product trace, turn repeated failures into evals, and make agent-assisted fixes pass through the same review gates as any other production change.


Bottom Line

The shift underway is from model-centric development to loop-centric development. Winning teams in 2026 will not be those with the most prompts, but those with the tightest safe cycle between production evidence, evals, scoped agent work, and human review.

Sources