The practical change is not "better prompts." It is better systems. Teams capture expert corrections as structured evidence, turn recurring failures into targeted evals, and use Codex inside bounded tasks to investigate, propose, and validate fixes before human-reviewed shipping.
What Actually Changed
In its May 27 engineering post, OpenAI details a six-month collaboration with Thrive Holdings and Crete (a network of 30+ accounting firms) to develop Tax AI. Their goal was to move a complex tax preparation workflow from mostly manual correction to a measurable self-improvement loop. Instead of treating production failures as isolated incidents, the product records them in a format that can be replayed, grouped, and tested.
The reported results are compelling: Tax AI drafts returns with up to 97% accuracy, saves practitioners about a third of their time, and increases throughput by about 50%. The pilot processed 7,000 returns across participating Crete firms during tax season.
The self-improvement trajectory is the most concrete evidence the approach works: at launch, only a quarter of returns reached 75% correct field completion. Within six weeks, 86% did. That curve doesn't happen from better prompts alone — it happens when expert corrections, product traces, and targeted evals are built into the product itself. Failure data becomes part of the product architecture, not just support noise in a ticket queue.
The Three-Part Loop
The pattern described in the article focuses on moving beyond static AI models. Its three-part loop is highly reusable outside of tax or finance:
- Practitioners steer what matters: Domain experts correct the system during real work, and those corrections reveal which failures are worth fixing.
- Production creates evidence: The system captures the provenance of a task—from source documents to extracted fields, downstream mappings, and final expert corrections—so teams can see where failures happened.
- Codex works on bounded findings: Reviewed failure patterns become scoped tasks with traces, evals, editable code surfaces, and regression checks that Codex can use to investigate and propose changes.
The important constraint is scope. A correction does not automatically become a code change. It first has to be reviewed, grouped into an actionable finding, and converted into a bounded task with explicit success criteria. Codex is operating inside that environment, not acting as an unbounded autopilot over the whole codebase.
What Not to Overclaim
"Self-improving" is easy to read as "the agent rewrites itself." That is not what this case study shows. It shows a governed improvement loop where production evidence makes the next engineering task legible.
- The accuracy numbers refer to correct field completion thresholds, not a blanket guarantee that every tax judgment is correct.
- The loop depends on practitioners, eval infrastructure, source traces, and human engineering review.
- The strongest fit is high-volume, repeatable work where outputs can be compared, grouped, and regression-tested.
Why This Matters for Developers
Most teams have already discovered that "just adjust the prompt" stops working once real users and edge cases arrive. This case study gives a concrete alternative, especially for extraction, classification, mapping, and workflow automation products:
- Store production evidence in a form your eval pipeline can consume.
- Promote repeated failure patterns into regression gates.
- Separate writable code context from read-only production artifacts.
- Require human review before merge, even when agent suggestions pass evals.
If your AI feature quality currently depends on one senior engineer manually triaging incidents, this architecture is likely the next maturity step: turn the incident stream into a product trace, turn repeated failures into evals, and make agent-assisted fixes pass through the same review gates as any other production change.
Bottom Line
The shift underway is from model-centric development to loop-centric development. Winning teams in 2026 will not be those with the most prompts, but those with the tightest safe cycle between production evidence, evals, scoped agent work, and human review.