Greetings post DevDay! Met a bunch of readers at the event, hope to see more of you again!

Lessons in Hyper Engineering

TL;DR— Most devs are sleeping on Codex. It has a lot more potential than you think.

TDD is becoming the norm for OpenAI hyper engineers

Test Driven Development is a winning strategy

OpenAI engineers are still not confidence in 1-shot prompting. It can work, but large systems quickly degrade without them. TDD is the standard for OpenAI Codex devs for 2 major reasons:

Generating tests is a form of prompting. By guiding what needs to be fulfilled, codegen models won’t work outside of expected bounds
The models are actually just good enough to write and complete tests on their own.

Plan engineering > Prompt engineering

Planning files work surprisingly well (screenshot from Aaron Friel’s segment)

I had a long discussion with Aaron Friel — OpenAI’s #1 user of Codex internally. He uses more than 150M tokens per day on average.

A lot of codegen systems like Devin, Cursor, and Claude Code have plans in-memory/in-context. But what ends up working the best is creating a markdown file instead and have Codex it along the way. For whatever reason, this is shockingly effective.

Codex has a specialized models for Code Review

Codex code review in action

The most important thing to understand about Codex is it is not just a model— Codex is a suite of products:

CLI (Claude Code competitor)
VSCode extension (Cursor competitor)
Background agent manager (Web App/Mobile ChatGPT)
Standalone model (GPT-5-Codex)
Code review model (unreleased)

The biggest lesson here is OpenAI uses specialized models for specialized tasks. We don’t have much information no it, but Codex Review was trained specifically on solving technical bugs.

Model trained specifically to handle merging changes and technical bugs

Free tokens!

OpenAI is leaning in really hard with Codex’s PR review bot. It’s essentially a direct competitor to CodeRabbit and Greptile, but if you have a ChatGPT plan, it’s free. Supposedly, 100% of code in OpenAI is reviewed by Codex. (Whether people actually look at the comments, unsaid). At this point, there’s no reason not to take the subsidy.

Reviews are the bottleneck

— # (#)

One of the more interesting bits was watching Daniel Edrisian live debugging an issue with Codex. Absent from the demo was an IDE— just Sublime Merge!

As a few of you might have seen, I launched a new OSS product called Bottleneck for speeding up code review. Several hyper engineers have found it useful so far, check it out!

— # (#)

Sleeper Agents

OpenAI didn’t talk about it much, but a few engineers afterwards did. Background Agents are still one of the biggest productivity unlocks. A lot of devs are internally firing 4x agents at a time to solve issues.

Currently there’s a bug on the ChatGPT mobile app where you can only run 1 Codex agent at a time. I asked about this to an engineer directly:

“I’m not allowed to say anything at this time.”

More to come?

Addendum: Hyper spenders

Credit: Deedy Das https://x.com/deedydas/status/1975582169126019223

OpenAI revealed that several startups were extreme power users of the GPT API. For context, GPT-5 costs $10 per 1M output tokens. So, 1T tokens = $10M in spend.

Some major categories:

Customer support Shopify, Decagon, Zendesk
CodeGen: Cognition, Genspark, Warp.dev, JetBrains
SRE: DataDog, Outtake AI, CodeRabbit
Sales: Rox, Hubspot
Wrappers/Routers: OpenRouter, Sider AI, Perplexity
Note taking: Abridge, Read AI

Full list here: https://x.com/zarazhangrui/status/1975594704428540272

Secrets from OpenAI's DevDay