From SPEC.md to /goal: My Codex + GPT-5.5 Workflow
TL;DR
- Claude is the best model for creating a
SPEC.md. TheAskUserQuestionToolinterview flow forces decisions and produces a complete spec in 10-15 minutes. - GPT-5.5 (via Codex) then reads, challenges, and tightens that spec before any code runs.
/goalkicks off long-running execution with a structured prompt body: XML blocks, a reading list, working rules, anti-pattern fences, and a completeness contract.- The Codex
~/.codex/config.tomlmatters.model_reasoning_effort = "high"for execution,plan_mode_reasoning_effort = "xhigh"for planning,model_auto_compact_token_limitfor sessions that run long. - The full prompt for a
/goalrun is 600+ words. Not a one-liner. A contract.
This is the second post in a two-part series. If you have not read the first one, start there: /goal: The Six-Hour Codex Run That Survived a Five-Hour Pause. That post covers what /goal is, what changed in Codex CLI v0.128.0, and what a real six-hour run looks like. This post is about the workflow that feeds into that kind of run: the spec side, the prompt engineering side, the config that makes it not fall apart.
This pipeline took me a while to land on. I kept trying to do everything in one model, in one session, with one prompt. It never held up past 30 minutes of autonomous work. Two models, deliberately sequenced, is what actually works.
Why Two Models
Claude is the better interviewer. GPT-5.5 is the better executor.
That sentence sounds reductive, but it holds up in practice. Claude’s AskUserQuestionTool flow is genuinely the best UX I’ve found for forcing decisions out of my own head. It asks 20 to 40+ multiple-choice questions. You click. Out comes a spec that reflects choices you actually made rather than assumptions the model filled in.
GPT-5.5 with xhigh reasoning is better at long-horizon execution. It holds a complex goal in mind across hundreds of steps without drifting. It catches when a sub-task contradicts an earlier decision. It parallelizes reads and searches naturally, using multi_tool_use.parallel, rather than reading files one at a time. It knows when it’s done and says so clearly, triggering task_complete cleanly rather than trailing off.
So I use both. I still use Claude for spec creation, since the UX is better. Then I use gpt-5.5 to refine and validate. Then /goal to execute.
If you’ve read about the Ralph Wiggum Loop, the impulse is the same: long autonomous runs need a spec that survives contact with the model. A shaky spec means the model fills gaps in the wrong direction. A tight spec means the model does something useful while you sleep.
Step 1: SPEC.md with Claude
Start with the roughest version of your idea. Five bullet points. A paragraph. It does not need to be good.
Then open Claude Code and run the interview:
read this @SPEC.md and interview me in detail using the
AskUserQuestionTool about literally anything: technical
implementation, UI & UX, concerns, tradeoffs, etc. but
make sure the questions are not obvious
be very in-depth and continue interviewing me continually
until it's complete, then write the spec to the file
Claude will ask 20, 30, sometimes 40+ questions. You click through them. When it’s done, the SPEC.md contains every decision you made, documented, with the rationale baked in.
I wrote about this approach in detail in The Interview Method. The short version: clicking through structured questions is faster than writing a spec from scratch, and the questions surface edge cases you would have hit mid-implementation anyway.
The output of this step is a SPEC.md you actually trust. That trust is the foundation for everything that follows.
Step 2: Refine with GPT-5.5
The SPEC.md is good. Not good enough.
Claude interviews well, but it tends to accept your answers at face value. It builds what you described. GPT-5.5, given the spec and a sharp prompt, will read it more skeptically. It challenges. It finds the places where the spec is technically possible but architecturally painful. It asks what happens in the failure cases you didn’t enumerate.
My refinement prompt is direct:
Read SPEC.md. Identify every ambiguous requirement.
For each one: explain the ambiguity, give me two distinct
interpretations, and recommend the one that will produce
a more maintainable result. Then tighten the spec in place.
Do not change decisions that are already clear.
Do not add scope. Remove anything that can't be verified.
After this pass, I have a spec that GPT-5.5 itself has read critically and approved. When the same model then runs /goal, it’s not encountering the spec cold. It already stress-tested it.
This step takes 5 minutes. It has saved me multiple 2-hour debugging sessions where the model implemented exactly what the spec said, and what the spec said turned out to be wrong.
Step 3: Kick Off with /goal
/goal is a Codex feature for long-running autonomous execution. You give it a structured prompt body and it runs, making decisions, checking its own work, compacting context when needed, and stopping when the completeness contract is satisfied.
The /goal prompt is not short. Mine run 600+ words. Here is the structure I use:
/goal
<goal>
[What the task produces. Specific. Measurable. No vague verbs.]
</goal>
<context>
[Files to read before starting. 10+ entries. The model loads these
before doing anything else.]
</context>
<constraints>
[Architecture rules. Things the model must not do. Anti-pattern fences.]
</constraints>
<done_when>
[Explicit, verifiable criteria. Test suites green. Specific file
outputs present. No specific failure modes observed. The model
uses this block to decide when to trigger task_complete.]
</done_when>
<workflow>
[Step-by-step execution order. Which things run in parallel.
Where verification gates live.]
</workflow>
<verification_loop>
[Self-check protocol. What the model runs after each major step
to confirm it hasn't broken anything.]
</verification_loop>
<execution_rules>
[Working rules. Verbatim, concrete. Examples below.]
</execution_rules>
<output_contract>
[What the model must produce as deliverables. Formats. Locations.
Completion signal.]
</output_contract>
The XML block structure is not cargo culting. Each block answers a question the model would otherwise have to guess at. <done_when> is the most important one. Without an explicit termination contract, models tend to either stop too early or keep polishing indefinitely.
I started using these XML tags manually back in April before /goal shipped as a feature. The structure works with or without the runtime feature. With /goal, it also activates the persistence cue: once direction is given, the model gathers context, plans, implements, tests, and refines without waiting for prompts at each step.
Prompt Patterns for GPT-5.5
These are the patterns I’ve found worth baking into every /goal run. They come directly from how GPT-5.5 behaves at high and xhigh reasoning effort, and from the OpenAI Cookbook: GPT-5 Codex Prompting Guide.
-
Pre-load the reading list. List 10+ files explicitly in
<context>. The model reads them before doing anything. Saves it from making assumptions about the codebase structure in the early steps when it should be learning. -
Write the working rules verbatim. Put them in
<execution_rules>. Concrete, imperative sentences. “Check git status before edits.” “Preserve unrelated user changes.” “Preferrgover grep.” “Useapply_patchfor manual edits.” “Do not create long-lived feature flags.” “Delete legacy paths once replacement tests pass.” The model follows these consistently when they are explicit. -
Fence off the anti-patterns. Put this in
<constraints>: “Do not add string-matching patches to pass one test. If a test exposes a bad loop, fix the underlying evaluator, not the symptom.” This one line has prevented at least three hacks that would have passed CI and broken production. -
Bake in architecture rules. Examples: “Deterministic logic only at protocol/runtime boundaries. Semantic interpretation must not be regex-driven.” These keep the model from taking shortcuts that satisfy the test suite but violate the design. If the rule isn’t in the prompt, the model optimizes for green tests.
-
Write measurable
<done_when>criteria. “All test suites green. Manual review of three representative outputs confirms no hallucinated citations. No 429 errors in the final 100 requests.” Vague criteria produce vague stopping behavior. The model uses this block to evaluate whether it’s actually done. -
Use
multi_tool_use.parallelfor batched reads. Tell the model explicitly: “Read all context files in parallel before starting.” GPT-5.5 does this naturally atxhigh, but stating it in the prompt locks it in even for sub-tasks mid-run. -
Separate final-answer length from reasoning quality. Add a line like “Reasoning may be thorough; final output should be concise. Code blocks only, no commentary unless explicitly requested.” At
xhigheffort, the model can produce very long reasoning chains. You want the quality of that reasoning applied to the task, not surfaced in the output.
The Skills Repo
I maintain a public repository of agent skills at github.com/yz5e/agent-skills. Several of them are directly relevant to this workflow.
gpt-5-5-prompt-writer codifies the XML-block structuring, reasoning and verbosity tuning, and verification cycles described above. When you want a polished /goal prompt without writing it from scratch, this skill produces a draft that already has the structure in place. Worth using when the task is complex enough that you’re not sure you’ve covered all the blocks.
forge is an interview-driven prompt creation skill for either gpt-5.5 or Claude Opus 4.7. It asks you questions about the task, just like the AskUserQuestionTool flow, and produces a complete structured prompt. Good when you know what you want to build but aren’t sure how to frame it as a /goal prompt.
code-simplifier and opus-4-7-prompt-writer are complementary. The first reviews code produced by a long run and flags anything that looks overcomplicated. The second is the Claude-side equivalent of gpt-5-5-prompt-writer.
Skills can be symlinked into ~/.codex/skills for Codex CLI use. The repo has install instructions.
The Config
The ~/.codex/config.toml file does real work here. This is the config I use for autonomous /goal sessions:
model = "gpt-5.5"
model_context_window = 1050000
model_auto_compact_token_limit = 997500
model_reasoning_effort = "high"
plan_mode_reasoning_effort = "xhigh"
approval_policy = "never"
sandbox_mode = "danger-full-access"
[features]
goals = true
A few notes on each setting.
model_reasoning_effort = "high" for execution tasks. plan_mode_reasoning_effort = "xhigh" for the planning pass. These map to the reasoning effort tiers: low, medium, high, xhigh. Default is medium. For anything you’d let run for multiple hours, medium is not enough. The model will take cheaper paths on hard sub-problems.
model_auto_compact_token_limit = 997500 is what allows a /goal session to run past what would otherwise be a hard context limit. The model compacts its context automatically when it approaches the limit, preserving the essential state and continuing. Without this, a six-hour run would hard-stop somewhere in hour two.
approval_policy = "never" and sandbox_mode = "danger-full-access" should only be used in directories you have explicitly marked as trusted in your config. These settings give the model full write access to the filesystem. Do not use them in a directory you would not let an unsupervised process touch. I apply them only to specific project paths.
The Mindset Shift
Old: Write a prompt, get an answer, iterate manually. New: Write a contract, hand it to the model, review the deliverable.
The shift is from prompt-to-response to spec-to-execution. Once you stop thinking of /goal as a better chat and start thinking of it as a way to delegate a defined scope of work, the workflow clicks into place. The spec is not prep work before the real work. The spec is the work. The model does the implementation.
Conclusion
Two models. One pipeline. Claude interviews you and produces a spec. GPT-5.5 reads that spec critically and tightens it. Then /goal runs with a structured prompt that bakes in the working rules, the architecture constraints, and the termination criteria.
The config is not optional. xhigh reasoning in plan mode, high in execution, model_auto_compact_token_limit set high enough to survive a long run. The prompt is not short. 600+ words, structured XML blocks, explicit anti-pattern fences. The spec is not a rough idea. It is a document that survived a GPT-5.5 challenge pass.
This is a workflow, not a hack. It works because each piece handles the thing it does best. Neither model is doing everything.
The first time a six-hour run completes cleanly and you open the output, it feels like something shifted. It did.
Read the companion post if you have not already: /goal: The Six-Hour Codex Run That Survived a Five-Hour Pause.