Model sweep — what 48 calls across DeepSeek thinking modes taught me

48 calls

Structured sweep — 8 agent prompts × 3 thinking modes × 2 runs

The interesting axis isn't pass-rate. It's reasoning-token cost you pay for and never see.

Third post in the FlowCode series. The overview covered the role split; the verification-lock post covered the pattern that stopped agent loops. This one is the data post — what a structured sweep across DeepSeek’s V4 Flash thinking modes revealed about the cost of reasoning tokens you don’t see in the response.

The setup

I picked 8 prompts that represent the real work different FlowCode roles do:

Prompt	Role	What it tests
writer-reacts-to-lint	Writer	Can the model react to a lint error and propose a fix without claiming DONE?
writer-plans-tests	Writer	Can it decompose “add tests” into specific cases?
reviewer-adversarial	Reviewer	Does it try to refute the writer’s claims rather than rubber-stamp?
orchestrator-area	Orchestrator	Can it pick the right area lead for a fuzzy brief?
end-run-discipline	Writer	Does it actually stop at the end of an iteration?
multi-message	Any	Can it handle multi-turn context without re-asking?
brain-search-before-retry	Any	Does it consult the knowledge base before guessing again?
need-file	Any	Does it ask for a file it doesn’t have instead of hallucinating?

Each prompt ran twice in each of three thinking modes: non-think, think-high, think-max. Total: 48 calls. Captured: pass/fail, completion tokens, reasoning tokens (the hidden cost), latency, sampled reasons.

The headline cost — reasoning chars per response

Reasoning chars are completion tokens you pay for and never see in the response. The model “thought” them; the API charges them; your code reads zero of them.

scale: average reasoning_chars across 2 runs per cell — non-think is always 0

writer-reacts-to-lint high → max

797 → 5,704

writer-plans-tests high → max

10,872 → 9,055

reviewer-adversarial high → max

2,561 → 5,653

orchestrator-area high → max

2,472 → 2,295

end-run-discipline high → max

437 → 731

multi-message high → max

5,014 → 6,611

brain-search-before-retry high → max

1,203 → 1,492

need-file high → max

9,297 → 11,393

Two patterns jump out:

Think-max rarely buys you correctness past think-high. On 4 of 8 tasks, max produced more reasoning but the same pass-rate.
Some tasks are bottomless. need-file and writer-plans-tests produced 9-11k reasoning chars per response. That’s cost you’re paying even when the visible answer is two sentences.

Pass-rate by mode

11/16

Non-think pass-rate

Across all 8 prompts × 2 runs

16/16

Think-high pass-rate

Sweet spot

16/16

Think-max pass-rate

Same as high, ~2x cost

Non-think failed on 5 calls — orchestrator-area (1/2) and need-file (1/2) being the worst. Both are tasks where the model needs to make a judgment about missing context. Without reasoning, it guesses; sometimes the guess is wrong.

What the sweep actually showed about the gateway

The numbers are useful. The gateway-level findings are more useful, because they’re the kind of footgun that costs money silently:

What works

What silently bites

What works

Top-level thinking field is what the opencode-go gateway respects.
Non-think mode passes most mechanical tasks (lint reaction, end-run discipline).
Think-high is the sweet spot — full pass-rate at half the cost of think-max.
DeepSeek V4 Flash is competitive with the bigger models on these prompts.

What silently bites

extra_body: { thinking: ... } is silently dropped by the gateway. The model still produces reasoning tokens internally — you pay for them but didn’t ask for them.
The model always returns reasoning_content unless explicitly disabled at top-level. Even Think-High burns ~4K reasoning chars per call.
MiniMax via the same gateway uses x-api-key (Anthropic-native), NOT Authorization: Bearer. Bearer returns 401.
MiniMax sometimes returns empty text and stuffs the entire answer into thinking blocks. If your code only reads content[].text, you ship empty.

How I’d assign models to roles after the sweep

The point of FlowCode’s role split is that different roles have different cost/quality tradeoffs. After the sweep:

Role	Recommended mode	Reasoning
Writer (reactive)	Non-think for lint fixes, Think-High for planning	Mechanical fixes don’t need reasoning; planning does
Auditor	Think-High	Adversarial reading needs reasoning, but think-max doesn’t add value
Verifier	Non-think	Verifier just runs commands and parses output. No reasoning needed.
Reviewer	Think-High	Adversarial refutation needs reasoning; max is overkill
Orchestrator	Think-High	Routing decisions benefit from reasoning, especially area-picking

For a typical iteration: orchestrator + planner + writer + auditor + verifier + reviewer. Two of those (verifier, sometimes writer) can run non-think. The rest stay think-high. Think-max stays off by default — flip on only when an iteration loops past two failures.

What I’d take to any agent framework

The full sweep results live in MODEL_SWEEP_REPORT.md in the opencode-flow repo. The methodology is reproducible against any OpenAI-compatible gateway with thinking-mode support.

This wraps the FlowCode mini-series. The next post pivots back to ZK: deeper benchmarks on Pedersen + Groth16 in Cadence.