Model sweep — what 48 calls across DeepSeek thinking modes taught me
Structured sweep across DeepSeek V4 Flash non-think / think-high / think-max modes on 8 agent tasks. Pass-rate stays high; the real cost is reasoning tokens you pay for but never see. Plus two gateway footguns: top-level vs extra_body thinking, and MiniMax x-api-key vs Bearer.
Third post in the FlowCode series. The overview covered the role split; the verification-lock post covered the pattern that stopped agent loops. This one is the data post — what a structured sweep across DeepSeek’s V4 Flash thinking modes revealed about the cost of reasoning tokens you don’t see in the response.
The setup
I picked 8 prompts that represent the real work different FlowCode roles do:
| Prompt | Role | What it tests |
|---|---|---|
| writer-reacts-to-lint | Writer | Can the model react to a lint error and propose a fix without claiming DONE? |
| writer-plans-tests | Writer | Can it decompose “add tests” into specific cases? |
| reviewer-adversarial | Reviewer | Does it try to refute the writer’s claims rather than rubber-stamp? |
| orchestrator-area | Orchestrator | Can it pick the right area lead for a fuzzy brief? |
| end-run-discipline | Writer | Does it actually stop at the end of an iteration? |
| multi-message | Any | Can it handle multi-turn context without re-asking? |
| brain-search-before-retry | Any | Does it consult the knowledge base before guessing again? |
| need-file | Any | Does it ask for a file it doesn’t have instead of hallucinating? |
Each prompt ran twice in each of three thinking modes: non-think, think-high, think-max. Total: 48 calls. Captured: pass/fail, completion tokens, reasoning tokens (the hidden cost), latency, sampled reasons.
The headline cost — reasoning chars per response
Reasoning chars are completion tokens you pay for and never see in the response. The model “thought” them; the API charges them; your code reads zero of them.
Two patterns jump out:
- Think-max rarely buys you correctness past think-high. On 4 of 8 tasks, max produced more reasoning but the same pass-rate.
- Some tasks are bottomless.
need-fileandwriter-plans-testsproduced 9-11k reasoning chars per response. That’s cost you’re paying even when the visible answer is two sentences.
Pass-rate by mode
Non-think failed on 5 calls — orchestrator-area (1/2) and need-file (1/2) being the worst. Both are tasks where the model needs to make a judgment about missing context. Without reasoning, it guesses; sometimes the guess is wrong.
What the sweep actually showed about the gateway
The numbers are useful. The gateway-level findings are more useful, because they’re the kind of footgun that costs money silently:
What works
- Top-level
thinkingfield is what the opencode-go gateway respects. - Non-think mode passes most mechanical tasks (lint reaction, end-run discipline).
- Think-high is the sweet spot — full pass-rate at half the cost of think-max.
- DeepSeek V4 Flash is competitive with the bigger models on these prompts.
What silently bites
extra_body: { thinking: ... }is silently dropped by the gateway. The model still produces reasoning tokens internally — you pay for them but didn’t ask for them.- The model always returns reasoning_content unless explicitly disabled at top-level. Even Think-High burns ~4K reasoning chars per call.
- MiniMax via the same gateway uses
x-api-key(Anthropic-native), NOTAuthorization: Bearer. Bearer returns 401. - MiniMax sometimes returns empty
textand stuffs the entire answer intothinkingblocks. If your code only readscontent[].text, you ship empty.
How I’d assign models to roles after the sweep
The point of FlowCode’s role split is that different roles have different cost/quality tradeoffs. After the sweep:
| Role | Recommended mode | Reasoning |
|---|---|---|
| Writer (reactive) | Non-think for lint fixes, Think-High for planning | Mechanical fixes don’t need reasoning; planning does |
| Auditor | Think-High | Adversarial reading needs reasoning, but think-max doesn’t add value |
| Verifier | Non-think | Verifier just runs commands and parses output. No reasoning needed. |
| Reviewer | Think-High | Adversarial refutation needs reasoning; max is overkill |
| Orchestrator | Think-High | Routing decisions benefit from reasoning, especially area-picking |
For a typical iteration: orchestrator + planner + writer + auditor + verifier + reviewer. Two of those (verifier, sometimes writer) can run non-think. The rest stay think-high. Think-max stays off by default — flip on only when an iteration loops past two failures.
What I’d take to any agent framework
The full sweep results live in MODEL_SWEEP_REPORT.md in the opencode-flow repo. The methodology is reproducible against any OpenAI-compatible gateway with thinking-mode support.
This wraps the FlowCode mini-series. The next post pivots back to ZK: deeper benchmarks on Pedersen + Groth16 in Cadence.