Model sweep — what 48 calls across DeepSeek thinking modes taught me

Structured sweep across DeepSeek V4 Flash non-think / think-high / think-max modes on 8 agent tasks. Pass-rate stays high; the real cost is reasoning tokens you pay for but never see. Plus two gateway footguns: top-level vs extra_body thinking, and MiniMax x-api-key vs Bearer.

May 19, 2026 ·6 min read
#LLM#DeepSeek#MiniMax#multi-agent#cost
48 calls
Structured sweep — 8 agent prompts × 3 thinking modes × 2 runs
The interesting axis isn't pass-rate. It's reasoning-token cost you pay for and never see.

Third post in the FlowCode series. The overview covered the role split; the verification-lock post covered the pattern that stopped agent loops. This one is the data post — what a structured sweep across DeepSeek’s V4 Flash thinking modes revealed about the cost of reasoning tokens you don’t see in the response.

The setup

I picked 8 prompts that represent the real work different FlowCode roles do:

PromptRoleWhat it tests
writer-reacts-to-lintWriterCan the model react to a lint error and propose a fix without claiming DONE?
writer-plans-testsWriterCan it decompose “add tests” into specific cases?
reviewer-adversarialReviewerDoes it try to refute the writer’s claims rather than rubber-stamp?
orchestrator-areaOrchestratorCan it pick the right area lead for a fuzzy brief?
end-run-disciplineWriterDoes it actually stop at the end of an iteration?
multi-messageAnyCan it handle multi-turn context without re-asking?
brain-search-before-retryAnyDoes it consult the knowledge base before guessing again?
need-fileAnyDoes it ask for a file it doesn’t have instead of hallucinating?

Each prompt ran twice in each of three thinking modes: non-think, think-high, think-max. Total: 48 calls. Captured: pass/fail, completion tokens, reasoning tokens (the hidden cost), latency, sampled reasons.

The headline cost — reasoning chars per response

Reasoning chars are completion tokens you pay for and never see in the response. The model “thought” them; the API charges them; your code reads zero of them.

scale: average reasoning_chars across 2 runs per cell — non-think is always 0
writer-reacts-to-lint high → max
797 → 5,704
writer-plans-tests high → max
10,872 → 9,055
reviewer-adversarial high → max
2,561 → 5,653
orchestrator-area high → max
2,472 → 2,295
end-run-discipline high → max
437 → 731
multi-message high → max
5,014 → 6,611
brain-search-before-retry high → max
1,203 → 1,492
need-file high → max
9,297 → 11,393

Two patterns jump out:

  1. Think-max rarely buys you correctness past think-high. On 4 of 8 tasks, max produced more reasoning but the same pass-rate.
  2. Some tasks are bottomless. need-file and writer-plans-tests produced 9-11k reasoning chars per response. That’s cost you’re paying even when the visible answer is two sentences.

Pass-rate by mode

11/16
Non-think pass-rate
Across all 8 prompts × 2 runs
16/16
Think-high pass-rate
Sweet spot
16/16
Think-max pass-rate
Same as high, ~2x cost

Non-think failed on 5 calls — orchestrator-area (1/2) and need-file (1/2) being the worst. Both are tasks where the model needs to make a judgment about missing context. Without reasoning, it guesses; sometimes the guess is wrong.

What the sweep actually showed about the gateway

The numbers are useful. The gateway-level findings are more useful, because they’re the kind of footgun that costs money silently:

What works
What silently bites

What works

  • Top-level thinking field is what the opencode-go gateway respects.
  • Non-think mode passes most mechanical tasks (lint reaction, end-run discipline).
  • Think-high is the sweet spot — full pass-rate at half the cost of think-max.
  • DeepSeek V4 Flash is competitive with the bigger models on these prompts.

What silently bites

  • extra_body: { thinking: ... } is silently dropped by the gateway. The model still produces reasoning tokens internally — you pay for them but didn’t ask for them.
  • The model always returns reasoning_content unless explicitly disabled at top-level. Even Think-High burns ~4K reasoning chars per call.
  • MiniMax via the same gateway uses x-api-key (Anthropic-native), NOT Authorization: Bearer. Bearer returns 401.
  • MiniMax sometimes returns empty text and stuffs the entire answer into thinking blocks. If your code only reads content[].text, you ship empty.

How I’d assign models to roles after the sweep

The point of FlowCode’s role split is that different roles have different cost/quality tradeoffs. After the sweep:

RoleRecommended modeReasoning
Writer (reactive)Non-think for lint fixes, Think-High for planningMechanical fixes don’t need reasoning; planning does
AuditorThink-HighAdversarial reading needs reasoning, but think-max doesn’t add value
VerifierNon-thinkVerifier just runs commands and parses output. No reasoning needed.
ReviewerThink-HighAdversarial refutation needs reasoning; max is overkill
OrchestratorThink-HighRouting decisions benefit from reasoning, especially area-picking

For a typical iteration: orchestrator + planner + writer + auditor + verifier + reviewer. Two of those (verifier, sometimes writer) can run non-think. The rest stay think-high. Think-max stays off by default — flip on only when an iteration loops past two failures.

What I’d take to any agent framework

The full sweep results live in MODEL_SWEEP_REPORT.md in the opencode-flow repo. The methodology is reproducible against any OpenAI-compatible gateway with thinking-mode support.


This wraps the FlowCode mini-series. The next post pivots back to ZK: deeper benchmarks on Pedersen + Groth16 in Cadence.

Available for contracts