Qwen3.7-Max Launches with Top Agent Benchmarks: What Routing Teams Need to Know
Alibaba Cloud's Qwen3.7-Max claims the top spot on Terminal Bench 2.0, outperforms Claude Opus 4.6 on GPQA Diamond, and completes a 35-hour autonomous agent task on novel hardware. Here's the operator angle for teams routing Qwen traffic.

Alibaba Cloud released Qwen3.7-Max and Qwen3.7-Plus at the Alibaba Cloud Summit on May 20, 2026 — just one month after Qwen3.6. Preview versions had quietly appeared on Arena AI and Qwen Chat the day before, continuing the team's pattern of soft-launching ahead of official announcement. The release is notable not just for the benchmark numbers, but for the operational model Alibaba is pushing: positioning Qwen as an agent-era infrastructure play, not a single-metric leaderboard chase.
What the benchmarks say
The numbers that matter for routing decisions:
- GPQA Diamond: 92.4 — above Claude Opus 4.6 (91.3). This is a measure of PhD-level reasoning accuracy, relevant for research, code review, and complex agentic subtasks.
- Terminal Bench 2.0-Terminus: 69.7 — reported as one of the highest public scores, above DeepSeek-v4-pro-Max and Claude Opus 4.6. This benchmark directly measures long-horizon coding-agent capability: real terminal sessions, file operations, multi-step debugging.
- HLE: 41.4 and HMMT 2026 Feb: 97.1 — strong on hard mathematical reasoning.
- SWE-Pro and SWE-Multilingual: leading positions on both. SWE benchmarks measure the ability to resolve real GitHub issues — the metric most directly tied to Cursor, Claude Code, and OpenAI-compatible coding workflows.
- Kernel Bench GPU: 1.98× acceleration on GPU kernel optimization tasks.
- Arena AI blind evaluation: ranked above Kimi-K2.6, DeepSeek-v4-pro, and GLM-5.1 among domestic Chinese models.
One concrete demonstration stood out: Qwen3.7-Max completed a 35-hour continuous autonomous task on the T-Head ZW-M890 PPU — hardware the model had never previously encountered. This is not a toy demo. It represents the kind of long-horizon, hardware-adaptive agent behavior that matters when you are running coding agents on internal or non-standard infrastructure.
Why this matters for AI engineering teams
The one-month cadence is a signal, not just a marketing decision. Qwen3.6 shipped in April. Qwen3.7-Max shipped in May. For teams building routing policies around Qwen, this pace means your model pinning strategy needs to be explicit. If you are sending traffic to qwen-plus or qwen-max by model family rather than a specific version, your quality baseline can shift between runs. If your product depends on consistent SWE or reasoning behavior, version pinning via your routing config is the right answer.
Agent capability is now the differentiation axis. The Terminal Bench and SWE scores are more operationally meaningful than traditional NLP benchmarks. Teams running Cursor, Claude Code, or custom agent pipelines that use Qwen through OpenAI-compatible routing should evaluate whether Qwen3.7-Max changes their default model selection for agentic tasks — especially long-horizon ones where a 69.7 vs. a lower Terminal Bench score compounds over many steps.
Chinese model reliability is a routing design question. Qwen, GLM, DeepSeek, and Kimi are shipping significant model updates on monthly or sub-monthly cycles. The Arena AI result — where Qwen3.7-Max ranks first among domestic models — matters if you are making routing decisions that include Chinese providers as primary or fallback paths. The competitive ordering shifts frequently, and a static routing config that was optimal in April may route sub-optimally in May.
The router/operator angle
Model version routing, not just model family routing. Alibaba's DashScope OpenAI-compatible endpoint (dashscope-intl.aliyuncs.com/compatible-mode/v1) typically surfaces model names like qwen-max, qwen-plus, and qwen-turbo. With monthly model refreshes, the underlying capability behind qwen-max changes without a name change. If you need reproducible agent behavior — especially for SWE-style tasks — route to an explicit versioned model name where Alibaba exposes one, or treat each monthly release as a provider event that triggers a re-evaluation of your routing weights.
Fallback chain design for long-horizon agent tasks. Terminal Bench scores measure 35+ step autonomous sessions. If your routing policy falls back to a weaker model mid-session (e.g., on a timeout or quota event), the fallback model's lower agent capability can cause cascading failures in a way that doesn't happen with single-turn tasks. Consider keeping long-horizon agent calls on a dedicated routing path with a high-quality fallback (Qwen3.7-Max → Qwen3.7-Plus, not → a lightweight model) and reserve aggressive cost optimization for shorter-lived tasks.
SWE-Multilingual as a provider selection signal for polyglot codebases. If your codebase is multilingual (Python + Go + Rust, or Chinese + English documentation), SWE-Multilingual performance matters. Qwen3.7-Max's reported top position on that benchmark suggests it may outperform alternatives on mixed-language file operations and cross-language refactors — worth running your own benchmark against your actual codebase before committing routing budget.
Domestic provider routing considerations. The Arena AI blind evaluation places Qwen3.7-Max above Kimi-K2.6, DeepSeek-v4-pro, and GLM-5.1. For teams that have a multi-domestic-model routing setup, this is a data point for adjusting quality-based routing weights — though self-reported Arena rankings should be validated against your task distribution before changing production routing policy.
What to watch or try
- Re-evaluate your Qwen model slot: If you are currently routing agentic tasks to DeepSeek or GLM as the primary domestic provider, the Terminal Bench and SWE scores make Qwen3.7-Max worth a comparative eval against your actual workload.
- Check your model version pinning: Determine whether your current routing config sends traffic to a versioned Qwen model ID or the floating
qwen-maxalias. If the latter, document that your quality baseline may shift with future Alibaba refreshes. - Test long-horizon agent runs explicitly: The 35-hour autonomous task result is a capability claim — test it against your own agent scaffolding with a 20–50 step benchmark before relying on it for production agentic workloads.
- Watch the cadence: Qwen3.8 or equivalent is likely within 4–6 weeks given the current release pace. If you are building a routing policy now, design it to accommodate model version updates without manual intervention.
Related
Latest AI News →
Qwen-MT Turbo: Alibaba's Dedicated Translation API Introduces extra_body Routing Parameters That Standard Proxies May Drop
Alibaba Cloud's new Qwen-MT turbo model arrives via OpenAI-compatible endpoints, but its translation controls live inside extra_body — a pattern that breaks any middleware that strips non-standard fields. Here's what routing teams need to watch.

DeepSeek Now Speaks Anthropic: What the New Dual-Format API Means for Your Routing Layer
DeepSeek's API now accepts Anthropic SDK format at api.deepseek.com/anthropic — meaning Claude Code, the Anthropic Python/TS SDK, and any Anthropic-native client can now route requests to DeepSeek V4 models without an OpenAI wrapper.

Kimi K2.6: Moonshot's Latest Open-Source Model Sets a New Bar for Long-Horizon Coding Agents
Moonshot AI releases Kimi K2.6 with state-of-the-art long-horizon coding, multimodal input (text, images, video), 256K context, and a fully OpenAI-compatible API — directly affecting how engineering teams route coding-agent workloads.