Qwen GSPO: Sequence-Level RL Stabilizes Large-Scale Language Model Training

AI engineering teams running in-house reinforcement learning from human feedback (RLHF) face a fundamental stability challenge: token-level optimization objectives in algorithms like GRPO introduce variance that can cause training collapse, especially for Mixture-of-Experts (MoE) models. Qwen's newly introduced Group Sequence Policy Optimization (GSPO) addresses this by shifting optimization from token-level to sequence-level, enabling more stable and infrastructure-friendly large-scale RL training.

What happened

Qwen researchers published GSPO (Group Sequence Policy Optimization), a new RL algorithm for training language models. Unlike previous approaches such as GRPO that optimize token-by-token, GSPO defines the importance ratio based on full sequence likelihood and performs sequence-level clipping, rewarding, and optimization.

The algorithm eliminates the need for Routing Replay, a workaround strategy previously required for GRPO to converge properly on MoE models due to expert activation volatility. Routing Replay caches activated experts from the old policy and replays these routing patterns during optimization, incurring additional memory and communication overhead that limits actual MoE model capacity.

GSPO has been successfully applied to large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking), demonstrating continuous performance improvement through increased training compute.

Why it matters for AI engineering teams

For teams building and fine-tuning their own models, GSPO represents an infrastructure-level improvement that affects three critical operational concerns:

Training stability: GSPO maintains stable training processes and resolves the stability challenges that have plagued RL training on MoE models. This means teams can scale their RLHF efforts without encountering model collapse that wastes compute and time.

Infrastructure overhead: By removing the dependency on Routing Replay, GSPO eliminates the need to cache and replay routing patterns. This reduces memory footprint, communication costs, and implementation complexity for teams deploying MoE models with RLHF pipelines.

Precision tolerance: GSPO's sequence-level optimization is fundamentally more tolerant to precision discrepancies than token-level approaches. This raises an interesting possibility: using likelihoods returned by inference engines directly for optimization, rather than recomputing with training engines. For training-inference disaggregated frameworks or partial rollout scenarios, this could simplify infrastructure significantly.

The router/operator angle

While TheRouter itself does not perform model training, GSPO's implications affect how AI engineering teams think about provider selection and infrastructure planning:

Cost-aware training decisions: Teams considering in-house RLHF should factor in the infrastructure savings from sequence-level optimization. The elimination of Routing Replay means lower hardware requirements and more efficient use of existing compute. When comparing the cost of fine-tuning versus buying higher-end pre-trained models, these infrastructure savings can tip the balance.

Provider transparency on training methods: As model providers increasingly use RLHF to improve reasoning and instruction-following capabilities, teams should ask about the training stability and infrastructure practices behind these improvements. GSPO-powered training suggests more reliable and predictor-behavior-stable fine-tuning, which matters for applications requiring consistent output patterns.

MoE architecture viability: The stability of MoE RL with GSPO removes a key barrier to adopting MoE architectures for production RLHF. Teams that previously avoided MoE due to RL instability can now consider these architectures for their parameter-efficient fine-tuning needs.

What TheRouter users should watch or try

Monitor the adoption of GSPO or similar sequence-level RL algorithms by major model providers. Widespread adoption could signal more fine-tuned models with stable reasoning characteristics, which affects routing decisions when selecting fine-tuned variants for specific use cases.

If your team evaluates in-house RLHF, benchmark GSPO-based implementations against token-level RL to measure stability gains and infrastructure savings. The decision framework for RLHF should now include sequence-level optimization as a key technical dimension alongside compute budget and data quality.

For production systems that rely on fine-tuned models, assess whether providers using sequence-level RL exhibit more predictable behavior across training runs. Consistency in fine-tuning outcomes can simplify model qualification processes and reduce the need for frequent re-validation.

Read the GSPO technical paper for algorithmic details and comparative benchmarks against GRPO.

What happened

Why it matters for AI engineering teams

The router/operator angle

What TheRouter users should watch or try

Похожие материалы

Qwen-Image on DashScope: What the New Image Generation and Editing APIs Mean for Your Async Media Pipeline

Qwen-MT Turbo: Alibaba's Dedicated Translation API Introduces extra_body Routing Parameters That Standard Proxies May Drop

Qwen3.7-Max Launches with Top Agent Benchmarks: What Routing Teams Need to Know