<aside> 💡
Author: Linfeng Song, Sidi Lu and Zhenwen Liang (Tencent AI Lab)
This is only a progress report.
</aside>
Reinforcement Learning from Verifier Rewards (RLVR) is an emerging paradigm that bridges the gap between reinforcement learning (RL) and verifiable reasoning. Unlike conventional RL from human preference feedback (RLHF), RLVR leverages automated verifiers to provide objective reward signals that directly reflect the correctness and completeness of an LLM’s outputs. This approach is particularly suited for complex reasoning domains—such as mathematical problem solving or code synthesis — where explicit supervision is scarce but correctness can be algorithmically checked.
Currently, the major RLVR approaches take the reasoning outcome as a binary reward signal — a trajectory with correct/wrong outcome yields a reward of +1/0 (or +1/-1), and typical adopted training algorithms are PPO, GRPO or their variants such as DAPO. Though RLVR has demonstrated large performance gain comparing to massive finetuning (Wang et al., 2024, Yu et al., 2024) or training on tree search (Tian et al., 2024), subsequent studies reveal the entropy collapse phenomenon, where the sampling diversity of an LLM diminishes and the model consistently exhibits high certainty in its outputs regardless of their factual or logical correctness.
To validate our hypothesis, we analyze the inter-sample similarity among trajectories generated by an LLM when conditioned on the same input question. To approximate the RL training dynamics, we employ algorithm-specific loss signals (e.g., from PPO or GRPO) to compute gradients and measure pairwise trajectory distances using the L1 distance between their gradient representations.
The analysis reveals that the trajectories corresponding to the same question naturally cluster into two distinct groups regardless of whether the training algorithm is PPO or GRPO. This indicates that RLVR with binary reward resembles pairwise preference learning, such as online DPO with pairs of correct/incorrect trajectories, and it further suggests that the entropy collapse phenomenon primarily stems from the sparse and coarse-grained nature of binary outcome rewards. Our observation is also consistent with one recent work (Wu et al., 2025) that mainly compares GRPO-2 (GRPO with a response group size of two) with DPO.
Reinforcement Learning with Verifier Reward (RLVR) aims to optimize a language model’s reasoning process by directly leveraging sparse binary outcome rewards $R_{out}=\{0,1\}$. Regarding the training algorithm, there are two main categories, PPO vs GRPO, with the main difference being the way for calculating advantages. GRPO uses the mean value of the outcome rewards as baseline for advantage calculation:
$$ A_i=\frac{r_i-\operatorname{mean}(r_1,\ldots, r_G)}{\operatorname{std}(r_1, \ldots, r_G) + \epsilon} $$
On the other hand, PPO takes a critic model as the value function ($V_\phi$) with token-level Generalized Advantage Estimation over question $q$ and output sequence $o_i$:
$$ A_{i,t} = \sum_{l=t}^{|o_i|} (\gamma\lambda)^{l-t} \delta_{i,l} \\ \delta_{i,l} = r_{i,l} + \gamma V_{\phi}(q, o_{i, \leq l+1}) - V_{\phi}(q, o_{i, \leq l}) $$
There are several possible ways to measure trajectory-to-trajectory similarity. We choose to compare inter-trajectory similarity based on their actual gradients, as this is directly relevant to model behavior change during training. Given the gradients $g^W_{t_i}$ and $g^W_{t_j}$ from trajectories $t_i$ and $t_j$ for a specific model weight $W$, the corresponding distance is based on the L1 norm:
$$ \text{dist}W(t_i, t_j, q, \theta,\mathcal{A})=||g^W{t_i} - g^W_{t_j}||_1 $$
where $q$, $\theta$ and $\mathcal{A}$ represent the input question, model parameter and RL objective. Thus, the overall distance is the macro-level average of the distances over each $W$:
$$ \text{dist}(t_i,t_j,q,\theta,\mathcal{A})=\frac{1}{|\theta_W|}\sum_{W \in \theta_W} \text{dist}_W(t_i,t_j,q,\theta,\mathcal{A}) $$
with $\theta_W$ represent the block-wise model parameters. The standard implementation incurs a substantial memory overhead, as the gradient tensors occupy memory of the same order as the model parameters. Performing a reliable analysis further requires a non-trivial number of sampled trajectories, resulting in an overall memory consumption proportional to $n \cdot |\theta|$, where $n$ denotes the number of trajectories per input question and $|\theta|$ indicates the number of model parameters. To mitigate this issue, we introduce a sparse gradient representation by retaining only the top 10% of elements with the largest absolute values in each gradient tensor. Although this approximation introduces additional variance, it renders any observed regularities or conclusions more robust, since they persist even under this reduced-precision representation.
Our analysis is based on Qwen3-4B base, where we select the checkpoints after a certain amount of training steps on DAPO 17K data with the batch size set to 256. We then investigate PPO/GRPO on the same training set, removing the KL loss term and setting the batch size and learning rate to be the same across all runs to ensure rigorous comparison. Particularly, the learning rate and number of rollouts are set to 1e-6 and 32, respectively.