Here is the complete overview, fully formatted in standard Markdown and ready for your website. I've included the introductory text about this being your first public post!

Paper of the Week: Unpacking GRPO and the Evolution of RL in LLMs

Welcome to our new weekly series! Behind the scenes, our engineering and research team has been meeting for weeks to dissect the latest machine learning papers. We realized there is too much value in these discussions to keep them to ourselves, so we have decided to pull back the curtain. Moving forward, we will be publishing overviews of our internal "Paper of the Week" deep dives to help you stay up to speed on the fast-moving world of AI research.

Here is the breakdown of our very first published discussion.

Date: March 30
Participants: Thibault, JD, William, Peter, Thomas

In this week’s paper breakdown, the team dove into the mechanics of Group Relative Policy Optimization (GRPO), contrasting it with OpenAI's previous standard, Proximal Policy Optimization (PPO). The discussion highlighted how modern AI research is bypassing heavy memory constraints while continuing to lean heavily on high-quality data.

Here is a comprehensive overview of the team's observations.

1. The Bottleneck of Traditional PPO

To understand the value of GRPO, Peter walked the team through the evolution of Reinforcement Learning (RL) in language models.

In early policy gradient methods (like REINFORCE), models suffered from high variance. Because an entire output was simply rated "good" or "bad," the training gradient became incredibly noisy, leading to slow and unstable training.

PPO (Proximal Policy Optimization) solved this by introducing a Value Model (or Critic).

The Value Model estimates the expected reward of a given state.
The model then calculates an "advantage"—how much better or worse a specific action was compared to that expected baseline.
The Catch: Training a separate Value Model requires almost as much memory and compute as the base model itself, creating a massive bottleneck for scaling.

2. Enter GRPO: Eliminating the Value Model

Thomas highlighted GRPO’s primary innovation: it completely removes the need for a separate Value Model.

Instead of relying on a Critic to establish a baseline, GRPO leverages the model itself to generate a group of different responses for the same prompt. The algorithm then evaluates this group:

It calculates the reward for each sampled output.
It establishes the baseline using the mean ( $\mu$ ) of those rewards.
It calculates the advantage using a standard z-score normalization: $A = \frac{R - \mu}{\sigma}$ (where $R$ is the reward and $\sigma$ is the standard deviation of the group).

If an output has a high reward but the group's standard deviation is massive (meaning the model is highly uncertain), the gradient step is naturally dampened. This stabilizes training and saves significant memory overhead.

PPO vs. GRPO at a Glance

Feature	PPO (Proximal Policy Optimization)	GRPO (Group Relative Policy Optimization)
Advantage Estimation	Requires a separate Value Model (Critic).	Uses group sampling and z-score normalization.
Memory Overhead	High (requires training a second neural network).	Low (eliminates the Value Model entirely).
Baseline Metric	Learned expected reward for a specific state.	The mean reward of the sampled group of outputs.

3. The Unsung Hero: Data and Supervised Fine-Tuning (SFT)

Despite the mathematical elegance of GRPO, the team noted a crucial detail hidden in the paper's benchmarks. While RL optimization is vital, Supervised Fine-Tuning (SFT) on high-quality data remains the biggest driver of performance.

The researchers built a massive, highly filtered mathematical dataset (filtering Common Crawl for math-heavy content).
SFT on this dataset provided a massive performance leap (roughly a 20% jump).
GRPO acted as the "cherry on top," providing an additional ~5.3% optimization on top of the SFT gains.

As the team concluded: GRPO works beautifully, but only if the base model has already been exposed to high-quality reasoning traces during pre-training or SFT. If the model is generating complete nonsense, group sampling won't save it.

4. Broader Implications and Next Steps

The conversation concluded with a look at the broader enterprise implications of these architectures:

Catastrophic Forgetting: Fine-tuning heavily on one domain (e.g., Math) often causes the model to degrade in others (e.g., Languages). The KL divergence penalty in GRPO helps mitigate this by anchoring the model to its original weights, though architectural setups like Mixture of Experts (MoE) or modular LoRAs are ultimately required for multi-domain enterprise use.
Looking Ahead to sDPO: Next week, the team will tackle self-correction and sDPO. Initial readings suggest that models can act as their own judges to distill complex "Chain of Thought" reasoning back into their weights—potentially reducing reasoning token usage from 5,000 tokens down to under 1,000 without sacrificing accuracy.