SymPy-RLVR: Teaching a Small Model to Do Math with Verifiable Rewards
The setup was simple on paper. I had a Qwen2.5-1.5B model SFT-trained on math problems, scoring 22% on GSM8K. I wanted to push it higher with reinforcement learning. The question was: what’s the reward signal?
The standard answer is a learned reward model — train a separate model to judge outputs. The problem: learned reward models get gamed. They’re soft targets. A model will find the corners of the reward surface that score high without actually solving the problem, which has a name now — specification gaming. I wrote about this in the context of Mermaid diagram generation in an earlier post.
For math, there’s a cleaner path. Math has verifiable answers. You don’t need a judge to tell you if x = 4 is correct when the ground truth is x = 4. You can check it symbolically. This is the core premise of RLVR[1] — replace the learned reward model with a deterministic verifier. The verifier is the environment. The model learns to satisfy it.
This post documents what it actually took to build that pipeline end to end, including the parts that broke in interesting ways.
The Ground Truth Problem
The first instinct was to train on GSM8K. It has 8.5k training questions with verified answers. Clean, well-studied, benchmark-aligned.
The problem: contamination risk. Any model trained on or near GSM8K data will have partial exposure to its distribution. The benchmark stops being a clean signal once you start closing the loop.
The alternative was synthesis — generate new math questions, verify the answers programmatically. But synthesized question ground truths from an LLM aren’t reliable. Ask a model to generate a hard combinatorics problem and its stated answer is often wrong. You can’t use the model’s own answer as ground truth.
The fix: separate generation from verification. Use a capable LLM (Grok-4) to generate questions, then use an independent verification pass where Grok-4 calls SymPy functions as tools to derive the answer from scratch — never trusting the generation’s stated answer.
async def resolve(question: str, max_steps: int = 10) -> str:
chat = client.chat.create(
model="grok-4-fast-reasoning",
messages=[system(SYSTEM_PROMPT), user(question)],
tools=TOOLS, # sympy_solve, sympy_integrate, sympy_factorize, ...
)
for _ in range(max_steps):
response = await chat.sample()
if not response.tool_calls:
return response.content.strip().splitlines()[0].strip()
chat.append(response)
for tc in response.tool_calls:
args = json.loads(tc.function.arguments)
result = TOOL_REGISTRY[tc.function.name](args)
chat.append(tool_result(result, tool_call_id=tc.id))
return "unresolved" The key detail: .strip().splitlines()[0].strip(). Without that, Grok would return the answer followed by ## Explanation and a paragraph of prose. The verifier would then fail to parse 42\n\n## Explanation: This is a linear equation... as a number. That one line cost me an afternoon.
For hard questions, the resolver runs twice. Both runs must agree, otherwise the question is discarded. Double-verification isn’t free — it doubles API cost on hard questions — but the alternative is training on wrong ground truths, which poisons the reward signal.
The Verifier Architecture
The SymPy tool suite covers most of what a math curriculum needs: symbolic solving, calculus, combinatorics, number theory, primality, factorization. The tools are called atomically — Grok calls sympy_solve({"expr": "x**2 - 4", "var": "x"}) and gets back [-2, 2]. It doesn’t write SymPy code. It uses SymPy as a calculator, tool by tool, to build toward an answer.
This was a deliberate choice over code generation. A code-generation approach asks the model to write a full Python script, execute it, and parse stdout. That introduces execution sandboxing, import surface, and a much wider failure mode space. Atomic tool calling is narrower and more debuggable.
The synthesis pipeline generates questions at four difficulty tiers — easy, medium, hard, olympiad — with random topic injection per question to prevent duplicate patterns:
TOPICS = {
"easy": ["linear equations", "percentages", "simple interest", ...],
"medium": ["quadratic equations", "combinations", "sequences", ...],
"hard": ["integral calculus", "number theory", "probability distributions", ...],
"olympiad": ["modular arithmetic", "generating functions", ...],
} Temperature is fixed at 1.0 during generation to maximize question diversity. The topic injection handles the rest.
GRPO from Scratch
GRPO[2] is the training algorithm. The core idea: for each question, sample completions from the current policy, score each one, normalize rewards group-relative, and use the resulting advantages to update the policy via a clipped ratio loss.
I built this from scratch instead of using TRL. Partly to understand it better, partly because TRL’s GRPO abstractions were adding overhead I didn’t want for a single-GPU LoRA setup.
The first working version had two bugs that took time to find:
Bug 1: Empty optimizer parameter list. The LoRA adapter was loaded with PeftModel.from_pretrained(...).eval() — inference mode, all weights frozen. Calling requires_grad_(True) on the LoRA parameters after the fact was the fix. The error was clear once it appeared:
ValueError: optimizer got an empty parameter list Bug 2: NaN loss on step 1. This one was subtler. The old_log_probs were computed from generation scores (which had temperature=0.8 applied), while cur_log_probs were computed from a raw forward pass (no temperature). For low-probability tokens, this mismatch produced large differences in log space. exp(large_number) → overflow → NaN loss → corrupted weights → NaN logits on step 2 → CUDA assert.
The fix: compute old_log_probs from a forward pass under torch.no_grad(), same as cur_log_probs. Both use the same temperature-free distribution. The ratio starts near 1.0 on step 1 as it should.
with torch.no_grad():
old_log_probs = compute_log_probs(model, input_ids, gen_tokens, pad_id)
ref_log_probs = compute_log_probs(ref_model, input_ids, gen_tokens, pad_id) A related fix: log_softmax on bfloat16 logits can underflow on extreme values. Upcasting before softmax — F.log_softmax(logits.float(), dim=-1) — stabilizes training.
Reward Modeling
The reward function went through three revisions.
V1 — Binary correctness.
return 0.9 * (1.0 if sympy_match else 0.0) + 0.1 * format_score Problem: if all completions score 0 (model can’t solve the question), advantages collapse to zero, gradient is zero, step is wasted. For a 1.5B model on hard questions, this happens constantly.
V2 — Dense additive. Eight signals: correctness (with proximity decay), self-consistency, reasoning depth, number grounding, format, parsability, length sweet spot, repetition penalty. All additive with fixed weights.
Problem: a wrong answer with perfect format and reasoning could score ~0.45. GRPO had weak signal distinguishing correct from wrong. The model learned to write well-structured wrong answers.
V3 — Correctness-gated.
# secondary signals: format, reasoning depth, grounding, etc.
secondary = 0.25*sc + 0.20*rd + 0.20*ng + 0.15*f + 0.10*p + 0.05*ls + 0.05*rp
# correctness scales both the base reward and secondary contribution
base = 0.1 + 0.5 * c # wrong: 0.1 | correct: 0.6
secondary_weight = 0.2 + 0.2 * c # wrong: 0.2 | correct: 0.4
return base + secondary_weight * secondary Now:
- Correct + good secondary: 0.6–1.0
- Wrong + perfect secondary: 0.1–0.3
- The gap between correct and wrong is always at least 0.3
The secondary signals still differentiate completions that are wrong but show good reasoning from ones that output gibberish. This matters because GRPO needs variance within the group to compute meaningful advantages — even on questions the model can’t solve.
Curriculum Learning
Training followed three sequential stages on synthesized data[3], each picking up from the previous run’s LoRA adapter:
easy (500 q) → medium (300 q) → hard (100 q) Each stage is a separate GRPO run — the model fully processes easy before touching medium. This is curriculum learning in the strict sense, not an ordered mixed dataset.
The hyperparameters were adjusted per stage:
| Easy | Medium | Hard | |
|---|---|---|---|
alpha | 5e-6 | 3e-6 | 1e-6 |
kl_beta | 0.02 | 0.05 | 0.10 |
max_new_tokens | 256 | 384 | 512 |
temperature | 1.2 | 1.0 | 1.0 |
KL beta increases with difficulty to keep the policy anchored to the SFT reference as questions get harder. Lower learning rate as training progresses to prevent catastrophic forgetting.
One thing advantages_std revealed: early training had it stuck at 0.999 consistently. This means within each group of completions, one completion was scoring high and the rest were near-zero — the model had collapsed to near-deterministic behavior. Raising temperature to 1.2 for easy training diversified the completions and brought advantages_std into a healthier 0.6–0.9 range.
Results So Far
| Stage | GSM8K (100 samples) |
|---|---|
| SFT baseline | 22% |
| Post easy GRPO | 30% |
| Post medium GRPO | 36% |
+14 points from a 1.5B LoRA model, trained entirely on synthesized questions, zero GSM8K supervision. Hard training is still running.
The results feel honest because the model has never seen GSM8K. The improvement is from learning reasoning structure and arithmetic habits — not from memorizing benchmark patterns. That was the point of synthesizing data rather than using GSM8K directly.
What’s Left Hard
Correctness ceiling on medium. The model gets to ~0.3 correctness on medium questions and plateaus. The format, reasoning depth, and length signals all stay healthy — the model is showing its work, grounding in question numbers, writing in the right XML structure. The arithmetic is where it fails. For a 1.5B model, multi-step word problems with 3–4 dependent computations exceed reliable capacity. DeepSeek and Qwen research suggests 7B+ is where multi-step arithmetic becomes consistently reliable without tool use.
Loss divergence. On medium training with KL beta too low, train loss increased from 0.002 → 0.004 over the run. Rewards didn’t improve. The policy was drifting away from the SFT reference rather than converging. The fix was stronger KL penalty (0.05 → 0.1) and lower learning rate. But it’s a reminder that GRPO loss is not like SFT loss — increasing loss doesn’t necessarily mean things are getting worse, but when rewards aren’t improving alongside it, something is off.
SIGKILL is undefeatable. SIGTERM and SIGINT are handled — the training loop saves the LoRA adapter to disk and logs it to MLflow before exiting. SIGKILL cannot be caught. If a RunPod pod gets preempted with kill -9, the checkpoint is lost. The mitigation is periodic checkpointing, which I haven’t implemented yet.
The Honest Summary
The simplest framing: this is a small model being taught to do math through practice, not memorization. The verifier is the teacher. The curriculum is the syllabus. GRPO is the practice loop.
What made this worth building from scratch rather than wrapping TRL was understanding why each piece exists. The KL penalty isn’t a regularizer in the abstract — it’s the thing that keeps a 1.5B model from forgetting how to format XML after being rewarded for arithmetic. The temperature on rollout isn’t a sampling hyperparameter — it’s what determines whether GRPO has enough signal to learn anything at all. The reward design isn’t a scoring function — it’s the specification of what “getting better at math” actually means.
The verifier is the hardest part to get right. It always is.
Code at github.com/dunkeln/sympy-rlvr.
References
- [1] DeepSeek-AI et al.. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv, 2025. source
- [2] Shao et al.. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arXiv, 2024. source
- [3] Qwen Team. "Qwen2.5: A Party of Foundation Models". arXiv, 2024. source