Train a tiny model on Grade School Math

Starmind Pico Dumb model trained on simple math = smarter model?


Background

In 2023, the TinyGSM paper from Microsoft showed that training a Small Language Model on Grade School Math helps it out-perform much larger models on the GSM8K math benchmark.

Starmind Pico

This was achieved by generating synthetic data from a Large Language Model like GPT-3.5 and train a fine-tune small models on that data. The fine-tuned model uses code to solve math problems.

Prompts


My approach: quality & diversity

Since 2023, there have been a few changes: there are wayyyy more Large Language Models & they are magnitudes better than GPT-3.5.

After reading a BabyAGI Paper about how Teacher model diversity improves the Student fine-tuned models, I decided to replicate the TinyGSM experiment with multiple high-quality Teacher models.


Step 0: How do we pay for all this?

Coins

The original TinyGSM dataset contains 1.8B tokens (12 millions question-answer pairs), and costs $3600 to generate. Adding in the compute for fine-tuning, it gets real expensive real fast.

Unlike Microsoft, I’m a startup founder. Therefore, I will try to do this for free!

Here is my strategy to pay for all this:


Step 1: Synthetic data generation

The data generation process is pretty straight forward. We first gather a list of Large Language Models (LLM), ask them a mathematical question, get the answer, check its correctness, and store it in a database.

Synthetic data generation illustration

To utilize diversity and quality of 2025’s Language Models, we will utilize a set of models:

After running Synthetic Data Generator for 7 days straight on my Mac Mini, I have created 12 TinyGSM sub-datasets.

Dataset

There are 3 types of datasets:

Overall $1000 in cloud credits were spent.


Step 2: Data filtering

Using Data Filtering Engine , all 12 datasets were analyzed for correctness, structure, code runnability & reasoning quality.

Filter

Overall, here are the findings:

Azure Violation (sidenote: Microsoft does not like it when you extracts o4-mini reasoning.)

More details here .


Step 3: Model fine-tuning

With dataset figured out, let’s finetune a few language models! Workout

We will pick the good ones to finetune a Gemma3-270M model. We pick this base model for 2 reasons: