Training a language model to run on RP2040

Starmind Pico Is it possible to train a Language Model to run on the RP2040? Yes. Dumb but fast


Background

I have a TinyPico (RP2040) laying around the office for the last 8 months. During a recent hackathon, I decided to whip it out for a weekend project and thought to myself - I bet I can make a Language Model run on this.

Hackathon (pic of the robotics team I helped out)

What I setout to do:

What I did:

Tiny2040

Model Architecture

I analyzed 5 factors of the Language Model Architecture & how that affects model inference speed as well as model quality:

1.Dimension size Tiny2040

2.Layer Depth Tiny2040

3.Attention Head count Attention Head

4.FFN Ratio FFN Ratio

5.Vocab Size Vocab

Architecture Impact Hierarchy (Most to Least Critical)

  1. Dimension Size: 40-50% speed loss per doubling - the ultimate performance killer
  2. Layer Depth: 25-40% speed loss per additional layer
  3. Attention Heads: 20-25% speed loss per doubling (but surprisingly cheap at small scales)
  4. FFN Ratio: 15-20% speed loss per doubling
  5. Vocabulary Size: 8-12% speed loss per doubling (minimal impact)

Microcontroller-Specific Insights

Memory vs Speed Trade-offs

Production-Ready Findings

Practical Deployment Models

Architecture Templates for Production

# Maximum Speed (1K params)
optimal_speed = {
    'vocab_size': 32-64,
    'dim': 1-2,              # Ultra-narrow
    'hidden_dim': 64-128,    # 64x-128x FFN ratio
    'n_layers': 1,
    'n_heads': 6-8,
    'expected_speed': '20-32 tok/s'
}

# Balanced Production (8K params)
balanced_production = {
    'vocab_size': 256-512,
    'dim': 6-8,
    'hidden_dim': 192-256,   # 32x FFN ratio
    'n_layers': 2-3,
    'n_heads': 8,
    'expected_speed': '2-5 tok/s'
}

Interesting things

Training Models

I trained the models on a H100 rented on Prime Intellect. Each model takes ~2 minutes to train.

All initial training took me 12 hours (about $20 total).

Results

Conclusion

Due to memory fragmentation on RP2040, the maximum vocabulary size we can use is 256, which is smaller than even the TinyStories models.

After spending $20+ in GPU for model pre-training, the models generate a few coherent words, which is promising. However, I decide to move on from this project.

People Seem to Love It on Reddit

https://www.reddit.com/r/LocalLLaMA/comments/1n1hro7/how_to_train_a_language_model_to_run_on_rp2040/

Resources