Training a language model to run on RP2040

Starmind Pico Is it possible to train a Language Model to run on the RP2040? Yes. Dumb but fast

Background

I have a TinyPico (RP2040) laying around the office for the last 8 months. During a recent hackathon, I decided to whip it out for a weekend project and thought to myself - I bet I can make a Language Model run on this.

Hackathon (pic of the robotics team I helped out)

What I setout to do:

Find a Language Model architecture that runs well on RP2040
Train a model
Run to the model

What I did:

Tested different hyper-parameters & parameter count (1-28k) on the Pico itself
Trained most optimized models on the TinyStories dataset
Finding: Due to memory fragmentation on RP2040, I could not fit more than 256 vocabulary on the SRAM due to memory fragmentation

Tiny2040

Model Architecture

I analyzed 5 factors of the Language Model Architecture & how that affects model inference speed as well as model quality:

1.Dimension size Tiny2040

2.Layer Depth Tiny2040

3.Attention Head count

4.FFN Ratio

5.Vocab Size Vocab

Architecture Impact Hierarchy (Most to Least Critical)

Dimension Size: 40-50% speed loss per doubling - the ultimate performance killer
Layer Depth: 25-40% speed loss per additional layer
Attention Heads: 20-25% speed loss per doubling (but surprisingly cheap at small scales)
FFN Ratio: 15-20% speed loss per doubling
Vocabulary Size: 8-12% speed loss per doubling (minimal impact)

Microcontroller-Specific Insights

Memory vs Speed Trade-offs

Quantization paradox: Saves memory but slows down inference due to de-quantization overhead
RP2040 bottleneck: Computation speed, not memory bandwidth
KV caching: 3-7x speed improvement for multi-token generation
Chunked loading: Eliminates large memory allocation failures

Production-Ready Findings

Practical Deployment Models

1K param models: 15-32 tok/s - real-time capable for interactive applications
8K param models: 2-3 tok/s - best balance of capability and reliability
10K param models: Near memory limits but can achieve 14.5 tok/s with optimal architecture

Architecture Templates for Production

# Maximum Speed (1K params)
optimal_speed = {
    'vocab_size': 32-64,
    'dim': 1-2,              # Ultra-narrow
    'hidden_dim': 64-128,    # 64x-128x FFN ratio
    'n_layers': 1,
    'n_heads': 6-8,
    'expected_speed': '20-32 tok/s'
}

# Balanced Production (8K params)
balanced_production = {
    'vocab_size': 256-512,
    'dim': 6-8,
    'hidden_dim': 192-256,   # 32x FFN ratio
    'n_layers': 2-3,
    'n_heads': 8,
    'expected_speed': '2-5 tok/s'
}

Interesting things

176 architectural variants tested - most comprehensive microcontroller transformer study ever
Ultra-narrow dimensions work at all scales (1K-10K parameters)
Mathematical optimization provides elegant parameter allocation
Speed scaling defies conventional wisdom - larger models can be faster with optimal architecture

Training Models

I trained the models on a H100 rented on Prime Intellect. Each model takes ~2 minutes to train.

All initial training took me 12 hours (about $20 total).

Results

Fastest model: 32.0 tokens/second (1D architecture)
Most balanced: 2-3 tokens/second (8K parameters)
Memory limit: 256 vocabulary tokens maximum
Architecture variants tested: 176 different configurations

Conclusion

Due to memory fragmentation on RP2040, the maximum vocabulary size we can use is 256, which is smaller than even the TinyStories models.

After spending $20+ in GPU for model pre-training, the models generate a few coherent words, which is promising. However, I decide to move on from this project.

People Seem to Love It on Reddit

https://www.reddit.com/r/LocalLLaMA/comments/1n1hro7/how_to_train_a_language_model_to_run_on_rp2040/