Slowrun: The Insane AI Benchmark Where Compute Never Ends
Slowrun: The Insane AI Benchmark Where Compute Never Ends
What if everything you believed about training large language models was backwards?
For years, we've been obsessed with speed. Faster iterations. Cheaper training. Quicker time-to-market. The entire AI industry has been locked in a frantic sprint, measuring success in tokens-per-second and wall-clock minutes. But what if that obsession with velocity has blinded us to something far more powerful?
Enter Slowrun, the viral benchmark that is flipping AI training on its head—and making Andrej Karpathy himself take notice.
Imagine a world where you have infinite compute, but only 100 million tokens of data. No rushing. No shortcuts. Just you, the data, and as much processing power as you need to squeeze every last drop of learning from every single token. This isn't a speedrun. This is a Slowrun. And it's exposing secrets about model optimization that the speed-obsessed mainstream has completely missed.
If you're still training models the old way—racing against the clock, praying your loss curves converge in time—you're leaving massive performance gains on the table. The Slowrun benchmark is here to prove it. Ready to discover what happens when time stops being your enemy?
What is Slowrun?
Slowrun is a revolutionary language modeling benchmark created by the team at qlabs-eng that operates in what they call the "infinite compute, fixed data regime." The premise is deceptively simple yet profoundly different from anything else in the AI benchmarking landscape: take exactly 100 million tokens from FineWeb, remove all compute and time constraints, and let researchers battle for the lowest possible validation loss.
The name itself is a deliberate contrast to phenomena like modded-nanogpt, the infamous speedrun competition where participants optimize for the fastest training time on fixed hardware with infinite data assumptions. Slowrun inverts that paradigm entirely. Here, data is precious and finite. Compute is abundant and unlimited. The goal isn't to finish first—it's to learn the deepest.
This benchmark emerged from a critical insight in the AI research community: when speed ceases to be the binding constraint, the entire algorithmic landscape transforms. Suddenly, techniques that seemed prohibitively expensive become not just viable but essential. Large models with heavy regularization. Sophisticated optimizers that require more iterations. Evolutionary search strategies. Architectural innovations that demand more computation per forward pass.
The project builds directly on Nanochat, Karpathy's minimalist framework, and carries forward many ideas from the modded-nanogpt ecosystem—but with a crucial philosophical pivot. As the creators note, the speedrun contest did produce genuine data efficiency gains. Using less data is indeed one path to faster training. However, by making wall-clock time the supreme constraint, it systematically filters out an entire class of algorithms that could unlock superior learning outcomes.
Andrej Karpathy's public endorsement on X (formerly Twitter) turbocharged Slowrun's visibility in the developer community. When one of modern AI's most influential educators signals that something matters, the industry listens. The repository has since become a magnet for cutting-edge contributions from researchers pushing the boundaries of what's possible when compute constraints dissolve.
Key Features That Make Slowrun Revolutionary
The Slowrun benchmark isn't just another leaderboard—it's a meticulously structured research environment with multiple competitive tracks designed to isolate different aspects of the optimization problem.
Four Distinct Competition Tracks define the Slowrun ecosystem. The Limited Compute Track caps participants at a single 8xH100 node for one hour—already 100x the compute used by the original Nanochat 1-epoch baseline. The Tiny Track squeezes the challenge into just 15 minutes on the same hardware. The Two Hour Track extends the window for more elaborate strategies. And the Unlimited Compute Track removes virtually all hardware and time restrictions, enabling truly extravagant experiments like 210-hour training runs across multiple 8xH100 nodes.
Rigorous Reproducibility Standards set Slowrun apart from casual benchmarks. Every world record submission must include the exact training script, achieved validation loss, training time, and contributor attribution. The maintainers verify each result through pull request review before adding entries to the official leaderboard. This creates an immutable, auditable history of algorithmic progress.
Real-World Data Foundation grounds the benchmark in practical relevance. By using FineWeb's 100 million tokens—a high-quality web corpus—Slowrun ensures that techniques proven here transfer meaningfully to actual language modeling scenarios, not synthetic toy problems.
Open Scientific Collaboration drives innovation velocity. The repository's PR-based submission system means every breakthrough becomes immediately available for others to build upon. When someone discovers that Stochastic Weight Averaging (SWA) drops loss by 0.012, or that Interleaved Head Attention (IHA) unlocks another 0.009 improvement, the entire community gains that knowledge.
Architectural Flexibility encourages radical experimentation. The baseline uses a 2.7B parameter transformer, but successful entries have modified everything from activation functions (SwiGLU) to attention mechanisms (exclusive self-attention, gating per head) to training dynamics (layer looping, multi-token prediction loss).
Use Cases Where Slowrun Shines
The Slowrun benchmark isn't merely an academic exercise—it addresses concrete, high-stakes scenarios that working AI engineers and researchers encounter daily.
Data-Constrained Domain Adaptation represents perhaps the most immediate practical application. Consider fine-tuning a medical LLM on a carefully curated dataset of 100M clinician notes. You cannot simply "get more data"—patient privacy regulations, collection costs, and annotation bottlenecks make that impossible. Slowrun's techniques for maximizing learning from limited tokens directly transfer to this scenario. The heavy regularization strategies, careful optimizer selection, and architectural innovations all apply.
Expensive Compute, Cheap Inference describes the operational reality of many production AI systems. Companies like Netflix or Spotify can afford substantial upfront training costs if they yield models that run efficiently at serving time. Slowrun's exploration of whether larger, heavily regularized models generalize better becomes directly relevant. The unlimited track's findings about ensemble methods and distillation strategies offer concrete recipes for this tradeoff.
Algorithmic Research and Publication benefits enormously from Slowrun's structured environment. Rather than running ad-hoc experiments with inconsistent baselines, researchers can test novel optimizers, architectural components, or training procedures against a standardized, competitive benchmark with immediate community validation. A paper claiming improvement on Slowrun carries more weight than one using a private, unverifiable setup.
Education and Skill Development provides perhaps the most accessible use case. For engineers transitioning into AI, or students building intuition about deep learning, Slowrun offers a concrete, gamified environment where every incremental improvement is measurable and celebrated. The detailed commit history showing exactly how loss dropped from 3.402 to 3.195 serves as an unparalleled educational resource—each 0.001 improvement represents a specific, documented technique.
Step-by-Step Installation & Setup Guide
Getting started with Slowrun is refreshingly straightforward, thanks to its foundation on clean, minimal code. Here's how to reproduce the current limited-compute record and begin your own optimization journey.
Prerequisites
You'll need access to an 8xH100 GPU node for competitive entries. For experimentation and smaller-scale validation, the code runs on any CUDA-capable hardware, though times will vary dramatically.
Installation Commands
# Clone the repository
git clone https://github.com/qlabs-eng/slowrun.git && cd slowrun
# Install dependencies
pip install -r requirements.txt
# Prepare the FineWeb dataset
python prepare_data.py
The prepare_data.py script downloads and tokenizes the 100M token FineWeb subset, creating the fixed training and validation splits that all benchmark entries must use. This ensures absolute comparability between submissions.
Running the Baseline
# Launch distributed training on 8 GPUs
torchrun --standalone --nproc_per_node=8 train.py
This command initiates the current limited-compute record training run. On proper 8xH100 hardware, expect approximately 47 minutes to reach the baseline 3.402 validation loss. The torchrun launcher handles process spawning, GPU assignment, and distributed communication automatically.
Track Selection
The repository organizes tracks into separate directories:
# Tiny track (15 minute limit)
cd tiny/
torchrun --standalone --nproc_per_node=8 train.py
# Two hour track
cd two_hour/
torchrun --standalone --nproc_per_node=8 train.py
# Unlimited compute track
cd unlimited/
torchrun --standalone --nproc_per_node=8 train.py
Each directory contains appropriately configured training scripts with track-specific hyperparameters, model sizes, and training durations.
Submission Process
To claim a world record:
- Fork the repository
- Implement your improvements in the relevant
train.py - Document your changes with clear comments
- Run full verification to confirm loss and timing
- Open a pull request with your results
The maintainers review submissions for correctness before updating the official leaderboard.
REAL Code Examples from the Repository
Let's examine actual techniques from the Slowrun world record history that drove validation loss from 3.402 down to 3.195—a 6.1% relative improvement that required dozens of clever innovations.
Example 1: Reproducing the Current Record
The entry point that started it all—reproducible, minimal, and powerful:
# Clone and setup—exact commands from the README
git clone https://github.com/qlabs-eng/slowrun.git && cd slowrun
pip install -r requirements.txt # Installs torch, numpy, and dependencies
python prepare_data.py # Downloads and tokenizes FineWeb 100M
torchrun --standalone --nproc_per_node=8 train.py # Distributed training launch
The torchrun launcher is critical here—it automatically handles the distributed data parallel (DDP) setup across 8 H100s, setting environment variables for process rank, local rank, and world size that the training script reads to coordinate gradient synchronization.
Example 2: The Shuffling Innovation (Record #2)
One of the earliest breakthroughs was devastatingly simple—yet nobody had tried it:
# From commit 106a290: Add shuffling every epoch
# This modification to the data loader reshuffles training batches each epoch
# Previously, data was consumed in fixed order across all epochs
# Shuffling breaks spurious correlations in batch ordering, improving generalization
This change dropped validation loss from 3.402 to 3.376 with zero additional compute cost. The insight: even with "infinite" compute, data presentation order matters enormously. Fixed ordering creates implicit regularization patterns that limit learning. Randomization each epoch exposes the model to more diverse optimization landscapes.
Example 3: Architectural Surgery—Value Projections (Record #3)
A deeper architectural modification that exemplifies Slowrun's philosophy:
# From commit b261fba: Change value embed tables to projections from x0
# Instead of learned value embeddings V = W_v @ tokens,
# project from the initial token representation: V = W_proj @ x0
# This reduces parameters, increases sharing, and acts as structural regularization
This 3.402 → 3.349 improvement illustrates a core Slowrun principle: when compute is unlimited, architectural efficiency and inductive biases often outperform raw parameter count. The projection mechanism forces the model to reuse representations, creating an implicit simplicity bias that generalizes better.
Example 4: Advanced Regularization—Stochastic Depth (Record #15)
A sophisticated training technique from the modern deep learning toolkit:
# From commit 038fa4b: Add stochastic depth training
# During training, randomly drop entire transformer layers with probability p
# This acts as a strong regularizer, preventing over-reliance on any single layer
# At inference, use all layers with scaled activations (expected value)
Stochastic depth contributed to dropping loss from 3.230 to 3.227, part of the relentless march toward optimal performance. In the data-limited regime, aggressive regularization paradoxically enables larger models—exactly the non-intuitive finding that Slowrun was designed to surface.
Example 5: The Unlimited Track's Ensemble Strategy (Record #10)
Where compute truly explodes, techniques become radically different:
# From commit dceb3e9: Use probability averaging over logit averaging, train 20 models
# Instead of averaging model logits (log-space), average predicted probabilities
# This preserves calibration better and handles peaky distributions more robustly
# 20 independently trained models, 210 hours total on 7xH100 nodes
This 3.024 validation loss—approaching a 12% improvement over baseline—required 210 hours of compute across multiple nodes. The ensemble approach exemplifies how unlimited compute unlocks entirely different algorithmic paradigms. No speedrun could ever consider this. Slowrun makes it not just possible but competitive.
Advanced Usage & Best Practices
Mastering Slowrun requires internalizing its counterintuitive optimization philosophy. Here are pro strategies from top leaderboard contributors.
Embrace Regularization Aggressively. The baseline's weight decay of 1.6—30× standard practice—wasn't a typo. In data-constrained regimes, strong regularization enables larger models without overfitting. Current records push this even further with scheduled weight decay, stochastic depth, and architectural constraints. When data is fixed, your only path to better generalization is controlling model complexity more intelligently.
Exploit Layer-Wise Dynamics. Multiple record-breaking entries involve layer looping—repeating middle layers (15-20) before final layers during late training. This suggests different network depths contain learnable features at different optimization timescales. Experiment with asymmetric training schedules that treat layers differently.
Master the Art of Averaging. From EMA to SWA to checkpoint averaging to probability ensembles, averaging model states consistently produces gains. The community has progressed through simple averaging (3.252), SWA (3.236), weighted checkpoint averaging (3.248), and finally probability-space ensemble averaging (3.024). Each refinement extracts more value from the same training runs.
Schedule Everything. The most recent records introduce weight decay schedules, learning rate schedule tuning, and context window scheduling. Static hyperparameters waste potential. Dynamic schedules adapt regularization strength to training phase, matching model capacity to remaining optimization difficulty.
Read the Commit History Like a Textbook. Every entry in the world record table links to the exact diff. Study them sequentially. The progression from simple to complex reveals which techniques combine synergistically and which interfere. This is arguably the most valuable educational resource in modern optimization practice.
Comparison with Alternatives
| Dimension | Slowrun | Modded-NanoGPT (Speedrun) | Standard Benchmarks (GLUE, etc.) |
|---|---|---|---|
| Primary Metric | Lowest validation loss | Fastest wall-clock time | Task-specific accuracy |
| Compute Assumption | Infinite, unlimited | Fixed hardware constraint | Varies, often implicit |
| Data Regime | Fixed 100M tokens | Effectively infinite | Fixed task datasets |
| Optimization Target | Maximum learning per token | Maximum tokens per second | Maximum task performance |
| Valid Techniques | Ensembles, heavy regularization, evolutionary search | Efficient kernels, optimal batching, compilation | Architecture search, pretraining scale |
| Community Structure | Collaborative, PR-based records | Competitive, individual submissions | Leaderboard aggregators |
| Research Insight | Data efficiency, regularization science | Systems optimization, training dynamics | Transfer learning, task structure |
| Hardware Requirements | Flexible, scales with ambition | Strictly defined | Varies widely |
Why choose Slowrun? If your research or application involves limited data with available compute—domain adaptation, private data scenarios, expensive annotation—Slowrun's insights transfer directly. If you're studying what models can learn given enough processing, rather than how fast they learn, Slowrun is the only benchmark designed for your question.
The speedrun paradigm implicitly assumes data is free and compute is precious. This describes some scenarios well—large-scale pretraining with web-scale data, for instance. But many critical applications invert these economics. Slowrun fills that gap.
FAQ
What hardware do I need to compete in Slowrun?
The limited track requires a single 8xH100 node for one hour. For experimentation, any CUDA GPU works but times scale proportionally. The unlimited track has no hardware restrictions—use as many nodes for as long as you wish.
How is Slowrun different from just training longer?
Training longer with fixed hyperparameters yields diminishing returns. Slowrun's leaderboard shows that algorithmic innovations—not just more steps—drive progress. The 3.402→3.195 improvement came from dozens of distinct techniques, not merely extended training.
Can techniques from Slowrun transfer to larger-scale training?
This is an open empirical question and actively researched. The 100M token size was chosen precisely because it's "large enough that winning techniques may work at larger scale." Several innovations (Muon optimizer, heavy weight decay) have already shown transfer.
Why did Andrej Karpathy endorse this project?
Karpathy's endorsement reflects Slowrun's philosophical alignment with rigorous, minimal experimental design—hallmarks of his own educational and research approach. The project extends his Nanochat framework with genuine scientific innovation.
How do I submit a new world record?
Fork the repository, implement your changes with clear documentation, verify your loss and timing, then open a pull request. Maintainers review for correctness before updating the official leaderboard.
What's the current lowest validation loss achieved?
As of the latest records: 3.195 (limited track), 3.332 (tiny track), 3.144 (two hour track), and 3.001 (unlimited track). Check the repository for real-time updates.
Is Slowrun only for large labs with massive compute?
No—the tiny track (15 minutes on 8xH100) and the educational value of studying commit history make Slowrun accessible. Many innovations came from individual contributors, not well-resourced teams.
Conclusion
The Slowrun benchmark is more than a leaderboard—it's a fundamental reframing of what it means to train language models well. In a field obsessed with velocity, it dares to ask: what if we optimized for depth instead?
The results speak with devastating clarity. From 3.402 to 3.001—a 12.5% improvement—achieved not by bigger data or faster chips, but by smarter algorithms applied with patience. Stochastic depth. Layer looping. Probability averaging. Techniques that speed-obsessed training would never discover.
This is the secret that top researchers are now racing to exploit. When compute constraints dissolve, the optimization landscape transforms completely. Regularization becomes your superpower. Architectural innovation becomes your edge. And the community's collective intelligence, captured in every PR and commit message, becomes your competitive advantage.
The Slowrun repository at https://github.com/qlabs-eng/slowrun isn't just code—it's a living textbook on the future of efficient learning. Whether you're adapting models to precious private data, researching optimization fundamentals, or simply building intuition for what transformers can truly achieve, Slowrun offers something no speedrun ever could: the time to get it right.
Stop racing. Start learning. Your next breakthrough might be waiting in the data you've already got.
Clone the repository. Study the commits. Break a record. The infinite compute era starts now.
Comments (0)
No comments yet. Be the first to share your thoughts!