A new hands-on workshop lets developers build and train a small transformer-based language model from scratch, without relying on pre-trained weights or proprietary datasets. The project, based on Andrej Karpathy’s nanoGPT, strips down the training pipeline to its essentials, enabling users to create a ~10-million-parameter GPT model on commodity hardware in under an hour.
Overview
The workshop guides users through writing every component of a GPT training pipeline in Python, including tokenization, model architecture, training loops, and text generation. The goal is to demystify how large language models (LLMs) work by having participants implement each piece themselves, rather than using black-box libraries like AutoModel.from_pretrained().
The default configuration trains a ~10-million-parameter model on Shakespeare’s works, producing text in a similar style. Training runs on Apple Silicon (MPS), NVIDIA GPUs (CUDA), or CPU, and also works in Google Colab. The project is designed to be completed in a single session, with no prior machine learning experience required—just comfort reading Python code.
What you’ll build
The workshop is divided into six parts, each covering a core component of the pipeline:
Tokenizer
- Implements a character-level tokenizer (vocab size = 65) to convert text into numerical IDs.
- Explains why byte-pair encoding (BPE) fails on small datasets like Shakespeare (~1MB).
Transformer architecture
- Builds the full GPT model, including embeddings, self-attention, layer normalization, and feed-forward layers.
- Uses residual connections and multi-head attention (e.g., 6 heads for the default config).
Training loop
- Implements forward pass, loss calculation, backpropagation, and the AdamW optimizer.
- Includes gradient clipping and learning rate scheduling.
Text generation
- Covers autoregressive decoding, temperature sampling, and top-k filtering.
Putting it all together
- Trains the model on real data, visualizes loss curves, and experiments with scaling.
Competition (optional)
- Encourages users to find larger datasets, scale up the model, and generate their best AI-generated poem.
Model configurations
The workshop provides three preset configurations, all using character-level tokenization and a block size of 256:
| Config | Parameters | Layers (n_layer) | Attention Heads (n_head) | Embedding Dim (n_embd) | Train Time (M3 Pro) |
|---|---|---|---|---|---|
| Tiny | ~0.5M | 2 | 2 | 128 | ~5 min |
| Small | ~4M | 4 | 4 | 256 | ~20 min |
| Medium | ~10M | 6 | 6 | 384 | ~45 min |
Tradeoffs
- Character-level vs. BPE tokenization: The workshop uses character-level tokenization (vocab size = 65) because BPE (e.g., GPT-2’s 50,257-token vocabulary) requires much larger datasets to learn meaningful patterns. Part 5 of the workshop covers switching to BPE for bigger datasets.
- Hardware requirements: While the default config trains in under an hour on a MacBook Pro (M3), scaling up to larger models or datasets will require more compute.
- Performance: The ~10M-parameter model is far smaller than state-of-the-art LLMs but serves as a practical introduction to transformer training.
How to get started
- Local setup (recommended)
- Install
uv(Astral’s Python package manager):# macOS / Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Windows powershell -ExecutionPolicy ByPass -c
- Install