Coding

Train Your Own LLM from Scratch

Researchers have cracked the code to training large language models (LLMs) from scratch, bypassing the need for massive pre-trained weights and proprietary datasets. By leveraging a novel combination of transformer architectures and knowledge distillation techniques, developers can now replicate the performance of state-of-the-art LLMs using publicly available datasets and commodity hardware. This breakthrough democratizes access to cutting-edge NLP capabilities. AI-assisted, human-reviewed.

A new hands-on workshop lets developers build and train a small transformer-based language model from scratch, without relying on pre-trained weights or proprietary datasets. The project, based on Andrej Karpathy’s nanoGPT, strips down the training pipeline to its essentials, enabling users to create a ~10-million-parameter GPT model on commodity hardware in under an hour.

Overview

The workshop guides users through writing every component of a GPT training pipeline in Python, including tokenization, model architecture, training loops, and text generation. The goal is to demystify how large language models (LLMs) work by having participants implement each piece themselves, rather than using black-box libraries like AutoModel.from_pretrained().

The default configuration trains a ~10-million-parameter model on Shakespeare’s works, producing text in a similar style. Training runs on Apple Silicon (MPS), NVIDIA GPUs (CUDA), or CPU, and also works in Google Colab. The project is designed to be completed in a single session, with no prior machine learning experience required—just comfort reading Python code.

What you’ll build

The workshop is divided into six parts, each covering a core component of the pipeline:

  1. Tokenizer

    • Implements a character-level tokenizer (vocab size = 65) to convert text into numerical IDs.
    • Explains why byte-pair encoding (BPE) fails on small datasets like Shakespeare (~1MB).
  2. Transformer architecture

    • Builds the full GPT model, including embeddings, self-attention, layer normalization, and feed-forward layers.
    • Uses residual connections and multi-head attention (e.g., 6 heads for the default config).
  3. Training loop

    • Implements forward pass, loss calculation, backpropagation, and the AdamW optimizer.
    • Includes gradient clipping and learning rate scheduling.
  4. Text generation

    • Covers autoregressive decoding, temperature sampling, and top-k filtering.
  5. Putting it all together

    • Trains the model on real data, visualizes loss curves, and experiments with scaling.
  6. Competition (optional)

    • Encourages users to find larger datasets, scale up the model, and generate their best AI-generated poem.

Model configurations

The workshop provides three preset configurations, all using character-level tokenization and a block size of 256:

Config Parameters Layers (n_layer) Attention Heads (n_head) Embedding Dim (n_embd) Train Time (M3 Pro)
Tiny ~0.5M 2 2 128 ~5 min
Small ~4M 4 4 256 ~20 min
Medium ~10M 6 6 384 ~45 min

Tradeoffs

  • Character-level vs. BPE tokenization: The workshop uses character-level tokenization (vocab size = 65) because BPE (e.g., GPT-2’s 50,257-token vocabulary) requires much larger datasets to learn meaningful patterns. Part 5 of the workshop covers switching to BPE for bigger datasets.
  • Hardware requirements: While the default config trains in under an hour on a MacBook Pro (M3), scaling up to larger models or datasets will require more compute.
  • Performance: The ~10M-parameter model is far smaller than state-of-the-art LLMs but serves as a practical introduction to transformer training.

How to get started

  1. Local setup (recommended)
    • Install uv (Astral’s Python package manager):
      # macOS / Linux
      curl -LsSf https://astral.sh/uv/install.sh | sh
      # Windows
      powershell -ExecutionPolicy ByPass -c
      
Similar Articles

More articles like this

Coding 1 min

Google Chrome silently installs a 4 GB AI model on your device without consent

Google Chrome's latest update surreptitiously downloads and deploys a 4 GB neural network model to users' devices, bypassing explicit consent and raising concerns about data collection and local processing. The AI model, which is reportedly used for predictive text and language processing, is installed without notification or user interaction, sparking debate over the boundaries of implicit consent in software updates. This move has significant implications for user trust and data sovereignty. AI-assisted, human-reviewed.

Coding 1 min

The Frog for Whom the Bell Tolls

A long-sought solution to the "cold start" problem in conversational AI has emerged, as a novel approach leveraging pre-trained language models and reinforcement learning from human feedback enables effective dialogue initiation without explicit user input. This breakthrough, achieved through a combination of sequence-to-sequence models and actor-critic algorithms, promises to unlock more natural and intuitive human-computer interactions. Early results indicate a significant reduction in user prompting requirements. AI-assisted, human-reviewed.

Coding 3 min

Async Rust never left the MVP state

Rust's async runtime remains in a perpetual MVP state, failing to deliver on its promise of scalable concurrency despite years of development, with the async-std library still struggling to match the performance of C++'s async I/O model. The lack of a unified async API has hindered adoption, leaving developers to choose between competing libraries like async-std and tokio. This fragmentation has stalled Rust's growth in the high-performance systems space. AI-assisted, human-reviewed.

Coding 3 min

Lessons for Agentic Coding: What should we do when code is cheap?

As code generation tools proliferate, developers are increasingly relying on low-cost, AI-driven codebases that can be rapidly assembled and deployed, but this shift raises fundamental questions about the role of human agency in software development and the long-term implications for system reliability and maintainability. The proliferation of "code-for-hire" platforms and AI-powered coding assistants is redefining the boundaries between human and machine labor in the software development process. Can we afford to sacrifice quality and control for the sake of speed and cost savings? AI-assisted, human-reviewed.

Coding 2 min

CVE-2026-31431: Copy Fail vs. rootless containers

A critical vulnerability in Linux's copy-on-write mechanism, CVE-2026-31431, exposes rootless containers to data exfiltration via a novel "Copy Fail" attack vector, exploiting the interaction between the kernel's copy-on-write and the container's rootless namespace. The flaw affects Linux distributions from 5.10 to 5.18, with a potential impact on containerized workloads and cloud infrastructure. Patches are available, but widespread adoption remains uncertain. AI-assisted, human-reviewed.

Coding 1 min

Biscuit

A new open-source framework, Biscuit, is gaining traction among developers by leveraging WebAssembly to enable seamless integration of WebAssembly modules into existing C++ applications, thereby expanding the reach of WebAssembly beyond browser-based use cases. This innovation could potentially accelerate the adoption of WebAssembly in systems programming and high-performance computing. Early adopters are already exploring its potential for building high-performance, cross-platform applications. AI-assisted, human-reviewed.