Train your own Large Language Model using HPC: NHR Süd-West
This workshop covers the key technical steps toward creating your own language model: selecting a dataset, generating a suitable tokenizer for the target language, initializing a model with appropriate training configurations, and conducting an initial pretraining run. We will use the LLäMmlein resources as basis (dataset, reference models, training configurations), with the goal of creating a small functional mini-LLM—including basic metrics, checkpoints, and the ability to compare different training parameters.
Using the LLäMmlein configurations, a small model (~10M parameters, decoder architecture) will be initialized and a short pretraining run will be conducted with the previously generated tokenizer (mixed precision, logging, checkpoints). Participants will then analyze the loss curve and throughput. Optional experiments include adjusting batch size, learning rate schedules, or dataset variants to study effects on stability and sample efficiency.
Additionally, scaling techniques can be practiced using few GPUs (e.g., 4–8).