How to train your own Large Language Model?
We cordially invite you to the How to train your own LLM? workshop, part of the NHR SW LLM Workshop series, organized by Saarland University. The workshop will be held online and in person on December 3rd from 1 pm to 5 pm at Goethe University Frankfurt, Uni Campus Riedberg.
Register here: link
This workshop covers the key technical steps toward creating your own language model: selecting a dataset, generating a suitable tokenizer for the target language, initializing a model with appropriate training configurations, and conducting an initial pretraining run. We will use the LLäMmlein resources as basis (dataset, reference models, training configurations), with the goal of creating a small functional mini-LLM—including basic metrics, checkpoints, and the ability to compare different training parameters.
Introductory Talk – How Are Large LLMs Trained?
A brief overview of the typical components of LLM training: data collection and filtering, tokenizer design, model architecture and scaling, training infrastructure (HPC/GPU clusters), optimization and scheduling strategies, and evaluation metrics. The focus will be on best practices, common challenges (data quality, stability, cost), and how the LLäMmlein workflow fits into the broader landscape of modern LLM development.
Block 1 – Tokenizer
by Israel A. Azime and Paloma García de Herreros García
Participants will work with a sample of the German LLäMmlein dataset and train a German-specific SentencePiece/BPE tokenizer. After training, a brief token analysis will be conducted using various German validation texts.
Block 2 – Mini-LLM-Pretraining
by Israel A. Azime and Paloma García de Herreros García
Using the LLäMmlein configurations, a small model (~10M parameters, decoder architecture) will be initialized and a short pretraining run will be conducted with the previously generated tokenizer (mixed precision, logging, checkpoints). Participants will then analyze the loss curve and throughput. Optional experiments include adjusting batch size, learning rate schedules, or dataset variants to study effects on stability and sample efficiency.
Additionally, scaling techniques can be practiced using few GPUs (e.g., 4–8) .