Train your own Large Language Model using HPC: NHR Süd-West
Previous Participation in the “How to Train Your Own Large Language Model Using HPC” workshop is required .
This 4-week hands-on workshop covers the key technical steps toward creating your own language model: selecting a dataset, generating a suitable tokenizer for the target language, initializing a model with appropriate training configurations, and conducting an initial pretraining run. We will use the LLäMmlein resources as basis (dataset, reference models, training configurations), with the goal of creating a small functional mini-LLM—including basic metrics, checkpoints, and the ability to compare different training parameters.
Additionally, scaling techniques can be practiced using few GPUs (e.g., 4–8).
Week 1: Dataset exploration, cleaning, and process for pretraining
Week 2: Tokenizer training, evaluation
Week 3: Pretraining LLMs
Week 4: Evaluation and publishing
Referance Repositories:
- LLäMmlein - https://github.com/LSX-UniWue/LLaMmlein
- LlamaFactory - https://github.com/hiyouga/LlamaFactory