Train your own Large Language Model using HPC: NHR Süd-West

Previous Participation in the “How to Train Your Own Large Language Model Using HPC” workshop is required .

This 4-week hands-on workshop covers the key technical steps toward creating your own language model: selecting a dataset, generating a suitable tokenizer for the target language, initializing a model with appropriate training configurations, and conducting an initial pretraining run. We will use the LLäMmlein resources as basis (dataset, reference models, training configurations), with the goal of creating a small functional mini-LLM—including basic metrics, checkpoints, and the ability to compare different training parameters.

Additionally, scaling techniques can be practiced using few GPUs (e.g., 4–8).

Week 1: Dataset exploration, cleaning, and process for pretraining

Week 2: Tokenizer training, evaluation

Week 3: Pretraining LLMs

Week 4: Evaluation and publishing

Referance Repositories:

  • LLäMmlein - https://github.com/LSX-UniWue/LLaMmlein
  • LlamaFactory - https://github.com/hiyouga/LlamaFactory

Register here

by Israel A. Azime and Paloma García-de-Herreros