LLM Pre-training and Fine-tuning

I designed and delivered this technical workshop to give engineering teams a comprehensive understanding of how large language models are built from scratch. As LLM deployment accelerates across organizations, understanding the complete training pipeline has become critical for making informed decisions about model selection, computational resources, and custom training approaches.
The workshop opens with the evolution from encoder-only models like BERT to today's decoder-only architectures that power ChatGPT and LLaMA. I walk through the complete pre-training pipeline, starting with massive data collection from web crawls, books, and code repositories, then diving deep into the often-overlooked data cleaning phase that reduces raw datasets by 90% through language detection, quality filtering, deduplication, and PII removal.
The core technical section covers causal language modeling through next-token prediction, showing how models progress from random tokens to coherent text. Using real examples like StarCoder and LLaMA-2, I demonstrate the computational scale involved and break down memory requirements that push even 7B models to need 140GB+ of GPU RAM when accounting for optimizer states, gradients, and activations beyond just parameter storage.
I cover the practical training challenges that make LLM development expensive: mixed-precision training to handle numerical instability, distributed training through techniques like ZeRO and FSDP for model sharding, and the Chinchilla scaling laws that revealed most models were undertrained rather than too small. These insights shifted the field toward longer training runs on smaller, more efficient models.
The fine-tuning section explains why base models need supervised fine-tuning to become conversational assistants, contrasting raw next-token prediction with instruction-following behavior. I explore different approaches from domain-specific continued pre-training to multi-task instruction tuning, tracing the evolution from T5's task prefixes through modern imitation learning techniques that use larger models to generate training data for smaller ones.
The workshop's climax focuses on parameter-efficient fine-tuning (PEFT), particularly LoRA, as the practical solution for teams without massive budgets. I provide a detailed walkthrough of low-rank adaptation, showing how small adapter matrices can achieve full fine-tuning performance while using 95% fewer parameters. The mathematical intuition, training process, and inference-time merging are all covered with concrete examples, plus practical guidance on hyperparameters and when to use techniques like QLoRA for quantized training.
My goal is to demystify the engineering realities behind LLM development while providing teams with strategic knowledge for choosing between existing models and custom training approaches.