This article was originally published on IBM Developer.
Training and fine-tuning large language models (LLMs) is becoming a central requirement for modern AI applications. As these models grow in size—from billions to hundreds of billions of parameters—the demands on computational resources have increased dramatically. Fine-tuning such models on a single GPU is no longer realistic due to memory limitations and training inefficiencies.
Sharding is the process of splitting a model’s data or components across multiple devices—such as GPUs or nodes—so that the training workload is distributed. By dividing the model’s parameters, gradients, and optimizer states into smaller “shards,” each device only needs to manage a fraction of the total, making it possible to train models that would not otherwise fit in memory. Sharding also enables parallel training, which speeds up the process and improves scalability.
In this article, we explore the importance of sharding for scalable LLM fine-tuning, describe various sharding strategies, and provide practical guidance based on industry-standard tools.
Why training and fine-tuning LLMs require sharding
Training large language models (LLMs) involves handling substantial amounts of data and computation during each pass through the network. These passes are generally referred to as:
- Forward pass: When data flows through the model to generate predictions.
- Backward pass: When the model computes how wrong the predictions were (loss) and adjusts internal weights accordingly through backpropagation.
Each training iteration requires tracking and updating several core components:
- Model parameters: These are the learnable weights of the neural network that determine the model’s behaviour. They are updated during training to minimize prediction errors.
- Gradients: These represent the rate of change of the loss with respect to each model parameter. Gradients are computed during the backward pass and guide how the model updates its parameters.
- Optimizer states: These are internal values maintained by optimization algorithms like Adam or SGD. They help fine-tune how each parameter gets updated based on the gradient and previous updates.
While inference can be managed on a single GPU using techniques like offloading or quantization, training requires all three of these components to reside in GPU memory simultaneously. This can triple the memory requirement compared to inference. Without sharding, even relatively modest models (7B–13B parameters) can exceed the capabilities of high-end GPUs.
Moreover, sharding enables:
- Larger batch sizes, improving convergence and model generalization.
- Distributed compute workloads, reducing training time.
- Better scalability across infrastructure.
Continue reading on IBM Developer to see a DeepSpeed ZeRO example of scalable fine-tuning...