The generative AI boom is putting immense pressure on infrastructure teams to optimize compute and storage for training large language models (LLMs). In fact, the US AI infrastructure market size alone was valued at US$ 42.0 billion in 2023 and is projected to grow at a CAGR of 25.6% from 2024 to 2030. This rapid growth demands scalable, high-performance I/O configurations that can handle the unique needs of LLM workloads.
Large language models are technically algorithms that consume vast amounts of data to learn correlations between vector embeddings. Traditional AI models were trained on smaller, problem-specific datasets for hundreds of epochs. However, due to the scale of data, LLM training is usually limited to single-digit epochs, restricted by high GPU costs and data pipeline inefficiencies.
How Can Kubernetes PVC Be a Game Changer?
Kubernetes' Persistent Volume Claims (PVCs) offer a powerful mechanism to boost performance during LLM training and inference. They optimize storage, I/O throughput, and resource availability, making them critical in high-scale AI pipelines.
For LLMs to consume the embeddings, the data has to be loaded into memory or buffer. When the data is at a petabyte scale, bottlenecks are imminent with I/O throttling, eating up GPU hours. Without proper checkpointing, training failures can result in the loss of intermediate progress, effectively resetting the model and incurring unnecessary compute costs. Persistent volumes are great at:
-
Using a high-performance storage class, PVCs guarantee fast sequential and random read speeds, reducing data loads with faster model interactions.
-
State and metadata can be persisted using periodic checkpoints, which can be used to restore the state in the event of failure.
-
The de facto capability of Kubernetes is scalability for distributed training where the load can be shared while speeding up the process.
Kubernetes PVC Config Flavors
It is established that we have access to persistent volume with high IOPS and checkpointing capabilities. We can always apply a feature or functionality in more than one way, each being an improvement over the other. The overall efficiency and performance of LLM training can be boosted by leveraging these Kubernetes PVC configuration flavors.
1. Local Caching on High-IOPS SSD
A high-performance SSD with maximum input/output operations per second is the crucial component for improving the LLM training. Using the SSD, we can cache the data locally before it is used by the LLM. This approach has two advantages: one is that we have the data readily available for LLM, making sure GPU utilization is at ~100%. The second is that this cache makes the model learn and back-propagate at alarming speed.
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-high-iops-pv
spec:
capacity:
storage: 1Ti
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: premium-ssd
csi:
driver: disk.csi.azure.com
volumeHandle: llm-high-iops-disk
volumeAttributes:
iops: "20000"
throughput: "500"
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- "us-west-2a"
We have provisioned a 1TB volume using the premium SSD storage class and configured it to support nearly 20k IOPS and ~500+ MB/s throughput. This puts cluster/node capabilities at a very high level with improvements in network latency.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-high-iops-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: premium-ssd
Now, we can claim and bind the 1TB high-performance storage with the persistent volume. The training can be initiated now, where a container packaged with a PyTorch transformer model is running on the GPU, mounting the PVC at /data for dataset loads and checkpointing.
2. Distributed File System for Multi-Node Training
The true potential of Kubernetes lies in its distributed orchestration capabilities. If leveraged optimally, organizations can take their application development to a whole new level with reduced costs and improved performance. In this flavor, we are going to see how we can leverage the distributed nature of Kubernetes to train LLMs parallelly on shared resources.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: fs-12345678
directoryPerms: "700"
throughputMode: elastic
Enabling efficient storage classes such as AWS EFS CSI boosts parallelism while delivering an elastic throughput of around 3GB/s.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-efs-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2Ti
storageClassName: efs-sc
The persistent volume claim is configured to request up to 2TB storage with ReadWriteMany access mode for binding the elastic file system with the shared resources on distributed nodes.
apiVersion: batch/v1
kind: Job
metadata:
name: llm-distributed-training
spec:
parallelism: 4
completions: 1
template:
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command: ["mpirun", "--allow-run-as-root", "python", "/app/train.py"]
resources:
requests:
memory: "64Gi"
cpu: "16"
limits:
memory: "128Gi"
cpu: "32"
nvidia.com/gpu: 1
volumeMounts:
- mountPath: "/data"
name: llm-storage
env:
- name: DATA_DIR
value: "/data"
volumes:
- name: llm-storage
persistentVolumeClaim:
claimName: llm-efs-pvc
restartPolicy: OnFailure
Each distributed pod is provisioned with 64GB RAM, 16 CPUs, and 1 GPU with PVC mounted at /data. The DATA_DIR is the shared directory that ensures all scripts can access the data from this directory.
3. NVMe-Backed Ephemeral Storage with Pre-Staging
Non-volatile memory express is another excellent choice that provides blazing-fast I/Os with low latency. When combined with ephemeral storage, storage that is available only during the lifecycle of the container/pod, AI/ML workload performance can be increased.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-staging-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: fast-ssd
In this flavor, we are providing a new storage class as ephemeral. This storage lives inside the container/pod, delivering up to 3GB/s throughput. During the pre-staging, we load all the required data into this NVMe-backed ephemeral storage. Since this stage is already residing in the control place, the model can easily access the data with significantly low latency.
apiVersion: v1
kind: Pod
metadata:
name: llm-nvme-training-pod
spec:
initContainers:
- name: data-stager
image: busybox
command: ["sh", "-c", "cp -r /data-staging/* /data-nvme/"]
volumeMounts:
- mountPath: "/data-staging"
name: llm-storage
- mountPath: "/data-nvme"
name: nvme-storage
containers:
- name: trainer
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command: ["python", "/app/train.py"]
resources:
requests:
memory: "64Gi"
cpu: "16"
limits:
memory: "128Gi"
cpu: "32"
nvidia.com/gpu: 1
volumeMounts:
- mountPath: "/data"
name: nvme-storage
env:
- name: DATA_DIR
value: "/data"
volumes:
- name: llm-storage
persistentVolumeClaim:
claimName: llm-staging-pvc
- name: nvme-storage
ephemeral:
volumeClaimTemplate:
metadata:
name: nvme-claim
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-nvme
resources:
requests:
storage: 1Ti
nodeSelector:
disktype: nvme
Closing Thoughts
Memory and storage are the key components of AI training where the data is loaded and unloaded onto these resources for finding patterns and inference. The scale and complexities involved in LLM training are very convoluted, making the training a black box. Without proper storage provisioning, training performance can degrade significantly, leading to slower convergence and inefficient GPU utilization. This guide aims to offer modern and advanced ways to tackle the storage limitations for performance boosts with promising IOPS.
Comments (0)