PLATFORM
  • Tails

    Create websites with TailwindCSS

  • Blocks

    Design blocks for your website

  • Wave

    Start building the next great SAAS

  • Pines

    Alpine & Tailwind UI Library

  • Auth

    Plug'n Play Authentication for Laravel

  • Designer comingsoon

    Create website designs with AI

  • DevBlog comingsoon

    Blog platform for developers

  • Static

    Build a simple static website

  • SaaS Adventure

    21-day program to build a SAAS

Kubernetes PVC Flavors To Implement High-Performance I/O for LLM Training

The generative AI boom is putting immense pressure on infrastructure teams to optimize compute and storage for training large language models (LLMs). In fact, the US AI infrastructure market size alone was valued at US$ 42.0 billion in 2023 and is projected to grow at a CAGR of 25.6% from 2024 to 2030. This rapid growth demands scalable, high-performance I/O configurations that can handle the unique needs of LLM workloads.

Large language models are technically algorithms that consume vast amounts of data to learn correlations between vector embeddings. Traditional AI models were trained on smaller, problem-specific datasets for hundreds of epochs. However, due to the scale of data, LLM training is usually limited to single-digit epochs, restricted by high GPU costs and data pipeline inefficiencies.

How Can Kubernetes PVC Be a Game Changer?

Kubernetes' Persistent Volume Claims (PVCs) offer a powerful mechanism to boost performance during LLM training and inference. They optimize storage, I/O throughput, and resource availability, making them critical in high-scale AI pipelines.

For LLMs to consume the embeddings, the data has to be loaded into memory or buffer. When the data is at a petabyte scale, bottlenecks are imminent with I/O throttling, eating up GPU hours. Without proper checkpointing, training failures can result in the loss of intermediate progress, effectively resetting the model and incurring unnecessary compute costs. Persistent volumes are great at:

  1. Using a high-performance storage class, PVCs guarantee fast sequential and random read speeds, reducing data loads with faster model interactions.

  2. State and metadata can be persisted using periodic checkpoints, which can be used to restore the state in the event of failure.

  3. The de facto capability of Kubernetes is scalability for distributed training where the load can be shared while speeding up the process.

Kubernetes PVC Config Flavors

It is established that we have access to persistent volume with high IOPS and checkpointing capabilities. We can always apply a feature or functionality in more than one way, each being an improvement over the other. The overall efficiency and performance of LLM training can be boosted by leveraging these Kubernetes PVC configuration flavors.

1. Local Caching on High-IOPS SSD

A high-performance SSD with maximum input/output operations per second is the crucial component for improving the LLM training. Using the SSD, we can cache the data locally before it is used by the LLM. This approach has two advantages: one is that we have the data readily available for LLM, making sure GPU utilization is at ~100%. The second is that this cache makes the model learn and back-propagate at alarming speed.

apiVersion: v1
kind: PersistentVolume
metadata:
 name: llm-high-iops-pv
spec:
 capacity:
   storage: 1Ti
 accessModes:
   - ReadWriteOnce
 persistentVolumeReclaimPolicy: Retain
 storageClassName: premium-ssd
 csi:
   driver: disk.csi.azure.com 
   volumeHandle: llm-high-iops-disk
   volumeAttributes:
     iops: "20000" 
     throughput: "500" 
 nodeAffinity:
   required:
     nodeSelectorTerms:
     - matchExpressions:
       - key: topology.kubernetes.io/zone
         operator: In
         values:
         - "us-west-2a"

We have provisioned a 1TB volume using the premium SSD storage class and configured it to support nearly 20k IOPS and ~500+ MB/s throughput. This puts cluster/node capabilities at a very high level with improvements in network latency.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: llm-high-iops-pvc
spec:
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 1Ti
 storageClassName: premium-ssd

Now, we can claim and bind the 1TB high-performance storage with the persistent volume. The training can be initiated now, where a container packaged with a PyTorch transformer model is running on the GPU, mounting the PVC at /data for dataset loads and checkpointing.

2. Distributed File System for Multi-Node Training

The true potential of Kubernetes lies in its distributed orchestration capabilities. If leveraged optimally, organizations can take their application development to a whole new level with reduced costs and improved performance. In this flavor, we are going to see how we can leverage the distributed nature of Kubernetes to train LLMs parallelly on shared resources.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: efs-sc
provisioner: efs.csi.aws.com
parameters:
 provisioningMode: efs-ap
 fileSystemId: fs-12345678 
 directoryPerms: "700"
 throughputMode: elastic 

Enabling efficient storage classes such as AWS EFS CSI boosts parallelism while delivering an elastic throughput of around 3GB/s.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: llm-efs-pvc
spec:
 accessModes:
   - ReadWriteMany 
 resources:
   requests:
     storage: 2Ti
 storageClassName: efs-sc

The persistent volume claim is configured to request up to 2TB storage with ReadWriteMany access mode for binding the elastic file system with the shared resources on distributed nodes.

apiVersion: batch/v1
kind: Job
metadata:
 name: llm-distributed-training
spec:
 parallelism: 4 
 completions: 1
 template:
   spec:
     containers:
     - name: trainer
       image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
       command: ["mpirun", "--allow-run-as-root", "python", "/app/train.py"]
       resources:
         requests:
           memory: "64Gi"
           cpu: "16"
         limits:
           memory: "128Gi"
           cpu: "32"
           nvidia.com/gpu: 1
       volumeMounts:
       - mountPath: "/data"
         name: llm-storage
       env:
       - name: DATA_DIR
         value: "/data"
     volumes:
     - name: llm-storage
       persistentVolumeClaim:
         claimName: llm-efs-pvc
     restartPolicy: OnFailure

Each distributed pod is provisioned with 64GB RAM, 16 CPUs, and 1 GPU with PVC mounted at /data. The DATA_DIR is the shared directory that ensures all scripts can access the data from this directory.

3. NVMe-Backed Ephemeral Storage with Pre-Staging

Non-volatile memory express is another excellent choice that provides blazing-fast I/Os with low latency. When combined with ephemeral storage, storage that is available only during the lifecycle of the container/pod, AI/ML workload performance can be increased.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: llm-staging-pvc
spec:
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 1Ti
 storageClassName: fast-ssd

In this flavor, we are providing a new storage class as ephemeral. This storage lives inside the container/pod, delivering up to 3GB/s throughput. During the pre-staging, we load all the required data into this NVMe-backed ephemeral storage. Since this stage is already residing in the control place, the model can easily access the data with significantly low latency.

apiVersion: v1
kind: Pod
metadata:
 name: llm-nvme-training-pod
spec:
 initContainers:
 - name: data-stager
   image: busybox
   command: ["sh", "-c", "cp -r /data-staging/* /data-nvme/"]
   volumeMounts:
   - mountPath: "/data-staging"
     name: llm-storage
   - mountPath: "/data-nvme"
     name: nvme-storage
 containers:
 - name: trainer
   image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
   command: ["python", "/app/train.py"]
   resources:
     requests:
       memory: "64Gi"
       cpu: "16"
     limits:
       memory: "128Gi"
       cpu: "32"
       nvidia.com/gpu: 1
   volumeMounts:
   - mountPath: "/data"
     name: nvme-storage
   env:
   - name: DATA_DIR
     value: "/data"
 volumes:
 - name: llm-storage
   persistentVolumeClaim:
     claimName: llm-staging-pvc
 - name: nvme-storage
   ephemeral:
     volumeClaimTemplate:
       metadata:
         name: nvme-claim
       spec:
         accessModes:
           - ReadWriteOnce
         storageClassName: local-nvme
         resources:
           requests:
             storage: 1Ti
 nodeSelector:
   disktype: nvme

Closing Thoughts

Memory and storage are the key components of AI training where the data is loaded and unloaded onto these resources for finding patterns and inference. The scale and complexities involved in LLM training are very convoluted, making the training a black box. Without proper storage provisioning, training performance can degrade significantly, leading to slower convergence and inefficient GPU utilization. This guide aims to offer modern and advanced ways to tackle the storage limitations for performance boosts with promising IOPS.

Comments (0)

loading comments