Deploying Large Language Models on Kubernetes: Model Loading and Caching Strategies for vLLM
Running large language models on Kubernetes creates a challenge that most traditional application deployments do not usually encounter: how can tens or even hundreds of gigabytes of model weights be loaded into inference pods efficiently? A model with 7 billion parameters needs around 14 GB of storage, while a model with 70 billion parameters can require more than 140 GB. At this scale, pod startup, autoscaling behavior, and storage architecture need to be planned very differently.
This tutorial gives an introductory overview of common approaches for loading and caching model weights for vLLM pods on Kubernetes. The aim is to help you understand the available options, compare their tradeoffs, and choose a suitable architecture for your environment. This is not a complete deep dive, because topics such as model versioning, multi-tenant environments, and detailed performance tuning each deserve dedicated guidance. Treat this guide as a practical starting point for further evaluation.
Key Takeaways
Before looking at the individual strategies in detail, these are the most important points to keep in mind:
- Startup latency matters: Cold starts for larger models can take 10 minutes or more, which directly affects how quickly your system can scale and react to traffic peaks.
- Storage efficiency differs greatly: Some approaches duplicate model files for every pod, while others use shared storage. For models that are tens or hundreds of gigabytes in size, this has a major impact on cost.
- Control your model sources: Public model hubs are convenient during development, but production deployments should mirror required models to storage that you control in order to avoid external runtime dependencies.
- There is no universal best option: The right strategy depends on your scale, infrastructure, update frequency, and operational preferences.
- Test with realistic conditions: Performance can vary significantly depending on the provider, storage system, and configuration. Validate your chosen setup against your actual requirements.
Key vLLM Model Loading and Caching Considerations
Before comparing specific strategies, it is important to understand the main factors that influence the choice of a vLLM model loading and caching approach:
- Startup latency describes how much time passes before a pod can answer its first request. For autoscaling workloads, this determines how fast the system can respond to sudden traffic increases.
- Storage efficiency considers whether model weights are copied separately to each node or pod, or whether a single shared copy is used. Since model files can be very large, unnecessary duplication can quickly become expensive.
- Network bandwidth includes both external traffic to model registries, such as public model hubs that may rate-limit heavy download activity, and internal network traffic inside the Kubernetes cluster.
- Operational complexity refers to the number of moving parts in the setup. More components usually mean more possible failure points and more effort when troubleshooting problems.
- Scaling behavior answers what happens when a new node is added or when pods are rescheduled. A key question is whether every new node must download the complete model again from the beginning.
- Model update and rollback looks at how easily a new model version can be deployed or an older version can be restored. Some approaches make this simple, while others require more coordination.
- Performance characteristics can differ widely depending on infrastructure provider, storage backend, and configuration. Whichever strategy you choose, test it under realistic conditions to confirm that it meets your needs.
Model Sources: Understanding vLLM Model Source Options
Before choosing a loading strategy, you should decide where your models will come from. The model source affects reliability, security, and operational complexity:
- Public model hubs are often the easiest starting point. vLLM integrates natively with popular model repositories, which makes development straightforward. You can specify a model identifier and vLLM handles the download process. However, a public model hub is an external dependency that you do not control. If that external service has problems during a traffic spike or node failure, your cluster may be unable to scale or recover properly.
- Self-hosted object storage, such as S3-compatible storage, gives you more control. You download models once into your own storage environment, and your cluster pulls them from infrastructure that you manage. This requires an additional setup step, but it removes external dependencies during runtime.
- Shared filesystems, such as NFS-based storage or other ReadWriteMany PVC providers, allow you to create a central model repository that all pods in the cluster can access. In this setup, model files are downloaded once to the shared volume and then reused by all inference pods. This can significantly improve storage efficiency. Actual performance depends on the shared filesystem implementation and the underlying storage infrastructure.
- HTTP endpoints provide flexibility for custom environments, including internal model registries, artifact servers, or CDNs. KServe’s StorageInitializer supports HTTP sources in addition to object storage, making it easier to integrate model loading with existing infrastructure.
- The production principle is simple: keep control over your model sources. For development and experimentation, downloading directly from public model hubs is convenient. For production, mirror the models you rely on to storage that you operate. An external service should not determine whether your system can scale up, replace a failed pod, or recover after a node issue.
Once the model sources are clear, the next step is to examine the strategies for loading these models into your pods.
Model Caching Strategy Survey
vLLM Native Download
With this method, vLLM downloads the model during startup. You configure vLLM with a model identifier or path, and it retrieves the model weights before it begins serving requests.
To local storage: Each pod downloads the model weights to local ephemeral storage, such as emptyDir, or to a pod-specific persistent volume. This is the simplest setup because there is no shared infrastructure to operate and no coordination between pods. vLLM’s native integration with public model repositories means this often works without much additional configuration.
The advantage is simplicity. For development, testing, or single-replica deployments, this approach can get you started quickly. The disadvantages appear at scale: every pod downloads the model independently, increasing bandwidth usage and placing more load on the model source. For larger models, cold starts can take 10 minutes or longer.
To shared storage: All pods can use a shared filesystem, such as NFS or a ReadWriteMany PVC, as their cache directory. The first pod downloads the model, and later pods find it already available in the shared cache.
This improves efficiency because the model is downloaded once and reused across pods. However, it introduces concurrency risks. If multiple pods start at the same time, they may all try to download the same model, which can lead to file corruption or race conditions. You need a mechanism that ensures only one pod writes to the cache at a time, such as file locks, readiness checks, or initially scaling to a single replica.
For shared storage setups, the central job approach described below is usually more robust. It separates the download process from pod startup entirely, avoids concurrency problems, and gives you explicit control over when models are populated.
Init Container Download
Init containers allow you to separate the download logic from the inference runtime. An init container runs before the main vLLM container and downloads the model to a shared volume, usually an emptyDir. After the download is complete, the main container starts with the model files already available.
This separation can be useful. The init container can use specialized download tools, implement retry behavior, or access private registries with credentials that the inference container does not need. The main vLLM container then starts with the model already in place, which simplifies its configuration.
Several tools and platforms can support this pattern:
Custom init containers using tools such as huggingface-cli, s3cmd, cloud provider CLIs, or similar utilities can download models before the main container starts. This gives you full control over the download process. For example, you could use an S3-compatible CLI to fetch model files from your own object storage. This approach allows you to customize the logic for your requirements, but it also means that you need to build and maintain your own init container image.
Object storage and persistent volumes can also be used with this init container pattern. An init container can download model files from S3-compatible object storage directly into a shared persistent volume. This centralizes storage and provides strong reliability. The init container pre-fetches the model weights before vLLM starts, helping reduce container startup time and avoid repeated downloads across pods. This pattern works well in managed Kubernetes environments and keeps model storage and deployment under your control.
The vLLM Production Stack supports this workflow through its Helm chart. With an initContainer, models can be downloaded before the primary vLLM container is started. During this initialization phase, the deployment can also mount a PVC. The chart allows different storage modes: ReadWriteOnce for single-node setups and ReadWriteMany for multi-node deployments where pods must access shared storage across nodes. In horizontally scaled production environments, shared ReadWriteMany storage, for example NFS, is typically the better choice. It is also useful to pre-fill the shared storage so that multiple pods do not try to download the same model at startup.
The drawback is that an init container still downloads the model separately for each pod unless shared storage is used. This makes the pod configuration more complex, but the clearer separation between initialization and runtime can be worthwhile.
Job-Based Pre-Population
Instead of downloading models during pod startup, dedicated Kubernetes Jobs can pre-populate storage before inference pods are started.
Per-Node Job
With this approach, a Job or DaemonSet downloads models to node-local storage, typically a hostPath volume on the node’s local SSD, before inference pods are scheduled onto that node.
The main advantage is performance. Once the model has been downloaded, pods on that node can load it from fast local storage. There is no network filesystem bottleneck and no dependency on shared storage. The download happens once per node, regardless of how many pods run there.
The complexity is in coordination. Inference pods need to know whether the model is already available on a node. This requires scheduling constraints such as node selectors, taints and tolerations, or custom logic to ensure that pods are only scheduled on nodes where the download Job has finished. hostPath volumes also have security implications that must be evaluated. In addition, you need a strategy for cleaning up local storage when nodes are removed.
If you want a Kubernetes-native implementation of this approach without building all tooling yourself, KServe’s LocalModelCache, also known as the LocalModel CRD, is designed for this use case. It provides a CRD for pre-caching models on nodes. The controller manages download jobs and handles node affinity automatically. If this strategy is being evaluated, LocalModelCache is worth serious consideration.
Central Job to Shared Storage
Another option is to use a single Job to download models to shared storage, such as an NFS-backed PVC, which is then accessed by all pods.
This centralizes model management. A single download prepares the model storage for the whole cluster. It scales to any number of pods and makes it easier to manage several model versions in one central location.
The tradeoff is reliance on shared storage performance. If the shared storage cannot provide enough read throughput, model loading can become a bottleneck. You also need a lifecycle strategy for storage, including cleanup of older model versions and capacity management.
Baking Models into Container Images
A seemingly simple option is to include model weights directly in the container image. The Dockerfile copies the model files into the image, and every pod receives them through the normal image pull process.
This approach is easy to understand. The deployment is self-contained and does not need external runtime dependencies. Images are immutable, so rollbacks can be as simple as deploying an earlier image tag.
However, the practical disadvantages are significant. Container images larger than 100 GB are difficult to handle because they are slow to build, slow to push, and slow to pull. Registry storage costs can rise sharply. Some container registries also enforce layer size limits or total image size limits that may make this approach impossible for larger models. Image pulls that take 20 minutes or more on cold nodes create scaling delays that are usually unacceptable for production workloads.
This approach should only be considered for very small models or development scenarios where image size is acceptable. For most production deployments, other strategies are more suitable.
CSI-Based Lazy Loading
Some teams use specialized CSI drivers that mount object storage directly and stream data on demand. Examples include JuiceFS and SeaweedFS.
These drivers expose object storage as a filesystem to pods. The advantage is transparency: the application does not need to know that model files are coming from object storage. Some implementations can begin serving before the full model has been downloaded by streaming weights as needed.
Performance can vary significantly depending on implementation and workload. This approach also adds infrastructure complexity because another component must be deployed, monitored, and debugged. For most teams, this is an advanced option that should be evaluated carefully against simpler alternatives. It may be the right fit for certain scenarios, but it is rarely the best starting point.
Comparison Summary
| Strategy | Cold Start Time | Storage Efficiency | Complexity | Pros | Cons | Best For | Example Tools / Techniques |
|---|---|---|---|---|---|---|---|
| vLLM native local download | Slow, often several minutes because each pod downloads the model | Low, because models are duplicated on each pod’s local disk | Low | Simplest setup, works closely with vLLM defaults, no shared storage required | Uses extra disk for every replica, causes slow multi-pod startup, and consumes significant bandwidth in larger clusters | Development, testing, one replica, or single-node production | vLLM configuration with model_download_dir local to each pod |
| vLLM native shared cache | Medium, because later pods can use the shared cache more quickly | High, because one shared copy can serve many pods | Medium | Fast for repeated deployments and simpler model access across multiple pods | Requires shared storage such as NFS or PVC, can be slower if storage is shared with other applications, and may create race conditions during the first download | Small clusters and quick launches | Shared ReadWriteMany PVC or NFS mounts |
| Init container | Slow to medium, depending on storage and parallelization | Medium to high when combined with shared storage | Medium | Clean separation of download logic, repeatable workflow, flexible scripting and source options | Adds another container and pod specification complexity, still downloads per pod without shared storage, and needs coordination in larger clusters | CI/CD pipelines and setups that benefit from separation of concerns | Init containers with s3cmd, huggingface-cli, cloud storage CLIs, and shared PVCs |
| Per-node job | Fast once the model has been downloaded to the node | Medium, because there is one copy per node rather than one per pod | High | High throughput from node-local SSDs, reduced network hot spots, and no dependency on shared storage | Requires job coordination and node affinity, increases infrastructure complexity, needs hostPath storage, and introduces security considerations | Large clusters and high-performance workloads | KServe LocalModelCache, DaemonSets, Job plus hostPath |
| Central job | Medium, faster than per-pod downloads but usually slower than local node storage | High, because one copy can serve all pods in the cluster | Medium | Centralized management, easier version upgrades, one download, and robust handling of race conditions | Depends on shared storage performance and requires operational handling for cleanup and versioning | Clusters running many replicas or several models | Kubernetes Job to PVC or NFS, Argo Workflows, custom model-populator |
| Container image | Very slow, because image pulls can take 10 minutes or more for very large images | Low, because models are duplicated in every image and pull | Low to medium | Simple in principle, straightforward rollback and version control, no runtime network dependency | Huge images, slow pushes and pulls, registry limits, difficult updates and rollbacks, and poor scalability for large models | Tiny models, demos, prototypes, or air-gapped environments | Custom images with COPY and multi-stage Docker builds |
| CSI lazy loading | Varies depending on the driver and access pattern | High, because data can be streamed from object storage | High | Models can stream directly from external or object storage, disk usage can be reduced, and infrequent or random access can be efficient | Adds infrastructure to manage and debug, performance may be unpredictable under heavy load, tuning can be complex, and caching behavior varies | Advanced production scenarios with object storage | JuiceFS, Mountpoint for S3, SeaweedFS, Alluxio CSI |
Making the Decision: Selecting the Right Model Loading and Caching Strategy for vLLM on Kubernetes
Start with the simplest approach and only add complexity when necessary. vLLM’s native download to local storage may be sufficient for development or low-scale production. There is no need to build a complex caching architecture before you have confirmed that it is actually required.
Consider your scale. The decision changes significantly when moving from one replica to ten or more. A strategy that feels unnecessary for a single pod may become essential when dozens of pods run across multiple nodes.
Think about how often models change. If models are updated frequently during experimentation, centralized approaches that make model swaps easier can reduce operational effort. If a model is deployed and then left running for months, the simplicity of per-pod downloads may be more attractive than maximum efficiency.
Match the strategy to your existing infrastructure. If your team already uses object storage heavily, using it as the model source together with a central job approach may fit naturally. If you already operate reliable NFS infrastructure, shared storage may be the easiest path.
Whatever strategy you select, validate it under realistic conditions. Measure cold-start times, monitor bandwidth usage, and test behavior during scaling events. The best strategy is the one that satisfies your real operational requirements, not the one that looks strongest on paper.
Frequently Asked Questions
How long does it take to load a vLLM model on Kubernetes?
Model loading time depends heavily on model size, network bandwidth, and the storage strategy. For a 7B parameter model of around 14 GB, the first startup from a public model hub can take 5 to 10 minutes. Larger 70B models of 140 GB or more can take 20 minutes or longer. Shared storage or pre-populated caches can reduce loading time for later pods to seconds, while node-local storage usually provides the fastest warm starts.
What is the best storage strategy for vLLM on Kubernetes?
The best strategy depends on scale and infrastructure. For a single-replica deployment, vLLM’s native download to local storage is the simplest option. For multiple replicas, shared storage such as NFS or ReadWriteMany PVCs combined with a central pre-population job often provides the best balance between efficiency and simplicity. For high-performance requirements using local SSDs, per-node jobs can provide the fastest warm starts. Baking models into container images should generally be avoided for anything larger than small development models.
How can race conditions be prevented when multiple pods start at the same time?
When using shared storage, only one pod should write to the cache at a time. Possible solutions include file locks, readiness mechanisms, or initially scaling to one replica to populate the cache. The central job approach is usually more reliable because it separates model download from pod startup entirely and avoids concurrency issues. For production deployments that require horizontal scaling, shared storage should be pre-populated before inference pods are deployed.
Can S3-compatible object storage be used for vLLM model storage?
Yes, S3-compatible object storage can work well as a model source for vLLM deployments. Models can be downloaded once to object storage, and the cluster can then pull them from there. This gives you control over model availability and removes external dependencies. The approach works well with init containers or job-based pre-population strategies.
What is the difference between init containers and job-based pre-population?
Init containers download models as part of pod startup and run before the main vLLM container starts. This keeps the download logic close to the deployment, but the download still happens per pod unless shared storage is used. Job-based pre-population uses dedicated Kubernetes Jobs to download models to storage before inference pods are started. This fully separates downloading from pod startup. Jobs are better suited for shared storage scenarios because they avoid race conditions and provide explicit control over when models are populated.
Conclusion
For vLLM on Kubernetes, model loading and caching do not have one universally correct solution. The best option depends on factors such as deployment size, available infrastructure, operational requirements, and how much complexity your team is willing to manage.
This overview introduced the core strategies, but further topics remain, including performance optimization, serving multiple models, CI/CD workflows, and production hardening. Treat this guide as a foundation, try the approaches that fit your environment, and improve the design based on practical results.
For concrete implementation guidance, refer to the vLLM documentation, the KServe storage and LocalModel resources, and the vLLM Helm chart examples. Since the community already provides many useful tools for these patterns, it is usually better to build on existing solutions before creating your own.


