Deploy vLLM on Kubernetes with Shared NFS Model Storage
Your vLLM pods may be downloading the same large model files every time they start.
If you run LLM inference on Kubernetes, you may have chosen the simple setup: configure vLLM to pull the model from Hugging Face during pod startup. That works, but it creates problems later:
- A pod fails at 2 AM. The replacement pod cannot serve any requests until it has spent several minutes downloading many gigabytes of model weights from Hugging Face.
- You need to scale out during a traffic increase. Every new pod downloads the model separately, uses the same network bandwidth, and slows down your ability to react to demand.
- Hugging Face is rate-limiting requests or has an outage. Your pods cannot start successfully.
A better approach is to download the model once into shared storage and let every pod load the model from there. This avoids repeated downloads, removes external runtime dependencies, and gives every new pod fast access to the model files.
In this guide, you will deploy vLLM on Kubernetes using managed NFS storage for model files.
The example uses one H100 GPU node to keep the setup easy to follow. However, the pattern can scale to as many nodes as required. That is the main benefit: once the model is stored on NFS, adding GPU capacity gives new pods direct model access instead of triggering another long download.
Key Takeaways
Avoid repeated downloads: Download LLM models once to shared NFS storage instead of fetching them every time a pod starts. This can reduce startup delays from minutes to seconds.
Enable faster scaling: New vLLM replicas can start serving traffic more quickly because they load models directly from NFS rather than waiting for multi-gigabyte downloads.
Reduce external dependencies: Keep models inside your own infrastructure so pod restarts and scaling events do not depend on Hugging Face availability.
Support concurrent access: NFS with ReadWriteMany access allows several pods to read the same model files at the same time, which is ideal for horizontal scaling.
Build a production-ready pattern: This storage approach works for many LLM deployments and can scale from a single-node setup to larger multi-GPU clusters.
The Problem: Why Downloading Models on Every Startup Hurts
Let’s look closely at the cost of downloading models during pod startup.
Model files are large. Mistral-7B-Instruct-v0.3, the model used in this guide, is around 15 GB. Larger models such as Llama 70B can exceed 140 GB. Whenever a pod starts and downloads the model from Hugging Face, that amount of data must travel over the internet again.
Every pod restart causes another download. Pods can crash. Nodes can be maintained or replaced. Deployments happen regularly. With the download-on-startup approach, all of these events trigger another complete model download. If an inference pod restarts three times in one day, the same model files are downloaded three times.
Scaling turns into a bandwidth bottleneck. When a Horizontal Pod Autoscaler adds replicas during a traffic spike, every new pod downloads the model at the same time. Three new pods mean three parallel multi-gigabyte downloads, all competing for bandwidth. Instead of immediately increasing serving capacity, your platform waits for downloads to finish.
Hugging Face becomes a runtime dependency. This is the less obvious risk. During normal operation, Hugging Face availability may not seem like an issue. But imagine a failure at 2 AM: a GPU node becomes unavailable, Kubernetes schedules a replacement pod, and Hugging Face is rate-limiting your IP address or experiencing issues. Your recovery from an infrastructure failure now depends on an external service outside your control.
The principle is simple: control your dependencies. External services such as Hugging Face should be used for the initial model acquisition, not as a required dependency every time a pod starts. Whether a pod starts because of scaling, deployment, or failure recovery, it should load from infrastructure you control.
The Solution: Download Once and Run Inference Everywhere
The pattern is simple:
- Download once: A Kubernetes Job downloads the model from Hugging Face to an NFS share.
- Store on NFS: The model files remain available on managed NFS storage.
- Load from NFS: Each vLLM pod mounts the NFS share and loads the model directly from that path.
This solves the problems described above.
ReadWriteMany access: NFS allows multiple pods to read from the same storage location at the same time. Whether you run one replica or ten, they can all use the same model files.
Persistence: Model files remain available across pod restarts, node replacements, and cluster upgrades. You download the model once and keep it until you intentionally remove it.
In-region access without external runtime dependency: The NFS share is located in the same region as the Kubernetes cluster. Model loading happens through the private network, which is fast, reliable, and independent of third-party availability.
Managed infrastructure: The NFS service is operated by the infrastructure provider, so you do not have to maintain NFS servers yourself.
The scaling advantage is especially important. When you add a new GPU node later and deploy another vLLM replica, the pod can access the model immediately. It starts, mounts NFS, loads the model into GPU memory, and begins serving requests. Startup time is limited to model loading, not model downloading plus model loading.
With the per-pod download approach, adding a new replica means waiting several minutes for yet another download before the new capacity becomes usable.
Architecture Overview
The data flow has two separate phases.
One-time setup:
Hugging Face → Download Job → NFS Share
Every pod startup:
NFS Share → vLLM Pod → GPU Memory → Ready to serve
Compare that with the download-every-time pattern, where every startup includes the Hugging Face download:
Hugging Face → vLLM Pod → GPU Memory → Ready to serve
Prerequisites
Before you begin, you need the following:
Cloud Account with H100 GPU Quota
H100 GPUs usually require quota approval. If you do not already have access, request GPU capacity through your cloud provider’s control panel or support process.
Hugging Face Account with Model Access
This guide uses mistralai/Mistral-7B-Instruct-v0.3. It is a non-gated model, so no approval is required and you can start directly.
You also need a Hugging Face access token with read permissions.
kubectl Installed Locally
Install kubectl for your operating system by following the official Kubernetes installation instructions.
Basic Kubernetes Knowledge
You should be familiar with kubectl commands and basic Kubernetes concepts such as pods, deployments, services, persistent volumes, and persistent volume claims. If you are new to managed Kubernetes, start with your provider’s introductory Kubernetes guide.
Step 1: Set Up the Infrastructure
You need three infrastructure resources: a VPC, a Kubernetes cluster with a GPU node, and an NFS share. All resources must be created in the same region.
Select a Region
Not every cloud region provides both H100 GPU nodes and managed NFS storage. Choose a region where both services are available. Check your cloud provider’s control panel or documentation for the latest supported regions.
Create a VPC
The VPC provides private networking between the Kubernetes cluster and the NFS share.
- Navigate to the networking section of your cloud control panel.
- Create a new VPC.
- Select the same region you plan to use for the Kubernetes cluster and NFS share.
- Choose a descriptive name, for example vllm-vpc.
- Use the default IP range or define your own range.
- Create the VPC.
Create a Kubernetes Cluster
- Navigate to the Kubernetes section of your cloud control panel.
- Create a new cluster.
- Select the same region as the VPC.
- Choose the VPC you created earlier.
- Configure the node pools:
-
- Management pool: Keep standard compute nodes for system workloads.
- GPU node pool: Add a GPU node pool, select an H100 GPU node type, and set the node count to 1.
- Create the cluster and wait for provisioning to finish.
Connect kubectl to the Cluster
After the cluster is running, configure kubectl access using your cloud provider’s kubeconfig instructions.
Verify access:
kubectl get nodes
Expected output, with node names varying by environment:
NAME STATUS ROLES AGE VERSION
mgmt-xxxxx Ready <none> 5m v1.34.1
mgmt-yyyyy Ready <none> 5m v1.34.1
gpu-h100-zzzzz Ready <none> 5m v1.34.1
You should see at least two types of nodes: management nodes and one GPU node.
This step creates the persistent storage location that will hold the model files.
- Navigate to the storage section of your cloud control panel.
- Create a new NFS share.
- Configure the share:
-
- Name: Use a descriptive name such as llm-models.
- Size: 100 GB is enough for this tutorial. If you need more capacity later, resize the NFS share.
- VPC: Select the same VPC as the Kubernetes cluster.
Create the NFS share and wait until its status becomes active.
Note the Mount Source
After the share becomes active, note the Mount Source value from the NFS overview. It contains the host IP address and mount path required for the Kubernetes configuration.
The Mount Source uses the format <HOST>:<PATH>, for example:
10.100.32.2:/2633050/7d1686e4-9212-420f-a593-ab544993d99b
You will split it into two values for the PersistentVolume configuration:
Host: The IP address before the colon, for example 10.100.32.2.
Path: Everything after the colon, for example /2633050/7d1686e4-9212-420f-a593-ab544993d99b.
This matters because the NFS share is now your persistent model library. Any model stored here can be accessed by every pod in the cluster now and in the future. You do not need to download it again when pods restart or when you scale up.
Step 3: Connect Kubernetes to NFS
Next, create the Kubernetes resources that allow pods to access the NFS share.
Create the Namespace
First, create a dedicated namespace for the vLLM resources:
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: vllm
Apply it:
kubectl apply -f namespace.yaml
Create the PersistentVolume
The PersistentVolume tells Kubernetes how to connect to the NFS share. Using the Mount Source from Step 2, replace <NFS_HOST> with the IP address and <NFS_MOUNT_PATH> with the path:
# pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: vllm-models-pv
labels:
pv-name: vllm-models-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany # Allows multiple pods to read simultaneously
persistentVolumeReclaimPolicy: Retain
nfs:
server: <NFS_HOST> # IP from Mount Source, for example 10.100.32.2
path: <NFS_MOUNT_PATH> # Path from Mount Source, for example /2633050/7d1686e4-...
Important details:
- ReadWriteMany: This access mode allows all vLLM pods to read from the NFS share at the same time. Many block storage systems only support ReadWriteOnce.
- Retain reclaim policy: If the PVC is deleted, the data on the NFS share remains intact. This helps protect downloaded models from accidental removal.
Create the PersistentVolumeClaim
The PersistentVolumeClaim is the resource that pods reference when they need access to the storage:
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models-pvc
namespace: vllm
spec:
accessModes:
- ReadWriteMany
storageClassName: "" # Empty string binds to a pre-provisioned PV
resources:
requests:
storage: 100Gi
selector:
matchLabels:
pv-name: vllm-models-pv
Apply both resources:
kubectl apply -f pv.yaml
kubectl apply -f pvc.yaml
Check whether the PVC is bound:
kubectl get pvc -n vllm
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
vllm-models-pvc Bound vllm-models-pv 100Gi RWX 10s
The status should be Bound. If it remains Pending, verify that the PV configuration matches the PVC selector and that the NFS host and path are correct.
Kubernetes can now access the NFS share. Any pod mounting vllm-models-pvc can use the shared storage.
Step 4: Download the Model Once
This is the central part of the download-once pattern. You will use a Kubernetes Job to download the model onto NFS. A Job runs to completion and is not a long-running deployment.
Create the Hugging Face Token Secret
First, create a secret with your Hugging Face token. Replace <YOUR_HUGGINGFACE_TOKEN> with your token:
# hf-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: vllm
type: Opaque
stringData:
HF_TOKEN: <YOUR_HUGGINGFACE_TOKEN>
Apply the secret:
kubectl apply -f hf-secret.yaml
Deploy the Model Download Job
Now create the Job that downloads the model:
# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
namespace: vllm
spec:
ttlSecondsAfterFinished: 300 # Clean up job after 5 minutes
template:
spec:
restartPolicy: Never
securityContext:
runAsUser: 1000
runAsGroup: 1000
containers:
- name: download
image: python:3.11-slim
command:
- /bin/sh
- -c
- |
pip install --target=/tmp/pip huggingface_hub &&
export PYTHONPATH=/tmp/pip &&
python -c "from huggingface_hub import snapshot_download; snapshot_download('mistralai/Mistral-7B-Instruct-v0.3', local_dir='/models/Mistral-7B-Instruct-v0.3')"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
- name: HOME
value: /tmp
- name: HF_HOME
value: /tmp/hf_home
volumeMounts:
- name: nfs-storage
mountPath: /models
volumes:
- name: nfs-storage
persistentVolumeClaim:
claimName: vllm-models-pvc
This Job performs the following tasks:
- It installs the Hugging Face Hub client.
- It downloads Mistral-7B-Instruct-v0.3 to /models/Mistral-7B-Instruct-v0.3 on the NFS share.
- It uses local_dir so the files are stored directly instead of inside the Hugging Face cache structure.
- It runs as a non-root user for better security.
Apply the Job:
kubectl apply -f model-download-job.yaml
Monitor the Download
Watch the Job output:
kubectl logs job/model-download -n vllm -f
You will see the Hugging Face Hub client being installed and then the progress of the model download. Depending on network conditions, the download usually takes around 5 to 10 minutes.
Wait for the Job to complete:
kubectl wait --for=condition=complete job/model-download -n vllm --timeout=15m
Expected output:
job.batch/model-download condition met
Verify that the Job succeeded:
kubectl get jobs -n vllm
NAME STATUS COMPLETIONS DURATION AGE
model-download Complete 1/1 3m 5m
This is the critical point: the download happens once. Every pod deployed from now on, whether today, tomorrow, or next month, uses these same model files. Pod restarts no longer require another download.
Step 5: Deploy vLLM
Now deploy vLLM. The pod mounts the NFS share and loads the model directly. It does not need to download the model from Hugging Face during startup.
Create the Deployment
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: vllm
labels:
app: vllm
spec:
replicas: 1 # Single replica for this tutorial
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0 # Delete old pod before creating new pod because of GPU constraints
maxUnavailable: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- /models/Mistral-7B-Instruct-v0.3
- --served-model-name
- Mistral-7B-Instruct-v0.3
ports:
- containerPort: 8000
name: http
env:
- name: VLLM_PORT
value: "8000"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
volumeMounts:
- name: model-cache
mountPath: /models
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-models-pvc
Important configuration details:
- replicas: 1: This tutorial uses one GPU node. Increase this value after adding more GPU nodes.
- maxSurge: 0: GPU nodes often provide one GPU. This setting deletes the old pod before creating the new one during updates, because both cannot run at the same time on a single GPU.
- tolerations: These allow the pod to be scheduled on GPU-tainted nodes.
- nvidia.com/gpu: 1: This requests one GPU for the pod.
- volumeMounts: The NFS PVC is mounted at /models, where the model files are stored.
- –model /models/Mistral-7B-Instruct-v0.3: This points vLLM directly to the model path on NFS.
A note about container images: this guide pulls the vLLM image from Docker Hub for simplicity. In production, mirror the image to a container registry you control and reference it from there. This follows the same dependency-control principle because Docker Hub then becomes a one-time source rather than a runtime dependency.
Create the Service
# vllm-service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm
namespace: vllm
labels:
app: vllm
spec:
type: ClusterIP
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
selector:
app: vllm
Apply both files:
kubectl apply -f vllm-deployment.yaml
kubectl apply -f vllm-service.yaml
Wait Until the Pod Is Ready
kubectl wait --for=condition=ready pod -l app=vllm -n vllm --timeout=10m
Check the pod status:
kubectl get pods -n vllm -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE
vllm-xxxxxxxxx-xxxxx 1/1 Running 0 2m 10.108.1.123 gpu-h100-xxxxx
Notice what is not happening: there is no model download step. vLLM starts, mounts NFS, and loads the model directly into GPU memory.
If the pod had to download the model first, you would wait several additional minutes.
Step 6: Test the Inference Endpoint
Now verify the deployment by sending requests to the vLLM endpoint.
Port-Forward to the Service
Because this example uses a ClusterIP service, use port-forwarding to access it locally:
kubectl port-forward svc/vllm -n vllm 8000:8000
Keep this command running in one terminal. Open another terminal for the next commands.
List Available Models
curl -s http://localhost:8000/v1/models | jq .
Expected output:
{
"object": "list",
"data": [
{
"id": "Mistral-7B-Instruct-v0.3",
"object": "model",
"created": 1234567890,
"owned_by": "vllm",
"root": "/models/Mistral-7B-Instruct-v0.3",
"parent": null,
"max_model_len": 32768,
"permission": [...]
}
]
}
Send a Chat Completion Request
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 50
}' | jq .
Expected output:
{
"id": "chatcmpl-xxxxxxxxxxxxxxxx",
"object": "chat.completion",
"created": 1234567890,
"model": "Mistral-7B-Instruct-v0.3",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris..."
},
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 60,
"completion_tokens": 50
}
}
You now have a working LLM inference endpoint backed by shared NFS storage.
In a production deployment, expose the service through Gateway API or a LoadBalancer service instead of port-forwarding. Many managed Kubernetes environments support Gateway API integrations that make HTTPS routing straightforward.
For more advanced vLLM deployment strategies and model caching approaches, review vLLM Kubernetes model loading and caching patterns.
Keep in mind that this tutorial focuses on model storage. A production LLM deployment also requires authentication, rate limiting, efficient request routing across replicas, observability, and other operational considerations that are outside the scope of this guide.
The Scaling Story: What Happens When You Add More GPUs?
This is where the NFS-based pattern becomes valuable.
Imagine you have been running one vLLM replica and traffic is increasing. You need more capacity.
Scenario: Scale to Multiple Replicas
First, add another GPU node to the cluster through your cloud provider’s control panel. After it is ready, scale the deployment:
kubectl scale deployment vllm -n vllm --replicas=2
Watch the new pod start:
kubectl get pods -n vllm -w
NAME READY STATUS RESTARTS AGE
vllm-xxxxxxxxx-xxxxx 1/1 Running 0 30m
vllm-xxxxxxxxx-yyyyy 0/1 Pending 0 5s
vllm-xxxxxxxxx-yyyyy 0/1 ContainerCreating 0 10s
vllm-xxxxxxxxx-yyyyy 1/1 Running 0 45s
Notice the timing. The new pod moves from Pending to Running in about 45 seconds. That is the model loading time, without an additional multi-minute download.
With the per-pod download method, this would take significantly longer. The new pod would need to download the full model from Hugging Face while the first pod continues handling all traffic alone.
Scenario: Add More GPU Nodes Later
The same pattern works when you add GPU nodes tomorrow, next week, or next month. The model is already stored on NFS. New pods can access it immediately:
- Add a GPU node to the cluster.
- Scale the deployment:
kubectl scale deployment vllm -n vllm --replicas=3
- The new pod starts and becomes ready in roughly 45 seconds.
- There are no downloads, no unnecessary waiting, and no bandwidth competition.
For multi-replica deployments, place a load balancer in front of vLLM. Use your Kubernetes provider’s Gateway API or load balancing documentation to configure this. To understand the storage foundation in more detail, review Kubernetes PersistentVolume and PersistentVolumeClaim concepts.
Cleanup
GPU nodes can be expensive. When testing is complete, remove the resources to avoid unnecessary charges.
Use your cloud provider’s control panel to delete the resources:
- Delete the NFS share.
- Delete the Kubernetes cluster.
- Delete the VPC.
Delete the resources in this order to avoid dependency conflicts.
Frequently Asked Questions
How does NFS compare with other storage options for LLM models?
NFS provides ReadWriteMany access, which means multiple pods can read the same model files at the same time. This is important for horizontally scaling LLM inference workloads. Block storage options often support only ReadWriteOnce, which limits usage to one pod per volume. Object storage can also be used in some architectures, but it usually requires additional tooling to mount as a filesystem and may have higher latency for model loading than NFS.
Can this pattern be used with LLM inference frameworks other than vLLM?
Yes. This approach works with any LLM inference framework that can load models from a filesystem path, including TensorRT-LLM, llama.cpp, and similar tools. The main requirement is that the inference container can mount the NFS PersistentVolumeClaim and read the model files from that mount location. Adjust the model path in the deployment configuration to point to the NFS-mounted directory.
What happens if the NFS share runs out of space?
Managed NFS shares can usually be resized through the control panel or API. After resizing, the additional capacity becomes available without downtime. This tutorial starts with 100 GB, which is enough for Mistral-7B-Instruct-v0.3 at around 15 GB. Larger models such as Llama 70B, which can exceed 140 GB, require more storage. Choose the initial NFS size based on your model requirements and increase it as needed.
Is loading from NFS slower than loading from local storage?
For model loading, NFS performance is usually sufficient because loading happens once per pod startup, not during every inference request. The model files are loaded into GPU memory when the pod starts, and inference then uses the in-memory model rather than repeatedly reading from NFS. If you frequently reload models or perform checkpointing during training, local NVMe storage may provide better performance. For inference workloads where models load once and remain in memory, NFS offers scaling benefits without a noticeable performance impact.
How can I update or replace a model stored on NFS?
There are several options for updating a model. You can download a new model version into a different directory on the same NFS share, for example /models/Mistral-7B-Instruct-v0.4, and then update the vLLM deployment to use the new path. This lets you test the new version while keeping the old model available for rollback.
Alternatively, you can remove the old model directory and download the new version into the same path. Because the download happens through a Kubernetes Job, you can rerun the Job with updated parameters. The NFS share remains available across pod restarts, so model updates stay available after deployment changes.
Conclusion
You have moved from downloading the model every time a pod starts to downloading once and running inference everywhere.
You built a model storage pattern that can serve as a foundation for production LLM deployments.
You created shared model storage that scales with the Kubernetes cluster.
You removed external runtime dependencies because Hugging Face is used as a one-time source, not as a requirement for every pod startup.
The principle behind this approach is to control your dependencies. External services should be used for initial acquisition, not as runtime requirements. When infrastructure needs to recover from failures, react to traffic spikes, or perform routine deployments, it should rely on components under your control.


