Content

1 Key Takeaways
2 The Problem: Why Downloading Models on Every Startup Hurts
3 The Solution: Download Once and Run Inference Everywhere
4 Architecture Overview
5 Prerequisites
6 Step 1: Set Up the Infrastructure
7 Step 2: Create the Managed NFS Share
8 Step 3: Connect Kubernetes to NFS
9 Step 4: Download the Model Once
10 Step 5: Deploy vLLM
11 Step 6: Test the Inference Endpoint
12 The Scaling Story: What Happens When You Add More GPUs?
13 Cleanup
14 Frequently Asked Questions
15 Conclusion

Vijona

44 minutes ago

Deploy vLLM on Kubernetes with Shared NFS Model Storage

Your vLLM pods may be downloading the same large model files every time they start.

If you run LLM inference on Kubernetes, you may have chosen the simple setup: configure vLLM to pull the model from Hugging Face during pod startup. That works, but it creates problems later:

A pod fails at 2 AM. The replacement pod cannot serve any requests until it has spent several minutes downloading many gigabytes of model weights from Hugging Face.
You need to scale out during a traffic increase. Every new pod downloads the model separately, uses the same network bandwidth, and slows down your ability to react to demand.
Hugging Face is rate-limiting requests or has an outage. Your pods cannot start successfully.

A better approach is to download the model once into shared storage and let every pod load the model from there. This avoids repeated downloads, removes external runtime dependencies, and gives every new pod fast access to the model files.

In this guide, you will deploy vLLM on Kubernetes using managed NFS storage for model files.

The example uses one H100 GPU node to keep the setup easy to follow. However, the pattern can scale to as many nodes as required. That is the main benefit: once the model is stored on NFS, adding GPU capacity gives new pods direct model access instead of triggering another long download.

Key Takeaways

Avoid repeated downloads: Download LLM models once to shared NFS storage instead of fetching them every time a pod starts. This can reduce startup delays from minutes to seconds.

Enable faster scaling: New vLLM replicas can start serving traffic more quickly because they load models directly from NFS rather than waiting for multi-gigabyte downloads.

Reduce external dependencies: Keep models inside your own infrastructure so pod restarts and scaling events do not depend on Hugging Face availability.

Support concurrent access: NFS with ReadWriteMany access allows several pods to read the same model files at the same time, which is ideal for horizontal scaling.

Build a production-ready pattern: This storage approach works for many LLM deployments and can scale from a single-node setup to larger multi-GPU clusters.

The Problem: Why Downloading Models on Every Startup Hurts

Let’s look closely at the cost of downloading models during pod startup.

Model files are large. Mistral-7B-Instruct-v0.3, the model used in this guide, is around 15 GB. Larger models such as Llama 70B can exceed 140 GB. Whenever a pod starts and downloads the model from Hugging Face, that amount of data must travel over the internet again.

Every pod restart causes another download. Pods can crash. Nodes can be maintained or replaced. Deployments happen regularly. With the download-on-startup approach, all of these events trigger another complete model download. If an inference pod restarts three times in one day, the same model files are downloaded three times.

Scaling turns into a bandwidth bottleneck. When a Horizontal Pod Autoscaler adds replicas during a traffic spike, every new pod downloads the model at the same time. Three new pods mean three parallel multi-gigabyte downloads, all competing for bandwidth. Instead of immediately increasing serving capacity, your platform waits for downloads to finish.

Hugging Face becomes a runtime dependency. This is the less obvious risk. During normal operation, Hugging Face availability may not seem like an issue. But imagine a failure at 2 AM: a GPU node becomes unavailable, Kubernetes schedules a replacement pod, and Hugging Face is rate-limiting your IP address or experiencing issues. Your recovery from an infrastructure failure now depends on an external service outside your control.

The principle is simple: control your dependencies. External services such as Hugging Face should be used for the initial model acquisition, not as a required dependency every time a pod starts. Whether a pod starts because of scaling, deployment, or failure recovery, it should load from infrastructure you control.

The Solution: Download Once and Run Inference Everywhere

The pattern is simple:

Download once: A Kubernetes Job downloads the model from Hugging Face to an NFS share.
Store on NFS: The model files remain available on managed NFS storage.
Load from NFS: Each vLLM pod mounts the NFS share and loads the model directly from that path.

This solves the problems described above.

ReadWriteMany access: NFS allows multiple pods to read from the same storage location at the same time. Whether you run one replica or ten, they can all use the same model files.

Persistence: Model files remain available across pod restarts, node replacements, and cluster upgrades. You download the model once and keep it until you intentionally remove it.

In-region access without external runtime dependency: The NFS share is located in the same region as the Kubernetes cluster. Model loading happens through the private network, which is fast, reliable, and independent of third-party availability.

Managed infrastructure: The NFS service is operated by the infrastructure provider, so you do not have to maintain NFS servers yourself.

The scaling advantage is especially important. When you add a new GPU node later and deploy another vLLM replica, the pod can access the model immediately. It starts, mounts NFS, loads the model into GPU memory, and begins serving requests. Startup time is limited to model loading, not model downloading plus model loading.

With the per-pod download approach, adding a new replica means waiting several minutes for yet another download before the new capacity becomes usable.

Architecture Overview

The data flow has two separate phases.

One-time setup:

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS