How to Size and Configure GPUs for vLLM Inference

Effective GPU sizing and configuration for vLLM inference begins with a solid understanding of the two main phases of large language model processing: prefill and decode. Each phase stresses the hardware in a different way and therefore influences GPU selection, memory planning, and deployment strategy.

This guide explains how vLLM behaves during runtime, defines important concepts such as memory requirements, quantization, and tensor parallelism, and outlines practical methods for aligning GPU choices with real-world workloads. By understanding how these elements interact, you can better predict performance bottlenecks and make informed, cost-conscious decisions when deploying large language models on GPU-based infrastructure.

Key Takeaways

  • Prefill and decode phases shape hardware requirements: The prefill phase processes input prompts and is typically limited by memory bandwidth, directly affecting Time-To-First-Token. The decode phase generates output tokens and is usually compute-bound, determining token generation speed.
  • VRAM capacity defines the absolute ceiling: Both model weights and the KV cache must fit into available GPU memory. A 70B model in FP16 requires around 140 GB for weights alone, making quantization critical for many single-GPU deployments.
  • The KV cache expands during runtime: Unlike fixed model weights, the KV cache grows based on context length and the number of concurrent users. A 70B model with a 32k context and 10 concurrent users requires roughly 112 GB for an FP16 cache or 56 GB for an FP8 cache.
  • Quantization is the most important optimization lever: Reducing precision from FP16 to INT4 lowers memory consumption by about 75%, allowing larger models to run on smaller GPUs. FP8 quantization often provides the strongest balance between speed and quality on modern hardware.
  • Tensor parallelism makes larger models possible: When a model is too large for one GPU, tensor parallelism splits weights across several GPUs and combines their VRAM, but this introduces communication overhead. If a model fits on one GPU, single-GPU execution is generally faster.

The Anatomy of vLLM Runtime Behavior: Prefill vs. Decode

The Prefill Phase: “Reading”

The prefill phase is the first step in processing a request. vLLM receives the full input prompt, including the user query, system prompt, and any retrieval-augmented context, and processes it in a highly parallel manner.

  • What happens: The model reads the supplied context and fills the Key-Value cache with the mathematical representation of that context.
  • The bottleneck: Since this phase processes thousands of tokens at once, it is usually constrained by memory bandwidth. The limiting factor is how quickly the GPU can move large weight matrices from VRAM into the compute cores.
  • Real-world impact: This phase determines Time-To-First-Token. For example, when a user asks the model to summarize a very large 100k-token document, the prefill phase is what causes the waiting time before the first generated word appears.

The Decode Phase: “Writing”

After the prefill phase has finished, vLLM enters an autoregressive generation loop to produce the response.

  • What happens: The model generates one token, adds it to the sequence, and then runs the model again to generate the next token. For a single request, this process is sequential by nature.
  • The challenge: Loading very large model weights from VRAM simply to calculate one token for one user is inefficient. In that situation, the GPU can spend more time moving data than performing calculations.
  • The solution: continuous batching: Modern inference engines such as vLLM avoid processing requests one at a time. Instead, they use continuous batching, where requests dynamically enter and leave the batch. vLLM can interleave the prefill work of new requests with the decode steps of existing requests within the same GPU cycle.
  • The bottleneck: When batching works effectively, the decode phase becomes compute-bound. The goal is to perform as many parallel token calculations as possible and maximize total throughput.

Linking Runtime Phases to Workloads and Hardware

Selecting the right hardware depends heavily on understanding which runtime phase dominates your workload.

Runtime Phase Primary Action Primary Hardware Constraint Dominant Use Cases
Prefill Processes long inputs in parallel. Memory bandwidth in TB/s, which is crucial for fast Time-To-First-Token. RAG, long document summarization, and large few-shot prompting.
Decode Generates output tokens sequentially. Compute performance in TFLOPS, which is crucial for fast token generation. Interactive chat, customer service, real-time code generation, and multi-turn agentic workflows.

KV Cache at Runtime

During inference, vLLM depends heavily on a KV cache to avoid repeating work that has already been completed.

  • The mechanics: In a transformer model, each token is converted into key and value vectors inside the attention layers. Without a cache, the model would need to reprocess the entire history, from token 0 to token t, just to generate token t+1.
  • The solution: The KV cache stores key and value vectors once and reuses them during generation.
  • Prefill: vLLM calculates key and value vectors for all prompt tokens and stores them immediately.
  • Decode: For every new token, vLLM retrieves the previous history from the cache and only computes the key and value vectors for the newly generated token.
  • The benefit: This changes attention from a process that effectively repeats the whole context into a more linear process focused on generating the next token.
  • The trade-off: The performance benefit comes at the cost of memory usage. Each generated token adds more entries to the cache.

At runtime, KV cache memory grows dynamically based on several factors:

  • Prompt length and output length: Longer conversations require more VRAM.
  • Concurrency: Every active request requires its own separate cache.
  • Model size: Deeper models with more layers and wider models with more heads consume more cache memory per token.

This scaling behavior explains why two workloads using the same model can have very different hardware requirements. A 70B model may initially fit on a GPU, but if the KV cache becomes too large during a long conversation, the system can run out of VRAM and fail. Understanding memory behavior is therefore essential for production deployments.

Sizing Fundamentals: How Models, Precision, and Hardware Determine Fit

After understanding how vLLM behaves at runtime, the next step is determining whether a specific model can run on a given GPU and what level of context length or concurrency it can realistically support.

This section explains the formulas and decision logic needed to calculate static memory requirements, estimate KV cache growth, and troubleshoot fit-related issues in a structured way.

GPU Hardware Characteristics and Constraints

Before calculating model size, it is important to understand the hardware container the model must fit into. Different GPUs create different limits for feasibility and performance.

Common Data Center GPU VRAM Capacities

The following values represent the hard memory limits of common GPUs used for inference workloads.

GPU Comparison for vLLM Inference and Training

GPU Model VRAM Capacity Peak Dense TFLOPS FP16 / FP8 Primary Applications and Advantages
NVIDIA L40S 48 GB 362 / 733 Cost-effective inference for small-to-medium quantized models, typically in the 7B to 70B range.
NVIDIA A100 40 GB / 80 GB 312 / N/A A previous high-end standard, with the 80 GB version well suited for workloads that require strong memory bandwidth.
NVIDIA H100 80 GB 989 / 1,979 A current high-end standard with very high bandwidth, ideal for long-context applications.
NVIDIA H200 141 GB 989 / 1,979 A major capacity improvement that supports larger batch sizes or 70B+ models with fewer GPUs.
NVIDIA B300 288 GB ~2,250 / 4,500 Designed for maximum density and capable of fitting very large models, such as Llama 405B, with minimal GPU parallelism.
AMD MI300X 192 GB 1,307 / 2,614 Offers very large memory capacity, making it well suited for very large unquantized models or extremely large batch sizes.
AMD MI325X 256 GB 1,307 / 2,614 Optimized for capacity and highly suitable for serving 70B+ models with very long context requirements.
AMD MI350X 288 GB 2,300 / 4,600 A high-performance flagship designed for massive-scale workloads and positioned against top-tier next-generation GPU platforms.

Even when a model fits into VRAM, GPU architecture still has a significant impact on vLLM performance. The most important metrics are:

Metric Measured In Impact on vLLM
VRAM Capacity GB Determines whether the model can run at all. It sets the maximum possible limit for model size and context window.
Memory Bandwidth TB/s Controls prefill speed and is especially important for RAG and long-context summarization. High bandwidth helps reduce Time-To-First-Token.
Compute TFLOPS Determines decode speed and is especially important for chat workloads. Higher TFLOPS improve token-per-second generation.
Interconnect GB/s Determines the cost of parallelism. Any interconnect introduces latency. Even high-speed interconnects such as NVLink add synchronization overhead when tensor parallelism is used, reducing performance compared with single-GPU execution.

Model Weight Footprint: Static Memory

Before vLLM can handle requests, the model weights have to be placed in GPU VRAM. How much memory they occupy is determined by the model’s parameter count and the precision format used.

Formula for Static Weights

The approximate VRAM requirement in GB for model weights can be calculated as follows:

VRAM (GB) ≈ Parameters (Billions) × Bytes per Parameter

The following table shows how this applies to a Llama 3.1 70B model at different precision levels.

Precision Bytes per Parameter Example: Llama 3.1 70B VRAM
FP16 / BF16 2 bytes 70 × 2 = 140 GB
FP8 / INT8 1 byte 70 × 1 = 70 GB
INT4 0.5 bytes 70 × 0.5 = 35 GB

Precision is the single most powerful lever for feasibility. Quantizing a 70B model from FP16 to INT4 reduces its static memory footprint by 75%, changing the deployment from something that may be impossible on a single node into something that can fit on a single high-memory GPU. This makes quantization essential for cost-efficient deployments on cloud GPU instances.

KV Cache Requirements: Dynamic Memory

Model weights determine whether a model can start, but the KV cache determines whether the deployment can scale. KV cache requirements are often underestimated, which can lead to out-of-memory failures under load.

To size a deployment accurately, you need to estimate how much memory the cache will consume based on expected context length and concurrency.

The Field Rule of Thumb for Quick Estimation

For most practical workload conversations, the exact formula is too detailed to calculate instantly. A simpler approach is to use a per-token memory multiplier. This method is usually accurate enough for initial sizing decisions.

Simplified KV Cache Formula:

Total KV Cache (MB) = Total Tokens × Multiplier

(Where Total Tokens = Context Length × Concurrency.)

Standard Multipliers

Model Size Standard Multiplier FP16 Cache Quantized Multiplier FP8 Cache
Small Models 7B – 14B 0.15 MB / token 0.075 MB / token
Large Models 70B – 80B 0.35 MB / token 0.175 MB / token

Example Calculation

A deployment should run Llama 3 70B with a 32k context and 10 concurrent users.

  • Calculate total tokens: 32,000 × 10 = 320,000 tokens.
  • Apply the standard multiplier: 320,000 × 0.35 MB = 112,000 MB, or 112 GB.
  • Check the FP8 option: With FP8 cache enabled, the cache requirement is cut roughly in half to about 56 GB.

Verdict

  • FP16 cache: 112 GB cache + 140 GB weights = 252 GB total, requiring approximately 4 H100-class GPUs.
  • FP8 cache: 56 GB cache + 140 GB weights = 196 GB total, which can fit on about 3 H100-class GPUs or may be tight on 2 H100-class GPUs if the weights are also quantized.

Exact Calculation and Tools

For detailed validation or edge cases, use the formal formula or an online calculator.

Online Tool: LMCache KV Calculator

Formal Formula:

Total KV Cache (GB) = (2 × n_layers × d_model × n_seq_len × n_batch × precision_bytes) / 10243

When Tensor Parallelism Is Required

Tensor parallelism is a method that shards a model’s individual weight matrices across multiple GPUs. In practice, it allows vLLM to treat several GPUs as one larger device with pooled VRAM.

Why Use Tensor Parallelism?

Tensor parallelism is mainly a feasibility tool, not primarily a performance optimization. It is usually enabled when:

  • The weights do not fit: The model is too large for one GPU, such as a Llama 3 70B model on a 24 GB card.
  • The KV cache has no room: The model weights fit, but leave almost no available memory for long contexts or high concurrency.

The Performance Tax of Parallelism

Tensor parallelism provides access to much more combined memory, but it also creates communication overhead. After each layer of computation, the GPUs must synchronize their partial results.

  • If the model fits on one GPU: Running it on a single GPU is almost always faster than using two GPUs, because there is no synchronization overhead.
  • Interconnect dependency: Tensor parallelism depends heavily on fast GPU-to-GPU bandwidth. If GPUs communicate only through standard PCIe instead of a high-speed interconnect such as NVLink, inference speed can drop significantly because of synchronization latency. For multi-GPU deployments, container orchestration platforms can be used to manage vLLM workloads reliably.

For more detail on how tensor parallelism shards models and affects latency, refer to Hugging Face: Tensor Parallelism Concepts.

Putting the Numbers to the Test: Sizing Scenarios

Before moving into advanced configurations, it helps to apply the formulas from earlier sections to realistic scenarios. This validates the concept of fit and highlights practical constraints that simple calculations often miss.

The Hidden VRAM Tax

A common mistake is to calculate weights plus cache and assume that all VRAM can be used. In practice, 100% utilization is not possible.

  • CUDA context and runtime: The GPU driver, PyTorch, and vLLM runtime reserve memory during initialization, often around 2 to 4 GB.
  • Activation buffers: Temporary storage is required for intermediate calculations during the forward pass.
  • Safe sizing rule: Always reserve about 4 to 5 GB of VRAM as unusable overhead. If your calculation leaves only 0.5 GB free, the server is likely to crash.

Scenario A: The Easy Fit for Standard Chat

Hardware: 1 × NVIDIA L40S with 48 GB VRAM

Model: Llama 3 8B in FP16

Math:

  • Weights: 8B parameters × 2 bytes = 16 GB
  • Overhead: -4 GB
  • Remaining cache memory: 48 – 16 – 4 = 28 GB

Cache capacity:

28,000 MB / 0.15 MB per token = 186,000 tokens.

Verdict: Excellent fit.

This configuration can handle large workloads, such as 60 concurrent users with a 3k context each.

Result: High throughput at low cost.

Scenario B: The Weight Failure for a Large Model on One GPU

Hardware: 1 × NVIDIA H100 with 80 GB VRAM

Model: Llama 3 70B in FP16

Math:

  • Weights: 70B parameters × 2 bytes = 140 GB

Verdict: Hard fail.

The model weights require 140 GB, which physically exceeds the 80 GB GPU capacity.

Solution: Use tensor parallelism with 2 GPUs or apply quantization.

Scenario C: The Cache Trap Where the Model Fits but Cannot Run Well

Hardware: 1 × NVIDIA H100 with 80 GB VRAM

Model: Llama 3 70B quantized to FP8

Math:

  • Weights: 70B parameters × 1 byte = 70 GB
  • Overhead: -4 GB
  • Remaining cache memory: 80 – 70 – 4 = 6 GB

Cache capacity:

6,000 / 0.175 MB per token for FP8 = 34,000 tokens total.

Verdict: Risky and poorly balanced.

The model loads, but there is almost no memory left for actual workload execution.

Impact: With 10 concurrent users, each user only receives about 3.4k context. If a user submits a long document of 4k tokens, the system can run out of memory.

Lesson: Fitting the weights does not mean the workload fits. This scenario usually needs a second GPU or a smaller model.

Scenario D: The Solution with Tensor Parallelism

This scenario improves the cache trap by adding a second GPU, showing how tensor parallelism pools memory resources.

Hardware: 2 × NVIDIA H100 with 80 GB each, providing 160 GB total VRAM

Model: Llama 3 70B quantized to FP8

Math:

  • Total VRAM: 160 GB
  • Weights: -70 GB, distributed across both GPUs
  • Overhead: -8 GB, assuming about 4 GB per GPU
  • Remaining cache memory: 160 – 70 – 8 = 82 GB

Cache capacity:

82,000 / 0.175 MB per token for FP8 = 468,000 tokens total.

Verdict: Production-ready.

Adding a second GPU increases available cache space from a risky 6 GB to a much stronger 82 GB.

Impact: With 10 concurrent users, each user can now receive roughly 46k context. The out-of-memory risk is removed, and the deployment can comfortably support RAG or long-document workloads.

Quantization: The Art of Squeezing Models

As the sizing scenarios show, VRAM is often the main bottleneck for LLM inference. Quantization reduces the numerical precision used to represent data, trading a small amount of accuracy for large improvements in memory efficiency and speed.

It is important to distinguish between the two major types of quantization used with vLLM, because they solve different problems.

Type 1: Model Weight Quantization as the Static Fix

This method compresses the large, static weight matrices of the pretrained model before the model is loaded.

  • Goal: Make a model fit onto a GPU when the full-precision weights would exceed available VRAM.
  • vLLM implementation: Although vLLM can quantize weights dynamically at startup, it is often more efficient to load a model that has already been quantized with optimized formats such as AWQ or GPTQ. These formats typically preserve accuracy better and provide faster decode speeds than generic on-the-fly conversion.
  • Impact: Static VRAM usage can be reduced by 50% with FP8 or INT8 and by 75% with INT4 or AWQ, leaving much more memory available for the KV cache.

Type 2: KV Cache Quantization as the Dynamic Fix

This method compresses the intermediate key and value states stored in memory during sequence generation.

  • Goal: Allow a model to support higher concurrency or longer context windows.
  • vLLM implementation: KV cache quantization is controlled through the runtime flag --kv-cache-dtype.
  • Recommendation: On modern GPUs with FP8 tensor core support, such as NVIDIA H100, L40S, or AMD MI300X, FP8 KV cache is strongly recommended. It can nearly double available context capacity with minimal impact on model quality.
  • Impact: It halves the per-token memory requirement described earlier, reducing the multiplier for a 70B model from about 0.35 MB per token to roughly 0.175 MB per token.

vLLM GPU Precision Formats

Quantization formats are not all equal. The best format depends on hardware architecture and the desired trade-off between model size and accuracy.

Precision / Format Bytes per Parameter Accuracy Impact Best Hardware Support Recommended Use Case
FP16 / BF16 2 None, reference quality All modern GPUs The gold standard. Use it whenever VRAM capacity allows.
FP8 1 Negligible H100, H200, L40S, MI300X The modern default. It offers an excellent balance of speed and quality on newer hardware and is especially useful for the KV cache.
AWQ / GPTQ INT4 variants ~0.5 Low to medium A100, L40S, consumer GPUs The squeeze option. Essential for running very large models on older or smaller GPUs, with strong decode performance.
Generic INT8 1 Medium Older GPUs such as V100 or T4 A legacy option that is generally replaced by FP8 on newer hardware or AWQ for more aggressive compression.

Strategic Application and Trade-Offs

Choosing when to apply quantization requires balancing practical deployment limits against workload sensitivity. Quantization is powerful, but it comes with trade-offs that must be considered during planning.

Key Considerations: Accuracy and Hardware

Before choosing a deployment approach, consider these two foundational constraints:

  • Accuracy vs. compression: Aggressive quantization such as INT4 can reduce quality on tasks involving complex reasoning or code generation. FP8 is generally safe for most chat and RAG workloads.
  • Hardware compatibility: The chosen precision format must match the GPU’s capabilities. For example, FP8 quantization requires GPUs with FP8 tensor cores, such as NVIDIA Ada or Hopper architectures or AMD CDNA3 architectures, to realize performance benefits.

When Quantization Should Be Recommended

Considering these trade-offs, quantization is useful in many real-world deployment scenarios and is often the default choice in enterprise environments.

  • Large models that do not fit in FP16: INT4 or INT8 is often the only practical way to serve 70B-class models on a single 48 GB or 80 GB GPU.
  • High-concurrency workloads: Lower VRAM usage leaves more space for the KV cache, allowing more active sequences and longer prompts.
  • RAG and enterprise chat: These workloads typically tolerate small accuracy changes without noticeably affecting the user experience.
  • Cost-optimization efforts: Quantization allows workloads to run on smaller and less expensive GPU options while maintaining acceptable performance. This is especially useful when balancing performance and cost on GPU-based cloud infrastructure.

When Quantization Should Be Avoided

Quantization is not suitable for every workload. Some tasks are highly sensitive to precision loss and should remain in FP16 or BF16 whenever possible.

  • Code generation and debugging: Lower precision can reduce structured reasoning quality and lead to subtle syntax or logic errors.
  • Math, finance, and scientific queries: Tasks requiring exact calculations benefit from higher precision formats to reduce rounding errors.
  • Evaluation, benchmarking, or regression testing: Even small accuracy drift can invalidate comparisons between model versions or deployment configurations.
  • Agentic workflows with multi-step reasoning: Small errors can compound across multiple steps and reduce overall reliability and task completion quality.

Putting It All Together: From Requirements to a Deployment Plan

So far, this guide has covered vLLM runtime behavior, memory fundamentals, and quantization strategies.

This section combines those concepts into a repeatable decision framework. It moves from theory to practice and provides a structured workflow for evaluating feasibility, choosing hardware, and building a deployment plan.

Step 1: Use a Sizing Questionnaire

To size a vLLM deployment accurately, you need specific technical details from the workload description. Broad goals such as “fast inference” are not precise enough. Use these five questions to collect the required information:

  • What maximum context length must be supported? This determines KV cache size and out-of-memory risk.
  • What is the target concurrency? This multiplies the KV cache requirement.
  • What latency is acceptable for TTFT and tokens per second? This helps determine whether high bandwidth, such as H100-class hardware, is needed or whether strong general capacity, such as L40S-class hardware, is sufficient.
  • Is model accuracy critical for math or code, or is good-enough quality acceptable for chat? This determines whether INT4 or FP8 quantization can be used to reduce cost.
  • Is there a strict budget limit? This helps decide between maximum performance and price-performance optimization.

Step 2: Select Model Size and Precision

Once the requirements are known, choose the smallest model and highest precision that meet the required quality level.

  • Precision is the lever: Lower precision formats such as INT4 or FP8 make larger models feasible on lower-cost hardware.
  • The 70B rule: A 70B model in FP16 requires multi-GPU or very high-memory hardware. The same model in INT4 can fit on a single GPU.

Guidance:

  • Chat or assistant workloads: Use INT4 or FP8.
  • Code or reasoning workloads: Use FP16 or FP8 and avoid INT4.

Step 3: Run a Hardware Feasibility Check

Validate the deployment using the memory calculations from earlier sections.

  • Static fit for weights: Does parameters × precision fit into VRAM? If not, quantize the model or add GPUs with tensor parallelism.
  • Dynamic fit for cache: Is there enough room for context × concurrency × multiplier? If not, reduce concurrency, shorten context length, or enable FP8 KV cache.
  • Workload fit for bandwidth: Long RAG and summarization workloads require high bandwidth, while standard chat requires strong compute performance.

Step 4: Recommend the GPU Strategy

After confirming feasibility, select the GPU configuration. The following cheat sheet summarizes common scenarios.

Common Configuration Outcomes

Workload Scenario Recommended Configuration Rationale
Standard Chat 8B-14B NVIDIA L40S with 48 GB Best value. It provides strong decode compute, and 48 GB easily fits small models plus a large cache.
Large Chat 70B, Cost-Sensitive L40S INT4 or A100 INT4 The squeeze approach. Quantization allows a 70B model to fit on one card, avoiding multi-GPU complexity.
High-Performance Chat 70B NVIDIA H100 FP8 or AMD MI300X FP16/FP8 The modern standard. H100-class hardware uses FP8 to fit and accelerate inference. AMD MI300X provides 192 GB VRAM, enabling 70B models with large batch sizes on one card.
Massive Context / RAG NVIDIA H200, AMD MI300X, or AMD MI325X These are strong options for bandwidth and capacity. With 192 GB on MI300X or 256 GB on MI325X, extreme context lengths such as 128k+ become more practical without requiring 4 to 8 GPUs.
Uncompromised Quality 70B FP16 2 × H100 with tensor parallelism or 1 × AMD MI300X NVIDIA requires two cards to fit 140 GB of weights. AMD MI300X can fit the full 70B FP16 model on a single GPU, avoiding tensor parallelism latency.
Ultra-Scale / Next-Generation 405B+ NVIDIA B300 or AMD MI350X Designed for frontier-scale model density. MI350X with 288 GB competes with next-generation high-end GPU platforms for fitting 400B+ mixture-of-experts models efficiently.

Step 5: Validate with Metrics

No theoretical sizing plan is perfect. Always validate with real metrics.

  • Check TTFT: If it is high, prefill is likely bottlenecked by memory bandwidth.
  • Check inter-token latency: If it is high, the batch size may be too aggressive and compute may be saturated.
  • Check KV cache usage: If usage is consistently above 90%, the deployment is at risk of out-of-memory failures. Enable chunked prefill or reduce concurrency.

Frequently Asked Questions

1. How much GPU memory is required for LLM inference?

GPU memory planning depends on four main factors: the model’s parameter count, the precision format, the context length, and the number of parallel users. For weight memory alone, FP16 usually needs roughly 2 GB for each billion parameters. That means a 70B model uses about 140 GB in FP16, while INT4 quantization can reduce the weight memory to around 35 GB. The KV cache must also be considered, since it increases with longer contexts and higher concurrency. For example, a 70B model running with a 32k context and 10 simultaneous users may need around 112 GB for an FP16 KV cache or about 56 GB when using FP8.

2. What is the difference between tensor parallelism and pipeline parallelism in vLLM?

Tensor parallelism shards model weight matrices across multiple GPUs within each layer, allowing the GPUs to work on the same computation at the same time. This pools VRAM but requires synchronization after every layer, which adds communication overhead. Pipeline parallelism distributes model layers across GPUs sequentially, with different GPUs handling different layers. Tensor parallelism is normally used when a model is too large for one GPU, while pipeline parallelism is more common in training scenarios. For inference, tensor parallelism is the standard approach when models exceed single-GPU capacity.

3. When should quantization be used for vLLM deployments?

Quantization is recommended when models do not fit into available VRAM, when higher concurrency or longer context windows are required, or when cost optimization is important. FP8 quantization is ideal for modern hardware such as H100, L40S, or MI300X and usually causes minimal accuracy loss. INT4 quantization is useful for fitting large models on smaller GPUs, but it should be avoided for code generation, mathematics, and scientific workloads where precision is important. For chat and RAG workloads, quantization is often the preferred default.

4. How can KV cache memory requirements be calculated?

For quick estimation, use the per-token multiplier method: multiply total tokens, calculated as context length × concurrency, by the model-specific multiplier. For small models in the 7B to 14B range, use 0.15 MB per token for FP16 cache or 0.075 MB for FP8 cache. For large models in the 70B to 80B range, use 0.35 MB per token for FP16 cache or 0.175 MB for FP8 cache. For exact calculations, use the formula: Total KV Cache (GB) = (2 × n_layers × d_model × n_seq_len × n_batch × precision_bytes) / 1024³, or use tools such as the LMCache KV Calculator.

5. Can vLLM run on cloud GPU instances?

Yes, vLLM can be deployed on cloud GPU instances. Many providers offer GPU-based virtual machines with NVIDIA or AMD GPUs that meet vLLM requirements. When deploying, ensure that the selected GPU has enough VRAM for the model size and expected workload. For cost-effective deployments, consider quantized models such as INT4 or FP8 to fit larger models on smaller GPU instances. For multi-GPU deployments, high-speed GPU interconnects are important for efficient tensor parallelism.

Practical Use Cases of vLLM GPU Inference

Building on the foundational understanding of how model size, precision, GPU architecture, KV cache, and batching affect performance, these concepts can be applied to practical vLLM workloads.

For each use case, three key questions help determine the optimal setup:

  • Workload definition: What are the defining characteristics, such as prompt length, output length, concurrency, and latency sensitivity?
  • Sizing priorities: Which factors create the bottleneck, such as weights vs. KV cache or bandwidth vs. compute?
  • Configuration pattern: Which specific flags and hardware choices perform reliably?

Use Case 1: Interactive Chat and Assistants

  • Focus: Low latency and decode-bound performance.
  • Goal: Smooth streaming and fast typing speed for users.
  • Key constraint: Compute performance in TFLOPS and inter-token latency.

Use Case 2: High-Volume Batch Processing

  • Focus: Maximum throughput and compute-bound execution.
  • Goal: Process millions of tokens per hour for offline workloads such as summarization.
  • Key constraint: Total system throughput measured in tokens per second.

Use Case 3: RAG and Long-Context Reasoning

  • Focus: Context capacity and memory-bound performance.
  • Goal: Fit very large documents or long histories into memory without failures.
  • Key constraint: VRAM capacity and memory bandwidth for fast prefill.

Conclusion

Correctly sizing and configuring GPUs for vLLM requires understanding the core trade-offs between model size, precision, context length, and concurrency. Prefill and decode have different hardware needs: prefill depends heavily on memory bandwidth, while decode depends on compute throughput. Quantization is the main lever for fitting larger models onto available hardware, and tensor parallelism makes it possible to go beyond single-GPU limits.

The most important factor in a successful deployment is matching workload characteristics to the right hardware configuration. Interactive chat applications prioritize compute for fast token generation, while RAG and long-context workloads require large VRAM capacity and high memory bandwidth. By following the sizing framework described in this guide, you can evaluate feasibility systematically, choose suitable hardware, and optimize a vLLM deployment for production workloads.

Source: digitalocean.com

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: