FlashAttention 4: Practical Guide for LLM Inference Optimization
Scaled-dot-product attention (SDPA) consumes a major share of inference time and energy in large language models (LLMs). In many workloads, most operator executions are concentrated inside this single primitive. Attention works by taking queries (Q), multiplying them with keys (K), applying softmax normalization, and then multiplying the result by values (V) to generate the final output. The process is heavily memory-bound because Q, K, and V must be read repeatedly from high-bandwidth memory, while intermediate tiles are also written back multiple times.
FlashAttention approaches reduce this limitation by keeping more data on-chip for longer and making better use of GPU streaming multiprocessors (SMs). First released as open source in 2022, FlashAttention redesigned the attention operation so that softmax can be calculated on the fly. Only a small number of Q × K rows need to remain on-chip at any moment.
This article provides a practical technical overview of FlashAttention 4 (FA4). It explains what has changed compared with earlier versions and offers guidance for adoption and benchmarking. The goal is to help LLM infrastructure engineers, kernel developers, and ML platform teams decide whether FA4 is suitable for their stack.
Key Takeaways
- FA4 is built as a Blackwell-first attention kernel designed mainly for SM100 GPU deployment. It introduces a warp-specialized 5-stage pipeline that improves overlap and on-chip reuse compared with earlier FlashAttention generations.
- Softmax efficiency is improved through two main methods: running software-based
exp2()calculations directly on CUDA cores to reduce SFU contention, and using adaptive online rescaling to avoid unnecessary rescale operations while preserving numerical stability. - Several important features are still missing. FA4 currently focuses on the forward pass and does not yet provide complete implementations for backward pass, variable-length sequences, and GQA/MQA, limiting its use in training and some model architectures.
- FA4 should be adopted carefully. Start with Blackwell inference, protect usage with benchmarks, correctness validation, and fallbacks, and continue using FA3 for Hopper and FA2 for Ampere/Ada as stable defaults for broad production environments.
Why Attention Kernels Still Drive LLM Cost
Although modern GPUs deliver very high FLOPS, transformer models frequently remain limited by memory bandwidth because attention must load Q, K, and V from global memory multiple times. To compute the attention output for each query token, the token may need to perform dot products against thousands of key tokens, normalize and rescale the result, and finally retrieve the corresponding values. Basic implementations calculate the full Q×K matrix and store it in memory. This quickly exceeds the capacity of on-chip memory and causes performance to suffer because of repeated off-chip reads.
FlashAttention addresses this issue by tiling the attention matrix. Instead of materializing the full Q×K matrix, it streams small Q and K tiles into shared memory, calculates softmax for each tile, and keeps only the running maximum and sum for every query row. This greatly reduces memory traffic and can shift the kernel closer to being compute-bound.
FlashAttention Evolution at a Glance
Understanding the evolution of FlashAttention helps clarify why FA4 matters. Each release used newer hardware capabilities to improve speed, efficiency, or both. FA4 is specifically tuned for Blackwell-generation GPUs. The following table summarizes the main improvements and performance impact of each generation.
| Version (Year) | Target GPU Architecture | Key Innovations | Performance Highlights |
|---|---|---|---|
| FlashAttention (v1) – 2022 | Ampere (NVIDIA A100) and earlier | IO-aware exact attention using tiling and on-chip buffering to reduce off-chip memory traffic. | 2–4× faster than baseline PyTorch attention; up to 10–20× lower memory usage compared with naïve implementations. |
| FlashAttention-2 – 2023 | Ampere (A100), Ada (RTX 30/40) | Improved parallelism, better work partitioning, higher occupancy, refined warp scheduling, and MQA/GQA support. | Approximately 2× faster than v1; reaches about 50–73% of A100 theoretical FLOPs, around 225 TFLOPs/s. |
| FlashAttention-3 – 2024 | Hopper (H100 / H800) | Uses asynchronous compute and data movement, Tensor Memory Accelerator, warp specialization, interleaved GEMM/softmax pipelines, and FP8 support with improved numerical behavior. | 1.5–2× faster than v2 on H100; around 740 TFLOPs/s FP16, roughly 75% utilization, and up to about 1.2 PFLOPs/s in FP8. |
| FlashAttention-4 – 2025 | Blackwell (for example B200, SM 10.x) | Further pipeline specialization for Blackwell concurrency, software approximations for exponentials, optimized online softmax, and a kernel architecture tuned for Blackwell through CUDA/CUTLASS/DSL. | About 20–22% faster than cuDNN attention on Blackwell in benchmarks, with reported performance improvements. |
What Is New in FlashAttention 4
FlashAttention 4 builds on the tile-based execution approach used in earlier FlashAttention releases. Instead of processing the full Q, K, and V tensors at once, the kernel works on smaller blocks. This improves reuse of data stored on-chip and reduces expensive global memory access. Cooperative thread arrays (CTAs) are responsible for producing one or more output tiles by loading the corresponding Q, K, and V blocks from global memory and transforming them into the final attention result.
Inside FA4, the attention computation is organized as a deeply pipelined workflow. Multiple parts of the operation run in an overlapping manner, which helps hide memory and execution latency while improving GPU utilization. Unlike FlashAttention 3, which mainly overlaps loading and compute in two stages, FA4 divides the work into several parallel pipeline stages. Different warp groups are assigned separate responsibilities as each tile moves through the attention computation.
In this warp-specialized design, 32-thread warps take on dedicated roles, including:
- Data loading and movement: moving Q, K, and V tiles from global memory into faster on-chip memory.
- Computation: producing partial attention scores with tensor-core matrix multiply-accumulate operations.
- Softmax and normalization: computing exponentials and normalizing the attention weights.
- Rescaling and reduction: applying numerical corrections and merging partial results before accumulation.
- Epilogue and storage: writing the finished output tiles back to global memory.
The main goal is to keep the asynchronous work queues for each warp group busy. This allows the GPU scheduler to switch quickly between groups as soon as the required operands are ready.
Software Exponentials via CUDA Cores
One of the most notable changes in FA4 is how it computes the exponential function used by softmax. Traditionally, kernels rely on dedicated GPU special function units (SFUs) for expensive mathematical operations such as exp(). Because each SM has only a limited number of SFUs, they can easily become a queueing bottleneck.
FA4 avoids this bottleneck by letting regular CUDA cores compute approximate exponentials in software. More specifically, FA4 includes a custom implementation of an exp2(x) polynomial approximation with hardware-level precision. It simulates the exponential calculation using a cubic polynomial, allowing the work to run in parallel across many CUDA cores instead of being concentrated on a small number of SFUs.
The result is less waiting for SFUs during softmax execution. By moving exponential work to general-purpose cores, FA4 enables the warp handling softmax to keep better pace with the rest of the pipeline.
Smarter Online Softmax Rescaling in FA4
Another FA4 improvement concerns how softmax accumulation preserves numerical stability. Like earlier FlashAttention versions, FA4 uses online softmax, meaning it accumulates partial maximum values and sums while streaming through the sequence to prevent overflow. Earlier versions continuously updated the running scale factor, usually the largest logit seen so far. Whenever a new maximum appeared, they would rescale to maintain stable softmax behavior.
FA4 introduces adaptive rescaling. Instead of rescaling every time a new maximum appears, it only rescales when the new maximum is significantly larger than the previous one. This reportedly reduces rescale frequency by around a factor of 10.
Compatibility and Current Status
The following table summarizes expected and currently available compatibility for FlashAttention 4 across the software and hardware stack.
| Aspect | Current Status | Notes / Source |
|---|---|---|
| Hardware Target | FA4 kernels are designed for Blackwell GPUs such as SM10.x / B200. | FA4 is optimized for Blackwell-generation GPUs. FA3 and earlier versions support older architectures, while Blackwell support is being added. |
| Blackwell Support in Official Releases | Work in progress. | Blackwell support has not yet been merged into released versions, and users are tracking support for the sm_120 architecture. |
| Forward Pass (FA4) | Functional / available. | The FA4 forward attention kernel has been committed and can run on Blackwell with suitable builds. |
| Backward Pass (FA4) | Incomplete / limited. | The Blackwell backward pass still lacks important features such as variable-length sequence and GQA support. |
| Variable-Length Sequence Support | Not fully supported in backward. | This is reported as missing in the backward implementation and remains under development. |
| GQA / Grouped-Query Attention | Not yet supported in backward. | The backward pass lacks support for grouped-query attention and similar variants. |
| Framework Integration | Not supported yet. | vLLM currently recognizes only FA2 or FA3 and produces an error when FA4 is selected. |
| Framework Integration with PyTorch SDPA | Not yet included in stable releases. | PyTorch scaled dot-product attention backends have not yet shipped FA4 support. |
| CUDA Toolkit / Driver Requirements | CUDA 12.8+ is commonly used for Blackwell builds. | Blackwell and SM10 builds are generally compiled with recent CUDA toolkits, usually CUDA 12.8 or newer, although exact requirements may change as development continues. |
FA4’s forward pass is already broadly available in the public source tree. Public discussions in FlashAttention-related repositories show that users are still requesting complete backward pass, variable-length sequence, grouped-query, and GQA support. This indicates that the topic remains an open public issue.
Variable-length sequence support allows efficient batching of sequences with different lengths without relying on padding. GQA, or grouped-query attention, lets several attention heads share the same key/value projections and is commonly used in LLMs to reduce memory usage. Earlier FlashAttention versions supported MQA/GQA configurations. FA4’s main forward kernel, however, is likely to assume that the same number of heads is used for each query. As a result, grouped heads may not yet be handled correctly. Models using grouped-query attention, including some Llama-style variants, will not be fully supported by FA4 until this capability is implemented.
FA4 ecosystem integration is also still limited because the kernel is new. For example, the vLLM serving engine, which is optimized for LLM inference, currently supports FlashAttention 2 and FlashAttention 3 but not FA4. Selecting FA4 in vLLM can produce an unsupported-version error because the version selector only allows versions 2 or 3.
How to Adopt FlashAttention: Decision Guide
Use the following decision guide to choose the most appropriate FlashAttention version for your workload.
| Situation | Recommendation | Reason |
|---|---|---|
| You use Ampere or Ada GPUs such as A100 or RTX 30/40. | Use FlashAttention-2 (FA2). | FA4 is Blackwell-only and will not compile or run on SM8.x. FA2 remains the best-supported option for Ampere and Ada and is widely integrated into modern stacks. |
| You use Hopper GPUs such as H100 or H200. | Use FlashAttention-3 (FA3). | FA3 was designed for Hopper and commonly provides strong gains, often cited as 1.5–2× faster than FA2, while offering mature feature coverage for backward pass, variable-length sequences, and GQA/MQA in common implementations. FA4 is not intended for Hopper. |
| You do not have Blackwell hardware such as B200 or B100. | Do not use FA4. | FA4 is tailored for SM10.0 and will not compile or run correctly on SM8.x or SM9.x. Use the newest FlashAttention version supported by your GPU instead, such as FA2 for Ampere/Ada or FA3 for Hopper. |
| You have Blackwell hardware and your workload is inference-only. | Test FA4 forward first, with a fallback. | FA4 is currently forward-first, making inference the best fit. The largest gains are expected with long sequences and standard attention patterns. Keep cuDNN or SDPA fallbacks enabled to avoid failures or regressions when features are missing. |
| You have Blackwell hardware and your workload requires training. | Prefer a mature training path until FA4 backward support is available. | Training needs a backward pass. If FA4 only handles forward, backward may fall back to another kernel and dominate step time. Options include training on Hopper with FA3, accepting slower backward execution on Blackwell for now, or waiting for FA4 backward support. |
| Your model requires variable batching with ragged batches through cu_seqlens. | Avoid FA4 unless padding or fallback behavior is acceptable. | If FA4 lacks variable-length support in your stack, you may encounter errors or silent fallbacks. Possible workarounds include padding to uniform lengths, bucketing by length, or disabling FA4 for variable-length cases. |
| Your model uses GQA or MQA, such as grouped-query heads in Llama-style models. | Validate carefully and be prepared to disable FA4. | If FA4 does not support GQA for your execution path, especially in backward and sometimes in forward, you may see incorrect behavior or fallbacks. Use a known-good backend such as FA3, cuDNN, or SDPA until GQA is confirmed in your environment. |
| You want FP4 or other ultra-low precision features. | Do not assume FA4 enables this yet. | Early FA4 adoption should begin with BF16 or FP16. FP4 depends on kernel and framework support and should not be planned for early deployments unless your toolchain documents it explicitly. |
| You need maximum stability and battle-tested behavior. | Choose the most mature option for your GPU. | FA4 is new and still evolving, with missing features and possible edge-case bugs. For low-risk production requirements, prioritize mature kernels and introduce FA4 only behind a feature flag with strict correctness checks. |
| You can experiment and optimize aggressively. | Adopt FA4 incrementally on Blackwell. | A practical approach is to enable FA4 for a narrow workload slice, such as inference, long-sequence buckets, non-GQA, and fixed-length inputs. Measure throughput and latency, verify numerical correctness, and then expand usage as support improves. |
FAQs
What problem does FlashAttention 4 solve?
FA4 reduces the heavy HBM memory traffic caused by attention by computing attention tile by tile on-chip and avoiding materialization of the full QK⊤ matrix.
What is the main architectural difference between FA4 and FA3?
FA4 uses a warp-specialized pipeline with approximately five stages: load, compute, softmax and normalize, rescale and reduce, and store. This increases overlap and on-chip reuse on Blackwell SM10.x GPUs.
Why does FA4 calculate exponentials on CUDA cores instead of SFUs?
SFUs are limited in number and can become a bottleneck during softmax. FA4 therefore computes a software approximation of exp2() on CUDA cores to reduce pressure on SFUs.
What changes with adaptive online rescaling in softmax?
Instead of rescaling every time a new maximum appears, FA4 rescales only when the maximum increases by a significant amount. This lowers rescale overhead while preserving numerical stability.
When should engineers use FA4 in production today?
FA4 is best introduced first for Blackwell inference-only workloads with fixed-length, standard attention behind a feature flag. cuDNN, SDPA, and FA3/FA2 on older GPUs should remain available as fallbacks until backward, variable-length, and GQA support mature.
Conclusion
FlashAttention 4 extends the original FlashAttention goal of reducing attention’s memory-bandwidth requirements by performing more Q, K, and V work on-chip. Unlike earlier versions, FA4 does this with a Blackwell-specific warp-specialized 5-stage pipeline and two major softmax improvements: software exponentials on CUDA cores to reduce SFU bottlenecks, and adaptive online rescaling to remove redundant stability calculations.
FA4 has demonstrated meaningful forward-pass throughput improvements over cuDNN at long sequence lengths, but it remains forward-first and Blackwell-only. It also still has important feature gaps, including missing backward pass, variable-length sequence support, and GQA support, while broader ecosystem integration is not yet fully mature.
The practical takeaway is to evaluate FA4 for Blackwell inference as a promising but carefully controlled optimization path. Adoption should be protected by rigorous benchmarking, correctness checks, and reliable fallbacks. Training workloads and broader hardware environments should continue to rely on mature FA3 and FA2 paths for production use.


