Serverless LLM Inference Performance: Metrics That Matter in Production
When teams compare serverless LLM inference models and platforms, the discussion often gets reduced to one figure: median tokens per second. This number is simple to publish, easy to compare, and for certain workloads it can be the exact metric worth optimizing. However, it is only one data point. By itself, it captures just a limited part of what “performance” means once an inference workload is running in production.
The reason is that each workload is affected by different bottlenecks. A nightly batch summarization process depends on sustained throughput, so median tokens per second is a reasonable benchmark for that scenario. A user-facing chat application, on the other hand, depends far more on how quickly the first token appears and how consistently that happens, rather than the steady generation rate. A production service that handles real traffic is shaped by its slowest requests, failure rate, and cost per completed response. None of these are reflected in a median throughput number. If you optimize for the wrong metric, you may end up with a system that performs well in benchmarks but disappoints in real usage.
This article explains the metrics that truly matter for production serverless inference, what each metric measures, and which workloads should prioritize it. The aim is to help you choose the measurements that fit your specific use case.
Key Takeaways
Benchmark results show that there is no universally fastest provider. Performance depends strongly on the specific model and workload. A provider may be very fast for one model, such as Llama 3.3 70B, but much slower for another, such as Gemma 4. Therefore, speed claims only make sense when the tested model and use case are clearly stated.
Reliability is just as important as speed and is often overlooked in benchmarks. Some providers restrict certain models to dedicated endpoints, while others show inconsistent behavior or long cold starts. A fast response time is not useful if the model cannot be accessed dependably.
For production systems, consistent time-to-first-token is usually more valuable than an occasional very low first-token latency. It is better to have predictable response behavior across the model catalog than a setup where the same request can range from under one second to 24 seconds. Users experience the slow outliers, not just the average case.
The most important cost factor is often the usefulness of the answer relative to the task. This is influenced more by choosing the right model than by comparing provider price lists alone.
Throughput: Tokens per Second
Throughput describes the steady rate at which a model generates tokens after it has begun responding. It is the metric most public benchmarks emphasize first. For some workloads, this is appropriate. A batch process that rewrites product data overnight, a pipeline that produces summaries or embeddings in large volumes, or any offline task where no user is actively waiting is limited by sustained tokens per second. In these cases, ranking platforms by throughput can guide the right decision.
Throughput benchmarks often look at a single active request, but this does not represent typical production traffic. Real applications handle many requests in parallel, so it is more useful to measure total throughput under concurrent load and observe how smoothly per-request performance drops as demand increases. Model architecture also plays a major role. Mixture-of-experts models can produce tokens much faster than dense models with similar or even higher parameter counts, so throughput depends on the chosen model as well as the provider.
Time to First Token and Stability
For interactive applications, time to first token, or TTFT, is the metric users feel most directly. In a streaming chat interface, TTFT is the delay between submitting a prompt and seeing the response begin. A model with only moderate throughput can still feel very responsive if the first token arrives quickly and predictably, especially when the user does not need to wait for the full answer before seeing generated output.
Predictability is the more difficult part. A first token that usually appears in 0.2 seconds but occasionally takes eight seconds creates a broken user experience, even if the median looks excellent. TTFT should therefore be measured as a range, comparing the median with the 95th percentile, because the gap between them reveals the experience hidden by the median.
When TTFT is measured across an interactive chat workload using fixed prompts, temperature set to 0, and at least 25 trials per model, with three warmup requests discarded, the spread between the median and worst case becomes especially meaningful. In a representative benchmark from a cloud server in a New York region, one provider showed a very narrow median-to-worst-case range on both gpt-oss-120b, from 0.29 to 0.35 seconds, and a Kimi reasoning model, which remained below 0.7 seconds at worst. Across a broader mainstream model lineup, the same pattern appeared: typical and worst-case first-token times stayed within a few hundred milliseconds of each other and remained under 0.4 seconds.
Tail Latency: p95 and p99
While TTFT measures whether a response starts quickly, tail latency measures whether the full request finishes within the required time budget. It represents the end-to-end duration of the slowest requests, usually measured at the 95th and 99th percentiles. These are the numbers used for service-level objectives, HTTP timeouts, and capacity planning. At production traffic levels, the tail is not an unusual edge case. It is a predictable portion of requests every minute, so a provider with an excellent median and a heavy tail can quietly exceed the latency budget once traffic increases.
A large difference between the median and tail latency can indicate a struggling serving path. Plan around p95 or p99, and treat a wide gap between median and tail as a reliability warning rather than a minor detail.
Reliability and Availability
Speed does not matter if the request fails, returns no usable output, or the model cannot be called in the first place. Availability describes whether you can access the model you want through serverless inference without having to provision dedicated infrastructure.
Reliability is the next dimension. It measures whether requests succeed after the model is available. Always benchmark the specific model you plan to deploy, because established models may perform reliably on a mature platform, while newer or more specialized models can still have availability and reliability problems.
Cost per Useful Result
When comparing model types, the right cost metric is not simply dollars per million tokens from a pricing page. The more meaningful metric is the cost of one useful, completed response at the token volumes your workload actually produces. The main factors are model choice and routing capabilities, especially the difference between standard models and reasoning models. Reasoning models often generate a long internal thinking process before producing the visible answer, and those thinking tokens are billed as output. As a result, an answer that appears to contain only a few hundred tokens may be billed as thousands of tokens.
Two patterns appear when measuring the cost of a single completed chat answer. Within the same model, providers often fall within a few percent of each other. A completed gpt-oss-120b answer may cost around $0.00017 to $0.00019, while a reasoning model answer may cost approximately 1.5 to 1.7 cents. Across different models, however, the difference can be around 230 times, from $0.00006 on a smaller model to about one and a half cents on a reasoning model. Provider choice may shift the cost of an answer by a few percent, but model choice can change it by orders of magnitude. The strongest cost lever is therefore matching the model to the task.
The resulting architecture is task-based routing. Standard requests should go to a fast and inexpensive mainstream model. Only problems that genuinely require reasoning should be escalated to a reasoning model. Provider selection should be treated as the secondary decision. A routing layer can help manage this process automatically.
Cold Starts and Burst Behavior
Serverless inference introduces a metric that dedicated deployments usually avoid: the cold start. When an endpoint has been idle or must scale to handle a burst, the first requests may pay a provisioning penalty. For spiky traffic, those first requests are often exactly the ones your users send.
This metric measures first-token time from a cold state and during bursts. If your traffic pattern is bursty, check whether the platform offers keep-warm behavior or provisioned capacity. Test the transition directly instead of assuming that warm-path benchmark numbers will apply.
Output Fidelity
A request can return HTTP 200 and still be unusable. Output fidelity measures whether the response is correct, complete, and at the expected quality level. This is not visible in latency or throughput charts. It can also include silent truncation. In some cases, a reasoning model with a normal answer-sized token budget may spend the entire budget on internal thinking and return an empty answer. Technically, the request succeeds, but the result is not useful.
Another possible issue is quantization. Some providers serve reduced-precision versions of a model, such as FP8 or FP4, which can affect output quality without changing the API. This is not always clearly disclosed.
This metric checks whether the output is valid for your task, not only whether the API returned a successful status code. For reasoning models, this means assigning enough tokens for the model to reach the actual answer. For any model, it means understanding the precision being served and regularly spot-checking quality.
Operational Fit
The final metric is less numerical but often determines integration effort: how well the platform fits the way you build. Most providers offer an OpenAI-compatible API, which can make switching as simple as changing a base URL. However, compatibility goes beyond endpoint format. It also matters whether the parameters you send are actually respected. A request to disable a model’s reasoning mode may be honored by one provider and silently ignored by another.
It is also important to monitor accurate server-reported token usage for billing and observability, reliable streaming behavior, region and data-residency options, and terms of service that allow your use case. An endpoint that looks compatible is not necessarily a platform that behaves compatibly, so confirm the behaviors your application depends on.
Choosing the Right Metrics for Your Workload
The metrics above are not a list that should be optimized all at once. They are a menu to choose from based on what your application actually does. The workload determines which measurements matter and which ones are mostly noise.
| Workload | Primary Metrics | Secondary Metrics |
|---|---|---|
| Interactive chat / streaming UI | TTFT stability (p95), reliability, tail latency | Sustained throughput |
| Batch / offline generation | Sustained throughput under concurrency, cost per result | TTFT |
| RAG (retrieval-augmented generation) / summarization | TTFT (prefill cost), cost per result, reliability | Peak throughput |
| Production service at scale | Reliability and availability, tail latency, cost per result | Median anything |
Median single-stream throughput is the number most comparisons highlight first. It is decisive for only one of these workloads and secondary for the others. It is still a useful metric, but it is not the only one. For most production deployments, it is not the most important one.
FAQ
What is the most important metric for serverless inference?
There is no single metric that matters most in every case. The right metric depends on the workload. Interactive chat applications prioritize stable time-to-first-token and reliability. Batch pipelines focus on sustained throughput under concurrency and cost per result, while RAG systems are especially affected by prefill latency on long prompts. Median tokens per second is a helpful baseline, but in most production environments, it is not the main deciding factor.
Why should I consider p95 latency instead of only the median?
The median describes a typical request, while p95 shows what the slower user experiences look like. At meaningful traffic levels, five percent of requests can represent thousands of interactions. A provider may look excellent at p50 and still be unsuitable at p95.
Do reasoning models cost more to run with serverless inference?
Yes, often far more than the list price suggests. Reasoning models generate thinking tokens before producing the visible response, and those tokens are billed as output. In this benchmark, a reasoning model cost roughly 230 times more per chat request than a small instruct model, even though the per-token prices differed much less. Always compare cost per completed answer instead of cost per million tokens.
Why do benchmark results vary so much between providers for the same model?
Providers use different hardware, batching strategies, and quantization levels, and the exact precision is not always disclosed. Pool provisioning also has a major effect. A provider can be very fast on its headline models while a niche model on the same platform performs much more slowly. Rankings can change completely depending on the model being tested, so benchmark the exact model you plan to serve.
How many trials are needed for a reliable benchmark?
Run at least 25 measured trials for each model-and-scenario combination. Remove a few warmup requests first, set the temperature to 0, and use fixed prompts to keep results comparable. This number of trials is usually sufficient to report a stable p50 and a useful indicative p95. It is also helpful to collect measurements across multiple time windows.
Conclusion
Serverless inference performance cannot be captured by a single metric. While median tokens per second is commonly reported, it primarily reflects batch throughput rather than real-world production performance. In practice, more meaningful metrics include model availability, the consistency of time to first token (TTFT) under load, tail latency at the 95th percentile, and the true cost of generating a response after accounting for realistic prompt sizes and reasoning tokens.
In our benchmarks, the strongest results consistently came from these production-oriented metrics. A catalog-wide first-token latency varying by only a few hundred milliseconds is difficult to achieve without well-provisioned, warm serving infrastructure. Conversely, higher tail latencies and gated model access often indicate more constrained provisioning. Before selecting a provider, benchmark it using your own workload by sending several hundred requests and analyzing latency percentiles. This type of testing provides far more insight into real-world infrastructure performance than pricing tables or headline benchmark numbers alone.


