Industry-standard LLM benchmarks in DataRobot

Every LLM deployment has a ceiling, a latency curve, and a unit cost. Most teams operate blindly, discovering their deployment limits only when over-provisioning exhausts their GPU budget or peak traffic causes a catastrophic failure.

Three numbers matter: maximum sustained concurrency before GPU saturation, end-to-end latency at that concurrency, and cost per million tokens at sustained load. These metrics emerge from how the model interacts with your hardware, runtime, tokenizer, and traffic mix.

DataRobot 11.8 changes that with LLM Profiling Jobs: a native integration of NVIDIA AIPerf, the industry-standard generative AI benchmarking tool. One authenticated POST benchmarks any DataRobot LLM deployment serving an OpenAI-compatible web server, sweeps the concurrency range and use cases you define, and returns the empirical inputs to Quota Reservations (available in DataRobot 11.9).

Why LLM capacity is hard to predict

LLM inference doesn’t scale linearly. Compute and memory demands per request depend dynamically on prompt length, response length, sampling parameters, and KV cache utilization.A deployment that serves 50 short chat turns per second can stall at 5 long-context RAG requests per second on the same hardware. Four distinct behaviors make static or speculative capacity estimates unreliable:

Latency is non-linear in concurrency. Time to first token and inter-token latency stay roughly flat across a wide concurrency range, then rise sharply once GPU memory bandwidth or compute saturates. TTFT rises when prefill compute saturates; inter-token latency rises when decode memory bandwidth saturates. Which one bites first depends on the workload mix and the deployment’s GPU configuration (single card or a cluster). The saturation knee is the operating point that matters, and it can’t be inferred from a single low-load measurement.
Throughput and latency trade off. You can squeeze more total tokens per second out of a deployment by running it at higher concurrency, at the cost of slower per-user response. The right trade-off depends on your SLO, not on a generic recommendation.
Use case mix matters. Two deployments running the same model on the same hardware can have very different capacity if one serves short Q&A and the other serves long-context summarization. The mix has to be in the test, or the test is wrong.
Caching and routing change the answer. Prefix caching (common in agentic coding with periodic compaction) and KV-aware routing can lift effective throughput dramatically. Profiles run against a cold deployment with random inputs represent the floor, not the ceiling.

LLM Profiling Jobs make those curves visible.

How LLM benchmarks help

Defend capacity and quota decisions with measured data. When finance questions a four-H100 footprint, or when cross-functional teams negotiate shared capacity, you can justify the architecture with empirical profiling data. Saturation knee, SLO target, and forecast traffic make GPU sizing an evidence-based line item. The same numbers feed Quota Reservations directly.
Account for cost per consumer. Total token throughput plus the GPU instance cost gives a cost-per-million-tokens figure that supports chargeback or showback. Attribute spend to consumers proportionally to their reservations, not by guesswork.
Compare models and hardware on equal terms. Hold the workload profile constant and vary one dimension at a time: the same model on different GPU configurations (a B200 node vs a B300 node, or 4×H100 vs 8×H100), or different models on the same configuration (Qwen3.6 35B-A3B MoE vs Qwen3.6 27B dense). Because AIPerf metrics match NVIDIA’s published NIM benchmarks, the numbers are also directly comparable to public benchmarks for the same model and hardware combinations. The right input for procurement and capacity-sizing decisions before a hardware order.
Prove a change is safe before you ship it. Before a model upgrade, vLLM bump, driver swap, or GPU migration, rerun the same profile and compare against the prior baseline. Regressions show up in the metrics, not in incident reports.

What LLM benchmark metrics mean

The four headline metrics AIPerf returns map directly to user experience and to GPU economics:

Time to first token (TTFT, ms). Measures how long a user waits between submitting a prompt and seeing the first character; this metric is dominated by prefill compute.
Inter-token latency (ITL, ms). Average time between successive output tokens once generation has started. Sets the perceived “typing speed” of the response.
Request throughput (requests/sec). Full request-and-response cycles per second at the tested concurrency. The basis for the Capacity (RPM) value on Quota Reservations.
Total token throughput (tokens/sec). Total tokens (input plus output) processed per second across all concurrent requests. The basis for cost-per-token economics.

For each metric, AIPerf reports averages and percentiles (p50, p90, p99). When GPU saturation is detected during the sweep, estimatedCapacity reports the iteration immediately before it. When saturation isn’t detected (the common case, since the profiler isn’t co-located with the deployment), estimatedCapacity reports the last iteration tested. Sweep wide enough that the curve clearly bends, or treat the result as a lower bound.

Submitting a job

A profiling request takes four parameters: a deploymentId (the ID of the DataRobot LLM deployment you want to profile), a list of concurrency levels to sweep, a request count scalar (how many requests each concurrent worker issues), and one or more use cases. Each use case defines an input sequence length (ISL), an output sequence length (OSL), standard deviations for both, and a weight (prob). Weights across all use cases must sum to 100.

export DATAROBOT_ENDPOINT="
export DR_API_KEY=""
export HUGGINGFACE_DR_CRED_ID=""
export DEPLOYMENT_ID=""
export CONCURRENCIES="[1,10,50,100]"
export REQUEST_COUNT_SCALAR=2
export MODEL_TOKENIZER="openai/gpt-oss-20b"
export USE_CASES='[{"isl":200,"islStddev":15,"osl":1000,"oslStddev":15,"prob":100}]'
 
curl -X POST -H "Authorization: Bearer ${DR_API_KEY}" \
     -H "Content-Type: application/json" \
     "${DATAROBOT_ENDPOINT}/api/v2/llmProfilingJobs/" \
     -d @- <

A 202 Accepted response returns the job ID, an execution ID, and a status ID:

{
  "id": "69e09f9e25fdfdfab0d27925",
  "jobExecutionId": "69e09f9f25fdfdfab0d27926",
  "statusId": "5633f028-3f68-4f83-bddc-560d266d6bd2"
}

Monitoring and retrieving LMM benchmark results

Poll the Status API with the returned statusId. When the job finishes, the API returns 303 See Other and the Location header points to the results endpoint:

curl -s -L -i \
  -H "Authorization: Bearer ${DR_API_KEY}" \
  "${DATAROBOT_ENDPOINT}/api/v2/status/${STATUS_ID}/"

Fetch the full results with the profiling job id:

curl -H "Authorization: Bearer ${DR_API_KEY}" \
     "${DATAROBOT_ENDPOINT}/api/v2/llmProfilingJobs/${LLM_PROFILING_JOB_ID}/profilingResults/"

Example payload (truncated):

{
  "estimatedCapacity": {
    "metrics": [
      { "name": "request_throughput",     "units": "requests/sec", "measurements": [{ "name": "avg", "value": 8.84    }] },
      { "name": "inter_token_latency",    "units": "ms",           "measurements": [{ "name": "avg", "value": 23.79   }] },
      { "name": "time_to_first_token",    "units": "ms",           "measurements": [{ "name": "avg", "value": 833.06  }] },
      { "name": "total_token_throughput", "units": "tokens/sec",   "measurements": [{ "name": "avg", "value": 4524.80 }] }
    ]
  },
  "results": [ "...per-iteration benchmark data..." ]
}

estimatedCapacity is the sustained operating point. results contains one entry per concurrency level tested, with the full metric set.

Reading the curve

The estimated-capacity numbers tell you the sustained ceiling. The per-iteration results show you how the deployment behaves as load climbs toward that ceiling. The table below is an illustrative example.

Concurrent requests	TTFT (ms)	Total throughput (tokens/sec)	Note
1	~150	~600	Low load, near-floor latency
10	~250	~2,500	Throughput scales nearly linearly
50	~800	~4,500	`estimatedCapacity` returned from this iteration
100	~1,500	~4,600	Saturated: TTFT roughly doubles, throughput plateaus

When AIPerf detects GPU saturation during the sweep, it identifies the iteration before it (concurrency 50 here) and returns those metrics as estimatedCapacity. When saturation isn’t detected, estimatedCapacity is simply the last iteration tested, which is why the sweep needs to extend past the knee. Anything past that point trades user-perceived latency for marginal throughput gains. If the product spec calls for TTFT under 1 second, the curve shows the deployment supports up to roughly 50 concurrent requests with margin: provision GPU so peak concurrent demand stays at or below that level.

From profiling result to Quota Reservations config

The bridge from a profiling run to a Quota Reservations configuration is direct:

Quota setting	Where it comes from	Example (from sample above)
Capacity (RPM)	`estimatedCapacity.request_throughput` × 60	8.84 req/sec × 60 ≈ 530 RPM
Utilization Threshold	Pick 70–80% of Capacity so enforcement engages before the saturation knee	80% → enforcement at ~424 RPM
Reserved % per consumer	Sized to the minimum each priority consumer needs during contention	30% Production Agent A, 20% Agent B, 30% Agent C, 20% unreserved pool
Refill rate	Capacity / 60 (requests per second)	530 / 60 ≈ 8.83 req/sec

For a primer on how Capacity, Utilization Threshold, and Reserved % interact under load, see Rate Limiting vs. Quota Reservations.

A worked cost example

Take the sample result: 4,524 total tokens per second sustained (input plus output). That is roughly 16.3 million tokens per hour from one deployment.

If the underlying GPU instance costs $X per hour, the cost per million tokens is $X / 16.3. For an instance at $4 per hour, that is about $0.25 per million tokens. For $12 per hour, about $0.74. To calculate cost per million output tokens—the standard benchmark for public API comparisons—divide the total cost by the workload’s output share. For example, given an ISL of 200 and an OSL of 1000, output accounts for roughly 83% of total tokens. At a $4 hourly instance price, this translates to approximately $0.30 per million output tokens.

Every benchmark run gives you a fresh, accurate cost-per-token figure for the exact model, hardware, and quantization combination you’re running. After a vLLM upgrade or a hardware swap, re-run the same profile and confirm your unit economics improved instead of trusting a vendor claim. This is the foundation for per-token and per-agent cost transparency in chargeback.

Choosing your inputs

A useful profile starts with two questions: what concurrency range do you expect in production, and what does your traffic actually look like?

Concurrencies to sweep. Start wide ([1, 10, 50, 100]) to locate the saturation knee, then narrow (such as [40, 50, 60, 70]) for an SLO-grade reading around that point.
Request count scalar. Set it high enough that each iteration runs long enough to smooth out noise. A scalar of 2 is a reasonable starting point. Raise it if variance looks high.
Use cases. Match your real traffic mix. If you serve 70% short chat turns (ISL 200, OSL 300) and 30% long-context RAG (ISL 4000, OSL 800), define two use cases with prob: 70 and prob: 30. Testing a blended traffic mix exposes tail-latency behavior (such as p99 spikes) that a single-use-case average obscures.
Tokenizer. Set it explicitly. The benchmark depends on accurate token counts, so the matching tokenizer is part of a correct measurement.

Operational notes

Profiling generates synthetic load. Run jobs against a non-production LLM deployment or during a maintenance window.
Because the traffic is synthetic, prefill cache hits won’t appear in token metrics.
Profiling treats the deployment as a black box. Whether the deployment runs on one GPU or many, and whatever combination of tensor, pipeline, data, or expert parallelism it uses, the profile measures the externally observable result.
Jobs can be canceled with a DELETE to the profiling job ID. Cancellation is best-effort and may not stop a run that is nearly complete.
Before you submit, store your Hugging Face token in DataRobot Credential Management as an “API Token (API Key)” credential. AIPerf uses it to fetch the model tokenizer, and the stored credential prevents rate-limit errors.

Get access

LLM Profiling Jobs are in private preview in DataRobot 11.8. To enable on your tenant, contact your DataRobot account team. They will turn on the Enable Dynamic Quota Capacity Profiling feature flag (the internal name for LLM Profiling Jobs) and configure the profiling job image in your cluster.

Learn more

Source link

What's Hot

The Advanced Materials Show 2026: key signals from the show floor

A Deluge of A.I. Computing Power Is About to Come Online, Fueling Major Leaps

Prysmian to double US fibre production after deal with Molex

Industry-standard LLM benchmarks in DataRobot

8 Essential Courses to Build Workflows and Multi-Agent Systems

Posit AI Blog: TensorFlow and Keras 2.9

How lasers could help provide fuel for nuclear reactors

Stranded in the Slow Zone – O’Reilly

Towards a conversational AI agent for everyday symptom assessment

Issue 05 | Signal Magazine

Understanding U-Net Architecture in Deep Learning

The Next Paradigm in Efficient Inference Scaling – The Berkeley Artificial Intelligence Research Blog

Hard-braking events as indicators of road segment crash risk

The Advanced Materials Show 2026: key signals from the show floor

A Deluge of A.I. Computing Power Is About to Come Online, Fueling Major Leaps

Prysmian to double US fibre production after deal with Molex

8 Essential Courses to Build Workflows and Multi-Agent Systems

Our Picks

The Advanced Materials Show 2026: key signals from the show floor

A Deluge of A.I. Computing Power Is About to Come Online, Fueling Major Leaps

What's Hot

Industry-standard LLM benchmarks in DataRobot

Why LLM capacity is hard to predict

How LLM benchmarks help

What LLM benchmark metrics mean

Submitting a job

Monitoring and retrieving LMM benchmark results

Reading the curve

From profiling result to Quota Reservations config

A worked cost example

Choosing your inputs

Operational notes

Get access

Learn more

Related Posts

Subscribe to Updates