LLM Inference Benchmarking with vLLM and NVIDIA GenAI-Perf

Jan 13

5 min read

0

53

0

While experimenting with LLM inference benchmarking, I ran into a familiar problem: inconsistent results caused by model hallucinations when relying solely on ChatGPT. That pushed me to step back and research dedicated tooling, where I discovered that NVIDIA has built a surprisingly mature and flexible benchmarking solution. It works across multiple inference backends, including Triton and OpenAI-compatible APIs, making it practical for real-world platforms.

One of the biggest sources of confusion in this space is the difference between LLM performance testing and LLM benchmarking. Performance testing evaluates the end-to-end capacity of your infrastructure — covering network latency, CPU limits, and system-level throughput. Benchmarking, however, zooms in on the inference engine itself, focusing on model-specific behavior such as token throughput, latency, batching efficiency, and KV cache usage. These details become essential when teams start operating their own inference platforms, which is increasingly common across enterprises.

This post is part of an ongoing series where I document what I’m learning while experimenting with vLLM-based inference stacks, combined with lessons from operating LLM inference platforms in production enterprise environments.

Below, I’ll walk through some of the most common LLM inference workload patterns.

Workload pattern	Typical examples	ISL vs OSL	GPU & KV cache impact	Batching & serving implications
Generation-heavy	Code generation, emails, content creation	Short input (~50–200) → long output (~800–1.5K)	KV cache grows over time during decoding; sustained GPU utilization during generation	Benefits from continuous batching; latency sensitive at high concurrency
Context-heavy (RAG / chat)	Summarization, multi-turn chat, retrieval-augmented QA	Long input (~1K–2K) → short output (~50–300)	Large upfront KV allocation; high memory footprint per request	Limits max concurrency; batching constrained by KV cache size
Balanced (translation / conversion)	Language translation, code refactoring	Input ≈ output (~600–1.8K each)	Stable KV usage throughout request lifecycle	Predictable batching; easier capacity planning
Reasoning-intensive	Math, puzzles, complex coding with CoT	Very short input (~50–150) → very long output (2K–10K+)	Explosive KV growth; long-lived sequences	Poor batch efficiency; throughput drops sharply at scale

In this example we will be setting up a single node Inference + benchmarking node for experimentation purpose, however, production use case would require the Benchmarking tool should run from a separate node.

For decent benchmarking, you need the following to get started:

NVIDIA GPU–powered compute platform. This can be your desktop, or you can use any of the Neo Cloud providers. My obvious preference is Denvr Cloud. Feel free to sign up — https://www.denvr.com/
Hugging Face login. Sign up for a free Hugging Face account. You’ll need it to download models and access gated models such as Meta Llama and others.
LLM-labs repo. https://github.com/kchandan/llm-labs

Step-by-step guide

Prerequisites

A Denvr Cloud Console with access to GPU instances.

Select the Virtual Machine

Running DeepSeek R1 on Denvr Cloud with H100 GPUs for Enterprise-Grade AI

Select the region ( Houston ) and H100 GPU, you can choose A100 ( either 80G or 40G option in our Calgary cluster )

Navigate to the GPU instances section.
Select an H100-based instance with at least 80GB GPU memory. Please note, for Dev instance setup (the client machine) to execute the benchmark, you don't necessarily need a full H100 node, a smaller A100-40G VM could be a great starting point.

Select the Option with Nvidia + Docker Pre installed

Wait for VM to launch ( takes 5-7 minutes ) and then login to the instance

SSH into your instance: ssh username@your-instance-ip

ssh -i <your-ssh-key> ubuntu@<IP of your Denvr VM instance>

To install the necessary packages on the Linux VM (e.g., NVIDIA drivers, Docker, etc.), the easiest approach is to update the IP address in the Ansible inventory file and then let the playbook handle the full installation.

git clone https://github.com/kchandan/llm-labs.git
cat llm-labs/llmops/ansible/inventory/hosts.ini
; [vllm_server]
; server_name ansible_user=ubuntu
[llm_workers]
<IP Address> ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/<your_key_file>

Once IP address is update, fire the Ansible playbook to install required packages

ansible-playbook -i ansible/inventory/hosts.ini ansible/setup_worker.yml

Post installation ensure, Driver installation looks good. Moreover, if you are planning to setup the client machine to execute the test, you don't necessarily need a full H100 node, a smaller A100-40G VM could be a great starting point.

Create the common docker bridge network so that all containers could talk to each other ( default bridge driver)

docker network create llmops-net

Export the Huggingface token

export HF_TOKEN=hf_token

Now, simply launch the vLLM docker compose, it will take some time to load

docker compose -f docker-compose-vllm-qwen3-0.6B.yml up -d
docker compose -f docker-compose.monitoring.yml up -d

Ignore the orphan container warning. I have kept those 2 compose file separate deliverable so that more model specific compose files could be added later into the same repo.

Once all containers are downloaded and loaded, it should look like this ( without container crash loop)

Now we have setup the vLLM inference base setup, next step is to setup Nvidia GenAI-Perf

pip install genai-perf

Do a quick test run to see if everything is working. Here I am running a loop with difference concurrency. In this case concurrency represent parallel requests

for c in 64 96 128 192 256; do
  genai-perf profile \
    -m Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --synthetic-input-tokens-mean 200 --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 100 --output-tokens-stddev 0 \
    --request-count 400 \
    --warmup-request-count 10 \
    --tokenizer Qwen/Qwen3-0.6B \
    --concurrency $c
done

The output would looks like this

If you are able to see these metrics from GenAI-Perf, it means your setup is complete.

Now let’s move on to setting up the Grafana dashboard.

First, ensure that you have configured the Prometheus backend in Grafana. By default, it points to localhost, so we need to switch it to prometheus, matching the service name used in the Docker Compose file.

Prometheus Grafana Setup for vLLM Benchmark

As part of the Docker Compose setup, Grafana should automatically pick up the dashboard (NVIDIA + vLLM).

You should now be able to see the metrics flowing into the Grafana dashboard.

At this stage, we’ve put together a minimal “hello-world” foundation for LLM inference benchmarking. The real work begins next: running meaningful benchmarks, analyzing the results, and tuning vLLM and GenAI-Perf parameters to extract the best possible performance from the underlying hardware. In this example, the setup runs on a single A100-40GB GPU. While it may appear modest on paper, these GPUs are extremely capable and well suited for agentic workloads where smaller language models are invoked frequently and at scale.

In the upcoming posts, I’ll dive deeper into advanced benchmarking strategies, additional metrics and logging, and practical techniques to maximize GPU efficiency in production inference environments.

If you’re planning to run or scale AI inference workloads, sign up today to explore GPU-powered infrastructure designed for modern LLM platforms and start experimenting with your own inference stack.