https://www.denvrdata.com/?utm_campaign=XAds&utm_campaign_id=1&utm_medium=paid&utm_source=X
top of page
Denvr Dataworks Website Updates 2024 - Alin - V1.0_Header.png
dnvr-lght-hrz_2x.png
dnvr-lght-hrz_2x.png

LLM Inference Benchmarking with vLLM and NVIDIA GenAI-Perf

Jan 13

5 min read

0

53

0

Benchmarking with GenAI-Perf
Benchmarking with GenAI-Perf

While experimenting with LLM inference benchmarking, I ran into a familiar problem: inconsistent results caused by model hallucinations when relying solely on ChatGPT. That pushed me to step back and research dedicated tooling, where I discovered that NVIDIA has built a surprisingly mature and flexible benchmarking solution. It works across multiple inference backends, including Triton and OpenAI-compatible APIs, making it practical for real-world platforms.


One of the biggest sources of confusion in this space is the difference between LLM performance testing and LLM benchmarking. Performance testing evaluates the end-to-end capacity of your infrastructure — covering network latency, CPU limits, and system-level throughput. Benchmarking, however, zooms in on the inference engine itself, focusing on model-specific behavior such as token throughput, latency, batching efficiency, and KV cache usage. These details become essential when teams start operating their own inference platforms, which is increasingly common across enterprises.


This post is part of an ongoing series where I document what I’m learning while experimenting with vLLM-based inference stacks, combined with lessons from operating LLM inference platforms in production enterprise environments.

Below, I’ll walk through some of the most common LLM inference workload patterns.


Workload pattern

Typical examples

ISL vs OSL

GPU & KV cache impact

Batching & serving implications

Generation-heavy

Code generation, emails, content creation

Short input (~50–200) → long output (~800–1.5K)

KV cache grows over time during decoding; sustained GPU utilization during generation

Benefits from continuous batching; latency sensitive at high concurrency

Context-heavy (RAG / chat)

Summarization, multi-turn chat, retrieval-augmented QA

Long input (~1K–2K) → short output (~50–300)

Large upfront KV allocation; high memory footprint per request

Limits max concurrency; batching constrained by KV cache size

Balanced (translation / conversion)

Language translation, code refactoring

Input ≈ output (~600–1.8K each)

Stable KV usage throughout request lifecycle

Predictable batching; easier capacity planning

Reasoning-intensive

Math, puzzles, complex coding with CoT

Very short input (~50–150) → very long output (2K–10K+)

Explosive KV growth; long-lived sequences

Poor batch efficiency; throughput drops sharply at scale


In this example we will be setting up a single node Inference + benchmarking node for experimentation purpose, however, production use case would require the Benchmarking tool should run from a separate node.


vLLM Benchmarking Setup
vLLM Benchmarking Setup

For decent benchmarking, you need the following to get started:


  • NVIDIA GPU–powered compute platform. This can be your desktop, or you can use any of the Neo Cloud providers. My obvious preference is Denvr Cloud. Feel free to sign up — https://www.denvr.com/

  • Hugging Face login. Sign up for a free Hugging Face account. You’ll need it to download models and access gated models such as Meta Llama and others.

  • LLM-labs repo. https://github.com/kchandan/llm-labs


Step-by-step guide


Prerequisites

  • Select the Virtual Machine


Running DeepSeek R1 on Denvr Cloud with H100 GPUs for Enterprise-Grade AI

Select the region ( Houston ) and H100 GPU, you can choose A100 ( either 80G or 40G option in our Calgary cluster )


  • Navigate to the GPU instances section.

  • Select an H100-based instance with at least 80GB GPU memory. Please note, for Dev instance setup (the client machine) to execute the benchmark, you don't necessarily need a full H100 node, a smaller A100-40G VM could be a great starting point.


Running DeepSeek R1 on Denvr Cloud with H100 GPUs for Enterprise-Grade AI

  • Select the Option with Nvidia + Docker Pre installed


Running DeepSeek R1 on Denvr Cloud with H100 GPUs for Enterprise-Grade AI
Denvr Cloud VM option with Pre-installed CUDA + Docker
  • Wait for VM to launch ( takes 5-7 minutes ) and then login to the instance


Running DeepSeek R1 on Denvr Cloud with H100 GPUs for Enterprise-Grade AI


  • SSH into your instance: ssh username@your-instance-ip


ssh -i <your-ssh-key> ubuntu@<IP of your Denvr VM instance>

To install the necessary packages on the Linux VM (e.g., NVIDIA drivers, Docker, etc.), the easiest approach is to update the IP address in the Ansible inventory file and then let the playbook handle the full installation.


git clone https://github.com/kchandan/llm-labs.git
cat llm-labs/llmops/ansible/inventory/hosts.ini
; [vllm_server]
; server_name ansible_user=ubuntu
[llm_workers]
<IP Address> ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/<your_key_file>

Once IP address is update, fire the Ansible playbook to install required packages


ansible-playbook -i ansible/inventory/hosts.ini ansible/setup_worker.yml

Post installation ensure, Driver installation looks good. Moreover, if you are planning to setup the client machine to execute the test, you don't necessarily need a full H100 node, a smaller A100-40G VM could be a great starting point.



Create the common docker bridge network so that all containers could talk to each other ( default bridge driver)

docker network create llmops-net

Export the Huggingface token

export HF_TOKEN=hf_token

Now, simply launch the vLLM docker compose, it will take some time to load

docker compose -f docker-compose-vllm-qwen3-0.6B.yml up -d
docker compose -f docker-compose.monitoring.yml up -d

Ignore the orphan container warning. I have kept those 2 compose file separate deliverable so that more model specific compose files could be added later into the same repo.


Once all containers are downloaded and loaded, it should look like this ( without container crash loop)


docker ps
docker ps


Now we have setup the vLLM inference base setup, next step is to setup Nvidia GenAI-Perf


pip install genai-perf

Do a quick test run to see if everything is working. Here I am running a loop with difference concurrency. In this case concurrency represent parallel requests

for c in 64 96 128 192 256; do
  genai-perf profile \
    -m Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --synthetic-input-tokens-mean 200 --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 100 --output-tokens-stddev 0 \
    --request-count 400 \
    --warmup-request-count 10 \
    --tokenizer Qwen/Qwen3-0.6B \
    --concurrency $c
done

The output would looks like this


GenAI-Perft
GenAI-Perf Output

If you are able to see these metrics from GenAI-Perf, it means your setup is complete.

Now let’s move on to setting up the Grafana dashboard.


First, ensure that you have configured the Prometheus backend in Grafana. By default, it points to localhost, so we need to switch it to prometheus, matching the service name used in the Docker Compose file.


Prometheus Grafana Setup for vLLM Benchmark
Prometheus Grafana Setup for vLLM Benchmark

As part of the Docker Compose setup, Grafana should automatically pick up the dashboard (NVIDIA + vLLM).


You should now be able to see the metrics flowing into the Grafana dashboard.


Grafana Dashboard - vLLM + DCGM
Grafana Dashboard - vLLM + DCGM

At this stage, we’ve put together a minimal “hello-world” foundation for LLM inference benchmarking. The real work begins next: running meaningful benchmarks, analyzing the results, and tuning vLLM and GenAI-Perf parameters to extract the best possible performance from the underlying hardware. In this example, the setup runs on a single A100-40GB GPU. While it may appear modest on paper, these GPUs are extremely capable and well suited for agentic workloads where smaller language models are invoked frequently and at scale.


In the upcoming posts, I’ll dive deeper into advanced benchmarking strategies, additional metrics and logging, and practical techniques to maximize GPU efficiency in production inference environments.


If you’re planning to run or scale AI inference workloads, sign up today to explore GPU-powered infrastructure designed for modern LLM platforms and start experimenting with your own inference stack.





Jan 13

5 min read

0

53

0

Related Posts

Comments

Share Your ThoughtsBe the first to write a comment.
bottom of page