Can I run a custom model?

Yes, use the vLLM Server App to deploy your private model. Contact us to make it a permissioned model for your tenant only.

Can I request a new open source model or variant?

Yes, we can quickly add any model to our catalog after verifying setup requirements.

Do you support precision / batch / quantization / context size settings?

Yes, you can fully customize your model usage using the vLLM Server App. Or we can add a custom model configuration to your tenant only.

Are my endpoints highly available?

Our orchestration will detect hardware failures and automatically failover to new nodes. This will take a few minutes depending on the model size. Depending on your requirements you may choose to run multiple endpoints for high availability.

Are models priced by tokens usage?

No, private endpoints use dedicated compute and are charged by GPU-hour. You can use unlimited tokens and sessions up to the selected hardware's capacity.

AI Inference on Dedicated GPUs

Deploy foundation or custom models on single-tenant infrastructure. No shared compute. OpenAI API compatible.

Deploy a Model

View Pricing

Model Catalog

Deploy from a catalog of leading open-weight models, pre-configured and sized for optimal GPU configurations, context length, and batch performance. Or bring your own custom model.

Custom Models

Deploy any open-source or fine-tuned model using vLLM Server or Ollama Server applications. Full control over serving parameters including quantization, context length, and batch size.

Secure by Default

Every endpoint runs on dedicated hardware with end-to-end encryption and zero-trust authentication. Deploy to public or private IPs based on your security requirements.

No Token Limits

Private endpoints are charged by GPU-hour, not by token. No metering on usage, no surprise bills. Run as many tokens as your hardware can serve.

Sovereign Infrastructure

Built on infrastructure owned and operated by Denvr in Canadian and US data centers. No foreign jurisdiction exposure. No third-party dependencies.

OpenAI Compatible API

Drop-in compatibility with the OpenAI API specification. Swap your base URL and start inferencing with zero code changes. No vendor lock-in, no proprietary SDKs.

Secure Model Endpoints

Select a model, choose your hardware, and deploy a private endpoint in minutes. Every endpoint runs on single-tenant infrastructure with dedicated GPUs, encrypted connections, and no shared resources.

From Setup To Live In Minutes

01 Select a Model

Choose from our catalog of leading open-weight foundation models, or bring your own custom model.

02 Choose Your Hardware

Select the GPU that fits your workload. NVIDIA H200/H100/A100, Intel Gaudi 2, and more. Scale from a single GPU to multi-GPU configurations.

03 Deploy

Launch your private endpoint in minutes. Your model, your hardware, your API endpoint. Ready for production.

Model Catalog

Launch production endpoints with the most capable open-weight models available. New models are added regularly.

Meta Llama 3.3

Optimized for multi-lingual dialogue use cases and outperforms many open source and closed chat models on industry benchmarks.

DeepSeek R1

Reasoning model that uses reinforcement learning to improve problem-solving capabilities across mathematics, coding, and complex reasoning tasks.

OpenAI GPT-OSS

General-purpose natural language understanding and generation tasks.

Qwen3-Coder-Next

State-of-the-art coding agent with ultra-efficient 3B active inference.

Gemma 3

State-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Hardware Options

Choose the right GPU for your workload and performance requirements. Use multi-GPU for the largest models available.

NVIDIA H200

Optimized For

Extended context, large batch inference

VRAM

141 GB

Notes

Higher memory bandwidth for context-heavy workloads.

NVIDIA H100

Optimized For

Large model inference, high throughput

VRAM

80 GB

Notes

Best for 70B+ parameter models. Native support for FP8 precision.

Intel Gaudi 2

Optimized For

Cost-effective inference for open source models

VRAM

96 GB

Notes

Near H100 performance with FP8 inference.

NVIDIA A100

Optimized For

Cost-effective inference, fine-tuned models

VRAM

40 GB

Notes

Best TCO for small batch and private models.

NVIDIA A100 MIG

Optimized For

Very small models with 0-7B params

VRAM

20 GB

Notes

Smallest available unit based on GPU hardware partitions.

Hardware

Optimized For

VRAM

Notes

NVIDIA H200

Extended context, large batch inference

141 GB

Higher memory bandwidth for context-heavy workloads.

NVIDIA H100

Large model inference, high throughput

80 GB

Best for 70B+ parameter models. Native support for FP8 precision.

Intel Gaudi 2

Cost-effective inference for open source models

96 GB

Near H100 performance with FP8 inference.

NVIDIA A100

Cost-effective inference, fine-tuned models

40 GB

Best TCO for small batch and private models.

NVIDIA A100 MIG

Very small models with 0-7B params

20 GB

Smallest available unit based on GPU hardware partitions.

Need help selecting hardware? Our solutions engineers can recommend the optimal configuration for your model and workload profile.

View full pricing →