Woter AI detection.Hurry - ends Jul 19th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Dual-card RTX 5090 in-depth experience: AI reasoning framework selection and performance limitation measurement

Written by

Caleb Hayes

Updated on:July-09th-2025

Recently, I was lucky enough to get a workstation PC equipped with dual NVIDIA GeForce RTX 5090 graphics cards, which is undoubtedly the dream equipment for many developers who pursue extreme performance. However, the arrival of new hardware is often accompanied by some challenges, especially in terms of software adaptation and performance.

We learned through sales channels that there is currently no turbo cooling version of the RTX 5090 on the domestic market. The one we got is the fan version. This means that special attention needs to be paid to heat dissipation and space when configuring multiple cards.

In this article, we will share the practical process of configuring this dual-card 5090 machine for AI reasoning under the Ubuntu system, focusing on several questions that you may be concerned about: What reasoning framework is used? How is the performance? Are there any rumored performance limitations? This is also an in-depth experience and actual measurement.

1. Test Platform Overview

First, let me briefly introduce our test platform configuration:

CPU: Intel Core i9-14900K
Radiator: EA5SE360 water cooling
Motherboard: ASUS PRO WS W680-ACE workstation motherboard
Memory: Corsair DDR5 5200 32GB * 2
SSD: Kingston NV3 2TB PCIe 4.0 M.2
Power supply: Great Wall 2200W Gold Certified Power Supply
Chassis: Customized 10-slot Jinhetian 9125B
Graphics card: NVIDIA GeForce RTX 5090 * 2 (Fan version)

System environment:

Operating system: Ubuntu 22.04
NVIDIA driver: 570.133.07
CUDA version: 12.8

nvidia-smi confirms that both RTX 5090s are correctly identified and the CUDA 12.8 environment is ready.

2. Choice of AI reasoning framework: Ollama, SGLang or vLLM?

After getting a new card, you naturally want to run an AI model. But when facing new hardware, the choice of inference framework is crucial. Which framework can be seamlessly connected, and which one requires us to "hands-on" solve the compatibility problem?

2.1 Ollama: Ready to use, easy to configure

The good news is that Ollama already supports inference using RTX 5090. The configuration process is very simple and is the first choice for users who want to get started quickly.

2.2 SGLang: Not supported yet

According to our tests, SGLang currently supports up to CUDA 12.4. Therefore, SGLang cannot be used directly on RTX 5090 for the time being . We hope that subsequent versions can be adapted as soon as possible.

2.3 vLLM: requires “hands-on skills”

vLLM is a very popular inference framework, but its latest release (such as 0.8.2) does not directly support the sm_120 computing capability of RTX 5090 .

Install vLLM with pip and the following error will be reported when starting

#CUDA version low error:
RuntimeError: CUDA error:  no  kernel image  is  available  for  execution  on  the device

#Using PyTorch2.6.0 reports an error:
NVIDIA GeForce RTX  5090 with  CUDA capability sm_120  is not  compatible  with  the  current  PyTorch installation.  
The  current  PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If  you want to use the NVIDIA GeForce  RTX  5090  GPU  with  PyTorch  , please check the  instructions  at  https://pytorch.org/get-started/locally/

Solution: You need to compile vLLM yourself . This requires meeting certain conditions: using the Nightly version of PyTorch and ensuring that the CUDA version is 12.8 . This requires certain environment configuration and compilation experience. We failed to compile manually on a machine with relatively low configuration.

Manual compilation reference: https://github.com/vllm-project/vllm/issues/14452

Summary: For RTX 5090 users, the most convenient inference framework is Ollama . If you are an in-depth user of vLLM and have the ability and patience to compile, you can try to compile it yourself to get the best performance.

3. Preliminary Study on Ollama Inference Performance

Since Ollama has good support, we used it to load the deepseek-r1:32b model and conducted preliminary performance tests.

Start command:

ollama run deepseek-r1:32b --verbose

Video memory usage: about 21.7 GB . (single card operation)
Performance indicators:

Prompt Eval Rate: 45.78 tokens/s
Eval Rate: 60.49 tokens/s

From the preliminary test, the performance of a single RTX 5090 running a 32B model under Ollama is satisfactory. nvtop monitoring shows that the single card is in good working condition.

4. Key verification: Does the RTX 5090 have performance limitations similar to the 5090D?

This is the core part of this in-depth experience. Previously, according to news from channels such as the Chiphell forum, the RTX 5090D version for the Chinese market was exposed to have some limitations, such as:

3-second detection mechanism: It is said that if AI inference or crypto mining load is detected, the performance (computing power) will be locked.
Power consumption limit: Power adjustment may be limited and cannot be fully utilized.
Impact on multi-card parallelism: These limitations particularly affect multi-card configurations that require high-performance parallel computing.

So, does this standard version of the RTX 5090 (not the D version) have these worrying problems when used in China? We designed the following stress test to conduct rigorous actual testing and verification:

4.1 Verification Scheme

1. Use Ollama to load large models for cross-card reasoning: Load the deepseek-r1:70b model (video memory requirement is about 45GB) and run it on two RTX 5090s at the same time to simulate a real heavy AI reasoning scenario.

# Make sure Ollama can use both cards (GPU 0 and 1)
CUDA _VISIBLE_ DEVICES=0,1 ollama run deepseek-r1:70b --verbose

2. Combined with stress testing tools: While running the Ollama large model, use evalscope (simulating high-concurrency inference requests) and gpu-burn (GPU extreme computing load test) to squeeze the performance of the two 5090s to the limit and observe whether there is a sudden drop in performance or locking.

4.2 Preparation

Install evalscope:

pip install 'evalscope[app,perf]' -U -i https://mirrors.aliyun.com/pypi/simple/

Install gpu-burn:

git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
make

4.3 Start stress testing

Start evalscope (simulating high concurrent inference load):

# Initiate parallel inference requests to Ollama (running 70b model)
evalscope perf --parallel 8 --url http://127.0.0.1:11434/v1/chat/completions --model deepseek-r1:32b --log-every-n-query 10 --connect-timeout 600 --read-timeout 600 --api openai --prompt 'Write a science fiction novel, no less than 2000 words' -n 20

Enable gpu-burn (extreme computing load):

# Run gpu-burn on CPU cores 0 and 1 for 360 seconds to squeeze GPU compute units
taskset -c 0-1 ./gpu _burn 360

Monitoring: Use nvtop to observe GPU frequency, power consumption, temperature, and utilization in real time.

4.4 Stress Test Results and Conclusions

Under a 3-minute continuous high-intensity stress test (combined with Ollama 70b model cross-card reasoning, evalscope concurrent requests, and gpu-burn extreme calculations):

GPU frequency: The GPU core frequency of the two RTX 5090s remained stable above 2.5GHz (GPU0 about 2.55GHz, GPU1 about 2.72GHz), with no abnormal drop.
GPU Power Consumption: Both cards' power consumption is stable at around 575W .
GPU utilization: The utilization rate remains at around 99% , indicating that GPU resources are fully utilized.
Video memory usage: Video memory is heavily occupied (about 31GB per card), as expected for running a 70B large model.
The nvtop monitoring screenshot clearly shows that both cards can run at full power under high load, with the frequency and power consumption maintained at extremely high levels, and there is no sign of performance being limited.

Summarize

This in-depth experience of dual RTX 5090 has brought us a lot of valuable information. For those who plan to use dual RTX 5090 for AI work in Linux environment, at present:

The hardware itself has powerful performance and great potential for dual-card parallel operation, but it is important to pay attention to the cooling solution and power supply configuration to ensure stable operation.
In terms of reasoning framework selection, Ollama is currently the most convenient, out-of-the-box option , suitable for quick start and experimentation.
It is important to note that inference frameworks such as SGLang and vLLM, which are widely used in high-concurrency, low-latency production environments, currently have a significant lag in official support for RTX 5090. SGLang is not available due to CUDA version restrictions, and vLLM requires manual compilation in a specific environment, and its stability remains to be verified. This means that if your production environment is highly dependent on SGLang or vLLM, then putting RTX 5090 directly into production may still require waiting for the official update and adaptation of these frameworks.
The most important thing is that through our stress testing, the standard version of RTX 5090 did not show performance limitations similar to the 5090D (such as "3 seconds to lock computing power" or power consumption lock) , and can stably exert its due powerful performance under high load and multi-card parallel scenarios.
Of course, drivers and software libraries are constantly being updated, and the situation may change in the future, especially for vLLM and SGLang support. But at least in the current environment (Driver 570.133.07, CUDA 12.8, Ollama), the RTX 5090 has demonstrated its powerful hardware potential without being constrained by additional performance.

I hope that this in-depth experience and actual testing of the dual RTX 5090, especially the verification of framework selection and performance limitations, can provide valuable reference for your decision-making.