Personal local deployment of DeepSeek: video memory formula and graphics card recommendation

Written by

Caleb Hayes

Updated on:July-09th-2025

With the rapid development of artificial intelligence technology, local deployment of large language models (such as DeepSeek) has become an important way for individual developers and small teams to explore AI applications. However, the choice of video memory requirements and hardware configuration often become the core difficulties in the deployment process. This article will start from the principle of video memory calculation, combine model scale and graphics card performance, and provide a systematic deployment solution for individual users.

1. Calculation logic of video memory requirements

Relationship between parameter scale and video memory

The model memory usage is mainly composed of three parts:

‌Model parameters‌ : Each parameter occupies 2 bytes in FP16 precision and 1 byte in INT8 precision
‌Inference Cache‌ : includes intermediate variables such as activation values and attention matrices
‌System Overhead‌ : Additional consumption such as CUDA context, framework memory management, etc.

‌Basic calculation formula‌ :

Video memory requirement ≈ parameter quantity × precision factor × safety factor

in:

‌Precision coefficient‌ : 2 for FP16, 1 for INT8, and 0.5 for 4-bit quantization
Safety factor : 1.2-1.5 is recommended (to allow for cache and system overhead)

Typical scenario calculation example using the DeepSeek-7B model as an example

FP16 mode: 7B×2×1.3=18.2GB
8-bit quantization: 7B×1×1.3=9.1GB
4-bit quantization: 7B×0.5×1.3=4.55GB

2. Comparison table of model scale and graphics card recommendations

3. Optimization strategy and cost-effectiveness improvement

1. Comparison of Quantification Techniques

Quantization Type	Video memory compression ratio	Performance loss
FP32 → FP16	50%	<1%
FP16 → INT8	50%	3-5%
INT8 → INT4	50%	8-12%

Quantization Type

Video memory compression ratio

Performance loss

FP32 → FP16

50%

<1%

FP16 → INT8

50%

3-5%

INT8 → INT4

50%

8-12%

2. Framework-level optimization

vLLM: PagedAttention technology is used to reduce KV Cache fragmentation, and the memory usage of the 32B model is reduced by 40%.
Ollama+IPEX-LLM: Implementing 7B model core graphics deployment on Intel Arc graphics cards, CPU collaborative acceleration

3. Hardware purchasing suggestions

Cost-effectiveness priority:

Video memory capacity > computing power (computing power cannot be fully utilized when video memory is insufficient)
Choose a graphics card that supports Resizable BAR technology (improve multi-card communication efficiency by 30% )
Prioritize energy efficiency (e.g. RTX 4090 's TOPS/Watt is 58% higher than 3090 )

IV. Future Trends and Deployment Suggestions

With the iteration of DeepSeek technology, the demand for video memory shows two major trends:

Model lightweight: Through MoE architecture and dynamic routing, the 670B-class model can be compressed to run within the 24GB video memory of a single card
Hardware equality: Intel core graphics supports 7B model through IPEX-LLM, and XeSS technology may realize 32B model consumer-level deployment in the future

Action Guide for Individual Users:

Short-term: Reserve redundancy according to the "video memory formula × 1.2" and choose a graphics card that supports quantization technology (such as RTX 4060 Ti 16GB)
Long term: Focus on 4-bit quantization support for the Blackwell architecture (RTX 50 series), and expect to achieve 70B model single-card deployment by the end of 2025

By scientifically calculating graphics memory requirements and rationally selecting hardware, individual users can build a high-performance DeepSeek local deployment environment within a budget of tens of thousands of yuan, opening the next chapter of AI innovation.