Local deployment and performance of 8 large models including DeepSeek

Written by
Iris Vance
Updated on:July-15th-2025
Recommendation

Explore the practice and effect of local deployment of large models, and gain in-depth understanding of the application performance of the Ollama distillation model in single-cell annotation.

Core content:
1. Local deployment of large models under restricted access
2. Download, load and single-cell annotation test method of the Ollama distillation model
3. Performance comparison and memory requirement analysis of local deployment models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

01

background

    
Due to excessive access, the full version of DeepSeek R1 (deep thinking mode) is often restricted:
The deepseek-reasoner mode of the paid API is also limited:
OpenAI has opened a chatbox that does not require registration, but there are still regional restrictions on usage.
There are many ways to deploy large models locally: Ollama and vLLM. This article mainly introduces the deployment and performance of Ollama's distilled models.

    02

Purpose


Testing the performance of locally deployed LLMs in single-cell annotation analysis

    03

method


1. Download Ollama. Download directly from the official website, very convenient.
2. Load the model.
ollama run deepseek-r1:7b
    In order to match the user's hardware conditions, Ollama provides distillation models of different sizes. The 7b distillation model is more suitable for most personal computers (16GB memory). The 7b here means 7 billion parameters. The size of the model file is mainly affected by the number of parameters and precision. The higher the number of parameters and precision, the greater the performance requirements for the hardware. For the convenience of comparison, the parameter number and precision of the ollama local deployment model are 7b~9b and 4bit respectively.
3. Test the performance of the localized distillation model on single-cell annotation.
# Run before calling the local model: ollama serve
git clone https://github.com/Zhihao-Huang/scPioneercd scPioneerRscript ./result/annotation_locally_test.R

04

result


Results of the full-blooded large model based on the API:
Results of the localized distillation model:

05

Summarize


1. The accuracy of the locally deployed DeepSeek R1 is far lower than that of the full-featured DeepSeek. The performance of API-based DeepSeek V3 and DeepSeek R1 is quite good.
2. Among the localization models, llama3.1:8b has the highest accuracy; the two distilled versions of deepseek-r1, 70b and 7b, performed poorly.
3. The 7b+4bit localized model requires 5GB of memory. The CPU model is Xeon(R) Gold 6238R CPU @ 2.20GHz, and it takes about 1 minute to run with 50 logical cores. It is recommended that a personal computer use a parameter volume of about 7b.