Local deployment and performance of 8 large models including DeepSeek

Written by

Iris Vance

Updated on:July-15th-2025

background

Due to excessive access, the full version of DeepSeek R1 (deep thinking mode) is often restricted:

The deepseek-reasoner mode of the paid API is also limited:

OpenAI has opened a chatbox that does not require registration, but there are still regional restrictions on usage.

There are many ways to deploy large models locally: Ollama and vLLM. This article mainly introduces the deployment and performance of Ollama's distilled models.

Purpose

Testing the performance of locally deployed LLMs in single-cell annotation analysis

method

1. Download Ollama. Download directly from the official website, very convenient.

2. Load the model.

ollama run deepseek-r1:7b

In order to match the user's hardware conditions, Ollama provides distillation models of different sizes. The 7b distillation model is more suitable for most personal computers (16GB memory). The 7b here means 7 billion parameters. The size of the model file is mainly affected by the number of parameters and precision. The higher the number of parameters and precision, the greater the performance requirements for the hardware. For the convenience of comparison, the parameter number and precision of the ollama local deployment model are 7b~9b and 4bit respectively.

3. Test the performance of the localized distillation model on single-cell annotation.

# Run before calling the local model: ollama serve

git clone https://github.com/Zhihao-Huang/scPioneercd scPioneerRscript ./result/annotation_locally_test.R

result

Results of the full-blooded large model based on the API:

Results of the localized distillation model:

Summarize

1. The accuracy of the locally deployed DeepSeek R1 is far lower than that of the full-featured DeepSeek. The performance of API-based DeepSeek V3 and DeepSeek R1 is quite good.

2. Among the localization models, llama3.1:8b has the highest accuracy; the two distilled versions of deepseek-r1, 70b and 7b, performed poorly.

3. The 7b+4bit localized model requires 5GB of memory. The CPU model is Xeon(R) Gold 6238R CPU @ 2.20GHz, and it takes about 1 minute to run with 50 logical cores. It is recommended that a personal computer use a parameter volume of about 7b.