Guide to Locally Deploying Large Language Models

Written by
Silas Grey
Updated on:July-01st-2025
Recommendation

Master the practical guide for local deployment of large language models.

Core content:
1. Local deployment steps of Ollama+QwQ32B model
2. Quantization technology and model performance balance
3. Security protection and configuration optimization

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Here, Ollama+QwQ32b is selected as the basis for running the local large model. If there is no concurrency requirement, LM Studio can be selected, which supports the MLX framework of the m series chips. The token generation speed is 50% faster than Ollama, but the disadvantage is that it does not support concurrency.

This article takes Mac OS deployment of Ollama+QwQ32B as an example:


1. Install Ollama

1. Download and install from the official website

Visit  the Ollama official website and download the macOS version installation package. When installing, you need to drag the application into the "Applications" folder and enter the system password to complete the installation.

2. Verify the installation

Open the terminal and enter the following command. If the version number (such as 0.6.3) is displayed, the installation is successful:

ollama --version




2. Locally run the QwQ-32B model

1. Download the model

Enter the following command in the terminal. The model file is about 19GB and you need to wait for the download to complete:

ollama run qwq

Technical description:

The default version downloaded is the quantized Q4 version. Quantization is simply a technology that converts high-precision model parameters to low-precision (such as 16-bit floating point to 4-bit integer) to reduce computing resource consumption and maintain model performance. The performance loss is about 10%. In actual tests, if Q6 quantization is used, the memory usage will increase significantly.

2. Verification and interaction

After the download is complete, the terminal will enter interactive mode (display >>> ), and you can directly enter text to test the model response:



3. Expanding the OLLAMA Context

1. Configuration procedure

echo 'export OLLAMA_CONTEXT_LENGTH=16384'  >> ~/.zshrc 
# Permanent configuration (write to shell configuration file)
source  ~/.zshrcollama serve
# Restart the terminal and apply the configuration

2. Verifying the configuration

echo $OLLAMA_CONTEXT_LENGTH 
# Check whether the environment variables are effective (need to be set in advance)
# Use the default value 2048 when returning null value
# Successful setting example output: 16384

3. Notes

  • Environment variables take precedence over model default configurations

  • When both Modelfile's num_ctx and environment variables exist, the latter takes precedence.

  • Extending the context will significantly increase memory usage


4. dify calls the local model

Set the path:

1. Dify-plugin-install Ollama

2. Plugin Settings - Model Supplier - Add Model



5. Security Issues

Risk warning : Ollama opens port 11434 by default without authentication. Attackers can directly access the service to steal data or perform malicious operations.

Protection suggestions :

  1. Modify the configuration to limit the port access range (such as binding 127.0.0.1)

  2. Enable API key or IP whitelist authentication

  3. Update to a secure version in a timely manner (such as 0.1.47+)