Guide to Locally Deploying Large Language Models

Master the practical guide for local deployment of large language models.
Core content:
1. Local deployment steps of Ollama+QwQ32B model
2. Quantization technology and model performance balance
3. Security protection and configuration optimization
Here, Ollama+QwQ32b is selected as the basis for running the local large model. If there is no concurrency requirement, LM Studio can be selected, which supports the MLX framework of the m series chips. The token generation speed is 50% faster than Ollama, but the disadvantage is that it does not support concurrency.
This article takes Mac OS deployment of Ollama+QwQ32B as an example:
1. Install Ollama
1. Download and install from the official website
Visit the Ollama official website and download the macOS version installation package. When installing, you need to drag the application into the "Applications" folder and enter the system password to complete the installation.
2. Verify the installation
Open the terminal and enter the following command. If the version number (such as 0.6.3) is displayed, the installation is successful:
ollama --version
2. Locally run the QwQ-32B model
1. Download the model
Enter the following command in the terminal. The model file is about 19GB and you need to wait for the download to complete:
ollama run qwq
Technical description:
The default version downloaded is the quantized Q4 version. Quantization is simply a technology that converts high-precision model parameters to low-precision (such as 16-bit floating point to 4-bit integer) to reduce computing resource consumption and maintain model performance. The performance loss is about 10%. In actual tests, if Q6 quantization is used, the memory usage will increase significantly.
2. Verification and interaction
After the download is complete, the terminal will enter interactive mode (display >>> ), and you can directly enter text to test the model response:
3. Expanding the OLLAMA Context
1. Configuration procedure
echo 'export OLLAMA_CONTEXT_LENGTH=16384' >> ~/.zshrc
# Permanent configuration (write to shell configuration file)
source ~/.zshrcollama serve
# Restart the terminal and apply the configuration
2. Verifying the configuration
echo $OLLAMA_CONTEXT_LENGTH
# Check whether the environment variables are effective (need to be set in advance)
# Use the default value 2048 when returning null value
# Successful setting example output: 16384
3. Notes
Environment variables take precedence over model default configurations
When both Modelfile's num_ctx and environment variables exist, the latter takes precedence.
Extending the context will significantly increase memory usage
4. dify calls the local model
Set the path:
1. Dify-plugin-install Ollama
2. Plugin Settings - Model Supplier - Add Model
5. Security Issues
Risk warning : Ollama opens port 11434 by default without authentication. Attackers can directly access the service to steal data or perform malicious operations.
Protection suggestions :
Modify the configuration to limit the port access range (such as binding 127.0.0.1)
Enable API key or IP whitelist authentication
Update to a secure version in a timely manner (such as 0.1.47+)