A complete guide to Ollama environment variable configuration: from basic settings to scenario-based tuning

Master the Ollama environment variable configuration, optimize model performance, and achieve efficient development.
Core content:
1. Cross-platform environment variable configuration guide: Linux/macOS and Windows setting methods
2. Docker container deployment skills and runtime dynamic configuration
3. GPU resource efficient utilization strategy: configuration optimization in scenarios with sufficient and limited video memory
In the local deployment and performance optimization of Ollama, environment variables play a key role as the "nerve center". By flexibly configuring these parameters, developers can fine-tune the runtime behavior of the model and adapt to a variety of scenarios from stand-alone development to distributed clusters. This article will combine practical experience to share a set of systematic environment variable configuration solutions to help you unleash the maximum potential of Ollama.
1. Cross-platform environment variable configuration guide
1. Linux/macOS configuration solution
1. Temporary effect (single session)
# Quick start custom configuration export OLLAMA_PORT=12345 # Custom service port (avoid default port conflict) export OLLAMA_MODEL_DIR=./custom-models # Specify exclusive model storage path ollama serve --listen :$OLLAMA_PORT # Load environment variable configuration at startup
2. Permanent effect (global configuration)
Edit the corresponding configuration file according to the Shell type (taking ZSH as an example):
echo 'export OLLAMA_NUM_GPUS=1' >> ~/.zshrc echo 'export OLLAMA_CACHE_DIR="/data/ollama-cache"' >> ~/.zshrc source ~/.zshrc # Apply configuration changes immediately
(II) Windows graphical configuration steps
Open "Control Panel" → "System" → "Advanced system settings"
Add a new system variable in "Environment Variables":
Variable Name:
OLLAMA_MODEL_DIR
Variable value:
C:\ollama\models
(It is recommended to use English absolute path)Verify the configuration using the command line:
echo $env:OLLAMA_MODEL_DIR # Check whether the custom path is read correctly
(III) Docker container deployment skills
# Dockerfile configuration example FROM ollama/ollama:latest ENV OLLAMA_PORT=11434 \ OLLAMA_USE_MLOCK=1 # Lock memory to improve inference speed VOLUME /ollama/models # Persistent storage model file
Dynamic injection configuration at runtime:
docker run -d\ -p 11434:11434 \ -v $(pwd)/models:/ollama/models \ -e OLLAMA_GPU_LAYERS=32 \ # Specify the number of GPU layers ollama/ollama:latest
2. GPU Resource Efficient Utilization Strategy
(I) Scenarios with sufficient video memory (≥16GB)
# Full GPU computing + memory optimization export OLLAMA_ENABLE_CUDA=1 # Force CUDA acceleration export OLLAMA_GPU_LAYERS=40 # Load 40 layers of core parameters to the GPU export OLLAMA_USE_MLOCK=1 # Prevent model data from being swapped to disk
Monitoring tools : throughnvidia-smi
Check the video memory usage in real time to ensureGPU-Util
Stable at above 80%.
(II) Video memory limited scenario (8GB and below)
# Layered computing + memory quota management export OLLAMA_GPU_LAYERS=20 # 20 layers run on the GPU, the rest are processed by the CPU export OLLAMA_MAX_GPU_MEMORY=6GB # Limit memory usage to no more than 6GB export OLLAMA_ENABLE_CUDA=1 # Keep basic CUDA acceleration capabilities
Best Practices : Collocationnvtop
Monitor real-time video memory fluctuations to avoid triggering OOM (out of memory) errors.
3. Concurrency Performance Optimization Combination Solution
1. High-concurrency API service configuration
# Build a high-performance service cluster export OLLAMA_MAX_WORKERS=8 # 8 concurrent worker processes handle requests export OLLAMA_NUM_THREADS=16 # 16 threads per process for parallel computing export OLLAMA_CACHE_SIZE=8GB # Cache high-frequency access model results export OLLAMA_KEEP_ALIVE_TIMEOUT=60s # Keep long connections for 60 seconds to reduce handshake overhead
Performance indicators : QPS (query rate per second) can be increased by 30%-50%, which is suitable for high-traffic scenarios such as e-commerce customer service and intelligent question and answer.
2. Lightweight deployment configuration (laptop/edge device)
# Optimize resource-constrained environments export OLLAMA_MAX_WORKERS=2 # Limit the number of concurrent processes to avoid CPU overload export OLLAMA_NUM_THREADS=4 # Adapt to the number of low-power CPU cores export OLLAMA_CACHE_SIZE=2GB # Control memory usage within a reasonable range
Applicable scenarios : lightweight applications such as local knowledge base query and single-user code assistance.
IV. Key points for strengthening the safety of production environment
1. API access control
# Basic authentication + HTTPS encryption export OLLAMA_AUTH_TOKEN="$(openssl rand -hex 32)"# Generate a 32-bit random authentication token export OLLAMA_ALLOW_ORIGINS="https://api.yourdomain.com"# Limit cross-domain request sources export OLLAMA_ENABLE_TLS=1# Enable TLS 1.3 encrypted communication export OLLAMA_TLS_CERT_FILE="/ssl/cert.pem" # Certificate file path
# Prevent model tampering and malicious pull export OLLAMA_DISABLE_REMOTE_PULL=1 # Disable remote model download export OLLAMA_READ_ONLY=1 # Enable read-only mode to protect local model export OLLAMA_ENABLE_SANDBOX=1 # Enable containerized sandbox isolation
3. Security monitoring configuration
# Log audit and request current limiting export OLLAMA_LOG_LEVEL=INFO # Record key operation logs export OLLAMA_LOG_FILE="/var/log/ollama/access.log" # Log file persistence export OLLAMA_MAX_REQUEST_SIZE=10MB # Limit the size of a single request to prevent DoS attacks
5. Advanced Configuration and Source Code Level Tuning
By studying the Ollama source code (envconfig/config.go
), which unlocks the following advanced configurations:
// Practical configuration hidden in the source code export OLLAMA_FLASH_ATTENTION=1 #Enable FlashAttention to optimize long text reasoning export OLLAMA_LLM_LIBRARY=llama.cpp #Force the use of a specified reasoning library (such as llama.cpp) export OLLAMA_MAX_LOADED_MODELS=3 #Load 3 models into memory at the same time (sufficient video memory is required)
VI. Common Problems Troubleshooting Table
Problem phenomenon | Possible causes | Solution |
---|---|---|
Port occupation | Multiple instances running port conflicts | ReviseOLLAMA_PORT=11435 And restart the service |
Model loading failed | Insufficient path permissions | make sureOLLAMA_MODEL_DIR Directory can be read and written |
GPU usage is less than 50% | CUDA is not enabled or the layer number is too low | examineOLLAMA_ENABLE_CUDA=1 And raiseGPU_LAYERS |
The log contains no key information | The log level is set too high | AdjustmentOLLAMA_LOG_LEVEL=DEBUG |
VII. Appendix
Common environment variables used when tuning Ollama GPU
Example Value | |||
Ollama concurrency tuning environment variables
Example Value | |||
Common security-related environment variables
Example Value | |||
https://example.com | |||
Prevent unauthorized model pulls | |||
Provide TLS certificate path | |||
Ollama environment variable default values
func AsMap() map[string]EnvVar { return map[string]EnvVar{ "OLLAMA_DEBUG": {"OLLAMA_DEBUG", Debug, "Show additional debug information (eg OLLAMA_DEBUG=1)"}, "OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention, "Enabled flash attention"}, "OLLAMA_HOST": {"OLLAMA_HOST", "", "IP Address for the ollama server (default 127.0.0.1:11434)"}, "OLLAMA_KEEP_ALIVE": {"OLLAMA_KEEP_ALIVE", KeepAlive, "The duration that models stay loaded in memory (default \"5m\")"}, "OLLAMA_LLM_LIBRARY": {"OLLAMA_LLM_LIBRARY", LLMLibrary, "Set LLM library to bypass autodetection"}, "OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models (default 1)"}, "OLLAMA_MAX_QUEUE": {"OLLAMA_MAX_QUEUE", MaxQueuedRequests, "Maximum number of queued requests"}, "OLLAMA_MAX_VRAM": {"OLLAMA_MAX_VRAM", MaxVRAM, "Maximum VRAM"}, "OLLAMA_MODELS": {"OLLAMA_MODELS", "", "The path to the models directory"}, "OLLAMA_NOHISTORY": {"OLLAMA_NOHISTORY", NoHistory, "Do not preserve readline history"}, "OLLAMA_NOPRUNE": {"OLLAMA_NOPRUNE", NoPrune, "Do not prune model blobs on startup"}, "OLLAMA_NUM_PARALLEL": {"OLLAMA_NUM_PARALLEL", NumParallel, "Maximum number of parallel requests (default 1)"}, "OLLAMA_ORIGINS": {"OLLAMA_ORIGINS", AllowOrigins, "A comma separated list of allowed origins"}, "OLLAMA_RUNNERS_DIR": {"OLLAMA_RUNNERS_DIR", RunnersDir, "Location for runners"}, "OLLAMA_TMPDIR": {"OLLAMA_TMPDIR", TmpDir, "Location for temporary files"}, }}
Commonly used Ollama environment variables
Environment variables | use | Example Value | illustrate |
---|---|---|---|
Model Management Configuration
Environment variables | use | Example Value | illustrate |
---|---|---|---|
Environment variables | use | Example Value | illustrate |
---|---|---|---|
Environment variables | use | Example Value | illustrate |
---|---|---|---|
Environment variables | use | Example Value | illustrate |
---|---|---|---|
curl http://localhost:11434/api/status
The interface monitors the model loading status and resource usage to ensure that the configuration effect meets expectations. By mastering these core parameters, you can give full play to the localized reasoning advantages of Ollama and build a high-performance and highly secure AI application system.