Woter AI detection.Hurry - ends Jul 11th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

A complete guide to Ollama environment variable configuration: from basic settings to scenario-based tuning

Written by

Silas Grey

Updated on:June-29th-2025

In the local deployment and performance optimization of Ollama, environment variables play a key role as the "nerve center". By flexibly configuring these parameters, developers can fine-tune the runtime behavior of the model and adapt to a variety of scenarios from stand-alone development to distributed clusters. This article will combine practical experience to share a set of systematic environment variable configuration solutions to help you unleash the maximum potential of Ollama.

1. Cross-platform environment variable configuration guide

1. Linux/macOS configuration solution

1. Temporary effect (single session)

# Quick start custom configuration export OLLAMA_PORT=12345 # Custom service port (avoid default port conflict) export OLLAMA_MODEL_DIR=./custom-models # Specify exclusive model storage path ollama serve --listen :$OLLAMA_PORT # Load environment variable configuration at startup

2. Permanent effect (global configuration)

Edit the corresponding configuration file according to the Shell type (taking ZSH as an example):

echo 'export OLLAMA_NUM_GPUS=1' >> ~/.zshrc echo 'export OLLAMA_CACHE_DIR="/data/ollama-cache"' >> ~/.zshrc source ~/.zshrc # Apply configuration changes immediately

(II) Windows graphical configuration steps

Open "Control Panel" → "System" → "Advanced system settings"
Add a new system variable in "Environment Variables":

Variable Name:OLLAMA_MODEL_DIR
Variable value:C:\ollama\models(It is recommended to use English absolute path)
Verify the configuration using the command line:

echo $env:OLLAMA_MODEL_DIR # Check whether the custom path is read correctly

(III) Docker container deployment skills

# Dockerfile configuration example FROM ollama/ollama:latest ENV OLLAMA_PORT=11434 \ OLLAMA_USE_MLOCK=1 # Lock memory to improve inference speed VOLUME /ollama/models # Persistent storage model file

Dynamic injection configuration at runtime:

docker run -d\ -p 11434:11434 \ -v $(pwd)/models:/ollama/models \ -e OLLAMA_GPU_LAYERS=32 \ # Specify the number of GPU layers ollama/ollama:latest

2. GPU Resource Efficient Utilization Strategy

(I) Scenarios with sufficient video memory (≥16GB)

# Full GPU computing + memory optimization export OLLAMA_ENABLE_CUDA=1 # Force CUDA acceleration export OLLAMA_GPU_LAYERS=40 # Load 40 layers of core parameters to the GPU export OLLAMA_USE_MLOCK=1 # Prevent model data from being swapped to disk

Monitoring tools : throughnvidia-smiCheck the video memory usage in real time to ensureGPU-UtilStable at above 80%.

(II) Video memory limited scenario (8GB and below)

# Layered computing + memory quota management export OLLAMA_GPU_LAYERS=20 # 20 layers run on the GPU, the rest are processed by the CPU export OLLAMA_MAX_GPU_MEMORY=6GB # Limit memory usage to no more than 6GB export OLLAMA_ENABLE_CUDA=1 # Keep basic CUDA acceleration capabilities

Best Practices : CollocationnvtopMonitor real-time video memory fluctuations to avoid triggering OOM (out of memory) errors.

3. Concurrency Performance Optimization Combination Solution

1. High-concurrency API service configuration

# Build a high-performance service cluster export OLLAMA_MAX_WORKERS=8 # 8 concurrent worker processes handle requests export OLLAMA_NUM_THREADS=16 # 16 threads per process for parallel computing export OLLAMA_CACHE_SIZE=8GB # Cache high-frequency access model results export OLLAMA_KEEP_ALIVE_TIMEOUT=60s # Keep long connections for 60 seconds to reduce handshake overhead

Performance indicators : QPS (query rate per second) can be increased by 30%-50%, which is suitable for high-traffic scenarios such as e-commerce customer service and intelligent question and answer.

2. Lightweight deployment configuration (laptop/edge device)

# Optimize resource-constrained environments export OLLAMA_MAX_WORKERS=2 # Limit the number of concurrent processes to avoid CPU overload export OLLAMA_NUM_THREADS=4 # Adapt to the number of low-power CPU cores export OLLAMA_CACHE_SIZE=2GB # Control memory usage within a reasonable range

Applicable scenarios : lightweight applications such as local knowledge base query and single-user code assistance.

IV. Key points for strengthening the safety of production environment

1. API access control

# Basic authentication + HTTPS encryption export OLLAMA_AUTH_TOKEN="$(openssl rand -hex 32)"# Generate a 32-bit random authentication token export OLLAMA_ALLOW_ORIGINS="https://api.yourdomain.com"# Limit cross-domain request sources export OLLAMA_ENABLE_TLS=1# Enable TLS 1.3 encrypted communication export OLLAMA_TLS_CERT_FILE="/ssl/cert.pem" # Certificate file path

2. Data security strategy

# Prevent model tampering and malicious pull export OLLAMA_DISABLE_REMOTE_PULL=1 # Disable remote model download export OLLAMA_READ_ONLY=1 # Enable read-only mode to protect local model export OLLAMA_ENABLE_SANDBOX=1 # Enable containerized sandbox isolation

3. Security monitoring configuration

# Log audit and request current limiting export OLLAMA_LOG_LEVEL=INFO # Record key operation logs export OLLAMA_LOG_FILE="/var/log/ollama/access.log" # Log file persistence export OLLAMA_MAX_REQUEST_SIZE=10MB # Limit the size of a single request to prevent DoS attacks

5. Advanced Configuration and Source Code Level Tuning

By studying the Ollama source code (envconfig/config.go), which unlocks the following advanced configurations:

// Practical configuration hidden in the source code export OLLAMA_FLASH_ATTENTION=1 #Enable FlashAttention to optimize long text reasoning export OLLAMA_LLM_LIBRARY=llama.cpp #Force the use of a specified reasoning library (such as llama.cpp) export OLLAMA_MAX_LOADED_MODELS=3 #Load 3 models into memory at the same time (sufficient video memory is required)

VI. Common Problems Troubleshooting Table

Problem phenomenon	Possible causes	Solution
Port occupation	Multiple instances running port conflicts	Revise`OLLAMA_PORT=11435`And restart the service
Model loading failed	Insufficient path permissions	make sure`OLLAMA_MODEL_DIR`Directory can be read and written
GPU usage is less than 50%	CUDA is not enabled or the layer number is too low	examine`OLLAMA_ENABLE_CUDA=1`And raise`GPU_LAYERS`
The log contains no key information	The log level is set too high	Adjustment`OLLAMA_LOG_LEVEL=DEBUG`

VII. Appendix

Common environment variables used when tuning Ollama GPU

Environment variables	use	Example Value	illustrate
OLLAMA_NUM_GPUS	Specify the number of GPUs to use	1, 2	Currently Ollama mainly supports single GPU, but may support multiple GPUs in the future
OLLAMA_GPU_LAYERS	Set the number of layers to run on the GPU	32, 40	The larger the value, the higher the GPU load and the lower the CPU usage.
OLLAMA_ENABLE_CUDA	Force CUDA to be enabled for GPU inference	1 or true	Make sure to enable this option if CUDA is available
OLLAMA_USE_MLOCK	Locks the model in memory, preventing data from being swapped to disk	1 or true	Improve inference speed and prevent memory swapping
OLLAMA_USE_GPU_OFFLOAD	Enable GPU Offload to transfer some tasks from the CPU to the GPU	1 or true	Suitable for GPUs with larger video memory
OLLAMA_MAX_GPU_MEMORY	Limit the amount of GPU memory used by Ollama	8GB、16GB	Effectively avoid video memory overflow in multi-tasking scenarios

Ollama concurrency tuning environment variables

Environment variables	use	Example Value	illustrate
OLLAMA_MAX_WORKERS	Control the maximum number of concurrent workers and determine the parallelism of model inference tasks	2, 4, 8	Set a higher value to support more concurrent requests
OLLAMA_NUM_THREADS	Control the number of threads used by each Worker	4, 8, 16	Improve CPU utilization and accelerate reasoning with multithreading
OLLAMA_CACHE_SIZE	Set the size of the model cache to reduce repeated loading	4GB、8GB	Reduce computational overhead for the same model and input
OLLAMA_KEEP_ALIVE_TIMEOUT	Controlling HTTP connection hold time	30s, 60s	Avoid frequent connection establishment and improve API response speed
OLLAMA_ENABLE_PARALLEL_DECODE	Enable parallel decoding to improve response efficiency when multiple requests are made	1 or true	Improve multi-request processing efficiency with GPU support

Common security-related environment variables

Environment variables	use	Example Value	illustrate
OLLAMA_AUTH_TOKEN	Set the authentication token for API requests	your-secret-token	Enable identity authentication to prevent unauthorized access
OLLAMA_ALLOW_ORIGINS	Configure allowed cross-domain request origins	https://example.com	Limit API access to specific sources to prevent CSRF attacks
OLLAMA_DISABLE_REMOTE_PULL	Disable downloading models from remote locations	1 or true	Prevent unauthorized model pulls
OLLAMA_READ_ONLY	Put Ollama into read-only mode	1 or true	Prohibit changes to models and configurations
OLLAMA_API_PORT	Custom API Port	11434	Avoid using default ports to reduce attack surface
OLLAMA_MAX_REQUEST_SIZE	Limit the maximum data size of API requests	10MB	Prevent DoS (Denial of Service) attacks
OLLAMA_LOG_LEVEL	Controlling the level of logging	NFO, WARN, ERROR	Record important events and monitor abnormal behavior
OLLAMA_ENABLE_TLS	Enable TLS encryption	1 or true	Protect API communications and prevent man-in-the-middle attacks
OLLAMA_TLS_CERT_FILE	Provide TLS certificate path	/path/to/cert.pem	Use with TLS
OLLAMA_TLS_KEY_FILE	Provide the path to the TLS private key	/path/to/key.pem	Use with TLS
OLLAMA_ENABLE_SANDBOX	Enable model sandbox environment	1 or true	Isolate the model running environment to prevent malicious model behavior

Ollama environment variable default values

Source code files in Ollamaenvconfig/config.goThe Ollama default configuration is defined in:

func AsMap() map[string]EnvVar { return map[string]EnvVar{ "OLLAMA_DEBUG": {"OLLAMA_DEBUG", Debug, "Show additional debug information (eg OLLAMA_DEBUG=1)"}, "OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention, "Enabled flash attention"}, "OLLAMA_HOST": {"OLLAMA_HOST", "", "IP Address for the ollama server (default 127.0.0.1:11434)"}, "OLLAMA_KEEP_ALIVE": {"OLLAMA_KEEP_ALIVE", KeepAlive, "The duration that models stay loaded in memory (default \"5m\")"}, "OLLAMA_LLM_LIBRARY": {"OLLAMA_LLM_LIBRARY", LLMLibrary, "Set LLM library to bypass autodetection"}, "OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models (default 1)"}, "OLLAMA_MAX_QUEUE": {"OLLAMA_MAX_QUEUE", MaxQueuedRequests, "Maximum number of queued requests"}, "OLLAMA_MAX_VRAM": {"OLLAMA_MAX_VRAM", MaxVRAM, "Maximum VRAM"}, "OLLAMA_MODELS": {"OLLAMA_MODELS", "", "The path to the models directory"}, "OLLAMA_NOHISTORY": {"OLLAMA_NOHISTORY", NoHistory, "Do not preserve readline history"}, "OLLAMA_NOPRUNE": {"OLLAMA_NOPRUNE", NoPrune, "Do not prune model blobs on startup"},        "OLLAMA_NUM_PARALLEL": {"OLLAMA_NUM_PARALLEL", NumParallel, "Maximum number of parallel requests (default 1)"}, "OLLAMA_ORIGINS": {"OLLAMA_ORIGINS", AllowOrigins, "A comma separated list of allowed origins"}, "OLLAMA_RUNNERS_DIR": {"OLLAMA_RUNNERS_DIR", RunnersDir, "Location for runners"}, "OLLAMA_TMPDIR": {"OLLAMA_TMPDIR", TmpDir, "Location for temporary files"}, }}

Commonly used Ollama environment variables

Basic Configuration

Environment variables	use	Example Value	illustrate
OLLAMA_HOST	Specify the address where the Ollama API listens	0.0.0.0 or 127.0.0.1	For accessing the API locally or remotely
OLLAMA_PORT	Specify the listening port for the Ollama API	The default port is 11434	11434, can be changed to avoid port conflicts

Model Management Configuration

Environment variables	use	Example Value	illustrate
OLLAMA_PULL_PROXY	Set the proxy address when downloading the model	http://proxy.example.com	Used to speed up model pulling, especially in the country
OLLAMA_PULL_PROXY	Set the proxy address when downloading the model	http://proxy.example.com	Used to speed up model pulling, especially in the country
OLLAMA_CACHE_DIR	Specify the model cache directory	/path/to/cache	Avoid downloading models repeatedly
OLLAMA_ALLOW_REMOTE_MODELS	Whether to allow remote model fetching	1 or true	Can be used to restrict external downloads of models
OLLAMA_FORCE_REDOWNLOAD	Force re-download of model	1 or true	Make sure to pull the latest version when the model is updated

Performance optimization configuration

Environment variables	use	Example Value	illustrate
OLLAMA_NUM_GPUS	Specify the number of GPUs to use	1 or 2	Used for multi-GPU inference, but currently Ollama mainly supports single GPU
OLLAMA_NUM_THREADS	Set the number of CPU threads used during inference	8	Can be used for CPU inference optimization
OLLAMA_GPU_LAYERS	Specify the number of layers to run on the GPU	32	GPU acceleration for model quantization
OLLAMA_ENABLE_CUDA	Enable CUDA for GPU inference	1 or true	Make sure CUDA is enabled if available
OLLAMA_USE_MLOCK	Lock memory to prevent data from being swapped to disk	1 or true	Improve inference performance, especially for large models

Security Configuration

Environment variables	use	Example Value	illustrate
OLLAMA_AUTH_TOKEN	Configure authentication for API calls	Token your_token_here	Used to protect API from unauthorized access
OLLAMA_DISABLE_REMOTE_MODELS	Disable loading models from remote locations	1 or true	Make sure to use only local models
OLLAMA_LOG_LEVEL	Setting the log level	info, debug, error	Facilitates security monitoring and logging

Debug and development configuration

Environment variables	use	Example Value	illustrate
OLLAMA_LOG_FILE	Specify log output file	/path/to/logfile.log	Save logs to files for later analysis
OLLAMA_DEV_MODE	Enable Development Mode	1 or true	Provides additional debugging information
OLLAMA_PROFILE	Enable performance profiling	1 or true	Output performance data to analyze inference speed
OLLAMA_DEBUG	Enable debug mode	1 or true	Display more log information to facilitate troubleshooting

By properly configuring the Ollama environment variables, developers can accurately adapt to the full process requirements from development and testing to production deployment.curl http://localhost:11434/api/statusThe interface monitors the model loading status and resource usage to ensure that the configuration effect meets expectations. By mastering these core parameters, you can give full play to the localized reasoning advantages of Ollama and build a high-performance and highly secure AI application system.