A complete guide to Ollama environment variable configuration: from basic settings to scenario-based tuning

Written by
Silas Grey
Updated on:June-29th-2025
Recommendation

Master the Ollama environment variable configuration, optimize model performance, and achieve efficient development.

Core content:
1. Cross-platform environment variable configuration guide: Linux/macOS and Windows setting methods
2. Docker container deployment skills and runtime dynamic configuration
3. GPU resource efficient utilization strategy: configuration optimization in scenarios with sufficient and limited video memory

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In the local deployment and performance optimization of Ollama, environment variables play a key role as the "nerve center". By flexibly configuring these parameters, developers can fine-tune the runtime behavior of the model and adapt to a variety of scenarios from stand-alone development to distributed clusters. This article will combine practical experience to share a set of systematic environment variable configuration solutions to help you unleash the maximum potential of Ollama.

1. Cross-platform environment variable configuration guide  

1. Linux/macOS configuration solution  

1. Temporary effect (single session)  

# Quick start custom configuration export OLLAMA_PORT=12345 # Custom service port (avoid default port conflict) export OLLAMA_MODEL_DIR=./custom-models # Specify exclusive model storage path ollama serve --listen :$OLLAMA_PORT # Load environment variable configuration at startup

2. Permanent effect (global configuration)  

Edit the corresponding configuration file according to the Shell type (taking ZSH as an example):  

echo 'export OLLAMA_NUM_GPUS=1' >> ~/.zshrc echo 'export OLLAMA_CACHE_DIR="/data/ollama-cache"' >> ~/.zshrc source ~/.zshrc # Apply configuration changes immediately

(II) Windows graphical configuration steps  

  1. Open "Control Panel" → "System" → "Advanced system settings"  

  2. Add a new system variable in "Environment Variables":  

  • Variable Name:OLLAMA_MODEL_DIR

  • Variable value:C:\ollama\models(It is recommended to use English absolute path)  

  • Verify the configuration using the command line:  


    echo $env:OLLAMA_MODEL_DIR # Check whether the custom path is read correctly

    (III) Docker container deployment skills  

    # Dockerfile configuration example FROM ollama/ollama:latest ENV OLLAMA_PORT=11434 \ OLLAMA_USE_MLOCK=1 # Lock memory to improve inference speed VOLUME /ollama/models # Persistent storage model file

    Dynamic injection configuration at runtime:  

    docker run -d\ -p 11434:11434 \ -v $(pwd)/models:/ollama/models \ -e OLLAMA_GPU_LAYERS=32 \ # Specify the number of GPU layers ollama/ollama:latest

    2. GPU Resource Efficient Utilization Strategy  

    (I) Scenarios with sufficient video memory (≥16GB)  

    # Full GPU computing + memory optimization export OLLAMA_ENABLE_CUDA=1 # Force CUDA acceleration export OLLAMA_GPU_LAYERS=40 # Load 40 layers of core parameters to the GPU export OLLAMA_USE_MLOCK=1 # Prevent model data from being swapped to disk

    Monitoring tools : throughnvidia-smiCheck the video memory usage in real time to ensureGPU-UtilStable at above 80%.  

    (II) Video memory limited scenario (8GB and below)  

    # Layered computing + memory quota management export OLLAMA_GPU_LAYERS=20 # 20 layers run on the GPU, the rest are processed by the CPU export OLLAMA_MAX_GPU_MEMORY=6GB # Limit memory usage to no more than 6GB export OLLAMA_ENABLE_CUDA=1 # Keep basic CUDA acceleration capabilities

    Best Practices : CollocationnvtopMonitor real-time video memory fluctuations to avoid triggering OOM (out of memory) errors.  

    3. Concurrency Performance Optimization Combination Solution  

    1. High-concurrency API service configuration  

    # Build a high-performance service cluster export OLLAMA_MAX_WORKERS=8 # 8 concurrent worker processes handle requests export OLLAMA_NUM_THREADS=16 # 16 threads per process for parallel computing export OLLAMA_CACHE_SIZE=8GB # Cache high-frequency access model results export OLLAMA_KEEP_ALIVE_TIMEOUT=60s # Keep long connections for 60 seconds to reduce handshake overhead

    Performance indicators : QPS (query rate per second) can be increased by 30%-50%, which is suitable for high-traffic scenarios such as e-commerce customer service and intelligent question and answer.  

    2. Lightweight deployment configuration (laptop/edge device)  

    # Optimize resource-constrained environments export OLLAMA_MAX_WORKERS=2 # Limit the number of concurrent processes to avoid CPU overload export OLLAMA_NUM_THREADS=4 # Adapt to the number of low-power CPU cores export OLLAMA_CACHE_SIZE=2GB # Control memory usage within a reasonable range

    Applicable scenarios : lightweight applications such as local knowledge base query and single-user code assistance.  

    IV. Key points for strengthening the safety of production environment  

    1. API access control  

    # Basic authentication + HTTPS encryption export OLLAMA_AUTH_TOKEN="$(openssl rand -hex 32)"# Generate a 32-bit random authentication token export OLLAMA_ALLOW_ORIGINS="https://api.yourdomain.com"# Limit cross-domain request sources export OLLAMA_ENABLE_TLS=1# Enable TLS 1.3 encrypted communication export OLLAMA_TLS_CERT_FILE="/ssl/cert.pem" # Certificate file path
    2. Data security strategy  
    # Prevent model tampering and malicious pull export OLLAMA_DISABLE_REMOTE_PULL=1 # Disable remote model download export OLLAMA_READ_ONLY=1 # Enable read-only mode to protect local model export OLLAMA_ENABLE_SANDBOX=1 # Enable containerized sandbox isolation

    3. Security monitoring configuration  

    # Log audit and request current limiting export OLLAMA_LOG_LEVEL=INFO # Record key operation logs export OLLAMA_LOG_FILE="/var/log/ollama/access.log" # Log file persistence export OLLAMA_MAX_REQUEST_SIZE=10MB # Limit the size of a single request to prevent DoS attacks

    5. Advanced Configuration and Source Code Level Tuning  

    By studying the Ollama source code (envconfig/config.go), which unlocks the following advanced configurations:  

    // Practical configuration hidden in the source code export OLLAMA_FLASH_ATTENTION=1 #Enable FlashAttention to optimize long text reasoning export OLLAMA_LLM_LIBRARY=llama.cpp #Force the use of a specified reasoning library (such as llama.cpp) export OLLAMA_MAX_LOADED_MODELS=3 #Load 3 models into memory at the same time (sufficient video memory is required)

    VI. Common Problems Troubleshooting Table  

    Problem phenomenonPossible causesSolution
    Port occupationMultiple instances running port conflictsReviseOLLAMA_PORT=11435And restart the service
    Model loading failedInsufficient path permissionsmake sureOLLAMA_MODEL_DIRDirectory can be read and written
    GPU usage is less than 50%CUDA is not enabled or the layer number is too lowexamineOLLAMA_ENABLE_CUDA=1And raiseGPU_LAYERS
    The log contains no key informationThe log level is set too highAdjustmentOLLAMA_LOG_LEVEL=DEBUG

    VII. Appendix


    Common environment variables used when tuning Ollama GPU

    Environment variables
    use
    Example Value
    illustrate
    OLLAMA_NUM_GPUS
    Specify the number of GPUs to use
    1, 2
    Currently Ollama mainly supports single GPU, but may support multiple GPUs in the future
    OLLAMA_GPU_LAYERS
    Set the number of layers to run on the GPU
    32, 40
    The larger the value, the higher the GPU load and the lower the CPU usage.
    OLLAMA_ENABLE_CUDA
    Force CUDA to be enabled for GPU inference
    1 or true
    Make sure to enable this option if CUDA is available
    OLLAMA_USE_MLOCK
    Locks the model in memory, preventing data from being swapped to disk
    1 or true
    Improve inference speed and prevent memory swapping
    OLLAMA_USE_GPU_OFFLOAD
    Enable GPU Offload to transfer some tasks from the CPU to the GPU
    1 or true
    Suitable for GPUs with larger video memory
    OLLAMA_MAX_GPU_MEMORY
    Limit the amount of GPU memory used by Ollama
    8GB、16GB
    Effectively avoid video memory overflow in multi-tasking scenarios


    Ollama concurrency tuning environment variables

    Environment variables
    use
    Example Value
    illustrate
    OLLAMA_MAX_WORKERS
    Control the maximum number of concurrent workers and determine the parallelism of model inference tasks
    2, 4, 8
    Set a higher value to support more concurrent requests
    OLLAMA_NUM_THREADS
    Control the number of threads used by each Worker
    4, 8, 16
    Improve CPU utilization and accelerate reasoning with multithreading
    OLLAMA_CACHE_SIZE
    Set the size of the model cache to reduce repeated loading
    4GB、8GB
    Reduce computational overhead for the same model and input
    OLLAMA_KEEP_ALIVE_TIMEOUT
    Controlling HTTP connection hold time
    30s, 60s
    Avoid frequent connection establishment and improve API response speed
    OLLAMA_ENABLE_PARALLEL_DECODE
    Enable parallel decoding to improve response efficiency when multiple requests are made
    1 or true
    Improve multi-request processing efficiency with GPU support

    Common security-related environment variables

    Environment variables
    use
    Example Value
    illustrate
    OLLAMA_AUTH_TOKEN
    Set the authentication token for API requests
    your-secret-token
    Enable identity authentication to prevent unauthorized access
    OLLAMA_ALLOW_ORIGINS
    Configure allowed cross-domain request origins
    https://example.com
    Limit API access to specific sources to prevent CSRF attacks
    OLLAMA_DISABLE_REMOTE_PULL
    Disable downloading models from remote locations
    1 or true
    Prevent unauthorized model pulls
    OLLAMA_READ_ONLY
    Put Ollama into read-only mode
    1 or true
    Prohibit changes to models and configurations
    OLLAMA_API_PORT
    Custom API Port
    11434
    Avoid using default ports to reduce attack surface
    OLLAMA_MAX_REQUEST_SIZE
    Limit the maximum data size of API requests
    10MB
    Prevent DoS (Denial of Service) attacks
    OLLAMA_LOG_LEVEL
    Controlling the level of logging
    NFO, WARN, ERROR
    Record important events and monitor abnormal behavior
    OLLAMA_ENABLE_TLS
    Enable TLS encryption
    1 or true
    Protect API communications and prevent man-in-the-middle attacks
    OLLAMA_TLS_CERT_FILE
    Provide TLS certificate path
    /path/to/cert.pem
    Use with TLS
    OLLAMA_TLS_KEY_FILE
    Provide the path to the TLS private key
    /path/to/key.pem
    Use with TLS
    OLLAMA_ENABLE_SANDBOX
    Enable model sandbox environment
    1 or true
    Isolate the model running environment to prevent malicious model behavior


    Ollama environment variable default values

    Source code files in Ollamaenvconfig/config.goThe Ollama default configuration is defined in:

    func AsMap() map[string]EnvVar { return map[string]EnvVar{ "OLLAMA_DEBUG": {"OLLAMA_DEBUG", Debug, "Show additional debug information (eg OLLAMA_DEBUG=1)"}, "OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention, "Enabled flash attention"}, "OLLAMA_HOST": {"OLLAMA_HOST", "", "IP Address for the ollama server (default 127.0.0.1:11434)"}, "OLLAMA_KEEP_ALIVE": {"OLLAMA_KEEP_ALIVE", KeepAlive, "The duration that models stay loaded in memory (default \"5m\")"}, "OLLAMA_LLM_LIBRARY": {"OLLAMA_LLM_LIBRARY", LLMLibrary, "Set LLM library to bypass autodetection"}, "OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners, "Maximum number of loaded models (default 1)"}, "OLLAMA_MAX_QUEUE": {"OLLAMA_MAX_QUEUE", MaxQueuedRequests, "Maximum number of queued requests"}, "OLLAMA_MAX_VRAM": {"OLLAMA_MAX_VRAM", MaxVRAM, "Maximum VRAM"}, "OLLAMA_MODELS": {"OLLAMA_MODELS", "", "The path to the models directory"}, "OLLAMA_NOHISTORY": {"OLLAMA_NOHISTORY", NoHistory, "Do not preserve readline history"}, "OLLAMA_NOPRUNE": {"OLLAMA_NOPRUNE", NoPrune, "Do not prune model blobs on startup"},        "OLLAMA_NUM_PARALLEL": {"OLLAMA_NUM_PARALLEL", NumParallel, "Maximum number of parallel requests (default 1)"}, "OLLAMA_ORIGINS": {"OLLAMA_ORIGINS", AllowOrigins, "A comma separated list of allowed origins"}, "OLLAMA_RUNNERS_DIR": {"OLLAMA_RUNNERS_DIR", RunnersDir, "Location for runners"}, "OLLAMA_TMPDIR": {"OLLAMA_TMPDIR", TmpDir, "Location for temporary files"}, }}

    Commonly used Ollama environment variables

    Basic Configuration

    Environment variablesuseExample Valueillustrate
    OLLAMA_HOST
    Specify the address where the Ollama API listens
    0.0.0.0 or 127.0.0.1
    For accessing the API locally or remotely
    OLLAMA_PORT
    Specify the listening port for the Ollama API
    The default port is 11434
    11434, can be changed to avoid port conflicts


    Model Management Configuration

    Environment variablesuseExample Valueillustrate
    OLLAMA_PULL_PROXY
    Set the proxy address when downloading the model
    http://proxy.example.com
    Used to speed up model pulling, especially in the country
    OLLAMA_PULL_PROXY
    Set the proxy address when downloading the model
    http://proxy.example.com
    Used to speed up model pulling, especially in the country
    OLLAMA_CACHE_DIR
    Specify the model cache directory
    /path/to/cache
    Avoid downloading models repeatedly
    OLLAMA_ALLOW_REMOTE_MODELS
    Whether to allow remote model fetching
    1 or true
    Can be used to restrict external downloads of models
    OLLAMA_FORCE_REDOWNLOAD
    Force re-download of model
    1 or true
    Make sure to pull the latest version when the model is updated


    Performance optimization configuration

    Environment variablesuseExample Valueillustrate
    OLLAMA_NUM_GPUS
    Specify the number of GPUs to use
    1 or 2
    Used for multi-GPU inference, but currently Ollama mainly supports single GPU
    OLLAMA_NUM_THREADS
    Set the number of CPU threads used during inference
    8
    Can be used for CPU inference optimization
    OLLAMA_GPU_LAYERS
    Specify the number of layers to run on the GPU
    32
    GPU acceleration for model quantization
    OLLAMA_ENABLE_CUDA
    Enable CUDA for GPU inference
    1 or true
    Make sure CUDA is enabled if available
    OLLAMA_USE_MLOCK
    Lock memory to prevent data from being swapped to disk
    1 or true
    Improve inference performance, especially for large models
    Security Configuration
    Environment variablesuseExample Valueillustrate
    OLLAMA_AUTH_TOKEN
    Configure authentication for API calls
    Token your_token_here
    Used to protect API from unauthorized access
    OLLAMA_DISABLE_REMOTE_MODELS
    Disable loading models from remote locations
    1 or true
    Make sure to use only local models
    OLLAMA_LOG_LEVEL
    Setting the log level
    info, debug, error
    Facilitates security monitoring and logging

    Debug and development configuration

    Environment variablesuseExample Valueillustrate
    OLLAMA_LOG_FILE
    Specify log output file
    /path/to/logfile.log
    Save logs to files for later analysis
    OLLAMA_DEV_MODE
    Enable Development Mode
    1 or true
    Provides additional debugging information
    OLLAMA_PROFILE
    Enable performance profiling
    1 or true
    Output performance data to analyze inference speed
    OLLAMA_DEBUG
    Enable debug mode
    1 or true
    Display more log information to facilitate troubleshooting


    By properly configuring the Ollama environment variables, developers can accurately adapt to the full process requirements from development and testing to production deployment.curl http://localhost:11434/api/statusThe interface monitors the model loading status and resource usage to ensure that the configuration effect meets expectations. By mastering these core parameters, you can give full play to the localized reasoning advantages of Ollama and build a high-performance and highly secure AI application system.