8-card H100 large model training environment deployment document

Written by
Clara Bennett
Updated on:June-27th-2025
Recommendation

A comprehensive guide to deploying a deep learning environment.

Core content:
1. System preparation and basic tool installation
2. Detailed steps for NVIDIA driver and CUDA configuration
3. PyTorch and LLaMA-Factory configuration and operation

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

8-card H100 large model training environment deployment document

Objective : Deploy a deep learning environment that supports model training and inference on an Ubuntu 22.04 system, based on 8 NVIDIA H100 GPUs, and supports LLaMA-Factory and DeepSpeed ​​distributed training.

Hardware environment :

  • • CPU: 36 cores
  • • Memory: 1TB
  • • GPU: 8 NVIDIA H100 (80GB HBM3)
  • • Storage: NVMe SSD
  • • Operating system: Ubuntu 22.04 LTS

Date : April 17, 2025

Table of contents

  1. 1. System Preparation1.1 Update the system and install basic tools1.2 Configure storage1.3 Configure the network


  2. 2. NVIDIA Driver and CUDA Configuration2.1 Install NVIDIA Driver2.2 Install NVIDIA Fabric Manager2.3 Install CUDA (optional for specific versions) 2.4 Install cuDNN (optional for specific versions)



  3. 3. Anaconda environment configuration 3.1 Install Anaconda 3.2 Create Conda environment

  4. 4. Install PyTorch and its dependencies 4.1 Install PyTorch [Required] 4.2 Install FlashAttention

  5. 5. LLaMA-Factory Configuration 5.1 Installing LLaMA-Factory 5.2 Configuring Datasets and Models 5.3 Model Format Conversion (Optional)


  6. 6. Run training and inference tasks6.1 Training tasks6.2 Inference tasks

  7. 7. Installation of auxiliary tools 7.1 Install llama.cpp 7.2 Install nvitop

  8. 8. Version Summary
  9. 9. AI Analysis of the Installation Process 9.1 Valid Commands and Key Installations 9.2 Invalid or Repeated Commands

  10. 10. Troubleshooting 10.1 CUDA Error 802 (system not yet initialized)

1. System Preparation

1.1 Update the system and install basic tools

Ensure system packages are up to date and install necessary tools:

sudo  apt-get update
sudo  apt install -y net-tools iputils-ping iptables parted lrzsz vim axel unzip cmake gcc make build-essential ninja-build

1.2 Configuring Storage

Configure NVMe SSD and LVM for large datasets and model checkpoints:

# Check disk
fdisk -l
lsblk

# Partition NVMe disk
parted /dev/nvme0n1
# Command: mklabel gpt, mkpart primary ext4 0% 100%, set 1 lvm on, quit

# Format partition
mkfs.ext4 /dev/nvme0n1p1

# Create LVM logical volume
sudo  lvcreate -n backup-lv -l 100%FREE ubuntu-vg
mkfs.ext4 /dev/ubuntu-vg/backup-lv

# Create a mount point
mkdir  /data /backup

# Configure /etc/fstab
sudo  vi /etc/fstab
# Add the following content (use blkid to get the UUID):
# UUID=<nvme0n1p1-uuid> /data ext4 defaults 0 0
# /dev/ubuntu-vg/backup-lv /backup ext4 defaults 0 0

# Mount
sudo  mount -a

verify :

  • • Check the mount:df -Th
  • • Check UUID:blkid
  • • Check the directory:ls -larth /data/ /backup/

1.3 Configuring the network

Make sure the network interface supports high-bandwidth communication:

cd  /etc/netplan/
sudo  vi 00-installer-config.yaml
# Example configuration (adjust according to the actual network card):
network:
  ethernets:
    enp25s0f0:
      dhcp4: no
      addresses: [10.1.1.10/24]
      gateway4: 10.1.1.1
      nameservers:
        addresses: [8.8.8.8, 8.8.4.4]
  version: 2

sudo  netplan apply

verify :

  • • Check IP:ip addr
  • • Check the network card status:ethtool enp25s0f0
  • • Test connectivity:ping 8.8.8.8

2. NVIDIA driver and CUDA configuration

2.1 Install NVIDIA Driver

Install the NVIDIA data center driver for H100:

cd  /data/install_deb/
sudo chmod  +x NVIDIA-Linux-x86_64-570.124.06.run 
sudo  ./NVIDIA-Linux-x86_64-570.124.06.run --no-x-check --no-nouveau-check --no-opengl-files

verify :

  • • Check the drive:nvidia-smi
  • • Check module:lsmod | grep nvidia

Version :

  • • NVIDIA driver: 570.124.06

2.2 Install NVIDIA Fabric Manager

Install Fabric Manager for multi-GPU systems with NVLink support and efficient communication:

cd  /data/install_deb/
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-570_570.124.06-1_amd64.deb
sudo  apt-get install ./nvidia-fabricmanager-570_570.124.06-1_amd64.deb
systemctl  enable  nvidia-fabricmanager
systemctl restart nvidia-fabricmanager

verify :

  • • Check Status:systemctl status nvidia-fabricmanager
  • • Check Fabric Manager:nvidia-smi -q | grep -i -A 2 Fabric

Version :

  • • Fabric Manager: 570.124.06

2.3 Install CUDA (optional)

Install CUDA 12.4 to support PyTorch and deep learning tasks:

cd  /data/install_deb/
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo  ./cuda_12.4.0_550.54.14_linux.run --no-x-check --no-nouveau-check --no-opengl-files

# Configure environment variables
echo 'export CUDA_HOME=/usr/local/cuda-12.4'  >> ~/.bashrc 
echo 'export PATH=$CUDA_HOME/bin:$PATH'  >> ~/.bashrc 
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH'  >> ~/.bashrc 
source  ~/.bashrc

verify :

  • • Check CUDA version:nvcc --version
  • • Check GPU status:nvidia-smi

Version :

  • • CUDA: 12.4.0
  • • Driver compatibility: 550.54.14

2.4 Install cuDNN (optional)

Install cuDNN to improve deep learning performance:

cd  /data/install_deb/
wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
sudo  dpkg -i cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
sudo  apt-get update
sudo  apt-get install -y libcudnn9-cuda-12 libcudnn9-dev-cuda-12

verify :

  • • Check cuDNN version:cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

Version :

  • • cuDNN: 9.0.0

3. Anaconda environment configuration

3.1 Install Anaconda

Install Anaconda for environment isolation:

cd  /data/install_deb/
wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
bash Anaconda3-2024.10-1-Linux-x86_64.sh
echo 'export PATH="/root/anaconda3/bin:$PATH"'  >> ~/.bashrc 
source  ~/.bashrc
conda init

verify :

  • • Check Conda version:conda --version

Version :

  • • Anaconda: 2024.10-1
  • • Python: 3.12

3.2 Create a Conda environment

Creating an isolated environment llama_factory:

conda create -n llama_factory python=3.12
conda activate llama_factory

verify :

  • • Check the environment:conda info --envs
  • • Check Python version:python --version

4. PyTorch and Dependency Installation

4.1 Install PyTorch [Required]

Install PyTorch 2.5.1 with CUDA 12.4 support:

conda activate llama_factory
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

verify :

  • • Check out PyTorch:
    python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.version.cuda); print(torch.cuda.device_count()); print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])"

Version :

  • • PyTorch: 2.5.1
  • • torchvision: 0.20.1
  • • torchaudio: 2.5.1
  • • CUDA (PyTorch): 12.4

4.2 Install FlashAttention

Install FlashAttention to improve Transformer model performance:

pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

verify :

  • • Check the installation:pip show flash-attn
  • • test:
    python -c "import flash_attn; print('FlashAttention installed successfully!')"

Version :

  • • flash-attn: 2.7.4.post1

5. LLaMA-Factory Configuration

5.1 Installing LLaMA-Factory

Clone and install LLaMA-Factory:

cd  /data
git  clone  https://github.com/hiyouga/LLaMA-Factory.git
cd  LLaMA-Factory
pip install llamafactory==0.9.0 -i https://repo.huaweicloud.com/repository/pypi/simple

verify :

  • • Check the version:llamafactory-cli --version

Version :

  • • LLaMA-Factory: 0.9.0

5.2 Configuring Datasets and Models

Prepare dataset and pre-trained model:

cd  /data
tar xvf checkpoint-214971.tar
tar xvf Qwen2___5-7B-Instruct.tar
mv  Qwen2___5-7B-Instruct qwen25_7BI

# Move the dataset
cd  /data/SKData
mv  data/*.jsonl ./

# Configure dataset_info.json
cd  /data/LLaMA-Factory/data
vim dataset_info.json
# Example configuration:
{
"v5" : {
    "file_name""/data/SKData/new_step_data_20250317_train_ocv.jsonl" ,
    "columns" : {
      "prompt""prompt" ,
      "response""response"
    }
  },
"ddz_dataset" : {
    "file_name""/data/ddz_dataset/ai_data_training_v1.0.json" ,
    "columns" : {
      "prompt""prompt" ,
      "response""response"
    }
  }
}

5.3 Model format conversion (optional)

Convert the Hugging Face model to GGUF format for use llama.cpp reasoning:

cd  /data/llama.cpp
python convert_hf_to_gguf.py /data/checkpoint-214971 --outfile /data/qwen2-model.gguf

verify :

  • • Check the GGUF file:ls -larth /data/qwen2-model.gguf

6. Run training and inference tasks

6.1 Training Tasks

Model fine-tuning using DeepSpeed ​​and multiple GPUs:

conda activate llama_factory
FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/qwen2_7b_freeze_sft_ddz_v1.yaml

Configuration file example  (qwen2_7b_freeze_sft_ddz_v1.yaml):

model_name_or_path: /data/qwen25_7BI 
dataset: v5,ddz_dataset
template: qwen
finetuning_type: freeze
use_deepspeed: true
deepspeed: ds_configs/stage3.json
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 5e-5
num_train_epochs: 3
output_dir: /data/checkpoint 

verify :

  • • Monitor GPU usage:nvitop or nvidia-smi
  • • Check the logs:tail -f /data/checkpoint/train.log

6.2 Reasoning Task

Start the API service for inference:

conda activate llama_factory
API_PORT=6000 CUDA_VISIBLE_DEVICES=4,5 llamafactory-cli api examples/test_7b_dcot.yaml

Configuration file example  (test_7b_dcot.yaml):

model_name_or_path: /data/checkpoint-214971 
template: qwen 
infer_backend: vllm 
vllm_args:
  gpu_memory_utilization: 0.9 
  max_model_len: 4096 

7. Installation of auxiliary tools

7.1 Install llama.cpp

For model conversion and lightweight inference:

cd  /data
git  clone  https://github.com/ggerganov/llama.cpp
cd  llama.cpp
mkdir  build &&  cd  build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build . --config Release

verify :

  • • Test reasoning:
    ./build/bin/llama-cli -m /data/qwen2-model.gguf --prompt "What is the capital of France?" -n 256 -t 8 --gpu-layers 28 -c 4096

7.2 Installing nvitop

For GPU monitoring:

pip install nvitop

verify :

  • • Run:nvitop

Version :

  • • nvitop: Latest

8. Version Summary

Components
Version
operating system
Ubuntu 22.04
NVIDIA Drivers
570.124.06
NVIDIA Fabric Manager
570.124.06
CUDA
12.8.0
cuDNN
9.0.0
Anaconda
2024.10-1
Python
3.12
PyTorch
2.5.1
torchvision
0.20.1
torchaudio
2.5.1
flash-attn
2.7.4.post1
transformers
4.46.1
accelerate
0.34.2
deepspeed
0.15.4
vllm
0.8.2
LLaMA-Factory
0.9.0
nvitop
up to date

9. AI analysis of the installation process

9.1 Valid commands and key installations

The following are .bash_history The key installation steps that are effective in the following are screened and organized:

  1. 1.  NVIDIA Driver  (NVIDIA-Linux-x86_64-570.124.06.run):
  • • Order:sudo ./NVIDIA-Linux-x86_64-570.124.06.run --no-x-check --no-nouveau-check --no-opengl-files
  • • Analysis: 570.124.06 is the latest data center driver for H100 GPU, and it was installed successfully.nvidia-smi 8 H100 GPUs are shown.
  • • Note: After trying to install other driver versions (such as 550, 535) several times, it was finally confirmed that 570.124.06 was the most stable.
  • 2.  NVIDIA Fabric Manager  (nvidia-fabricmanager-570_570.124.06-1_amd64.deb):
    • • Order:sudo apt-get install ./nvidia-fabricmanager-570_570.124.06-1_amd64.deb
    • • Analysis: Fabric Manager ensures efficient NVLink communication between multiple GPUs.systemctl status nvidia-fabricmanager Confirm that the service is running normally. Failure to install may cause errors. See 10.1 for details.
  • 3.  CUDA 12.4  (cuda_12.4.0_550.54.14_linux.run):
    • • Order:sudo ./cuda_12.4.0_550.54.14_linux.run --no-x-check --no-nouveau-check --no-opengl-files
    • • Analysis: CUDA 12.4 is compatible with PyTorch 2.5.1 and H100, environment variables are configured correctly,nvcc --version Displays 12.4.0.
  • 4.  cuDNN  (cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb):
    • • Order:sudo dpkg -i cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb && sudo apt-get install -y libcudnn9-cuda-12 libcudnn9-dev-cuda-12
    • • Analysis: cuDNN 9.0.0 enhances deep learning performance, installation successful,cat /usr/include/cudnn_version.h Confirm the version.
  • 5.  FlashAttention  (flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl):
    • • Order:pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
    • • Analysis: Use precompiled wheels to avoid compilation problems,ninja-build Ensure that dependencies are complete.pip show flash-attn Confirm the installation.
  • 6.  PyTorch and dependencies :
    • • Order:pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
    • • Analysis: PyTorch 2.5.1 is compatible with CUDA 12.4 and H100. Uninstall and reinstall multiple times to ensure a consistent environment.python -c "import torch; print(torch.cuda.is_available())" Verify that the GPU is available.
  • 7.  LLaMA-Factory and other libraries :
    • • Order:pip install llamafactory==0.9.0 transformers==4.46.1 accelerate==0.34.2 deepspeed==0.15.4 vllm==0.8.2 -i https://repo.huaweicloud.com/repository/pypi/simple
    • • Analysis: LLaMA-Factory 0.9.0 is running stably,DISABLE_VERSION_CHECK=1 Resolve version conflicts,vllm Provides efficient reasoning.

    9.2 Invalid or duplicate commands

    • •  Repeated driver installation : multiple attempts nvidia-driver-550,nvidia-driver-535 etc., finally use 570.124.06, earlier versions are invalid.
    • •  Conda pipeline configuration : Adjust the Conda pipeline multiple times (e.g. conda-forge,pytorch), but PyTorch is ultimately installed via pip, and the Conda channel configuration has limited impact.
    • •  FlashAttention compilation failed : Try compiling from source flash-attngit clone and python setup.py install), failed due to complex dependencies, and changed to pre-compiled wheel.
    • •  vllm installation problem : After trying different versions several times (e.g. 0.6.2, 0.7.2), 0.8.2 was finally compatible with LLaMA-Factory.
    • •  Redundant commands : e.g. conda init llamafactory(Invalid, correct is conda init bash),repeatedly ls -larth and ps auxf Used for debugging, the documentation is condensed.


    10. Troubleshooting

    • •  NVIDIA driver issues :
      • • Check module:lsmod | grep nvidia
      • • Clean up old drivers:sudo apt purge nvidia*
    • •  CUDA does not detect GPU : CUDA initialization: Unexpected error from cudaGetDeviceCount
      • • Verify environment variables:echo $CUDA_HOME $LD_LIBRARY_PATH
      • • Check out PyTorch:python -c "import torch; print(torch.cuda.is_available())"
      • • Install
    • •  FlashAttention installation failed :
      • • make sure ninja Installed:ninja --version
      • • Use precompiled wheels or downgrade Python to 3.11.
    • •  LLaMA-Factory version conflict :
      • • set up DISABLE_VERSION_CHECK=1 Bypass inspection.
      • • make sure transformers==4.46.1.
    • •  vllm reasoning error :
      • • examine use_beam_search Parameters, modify if necessary /root/anaconda3/envs/llama_factory/lib/python3.12/site-packages/llamafactory/chat/vllm_engine.py.

    10.1 CUDA Error 802 (system not yet initialized)

    • •  Error log :CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (2025-04-15 17:45:57, 13:10:42).
    • •  reason :
      • • The NVIDIA Fabric Manager service is not started or the version does not match the driver.
      • • CUDA environment variables are incorrectly configured.
      • • PyTorch is not compatible with CUDA versions.
    • •  Solution : Check Fabric Manager :
        systemctl unmask nvidia-fabricmanager.service
        sudo rm  -f /lib/systemd/system/nvidia-fabricmanager.service 
        sudo rm  -f /etc/systemd/system/nvidia-fabricmanager.service 
        sudo  apt-get remove nvidia-fabricmanager*
        sudo  apt-get install ./nvidia-fabricmanager-570_570.124.06-1_amd64.deb
        sudo  systemctl  enable  nvidia-fabricmanager
        sudo  systemctl restart nvidia-fabricmanager
    • •  verify :
      python -c  "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"
      nvidia-smi