8-card H100 large model training environment deployment document

A comprehensive guide to deploying a deep learning environment.
Core content:
1. System preparation and basic tool installation
2. Detailed steps for NVIDIA driver and CUDA configuration
3. PyTorch and LLaMA-Factory configuration and operation
8-card H100 large model training environment deployment document
Objective : Deploy a deep learning environment that supports model training and inference on an Ubuntu 22.04 system, based on 8 NVIDIA H100 GPUs, and supports LLaMA-Factory and DeepSpeed distributed training.
Hardware environment :
• CPU: 36 cores • Memory: 1TB • GPU: 8 NVIDIA H100 (80GB HBM3) • Storage: NVMe SSD • Operating system: Ubuntu 22.04 LTS
Date : April 17, 2025
Table of contents
1. System Preparation1.1 Update the system and install basic tools1.2 Configure storage1.3 Configure the network 2. NVIDIA Driver and CUDA Configuration2.1 Install NVIDIA Driver2.2 Install NVIDIA Fabric Manager2.3 Install CUDA (optional for specific versions) 2.4 Install cuDNN (optional for specific versions) 3. Anaconda environment configuration 3.1 Install Anaconda 3.2 Create Conda environment 4. Install PyTorch and its dependencies 4.1 Install PyTorch [Required] 4.2 Install FlashAttention 5. LLaMA-Factory Configuration 5.1 Installing LLaMA-Factory 5.2 Configuring Datasets and Models 5.3 Model Format Conversion (Optional) 6. Run training and inference tasks6.1 Training tasks6.2 Inference tasks 7. Installation of auxiliary tools 7.1 Install llama.cpp 7.2 Install nvitop 8. Version Summary 9. AI Analysis of the Installation Process 9.1 Valid Commands and Key Installations 9.2 Invalid or Repeated Commands 10. Troubleshooting 10.1 CUDA Error 802 (system not yet initialized)
1. System Preparation
1.1 Update the system and install basic tools
Ensure system packages are up to date and install necessary tools:
sudo apt-get update
sudo apt install -y net-tools iputils-ping iptables parted lrzsz vim axel unzip cmake gcc make build-essential ninja-build
1.2 Configuring Storage
Configure NVMe SSD and LVM for large datasets and model checkpoints:
# Check disk
fdisk -l
lsblk
# Partition NVMe disk
parted /dev/nvme0n1
# Command: mklabel gpt, mkpart primary ext4 0% 100%, set 1 lvm on, quit
# Format partition
mkfs.ext4 /dev/nvme0n1p1
# Create LVM logical volume
sudo lvcreate -n backup-lv -l 100%FREE ubuntu-vg
mkfs.ext4 /dev/ubuntu-vg/backup-lv
# Create a mount point
mkdir /data /backup
# Configure /etc/fstab
sudo vi /etc/fstab
# Add the following content (use blkid to get the UUID):
# UUID=<nvme0n1p1-uuid> /data ext4 defaults 0 0
# /dev/ubuntu-vg/backup-lv /backup ext4 defaults 0 0
# Mount
sudo mount -a
verify :
• Check the mount: df -Th
• Check UUID: blkid
• Check the directory: ls -larth /data/ /backup/
1.3 Configuring the network
Make sure the network interface supports high-bandwidth communication:
cd /etc/netplan/
sudo vi 00-installer-config.yaml
# Example configuration (adjust according to the actual network card):
network:
ethernets:
enp25s0f0:
dhcp4: no
addresses: [10.1.1.10/24]
gateway4: 10.1.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
version: 2
sudo netplan apply
verify :
• Check IP: ip addr
• Check the network card status: ethtool enp25s0f0
• Test connectivity: ping 8.8.8.8
2. NVIDIA driver and CUDA configuration
2.1 Install NVIDIA Driver
Install the NVIDIA data center driver for H100:
cd /data/install_deb/
sudo chmod +x NVIDIA-Linux-x86_64-570.124.06.run
sudo ./NVIDIA-Linux-x86_64-570.124.06.run --no-x-check --no-nouveau-check --no-opengl-files
verify :
• Check the drive: nvidia-smi
• Check module: lsmod | grep nvidia
Version :
• NVIDIA driver: 570.124.06
2.2 Install NVIDIA Fabric Manager
Install Fabric Manager for multi-GPU systems with NVLink support and efficient communication:
cd /data/install_deb/
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-570_570.124.06-1_amd64.deb
sudo apt-get install ./nvidia-fabricmanager-570_570.124.06-1_amd64.deb
systemctl enable nvidia-fabricmanager
systemctl restart nvidia-fabricmanager
verify :
• Check Status: systemctl status nvidia-fabricmanager
• Check Fabric Manager: nvidia-smi -q | grep -i -A 2 Fabric
Version :
• Fabric Manager: 570.124.06
2.3 Install CUDA (optional)
Install CUDA 12.4 to support PyTorch and deep learning tasks:
cd /data/install_deb/
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo ./cuda_12.4.0_550.54.14_linux.run --no-x-check --no-nouveau-check --no-opengl-files
# Configure environment variables
echo 'export CUDA_HOME=/usr/local/cuda-12.4' >> ~/.bashrc
echo 'export PATH=$CUDA_HOME/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
verify :
• Check CUDA version: nvcc --version
• Check GPU status: nvidia-smi
Version :
• CUDA: 12.4.0 • Driver compatibility: 550.54.14
2.4 Install cuDNN (optional)
Install cuDNN to improve deep learning performance:
cd /data/install_deb/
wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
sudo apt-get update
sudo apt-get install -y libcudnn9-cuda-12 libcudnn9-dev-cuda-12
verify :
• Check cuDNN version: cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
Version :
• cuDNN: 9.0.0
3. Anaconda environment configuration
3.1 Install Anaconda
Install Anaconda for environment isolation:
cd /data/install_deb/
wget https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
bash Anaconda3-2024.10-1-Linux-x86_64.sh
echo 'export PATH="/root/anaconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
conda init
verify :
• Check Conda version: conda --version
Version :
• Anaconda: 2024.10-1 • Python: 3.12
3.2 Create a Conda environment
Creating an isolated environment llama_factory
:
conda create -n llama_factory python=3.12
conda activate llama_factory
verify :
• Check the environment: conda info --envs
• Check Python version: python --version
4. PyTorch and Dependency Installation
4.1 Install PyTorch [Required]
Install PyTorch 2.5.1 with CUDA 12.4 support:
conda activate llama_factory
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
verify :
• Check out PyTorch: python -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.version.cuda); print(torch.cuda.device_count()); print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])"
Version :
• PyTorch: 2.5.1 • torchvision: 0.20.1 • torchaudio: 2.5.1 • CUDA (PyTorch): 12.4
4.2 Install FlashAttention
Install FlashAttention to improve Transformer model performance:
pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
verify :
• Check the installation: pip show flash-attn
• test: python -c "import flash_attn; print('FlashAttention installed successfully!')"
Version :
• flash-attn: 2.7.4.post1
5. LLaMA-Factory Configuration
5.1 Installing LLaMA-Factory
Clone and install LLaMA-Factory:
cd /data
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install llamafactory==0.9.0 -i https://repo.huaweicloud.com/repository/pypi/simple
verify :
• Check the version: llamafactory-cli --version
Version :
• LLaMA-Factory: 0.9.0
5.2 Configuring Datasets and Models
Prepare dataset and pre-trained model:
cd /data
tar xvf checkpoint-214971.tar
tar xvf Qwen2___5-7B-Instruct.tar
mv Qwen2___5-7B-Instruct qwen25_7BI
# Move the dataset
cd /data/SKData
mv data/*.jsonl ./
# Configure dataset_info.json
cd /data/LLaMA-Factory/data
vim dataset_info.json
# Example configuration:
{
"v5" : {
"file_name" : "/data/SKData/new_step_data_20250317_train_ocv.jsonl" ,
"columns" : {
"prompt" : "prompt" ,
"response" : "response"
}
},
"ddz_dataset" : {
"file_name" : "/data/ddz_dataset/ai_data_training_v1.0.json" ,
"columns" : {
"prompt" : "prompt" ,
"response" : "response"
}
}
}
5.3 Model format conversion (optional)
Convert the Hugging Face model to GGUF format for use llama.cpp
reasoning:
cd /data/llama.cpp
python convert_hf_to_gguf.py /data/checkpoint-214971 --outfile /data/qwen2-model.gguf
verify :
• Check the GGUF file: ls -larth /data/qwen2-model.gguf
6. Run training and inference tasks
6.1 Training Tasks
Model fine-tuning using DeepSpeed and multiple GPUs:
conda activate llama_factory
FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train examples/qwen2_7b_freeze_sft_ddz_v1.yaml
Configuration file example (qwen2_7b_freeze_sft_ddz_v1.yaml
):
model_name_or_path: /data/qwen25_7BI
dataset: v5,ddz_dataset
template: qwen
finetuning_type: freeze
use_deepspeed: true
deepspeed: ds_configs/stage3.json
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 5e-5
num_train_epochs: 3
output_dir: /data/checkpoint
verify :
• Monitor GPU usage: nvitop
ornvidia-smi
• Check the logs: tail -f /data/checkpoint/train.log
6.2 Reasoning Task
Start the API service for inference:
conda activate llama_factory
API_PORT=6000 CUDA_VISIBLE_DEVICES=4,5 llamafactory-cli api examples/test_7b_dcot.yaml
Configuration file example (test_7b_dcot.yaml
):
model_name_or_path: /data/checkpoint-214971
template: qwen
infer_backend: vllm
vllm_args:
gpu_memory_utilization: 0.9
max_model_len: 4096
7. Installation of auxiliary tools
7.1 Install llama.cpp
For model conversion and lightweight inference:
cd /data
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build . --config Release
verify :
• Test reasoning: ./build/bin/llama-cli -m /data/qwen2-model.gguf --prompt "What is the capital of France?" -n 256 -t 8 --gpu-layers 28 -c 4096
7.2 Installing nvitop
For GPU monitoring:
pip install nvitop
verify :
• Run: nvitop
Version :
• nvitop: Latest
8. Version Summary
9. AI analysis of the installation process
9.1 Valid commands and key installations
The following are .bash_history
The key installation steps that are effective in the following are screened and organized:
1. NVIDIA Driver ( NVIDIA-Linux-x86_64-570.124.06.run
):
• Order: sudo ./NVIDIA-Linux-x86_64-570.124.06.run --no-x-check --no-nouveau-check --no-opengl-files
• Analysis: 570.124.06 is the latest data center driver for H100 GPU, and it was installed successfully. nvidia-smi
8 H100 GPUs are shown.• Note: After trying to install other driver versions (such as 550, 535) several times, it was finally confirmed that 570.124.06 was the most stable.
2. NVIDIA Fabric Manager ( nvidia-fabricmanager-570_570.124.06-1_amd64.deb
):• Order: sudo apt-get install ./nvidia-fabricmanager-570_570.124.06-1_amd64.deb
• Analysis: Fabric Manager ensures efficient NVLink communication between multiple GPUs. systemctl status nvidia-fabricmanager
Confirm that the service is running normally. Failure to install may cause errors. See 10.1 for details.3. CUDA 12.4 ( cuda_12.4.0_550.54.14_linux.run
):• Order: sudo ./cuda_12.4.0_550.54.14_linux.run --no-x-check --no-nouveau-check --no-opengl-files
• Analysis: CUDA 12.4 is compatible with PyTorch 2.5.1 and H100, environment variables are configured correctly, nvcc --version
Displays 12.4.0.4. cuDNN ( cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb
):• Order: sudo dpkg -i cudnn-local-repo-ubuntu2204-9.0.0_1.0-1_amd64.deb && sudo apt-get install -y libcudnn9-cuda-12 libcudnn9-dev-cuda-12
• Analysis: cuDNN 9.0.0 enhances deep learning performance, installation successful, cat /usr/include/cudnn_version.h
Confirm the version.5. FlashAttention ( flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
):• Order: pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
• Analysis: Use precompiled wheels to avoid compilation problems, ninja-build
Ensure that dependencies are complete.pip show flash-attn
Confirm the installation.6. PyTorch and dependencies : • Order: pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
• Analysis: PyTorch 2.5.1 is compatible with CUDA 12.4 and H100. Uninstall and reinstall multiple times to ensure a consistent environment. python -c "import torch; print(torch.cuda.is_available())"
Verify that the GPU is available.7. LLaMA-Factory and other libraries : • Order: pip install llamafactory==0.9.0 transformers==4.46.1 accelerate==0.34.2 deepspeed==0.15.4 vllm==0.8.2 -i https://repo.huaweicloud.com/repository/pypi/simple
• Analysis: LLaMA-Factory 0.9.0 is running stably, DISABLE_VERSION_CHECK=1
Resolve version conflicts,vllm
Provides efficient reasoning.• Repeated driver installation : multiple attempts nvidia-driver-550
,nvidia-driver-535
etc., finally use570.124.06
, earlier versions are invalid.• Conda pipeline configuration : Adjust the Conda pipeline multiple times (e.g. conda-forge
,pytorch
), but PyTorch is ultimately installed via pip, and the Conda channel configuration has limited impact.• FlashAttention compilation failed : Try compiling from source flash-attn
(git clone
andpython setup.py install
), failed due to complex dependencies, and changed to pre-compiled wheel.• vllm installation problem : After trying different versions several times (e.g. 0.6.2, 0.7.2), 0.8.2 was finally compatible with LLaMA-Factory. • Redundant commands : e.g. conda init llamafactory
(Invalid, correct isconda init bash
),repeatedlyls -larth
andps auxf
Used for debugging, the documentation is condensed.• NVIDIA driver issues : • Check module: lsmod | grep nvidia
• Clean up old drivers: sudo apt purge nvidia*
• CUDA does not detect GPU : CUDA initialization: Unexpected error from cudaGetDeviceCount • Verify environment variables: echo $CUDA_HOME $LD_LIBRARY_PATH
• Check out PyTorch: python -c "import torch; print(torch.cuda.is_available())"
• Install • FlashAttention installation failed : • make sure ninja
Installed:ninja --version
• Use precompiled wheels or downgrade Python to 3.11. • LLaMA-Factory version conflict : • set up DISABLE_VERSION_CHECK=1
Bypass inspection.• make sure transformers==4.46.1
.• vllm reasoning error : • examine use_beam_search
Parameters, modify if necessary/root/anaconda3/envs/llama_factory/lib/python3.12/site-packages/llamafactory/chat/vllm_engine.py
.• Error log : CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
(2025-04-15 17:45:57, 13:10:42).• reason : • The NVIDIA Fabric Manager service is not started or the version does not match the driver. • CUDA environment variables are incorrectly configured. • PyTorch is not compatible with CUDA versions. • Solution : Check Fabric Manager : systemctl unmask nvidia-fabricmanager.service
sudo rm -f /lib/systemd/system/nvidia-fabricmanager.service
sudo rm -f /etc/systemd/system/nvidia-fabricmanager.service
sudo apt-get remove nvidia-fabricmanager*
sudo apt-get install ./nvidia-fabricmanager-570_570.124.06-1_amd64.deb
sudo systemctl enable nvidia-fabricmanager
sudo systemctl restart nvidia-fabricmanager• verify : python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"
nvidia-smi
9.2 Invalid or duplicate commands
10. Troubleshooting
10.1 CUDA Error 802 (system not yet initialized)