AI big models are hot. Why is converting the Hugging Face big model to GGUF so popular?

Written by
Silas Grey
Updated on:June-28th-2025
Recommendation

A new breakthrough in the era of AI big models, the GGUF format leads a new trend in model storage and conversion.

Core content:
1. GGUF format definition and its significance to AI big models
2. GGUF format advantages in storage efficiency, loading speed and compatibility
3. Steps for converting the Hugging Face model to GGUF format and tool usage guide

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

What exactly is GGUF?

GGUF  is a binary file storage format designed for large language models . Its full name is GPT - Generated Unified Format. It is a new file format that is commonly used to represent and store neural network models and their related data. It is a unified and universal graph format designed to simplify model exchange and conversion between different deep learning frameworks and hardware platforms.

The main goal of GGUF is to provide a standardized format so that the graph structure, weights, parameters, and other related information of the neural network can be smoothly transferred between various platforms and tools. It can reduce interoperability issues between different frameworks (such as TensorFlow, PyTorch, etc.) and help optimize the use of hardware resources, especially in multi-card training and distributed training scenarios.

Application

  • Framework support: Mainstream tool chains such as Huggingface Transformers and llama.cpp all support loading GGUF format models.

  • Model ecology: Models officially released by Google Gemma, Ali Qwen, etc. provide GGUF versions by default.

  • Tool compatibility: Local reasoning tools such as LM Studio and Ollama support the GGUF format. 


Large models in GGUF format have the following advantages

  • Efficient storage : By optimizing the data structure and encoding method, the model storage space is significantly reduced. For large models with a large number of parameters, it can effectively reduce storage costs.

  • Fast loading : Supports technologies such as memory mapping, which can directly map data from disk to memory address space without fully loading the entire file, speeding up data loading and meeting application scenarios that require instant response, such as online chat robots or real-time translation systems.

  • Strong compatibility : As a unified format, it is designed with cross-platform and cross-framework compatibility in mind, and can run seamlessly in different hardware and software environments, allowing the model to be easily used in a variety of devices and frameworks, promoting the widespread application of large models.

  • Good scalability : The key-value metadata structure allows flexible expansion and can add new metadata, new features or new information without destroying compatibility with existing models to meet the development needs of larger-scale models and more complex data structures in the future.

  • Quantization support : Supports multiple quantization types, such as Q8_K, Q6_K, etc., which can reduce the file size by reducing the model accuracy. It is suitable for different hardware resource scenarios, saving computing resources while ensuring that the model performance is not significantly affected.

  • Easy to use : GGUF files contain all model information, such as metadata, tensor data, etc., without relying on external files or complex configurations. Single files can be easily distributed and loaded, and the amount of code required to load the model is small, and no external library is required, which simplifies the process of model deployment and sharing.

Converting a Hugging Face (HF) model to the GGUF (Guanaco General Universal Format) format usually requires the use of  the llama.cpp  tool.


1. Install llama.cpp

1. Download the llama.cpp source code to your local computer

First, clone the llama.cpp repository to your local computer, which contains the tools needed to convert the model. Execute the following command in the terminal:

git clone https://github.com/ggerganov/llama.cpp.git

2. Install the Python package of llama.cpp

conda create -n llamacpp python==3.10 -yconda activate llamacpppip install -r llama.cpp/requirements.txt

2. Conversion

The hf model can be directly converted to gguf without quantization, or it can be quantized.

#If not quantized, retain the effect of the model#python llama.cpp/convert_hf_to_gguf.py absolute path of the model --outtype f16 --verbose --outfile output file pathpython  llama.cpp/convert_hf_to_gguf.py /root/autodl-tmp/llm/Qwen/Qwen2. 5 - 3 B-Instruct-merge --outtype f16 --verbose --outfile /root/autodl-tmp/llm/Qwen/Qwen2. 5 - 3 B-Instruct-merge-gguf.gguf
#If you need quantization (acceleration and lossy effect), just execute the following scriptpython  llama.cpp/convert_hf_to_gguf.py /root/autodl-tmp/llm/Qwen/Qwen2. 5 - 3 B-Instruct-merge --outtype q8_0 --verbose --outfile /root/autodl-tmp/llm/Qwen/Qwen2. 5 - 3 B-Instruct-merge-gguf_q8_0.gguf


Here --outtype is the output type, which means:

q2_k: Specific tensors use higher precision settings, while others remain at the base level.

q3_k_l, q3_k_m, q3_k_s: These variants use different levels of precision on different tensors to achieve a balance between performance and efficiency.

q4_0: This is the original quantization scheme, using 4 bits of precision.

q4_1 and q4_k_m, q4_k_s: These provide different levels of accuracy and inference speed, suitable for scenarios that need to balance resource usage.

q5_0, q5_1, q5_k_m, q5_k_s: These versions use more resources and have slower inference speed while ensuring higher accuracy.

q6_k and q8_0: These provide the highest accuracy, but may not be suitable for all users due to high resource consumption and slow speed.

fp16 and f32: Do not quantize, retain the original precision.

The converted model is as follows:

3. Ollama runs gguf

1. Install Ollama

Ollama official website:

https://ollama.com/download/linux

https://github.com/ollama/ollama/blob/main/docs/linux.md

We choose Ubuntu environment deployment.

#  autodl computing power cloud academic acceleration#  source  /etc/network_turbo
# Install ollamacurl -fsSL https://ollama.com/install.sh | sh

This process takes a long time, please be patient. You can also download manually:

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgzsudo tar -C /usr -xzf ollama-linux-amd64.tgz

Download can be done locally and then uploaded to the server.


2. Start the Ollama service

ollama serve

Note that this command window must be kept open. Otherwise, the service will be interrupted. You can also use background execution mode.


3. Create ModelFile

Copy the model path and create a meta file named "ModelFile" with the following content

#GGUF file path FROM /root/autodl-tmp/llm/Qwen/Qwen2.5-3B-Instruct-merge-gguf.gguf

4. Create a custom model

Use the ollama create command to create a custom model. The model name can be customized, such as the following "qwen2.5-3B-Instruct". The name must be unique.

ollama create qwen2.5-3B-Instruct --file ModeFile

If you see success, it means it is successful. You can use the command ollama list to view it.

5. Run the model

# You can add :latest or not# ollama run qwen2.5-3B-Instruct:latestollama run qwen2.5-3B-Instruct