AI Newbie Village: Hugging Face

Written by

Iris Vance

Updated on:June-27th-2025

Hugging Face

Hugging Face was first established in 2016 as a community center for NLP models. However, with the popularity of LLM, pre-trained models and related tools of mainstream LLM models can be found on this platform. In addition, the platform also provides a wealth of computer vision and audio-related models.

Hugging Face is often called the GitHub of AI models. Hugging Face has three core libraries: Transformer (used to encapsulate Transformer models to make them easier to use), Tokenizers (used to split text sentences into the smallest units that the model can understand), and Dataset (used to read external data).

The picture below is the homepage of Hugging Face. The main commonly used functions are the functions of the models and datasets marked in the picture.

Loading of data

Hugging Face's Datasets page has a rich collection of data sets, including text, audio, and images, and also provides an intuitive visualization page.

The way to use the data set is also very simple, useload_datasetJust load the data set we need. If you want to use our custom data set, use the functionload_datasetThat's fine too.

from  datasets  import  load_dataset

ds = load_dataset( "clapAI/MultiLingualSentiment" )

from  datasets  import  load_dataset
# Read in training data and test data
data_files = { "train" :  "./day014/datas/train_data.json" ,  "test" :  "./day014/datas/test_data.json" }
dataset = load_dataset( "json" , data_files = data_files)
print(dataset)
# View the first training data
print(dataset[ 'train' ][ 0 ])

Use of the model

Taking the task of text classification (sentiment analysis) as an example, we can use the function pipelineYou only need to specify the task name to call the model. The default model used is distilbert/distilbert-base-uncased-finetuned-sst-2-english. You can also specify a specific model through the parameter model.

from  transformers  import  pipeline

# Using the default model
# pipe = pipeline("text-classification")     

# Specify a specific model. The model can be found on the Models page (because the default model uses English data as training data, I changed it to a model that supports multiple languages)
pipe = pipeline( "text-classification" , model= "lxyuan/distilbert-base-multilingual-cased-sentiments-student" )     

string_arr = [
         "Once upon a time, there was someone who loved you for a long time, but the wind gradually blew us away. I finally got to love you for one more day, but at the end of the story, you still seemed to say goodbye." ,
        "I headed north, leaving the season with you. You said you were tired and could no longer fall in love with anyone. The wind blew on the mountain road, and the past scenes were all wrong. I am ashamed to count the times I hurt you." ,
        "I am very happy" ]
   
results = pipe(string_arr)

print(results)
# Output results
# [{'label': 'positive', 'score': 0.5694631338119507}, {'label': 'negative', 'score': 0.9576570987701416}, {'label': 'positive', 'score': 0.9572104811668396}]

When you run the above program for the first time, the model will be automatically downloaded. The default path is /HOME/.cache/huggingface/hub.

In addition to using pipeline Function, you can also use the model through the interface, but you need to prepare the token applied for on the website in advance. When calling the model through the interface, the model itself will not be downloaded locally, which is more convenient than using pipeline The method is more convenient.

from  utils.common_config  import  config
import  requests
def generate_embedding (text: str)  -> list[float]: 
    embedding_url =  "https://api-inference.huggingface.co/models/lxyuan/distilbert-base-multilingual-cased-sentiments-student"
    response = requests.post(
        embedding_url,
        headers={ "Authorization" :  f"Bearer  {config.hg_token} " },
        json={ "inputs" : text})

    if  response.status_code !=  200 :
        raise  ValueError( f"Request failed with status code  {response.status_code} :  {response.text} " )

    return  response.json()

string_arr = [
         "Once upon a time, there was someone who loved you for a long time, but the wind gradually blew us away. I finally got to love you for one more day, but at the end of the story, you still seemed to say goodbye." ,
        "I headed north, leaving the season with you. You said you were tired and could no longer fall in love with anyone. The wind blew on the mountain road, and the past scenes were all wrong. I am ashamed to count the times I hurt you." ,
        "I am very happy" ]
a = generate_embedding(string_arr)
print(a)

# Output results
# [[{'label': 'positive', 'score': 0.5694631934165955}, {'label': 'neutral', 'score': 0.2743554711341858}, {'label': 'negative', 'score': 0.15618135035037994}], [{'label': 'negative', 'score': 0.9576572179794312}, {'label': 'neutral', 'score': 0.0352189838886261}, {'label': 'positive', 'score': 0.007123854476958513}], [{'label': 'positive', 'score': 0.9572104811668396}, {'label': 'neutral', 'score': 0.03854822367429733}, {'label': 'negative', 'score': 0.004241317044943571}]]

Fine-tuning the model

If you are not satisfied with the model results, you can also use fine-tuning to train with custom data and adjust the model parameters.

from  datasets  import  load_dataset
# Read in training data and test data
import  os
data_files = {
    "train" : os.path.join(os.path.dirname(__file__),  "datas/train_data.json" ),
    "test" : os.path.join(os.path.dirname(__file__),  "datas/test_data.json" )
}
dataset = load_dataset( "json" , data_files = data_files)
print(dataset)
# View the first training data
print(dataset[ 'train' ][ 0 ])

from  transformers  import  DistilBertTokenizer, DistilBertForSequenceClassification
import  torch

device = torch.device( "cuda" if  torch.cuda.is_available()  else "cpu" )  
tokenizer = DistilBertTokenizer.from_pretrained( 'lxyuan/distilbert-base-multilingual-cased-sentiments-student' )
model = (
    DistilBertForSequenceClassification.from_pretrained(
        'lxyuan/distilbert-base-multilingual-cased-sentiments-student' ,
        num_labels =  3 ,
        id2label = { 0 :  "negative" ,  1 :  "neutral" ,  2 :  "positive" },
        label2id = { "negative" :  0 ,  "neutral" :  1 ,  "positive" :  2 },
        #ignore_mismatched_sizes=True
    ).to(device)
)
model_name =  "sentiment_model"


from  transformers  import  DataCollatorWithPadding
from  sklearn.metrics  import  accuracy_score

def preprocess_function (example) : 
  return  tokenizer(example[ 'text' ], truncation =  True , padding =  True )

train_dataset = dataset[ "train" ].map(preprocess_function, batched =  True )
test_dataset = dataset[ "test" ].map(preprocess_function, batched =  True )

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

def compute_metrics (pred) : 
  labels = pred.label_ids
  predictions = pred.predictions.argmax( -1 )
  accuracy = accuracy_score(labels, predictions)
  return  { "accuracy" : accuracy}

from  transformers  import  Trainer, TrainingArguments

training_args = TrainingArguments(
  output_dir = model_name,
  eval_strategy =  "epoch" ,
  learning_rate =  2e-5 ,
  per_device_train_batch_size =  4 ,
  per_device_eval_batch_size =  4 ,
  num_train_epochs =  60 ,
  weight_decay =  0.01 ,
)

trainer = Trainer(
  model = model,
  args = training_args,
  train_dataset = train_dataset,
  eval_dataset = test_dataset,
  tokenizer = tokenizer,
  data_collator = data_collator,
  compute_metrics = compute_metrics,
)

trainer.train()

train_results = trainer.evaluate(eval_dataset = train_dataset)
train_accuracy = train_results.get( 'eval_accuracy' )
print( f"Training Accuracy:  {train_accuracy} " )

test_results = trainer.evaluate(eval_dataset = test_dataset)
test_accuracy = test_results.get( 'eval_accuracy' )
print( f"Testing Accuracy:  {test_accuracy} " )

After the training is completed, we can use the new model to evaluate the model effect. Due to the small amount of local training data, the final effect of the new model may not be ideal.

from  transformers  import  pipeline
classifier = pipeline(task =  'sentiment-analysis' , model =  "/Users/shaoyang/.cache/huggingface/hub/sentiment_model/checkpoint-120" )
a = classifier([ "Once upon a time, there was someone who loved you for a long time, but the wind gradually blew us away. It was so hard to love you for one more day, but at the end of the story, you still said goodbye." ,
                              "I am very happy" ])

print(a)
# [{'label': 'negative', 'score': 0.532397449016571}, {'label': 'neutral', 'score': 0.9187697768211365}]