AI Newbie Village: Hugging Face

Explore the AI Newbie Village, a comprehensive guide to the Hugging Face platform.
Core content:
1. The development history and core value of the Hugging Face community center
2. Introduction and application of the three core libraries of Hugging Face
3. Practical examples of dataset loading and model use
Hugging Face
Hugging Face was first established in 2016 as a community center for NLP models. However, with the popularity of LLM, pre-trained models and related tools of mainstream LLM models can be found on this platform. In addition, the platform also provides a wealth of computer vision and audio-related models.
Hugging Face is often called the GitHub of AI models. Hugging Face has three core libraries: Transformer (used to encapsulate Transformer models to make them easier to use), Tokenizers (used to split text sentences into the smallest units that the model can understand), and Dataset (used to read external data).
The picture below is the homepage of Hugging Face. The main commonly used functions are the functions of the models and datasets marked in the picture.
Loading of data
Hugging Face's Datasets page has a rich collection of data sets, including text, audio, and images, and also provides an intuitive visualization page.
The way to use the data set is also very simple, useload_dataset
Just load the data set we need. If you want to use our custom data set, use the functionload_dataset
That's fine too.
from datasets import load_dataset
ds = load_dataset( "clapAI/MultiLingualSentiment" )
from datasets import load_dataset
# Read in training data and test data
data_files = { "train" : "./day014/datas/train_data.json" , "test" : "./day014/datas/test_data.json" }
dataset = load_dataset( "json" , data_files = data_files)
print(dataset)
# View the first training data
print(dataset[ 'train' ][ 0 ])
Use of the model
Taking the task of text classification (sentiment analysis) as an example, we can use the function pipeline
You only need to specify the task name to call the model. The default model used is distilbert/distilbert-base-uncased-finetuned-sst-2-english. You can also specify a specific model through the parameter model.
from transformers import pipeline
# Using the default model
# pipe = pipeline("text-classification")
# Specify a specific model. The model can be found on the Models page (because the default model uses English data as training data, I changed it to a model that supports multiple languages)
pipe = pipeline( "text-classification" , model= "lxyuan/distilbert-base-multilingual-cased-sentiments-student" )
string_arr = [
"Once upon a time, there was someone who loved you for a long time, but the wind gradually blew us away. I finally got to love you for one more day, but at the end of the story, you still seemed to say goodbye." ,
"I headed north, leaving the season with you. You said you were tired and could no longer fall in love with anyone. The wind blew on the mountain road, and the past scenes were all wrong. I am ashamed to count the times I hurt you." ,
"I am very happy" ]
results = pipe(string_arr)
print(results)
# Output results
# [{'label': 'positive', 'score': 0.5694631338119507}, {'label': 'negative', 'score': 0.9576570987701416}, {'label': 'positive', 'score': 0.9572104811668396}]
When you run the above program for the first time, the model will be automatically downloaded. The default path is /HOME/.cache/huggingface/hub.
In addition to using pipeline
Function, you can also use the model through the interface, but you need to prepare the token applied for on the website in advance. When calling the model through the interface, the model itself will not be downloaded locally, which is more convenient than using pipeline
The method is more convenient.
from utils.common_config import config
import requests
def generate_embedding (text: str) -> list[float]:
embedding_url = "https://api-inference.huggingface.co/models/lxyuan/distilbert-base-multilingual-cased-sentiments-student"
response = requests.post(
embedding_url,
headers={ "Authorization" : f"Bearer {config.hg_token} " },
json={ "inputs" : text})
if response.status_code != 200 :
raise ValueError( f"Request failed with status code {response.status_code} : {response.text} " )
return response.json()
string_arr = [
"Once upon a time, there was someone who loved you for a long time, but the wind gradually blew us away. I finally got to love you for one more day, but at the end of the story, you still seemed to say goodbye." ,
"I headed north, leaving the season with you. You said you were tired and could no longer fall in love with anyone. The wind blew on the mountain road, and the past scenes were all wrong. I am ashamed to count the times I hurt you." ,
"I am very happy" ]
a = generate_embedding(string_arr)
print(a)
# Output results
# [[{'label': 'positive', 'score': 0.5694631934165955}, {'label': 'neutral', 'score': 0.2743554711341858}, {'label': 'negative', 'score': 0.15618135035037994}], [{'label': 'negative', 'score': 0.9576572179794312}, {'label': 'neutral', 'score': 0.0352189838886261}, {'label': 'positive', 'score': 0.007123854476958513}], [{'label': 'positive', 'score': 0.9572104811668396}, {'label': 'neutral', 'score': 0.03854822367429733}, {'label': 'negative', 'score': 0.004241317044943571}]]
Fine-tuning the model
If you are not satisfied with the model results, you can also use fine-tuning to train with custom data and adjust the model parameters.
from datasets import load_dataset
# Read in training data and test data
import os
data_files = {
"train" : os.path.join(os.path.dirname(__file__), "datas/train_data.json" ),
"test" : os.path.join(os.path.dirname(__file__), "datas/test_data.json" )
}
dataset = load_dataset( "json" , data_files = data_files)
print(dataset)
# View the first training data
print(dataset[ 'train' ][ 0 ])
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )
tokenizer = DistilBertTokenizer.from_pretrained( 'lxyuan/distilbert-base-multilingual-cased-sentiments-student' )
model = (
DistilBertForSequenceClassification.from_pretrained(
'lxyuan/distilbert-base-multilingual-cased-sentiments-student' ,
num_labels = 3 ,
id2label = { 0 : "negative" , 1 : "neutral" , 2 : "positive" },
label2id = { "negative" : 0 , "neutral" : 1 , "positive" : 2 },
#ignore_mismatched_sizes=True
).to(device)
)
model_name = "sentiment_model"
from transformers import DataCollatorWithPadding
from sklearn.metrics import accuracy_score
def preprocess_function (example) :
return tokenizer(example[ 'text' ], truncation = True , padding = True )
train_dataset = dataset[ "train" ].map(preprocess_function, batched = True )
test_dataset = dataset[ "test" ].map(preprocess_function, batched = True )
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)
def compute_metrics (pred) :
labels = pred.label_ids
predictions = pred.predictions.argmax( -1 )
accuracy = accuracy_score(labels, predictions)
return { "accuracy" : accuracy}
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir = model_name,
eval_strategy = "epoch" ,
learning_rate = 2e-5 ,
per_device_train_batch_size = 4 ,
per_device_eval_batch_size = 4 ,
num_train_epochs = 60 ,
weight_decay = 0.01 ,
)
trainer = Trainer(
model = model,
args = training_args,
train_dataset = train_dataset,
eval_dataset = test_dataset,
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
)
trainer.train()
train_results = trainer.evaluate(eval_dataset = train_dataset)
train_accuracy = train_results.get( 'eval_accuracy' )
print( f"Training Accuracy: {train_accuracy} " )
test_results = trainer.evaluate(eval_dataset = test_dataset)
test_accuracy = test_results.get( 'eval_accuracy' )
print( f"Testing Accuracy: {test_accuracy} " )
After the training is completed, we can use the new model to evaluate the model effect. Due to the small amount of local training data, the final effect of the new model may not be ideal.
from transformers import pipeline
classifier = pipeline(task = 'sentiment-analysis' , model = "/Users/shaoyang/.cache/huggingface/hub/sentiment_model/checkpoint-120" )
a = classifier([ "Once upon a time, there was someone who loved you for a long time, but the wind gradually blew us away. It was so hard to love you for one more day, but at the end of the story, you still said goodbye." ,
"I am very happy" ])
print(a)
# [{'label': 'negative', 'score': 0.532397449016571}, {'label': 'neutral', 'score': 0.9187697768211365}]