Fine-tuning chapter "Dataset Construction" nanny-level tutorial is here!

In-depth understanding of the entire process of dataset construction, from open source data acquisition to data annotation, cleaning, and enhancement
Core content:
1. Open source data website resource summary and usage guide
2. Detailed steps and techniques for dataset construction
3. Application of data enhancement technology and sample code
Datawhale Tips
1. Download from open source data website
Kaggle: https://www.kaggle.com/
ModelScope: https://modelscope.cn/datasets
hugging face: https://huggingface.co
Baidu PaddlePaddle: https://aistudio.baidu.com/datasetoverview
2. Constructing a Dataset (General Steps)
1. Clarify your goals
- Define the problem : Identify the problem or task you want to solve. (For example, if you need to build a data set in the medical field, then you should search for relevant medical information. Sometimes the problem is not very clear, and at this time you need to actually explore what you need in essence.)
- Determine the data type : Be clear about the type of data you need (text, images, audio, etc.).
(ps: Emphasize!!! And it is very important to pay attention to whether the constructed dataset conforms to the data format you want to train the model later~)
2. Data collection (this step is to collect all the relevant data you can collect)
- Internal data : Get data from existing databases, logs, etc. [If conditions permit]
- External data : obtained through public data sets, relevant websites and AI extraction.
- Data generation : If necessary, simulated or synthetic data can be used. [Optional]
3. Data Annotation
- Manual annotation : manually annotated data.
- Automatic labeling : Use tools or pre-trained models for labeling.
4. Data cleaning
- Handling missing values : Fill or remove missing data.
- Deduplication : Remove duplicate data.
- Format unification : Ensure that data formats are consistent.
- Outlier handling : Identify and handle outliers.
Steps for bulk data processing Step 1: Rely on traditional big data platforms (such as Hive, HBase, Flink, MySQL , etc.) to perform preliminary data cleaning and remove data with obvious errors or abnormalities. Step 2: With the help of artificial intelligence technology, intelligent repair of typos, grammatical errors, logical problems, etc. in the data is performed, and calibration is performed in combination with standard data sets to improve data quality and accuracy. Step 3: Conduct manual final review, and conduct a final review of the data that has been processed at the first two levels through random sampling to ensure the integrity and reliability of the data.
5. Data enhancement [optional, depending on the situation]
- Image : rotate, crop, etc.
(1) Rotation
Details : The rotation angle is usually randomly selected within a certain range, such as ±30° or ±45°, to simulate images from different viewing angles.
Operation steps : Use image processing libraries (such as OpenCV or Albumentations) to rotate the image. If the image has a label box (such as object detection tasks), the label box also needs to be rotated synchronously.
import albumentations as A
transform = A.RandomRotate90(p= 0.5 ) # Randomly rotate 90 degrees
augmented_image = transform(image=image)[ 'image' ]
(2) Cutting
Details : Randomly crop a part of the image. The cropped area can be of fixed or random size. When cropping, it is important to keep key information.
How to do it : Use a random cropping function, such as Albumentations' RandomCrop .
transform = A.RandomCrop(width= 400 , height= 400 , p= 0.3 )
augmented_image = transform(image=image)[ 'image' ]
(3) Other enhancements
Brightness Adjustment : Simulate different lighting conditions by adjusting the brightness of the image.
Noise addition : Add random noise to the image to enhance the robustness of the model.
transform = A.Compose([
A.RandomBrightnessContrast(p= 0.3 ),
A.GaussianBlur(blur_limit= 3 , p= 0.2 )
])
augmented_image = transform(image=image)[ 'image' ]
- Text : synonym replacement, back translation, etc. (i.e. adding noise data)
(1) Synonym replacement
Details : Randomly select some words in the sentence and replace them with their synonyms. Note that the semantics of the replaced sentences should remain consistent.
Steps : Use a dictionary or word embedding model (such as Word2Vec) to find synonyms and replace them
(2) Back translation (English to Chinese, Chinese to English, Italian to Italian, etc.)
Details : Translating a text into one language and then back into the original language may introduce some semantic changes.
How to do it : Use a machine translation API such as Google Translate to translate.
- Audio : speed change, noise addition, etc.
(1) Speed change
Details : Adjust the playback speed of the audio while keeping the pitch unchanged.
Operation steps : Use an audio processing library (such as librosa) to change the speed of the audio.
(2) Adding noise
Details : Add background noise to the audio to enhance the model's robustness to noise.
How to do it : Select noise from the noise library and overlay it on the audio.
Why add noise? (Supplementary content)
The main purpose of adding noise to the dataset is to enhance the robustness of the model. Specific reasons include:
1. Simulate real scenes : Images in the real world usually contain noise (such as sensor noise, compression noise, etc.). By adding noise to the training data, the model can better adapt to the noisy environment in actual applications. 2. Prevent overfitting : Noise can be used as a regularization method to prevent the model from over-relying on specific features in the training data, thereby improving generalization ability. 3. Data enhancement : Noise addition is a form of data enhancement that can increase the diversity of data and help the model learn a wider range of features.
Determine whether to add noise
When no noise dataset needs to be added
- The data quality is high and the task is clear : If the original dataset is rich, diverse, and high-quality enough to cover the patterns and features that the model needs to learn, then there is usually no need to add additional noisy data.
- Low risk of model overfitting : When the dataset is large, the data is evenly distributed, and the model architecture is relatively simple, the risk of model overfitting is low, and there is no need to enhance the generalization ability of the model by adding noise data.
When you need to add a noisy dataset
- Serious overfitting problem : When the model performs well on the training set, but the performance on the validation set or test set drops significantly, it means that the model may have overfitted the noise and specific patterns in the training data. At this time, the robustness of the model can be enhanced by adding noise data.
- Specific task requirements : In some specific tasks, such as image generation or speech recognition, adding noisy data can help the model learn more complex patterns and features, thereby improving the performance of the model in practical applications.
Considerations in dataset construction
1. Balance noise and raw data :
- In the data set, the noise image should maintain a certain ratio with the original image to avoid excessive reliance on noise features due to excessive noise data.
2. Diversity : - When adding noise, ensure diversity in noise type and intensity to cover more realistic scenarios.
3. Validation set and test set : - An appropriate amount of noisy data should also be included in the validation set and test set to evaluate the performance of the model in a noisy environment.
4. Combination of data enhancement : - Noise addition can be combined with other data augmentation techniques (such as rotation, scaling, flipping, etc.) to further improve the robustness of the model.
- Training set : used for model training.
- Validation set : used for parameter adjustment and model selection.
- Test set : used for final evaluation.
1. I need to make the fine-tuned model better at providing diagnosis and treatment advice. To enhance feasibility, it is best if its tone is more like that of a doctor. 2. Make sure that the type of data I want to collect is text, so I should collect more medical-related texts, and finally I can find texts that directly simulate doctors, and expand and strengthen them on this basis! 3. The data format required by the DeepSeek-R1 distillation model is: Question-Complex-CoT-Response. So later, when building the dataset, I had to pay attention to the need to build Complex-CoT.
6. Data Partitioning
3. Specific example (taking the DeepSeek-R1 distillation model as a fine-tuning model and constructing a medical dataset as an example)
1. Clear goal: Construction of doctor text dataset
2. Data Collection
1. I think there should be a lot of previously constructed data sets in the medical field, so I first went to the open source website modelscope to search for them.
2. Confirm the specifications through high-quality open source datasets, and tend to build similar high-quality open source datasets in the subsequent construction process as much as possible.
3. Medical information collection from major public websites ~ Here, the author collects papers and guidelines published on websites such as the Chinese Medical Journal Network [save link or download]
3. Data Annotation
1. (1) Use some data annotation auxiliary platforms to help build the data set. Put the extracted files into the platform to build basic QA question and answer pairs.
(2) Put the constructed question-answer pairs into the big model AI and let it help complete the Complex_CoT part. [The question-answer pairs can also be some simple questions and answers that you think of. According to your needs, you can use prompt words to let the big model AI achieve the desired effect.]
[
{
"Question" : "Doctor, I have ankylosing spondylitis (SpA) and other diseases. If I use biological agents for treatment, what should I pay attention to?" ,
"Complex-CoT" : "For SpA patients with comorbidities, the selection of biologics needs to take into account the patient's overall health status, the type and severity of comorbidities, and the potential side effects of biologics. Different types of biologics have different mechanisms of action and may have different effects on comorbidities. Therefore, it is necessary to carefully select biologics suitable for patients under the guidance of professional doctors and closely monitor the response during treatment." ,
"Response" : "For SpA patients with comorbidities, the selection of different types of biologics should be cautious. We need to comprehensively evaluate your overall health status, including the type and severity of comorbidities, and consider the potential side effects of biologics. Different biologics have different mechanisms of action and may have different effects on comorbidities. Therefore, when choosing a biologic, we will weigh the pros and cons based on your specific situation, choose the treatment plan that best suits you, and closely monitor your response during treatment."
} ,
{
"Question" : "Doctor, what is the current research progress on ankylosing spondylitis (SpA)?" ,
"Complex-CoT" : "Although some achievements have been made in the study of ankylosing spondylitis (SpA), such as a preliminary understanding of the disease mechanism, the exploration of new diagnostic methods and treatment methods, there are still many unsolved mysteries. For example, the pathogenesis of SpA has not been fully clarified, some patients do not respond well to existing treatments, and the long-term prognosis of the disease still needs further study. Therefore, the study of SpA is still in-depth." ,
"Response" : "Despite the many research results, there are still many unsolved mysteries in SpA. Although we have made some progress in disease mechanisms, diagnostic methods and treatment methods, there are still many questions that need further study. For example, the pathogenesis of SpA has not been fully understood, some patients do not respond well to existing treatments, and the long-term prognosis of the disease still needs to be further explored. Therefore, research on SpA is still ongoing in order to provide better help for patients."
} ,
{
"Question" : "Doctor, how should future research be conducted to better improve the quality of life of patients with ankylosing spondylitis (SpA)?" ,
"Complex-CoT" : "Future research should start from multiple aspects to improve the quality of life of SpA patients. First, exploring new biomarkers can help us diagnose the disease earlier and more accurately and achieve personalized treatment. Second, the development of new drugs and treatments can provide patients with more options and improve treatment effects. In addition, the application of new technologies can optimize treatment strategies and reduce the side effects of treatment. Through these comprehensive measures, we hope to further improve the quality of life of SpA patients." ,
"Response" : "Future research should continue to explore the application of new biomarkers, new drugs, and new technologies to further improve the accuracy of diagnosis, optimize treatment strategies, and ultimately improve the quality of life of SpA patients. We will start from multiple aspects, including finding more accurate biomarkers to achieve early diagnosis and personalized treatment, developing more effective drugs and treatments, and using new technologies to optimize treatment options and reduce side effects. Through these efforts, we hope to bring a better quality of life to SpA patients."
}
]
2. At the same time, you can also use the collected links and put them into the big model AI to build a data set.
[
{
"Question" : "Doctor, I am a woman and I have been feeling uncomfortable in my heart and a little chest tightness recently. Is this a sign of a heart attack?" ,
"Complex-CoT" : "The symptoms of a heart attack in women may be different from those in men and are often more atypical. In addition to chest tightness, they may also be accompanied by other symptoms such as shortness of breath, nausea, and sweating. In clinical practice, for female patients with heart-related symptoms, it is necessary to consider a variety of factors, including medical history, physical signs, and auxiliary examination results, in order to accurately determine whether it is a heart attack and take appropriate treatment measures in a timely manner." ,
"Response" : "The symptoms of a heart attack in women may not be typical. In addition to chest tightness, they may also experience shortness of breath, nausea, sweating, etc. We need to consider your medical history, physical signs, and auxiliary examination results to determine whether it is a heart attack. If you have these symptoms, it is recommended that you seek medical attention as soon as possible. We will take appropriate treatment measures based on the specific situation."
} ,
{
"Question" : "Doctor, I have asthma and have been coughing badly recently. Is my condition getting worse?" ,
"Complex-CoT" : "Cough is one of the common symptoms of asthma, but the severity of cough does not necessarily directly reflect the overall control of asthma. In clinical practice, the assessment of asthma needs to take into account multiple aspects, including symptom frequency, number of acute attacks, and lung function test results. For patients with obvious cough symptoms, further evaluation is needed to determine whether there are other triggers or comorbidities, and the treatment plan should be adjusted according to the specific situation." ,
"Response" : "An increase in coughing in asthma patients does not necessarily mean a worsening of the condition. We need to comprehensively evaluate your symptom frequency, number of acute attacks, and lung function test results. If the cough is severe, further examination is recommended to see if there are other triggers or comorbidities. We will adjust the treatment plan based on the specific situation."
} ,
{
"Question" : "Doctor, I have been diagnosed with asthma, but I feel that my condition is not well controlled. What tests do I need to do to fully evaluate my condition?" ,
"Complex-CoT" : "A comprehensive assessment of asthma is essential for developing an effective treatment plan. Pulmonary function tests, including bronchial provocation tests and bronchial dilation tests, are usually required to assess airway responsiveness and reversibility. In addition, the patient's symptom control, acute exacerbation frequency, quality of life, and the presence of comorbidities need to be assessed. Through these comprehensive assessments, the level of asthma control can be more accurately determined and the treatment plan can be adjusted." ,
"Response" : "In order to fully evaluate your asthma condition, we need to conduct some tests, such as pulmonary function tests, including bronchial provocation tests and bronchial dilation tests, to evaluate airway responsiveness and reversibility. At the same time, we will also evaluate your symptom control, acute attack frequency, quality of life, and whether there are comorbidities. These comprehensive assessments help us to more accurately diagnose the condition and adjust the treatment plan."
}
]
4. Data cleaning
In fact, it is mainly to confirm the data format and ensure that the data format is consistent. During this construction process, the overall construction data quality is high.
import json
def validate_json_format ( json_file_path ):
"""
Verify that the JSON file conforms to the specified format.
parameter:
json_file_path (str): The path to the JSON file.
return:
bool: Returns True if the format is correct, otherwise returns False.
"""
try :
# Open and load the JSON file
with open (json_file_path, 'r' , encoding= 'utf-8' ) as file:
data = json.load(file)
# Verify that the data is a list
if not isinstance (data, list ):
print ( "JSON data must be a list." )
return False
# Validate each entry
for item in data:
# Check that all required fields are included
required_fields = [ "Question" , "Complex-CoT" , "Response" ]
if not all (field in item for field in required_fields):
print ( f"Missing fields: {required_fields} " )
return False
# Check if the field value is a string
for field in required_fields:
if not isinstance (item[field], str ):
print ( f"The value of field ' {field} ' must be a string." )
return False
print ( "JSON format verification passed!" )
return True
except json.JSONDecodeError:
print ( "JSON file format error." )
return False
except FileNotFoundError:
print ( f"File not found: {json_file_path} " )
return False
except Exception as e:
print ( f"Error: {e} " )
return False
if __name__ == "__main__" :
# Replace with your JSON file path
json_file_path = "test.json"
validate_json_format(json_file_path)
Used to determine whether it is a json file in the form of ["Question", "Complex-CoT", "Response"]~
Supplementary test data set
When looking for a high-quality dataset, it is recommended to extract about 1,000 data points for preliminary testing and fine-tuning to evaluate whether the effect meets the requirements. If the results after fine-tuning are satisfactory, then consider using the dataset as a reference for building a standard dataset.
When you build your own additional datasets in the future, you should follow the principle of gradual progress. First, build a small amount of data and perform fine-tuning tests to observe the effect. Only after confirming that the effect meets your expectations, continue to expand the size of the dataset.
Finally, all the collected data sets are integrated together. Before hybrid fine-tuning, test fine-tuning is performed on a portion of the data. If the effect is good, fine-tuning can be continued; if problems are found, the scope of the data set needs to be narrowed down, and reliable data needs to be carefully screened to avoid adverse effects of dirty data on the fine-tuning process.
4. Closing remarks
Thank you very much to Deepseek official website and Kimi for their valuable help in code modification, data collection and article polishing in this chapter!