FastGPT principle analysis-the first step of data set creation

In-depth exploration of the principles and implementation details of FastGPT dataset creation.
Core content:
1. The overall process of FastGPT file upload and preprocessing
2. Detailed implementation logic analysis of dataset creation
3. Key steps of data preprocessing, inserting into MongoDB and transaction processing
The file upload process of FastGPT is divided into two stages: the first stage is file upload. The second stage is vectorization and QA processing of the file. This article introduces the overall process of file upload and analyzes the detailed implementation logic of the first step of creating a dataset.
Overall process of dataset creation
Dataset creation is divided into two steps:
Step 1: Upload and preprocess the files and insert the records into the dataset_trainings table of the mongodb training queue.
Step 2: Monitor the insertion operations in MongoDB and start data processing: (1) calculation of embedding vectors or (2) QA text splitting.
// Step 1: Create a data set and insert it into the MongoDB data table createCollectionAndInsertData -> pushDataListToTrainingQueue // Push the data set data to the training queue
// Step 2: Register the processing function and automatically process the datastartMongoWatch - > createDatasetTrainingMongoWatch // Monitor the insertion operation of mongodb and trigger the text processing task- > generateQA(); # Processing of QA question and answer pairs- > generateVector(); # Embedding vector processing
Implementation of the first step of dataset creation
The first step in creating a dataset is to upload the file and preprocess the data, then obtain the relevant data processing parameters and save the parameters to the mongodb training queue. Subsequent processing tasks will call the corresponding model or service for processing based on the processing configuration of the dataset. The main logic of the first step of dataset processing is as follows:
Implementation logic description
The creation of a data set is processed in the createCollectionAndInsertData function, which mainly calls pushDataListToTrainingQueue. The implementation logic of this function is as follows:
Model verification and configuration :
Verify that the inference model and embedding vector model in the parameters are valid.
Set the maximum number of tokens and weights according to the training mode (chunk/qa/auto)
If the model is invalid or the training mode is invalid, an error will be thrown
Data preprocessing :
use
simpleText
Function to clean whitespace characters in questions and answersProcess the index data and also perform text cleaning
Filter out empty questions, duplicate content, and data exceeding the maximum token limit
Classify data into four categories: success, too long, duplicate, and error
Data Insertion
Use batch insert to insert data into MongoDB, 200 records per batch
If a batch insert fails, the failed documents are logged and attempted to be inserted individually
Supports execution in the passed session, and creates a new transaction if no session is passed in
This function is mainly used to insert the data set into the training queue after preprocessing to ensure the validity and integrity of the data, while supporting transaction processing to ensure data consistency.
Mainly implement source code analysis
export async function pushDataListToTrainingQueue({
// Verify model configuration: check the validity of agentModel and vectorModel, set the maximum number of tokens and weight. const { model, maxToken, weight } = await (async () => { const agentModelData = getLLMModel(agentModel); if (!agentModelData) { return Promise.reject(`File model ${agentModel} is inValid`); } const vectorModelData = getVectorModel(vectorModel); if (!vectorModelData) { return Promise.reject(`Vector model ${vectorModel} is inValid`); }
Set the maximum number of tokens and weights
if ( trainingMode === TrainingModeEnum . chunk ) { return { maxToken : vectorModelData . maxToken * 1.5 , model : vectorModelData . model , weight : vectorModelData . weight }; }
if ( trainingMode === TrainingModeEnum . qa || trainingMode === TrainingModeEnum . auto ) { return { maxToken : agentModelData . maxContext * 0.8 , model : agentModelData . model , weight : 0 }; }
Filter out long or repetitive content
// Filter duplicate and overlong content // filter repeat or equal content const set = new Set (); const filterResult : Record < string , PushDatasetDataChunkProps [] > = { success : [], overToken : [], repeat : [], error : [] };
Process the correct QA questions and answers and remove the empty characters
// format q and a, remove empty char data . forEach (( item ) => { item . q = simpleText ( item . q ); item . a = simpleText ( item . a );
Batch insert into the dataset_trainings table of mongodb
// Use insertMany for batch insertion // Data insertion: insert data in batches (200 records per batch) const batchSize = 200 ; const insertData = async ( startIndex : number , session : ClientSession ) => { const list = filterResult . success . slice ( startIndex , startIndex + batchSize );
if ( list . length === 0 ) return ;
try { await MongoDatasetTraining . insertMany ( list . map (( item ) => ({
Summarize
This article introduces the overall process of FastGpt creating a data set, and analyzes the implementation steps and principles of the first step in detail. As you can see, this step only puts the data into the table of the Mongodb training queue. So, when the data is inserted into Mongodb, how to process the data? And how is the task of processing the data triggered? The principles and logic of these implementations will be analyzed in the next article.