Prompt-Tuning

Written by

Jasper Cole

Updated on:June-27th-2025

Theoretical Introduction

Prompt-Tuning is an efficient parameter fine-tuning method. The core idea can be compared to this: instead of modifying a knowledgeable textbook (a pre-trained large model), technicians add a few very smart and learnable sticky notes (soft prompts/virtual tokens (Soft Prompt) or virtual tokens) at the beginning of the book (input layer). The content on the sticky notes is not fixed text, but parameters (vectors) that the model can learn and adjust by itself.

During training , we freeze most of the parameters of the original model and only train the newly added sticky note parameters, so that when the model sees a specific sticky note, it will perform the task in the way we expect.

Core principle diagram

PLM (pre-trained model) remains unchanged, W (model weight) remains unchanged, and X (model input) changes.

Design task-related prompt templates and fine-tune prompt embeddings to guide the pre-trained model to adapt to specific tasks. Only a small number of prompts (Prompt Embeddings) need to be fine-tuned instead of the entire model parameters.

Advantages over traditional fine-tuning:

• Fast training and resource-saving: only very few parameters are trained.
• Small storage: Each new task only needs to save those little sticky note parameters, not the entire model.
• Good results: On many tasks, Prompt-Tuning can achieve results comparable to full fine-tuning.
• The original model is not affected: The basic model remains unchanged, making it easy to load different sticky notes for different tasks.

Interpreting principles through code

The code will show the complete implementation process of Prompt-Tuning and explain its working principle. (The operating environment requires pip to install the PEFT package, and the PEFT version used in this code is 0.14.0).

The Prompt-Tuning method is mainly reflected in the fourth and eighth steps , which need to be read carefully. The rest of the code is basically the same as in previous issues.

Step 1: Import related packages

import  torch
from  datasets  import  Dataset
from  transformers  import  AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer
from  peft  import  PromptTuningConfig, get_peft_model, TaskType, PromptTuningInit, PeftModel

Step 2: Load the dataset

# Contains 'instruction' (instructions), 'input' (optional additional input), 'output' (expected answer)
ds = Dataset.load_from_disk( "../data/alpaca_data_zh/" )

Step 3: Dataset preprocessing

Process each sample into a dictionary containing input_ids, attention_mask, and labels.

tokenizer = AutoTokenizer.from_pretrained( "D:\\git\\model-download\\bloom-389m-zh" ) 
def process_func ( example ):

    MAX_LENGTH =  256

    # Build the input text: Combine the instructions and input (optional) together, and add explicit "Human:" and "Assistant:" identifiers. "\n\nAssistant: " is the key delimiter that prompts the model to start generating answers.
    prompt =  "\n" .join([ "Human: "  + example[ "instruction" ], example[ "input" ]]).strip() +  "\n\nAssistant: "
    # Segment the input + prompt. No special tokens (<s>, </s>) are added here for now. They will be concatenated later.
    instruction_tokenized = tokenizer(prompt, add_special_tokens= False )
    # Tokenize the expected output (answer) and add `tokenizer.eos_token` (end-of-sentence) to the end of the answer. This tells the model that generation can end here.
    response_tokenized = tokenizer(example[ "output" ] + tokenizer.eos_token, add_special_tokens= False )
    # Concatenate the token IDs of the input prompt and answer to form a complete input sequence input_ids
    input_ids = instruction_tokenized[ "input_ids" ] + response_tokenized[ "input_ids" ]
    # attention_mask is used to tell the model which tokens are real and need to be paid attention to, and which are padding.
    attention_mask = instruction_tokenized[ "attention_mask" ] + response_tokenized[ "attention_mask" ]
    # Create labels: This is the target that the model needs to learn to predict. Since we only want the model to learn to predict the answer after "Assistant:", we set the label of the input prompt to -100. The loss function automatically ignores tokens with a label of -100 and does not calculate their loss.
    labels = [- 100 ] *  len (instruction_tokenized[ "input_ids" ]) + response_tokenized[ "input_ids" ]
    
    # Truncate
    if len (input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]

    # Return processed data
    return  {
        "input_ids" : input_ids,
        "attention_mask" : attention_mask,
        "labels" : labels
    }
# The .map() method applies the processing function to all samples in the entire dataset.
tokenized_ds = ds.map ( process_func, remove_columns=ds.column_names)     # `remove_columns` will remove the original columns and only keep the new columns returned by process_func.    
print ( "\nCheck the processing result of the second data: " )
print ( "Input sequence (input_ids decoded):" , tokenizer.decode(tokenized_ds[ 1 ][ "input_ids" ]))
target_labels =  list ( filter ( lambda  x: x != - 100 , tokenized_ds[ 1 ][ "labels" ]))  # Filter out -100 to see what labels the model really needs to predict
print ( "Label sequence (labels decoded, filtered - 100): " , tokenizer.decode(target_labels))

Step 4: Create model and PEFT configuration

This step is the core step of Prompt-Tuning . It requires embedding a piece of text to initialize the "virtual prompt word". After the text is segmented, the corresponding word vector (embedding) is used as the initial value of the virtual prompt word.

Finalnum_virtual_tokensIt refers to the number of virtual cue word embedding vectors and is the only parameter that needs to be trained.

model = AutoModelForCausalLM.from_pretrained( "D:\\git\\model-download\\bloom-389m-zh" ) 
# Configure Prompt Tuning
config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,     # Causal Language Model
    prompt_tuning_init=PromptTuningInit.TEXT,     # PromptTuningInit.TEXT means using a piece of text embedding to initialize the "virtual prompt word", which is better than random initialization.
    prompt_tuning_init_text = "The following is a conversation between a person and a robot." ,     # Segment the text and use the corresponding word vector (embedding) as the initial value of the virtual prompt word.
    num_virtual_tokens = len (tokenizer( "The following is a conversation between a person and a robot." )[ "input_ids" ]),   # The number of virtual prompt words is equal to the length of the initialized text after word segmentation. The embedding vector of these `num_virtual_tokens` virtual prompt words is the only parameter that needs to be trained!
    tokenizer_name_or_path= "D:\\git\\model-download\\bloom-389m-zh"
)

# Apply the Prompt Tuning configuration to the base model through the `get_peft_model` function. The function will add a learnable Prompt Encoder inside the model and automatically freeze all other parameters of the base model.
model = get_peft_model(model, config)
# Check the changes in the model structure, there will be an additional prompt_encoder part
print ( "PEFT model structure: " , model)
# Check trainable parameters: Print and compare the number of trainable parameters and the total number of parameters.
model.print_trainable_parameters()     # 'trainable parameters' is much smaller than 'all parameters'.

Step 5: Configure training parameters

args = TrainingArguments(
    output_dir= "./chatbot_prompt_tuning_explained_zh" ,  
    per_device_train_batch_size= 1 ,  
    gradient_accumulation_steps= 8 ,     # Gradient accumulation: equivalent to an effective batch size of 1 * 8 = 8, useful for limited video memory
    logging_steps= 10 ,               # print log information (such as loss) every 10 training steps
    num_train_epochs = 1 ,             # Number of training rounds
    save_steps= 100 ,                 # Save a model checkpoint every 100 training steps
    # learning_rate=1e-3, # Prompt Tuning You can usually use a slightly larger learning rate than full fine-tuning
    # gradient_checkpointing=True, # can save video memory, a little slower, if the video memory is insufficient, you can turn it on
)

Step 6: Create a trainer

trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=tokenized_ds, 
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding= True ), # Data collator: responsible for grouping the samples in the dataset into a batch and performing necessary padding
)

Step 7: Model training

 trainer.train() # Only optimize the parameters of the virtual token and freeze the weights of the basic model.

Step 8: Model reasoning and effect display

Prompt-Tuning Reasoning:

1. During inference, the trained virtual token is automatically added to the input sequence;
2. Virtual token is equivalent to giving the model aImplicit Prompt, guiding them to generate answers of a specific style or content;
3. Users do not need to see these virtual tokens, they only work within the model.

Standard process:

1. Load the original base model.
2. Load the trained PEFT adapter weights.
3. UsePeftModel.from_pretrainedCombine the two.

# 1. Load the base model
base_model = AutoModelForCausalLM.from_pretrained( "D:\\git\\model-download\\bloom-389m-zh" )

# 2. Specify the directory where the PEFT adapter weights are located
peft_model_path =  "./chatbot_prompt_tuning_explained_zh/checkpoint-3357/"

# 3. Load the PEFT adapter and apply it to the base model
peft_model = PeftModel.from_pretrained(model=base_model, model_id=peft_model_path)

if  torch.cuda.is_available():
    peft_model = peft_model.cuda()
    print ( "Model has been moved to GPU." )
else :
    print ( "CUDA not detected, running inference on CPU." )

# Prepare input text
instruction =  "What are the tips for taking the test?"
input_text =  ""
prompt =  f"Human:  {instruction} \n {input_text} " .strip() +  "\n\nAssistant: "
print ( f"\nInput Prompt for reasoning:\n {prompt} " )

# Tokenize the input text, convert it into tensors, and then move it to the device where the model is located (CPU or GPU)
ipt = tokenizer(prompt, return_tensors= "pt" ).to(peft_model.device)

# Use the `.generate()` method to generate answers. The PEFT model will automatically handle the injection of soft prompts.
print ( "Generating answers..." )
response_ids = peft_model.generate(**ipt, max_length= 128 , do_sample= True , top_k= 50 , top_p= 0.95 , temperature= 0.7 )

# Decode the generated token IDs back into text
full_response = tokenizer.decode(response_ids[ 0 ], skip_special_tokens= True )     # `skip_special_tokens=True` will remove special tokens like <|endoftext|>

# Pay attention to the content after "Assistant: "
assistant_response = full_response.split( "Assistant: " )[- 1 ]
print ( f"\nResponse generated by the model:\n {assistant_response} " )

-END-