Build an enterprise-level prompt word project

Building an enterprise-level prompt word project is the key to efficiently utilize large language models (LLMs).
Core content:
1. The definition and importance of prompt projects
2. Overlapping problems and solutions in prompt library construction
3. Prompt templates in the LangChain framework and their applications
introduction
Today, prompts are the primary way to interact with Large Language Models (LLMs). Prompts need to be tailored to the user’s needs, providing the right context and guidance to maximize the likelihood of getting the “right” response.
This led to the rise of prompt engineering[1] as a professional discipline, in which prompt engineers systematically experiment and record their findings to derive the “right” prompts that elicit the “best” responses. These lists of successful prompts are then organized into a form that allows for efficient reuse—called a prompt library.
Unfortunately, it is still extremely challenging to comb and maintain a high-quality prompt library. The fundamental goal of a prompt library is to retrieve the best prompt for a given task without having to repeat the entire experimental process. However, due to the overlap between prompts, such retrieval is not easy to achieve.
Problem Statement
Let’s try to understand the problem of overlapping cues with a few examples from the content writing field, one of the fields with the highest AI adoption today:
Tip 1: Generate an attractive lead for a blog post announcing the opening of an eighties-themed cafe. Highlight the ambiance and menu. Use a friendly tone to attract older customers.
Tip 2: Write a news report summary of no more than 200 words announcing the opening of a modern themed bar. Emphasize the interior decoration and menu content. Use friendly language to impress young customers.
Since both prompts generate “good” responses approved by (human) reviewers (even disregarding the technique of using LLM as referee), the question arises - which prompt should be added to the prompt library? Here are at least four options:
1. Both are added, but they are so similar that they are difficult to distinguish during search. 2. Choose only one of the two, but this will lose the contextual information unique to each prompt, such as the difference between a newspaper page and a blog post, the difference between an older customer and a younger customer, etc.; it also includes constraints on the length of the response (such as the requirement for the second prompt). 3. Do not add either. Because there is already a hint in the library that covers a similar range. (This also applies to LLM cache) 4. Add a template that captures the common features of the two prompts above, with placeholders and value lists for specific variables. For example, a general summary template for the two example prompts is as follows: Generate an attractive introduction to an article about the following activities. Emphasize the topic style and limit the number of words in the answer using the specified tone.
• Publication Type: [Newspaper, blog post, report…] • Event: [Opening of a cafe or restaurant…] • Style theme: [atmosphere, decorations, menu…] • Language style: [friendly, formal, informative...]
This is the recommended approach, and in the next section we will show how to organize prompts through prompt templates with the help of a framework like LangChain, and what types of templates are supported.
Prompt Template
LangChain provides the following predefined prompt template classes [2]:
• PromptTemplate is the default template. • ChatPromptTemplate for modeling chat messages. • FewShotPromptTemplate for applying few-shot learning techniques
Templates can be merged, for example, merging ChatPromptTemplate with FewShotPromptTemplate to meet the usage scenario.
Let's start with the most basic PromptTemplate:
from langchain_core.prompts import PromptTemplate
prompt_template = PromptTemplate.from_template(
"Generate an engaging abstract for a {post} on the following {event}."
)
prompt_template. format (post= "blog" , event= "opening of cafe" )
It basically works for modeling prompts as strings (sentences) with variables (placeholders), like the example prompt we considered in the previous section.
If you have worked with chatbots pre-ChatGPT, using tools such as IBM Watson Assistant [1] , AWS Lex [2] , Google Dialogflow [3] , etc., these concepts may be familiar to you. They involve intents, utterances, and entities. Such guided chatbots need to be trained by providing a set of prompt sentences, their corresponding variants, and their corresponding responses. Prompts can be categorized by intent . In chatbot terminology, different ways of saying the same question from a user are called utterances or utterances. Finally, entities refer to the vocabulary of a specific domain, that is, the list of allowed values for a variable.
Next, we consider the ChatPromptTemplate , which can model a multi-step conversation between a user and an AI system. One can specify roles such as user, assistant, system, etc. The allowed roles depend on the types supported by the underlying LLM. For example, the OpenAI Chat Completion API [4] allows the following roles to be specified: an AI assistant, a person, or a system. These roles provide additional context to the conversation and help the LLM to understand the conversation more deeply.
from langchain_core.prompts import ChatPromptTemplate
chat_template = ChatPromptTemplate.from_messages(
[
( "system" , "You are a knoweldgeable AI bot. You are called {name}." ),
( "human" , "Hi, how are you today?" ),
( "ai" , "I'm doing great, thanks! How can I help you?" ),
( "human" , "{user_input}" ),
]
)
messages = chat_template.format_messages(name= "Emily" ,
user_input= "How should I call you?" )
Finally, let’s consider the FewShotPromptTemplate class, which performs few-shot learning by first training a large language model (LLM) using a sample question-response dictionary and then asking real questions.
from langchain_core.prompts.few_shot import FewShotPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate
examples = [
{ "question" : "What is the second largest ocean on Earth?" , "answer" : "Atlantc Ocean" },
{ "question" : "What is the tallest mountain in Asia?" , "answer" : "Mount Everest" },
]
example_prompt = PromptTemplate(
input_variables=[ "question" , "answer" ], template= "Question: {question}\n{answer}"
)
prompt = FewShotPromptTemplate(
examples=examples,
example_prompt=example_prompt,
suffix= "Question: {input}" ,
input_variables = [ "input" ],
)
print (prompt. format ( input = "What is the tallest mountain in Africa?" ))
Hint storage management based on reinforcement learning
LangChain's prompt templates are a great solution to template and organize prompts. However, when dealing with an enterprise-level prompt library of 100+ prompts (templates), it quickly becomes difficult to manage them manually . For each new (approved and recommended) prompt,
1. Extract prompt templates and entity values. 2. Determine overlap with existing prompt templates. 3. If no overlay prompt template exists, add the template to the prompt store. 4. If a covering template exists, but the coverage is below a certain threshold (e.g. 70%), the template may need to be adjusted for the missing entities and count values.
The selection process of prompt storage is shown in the following figure:
For example, given the following (existing) prompt template
Generate an engaging abstract for a post on the following event . Highlight the theme . Use the specified tone of voice_. Post (type):_ [newspaper, blog, article, ...] Event: [opening of cafe, restaurant, diner, ...] Theme: [ambience, menu] Tone: [informative, formal]
New prompt word
Generate an engaging abstract of no more than 200 words for a newspaper post announcing the opening of a modern-themed cafe. Highlight the decor and menu. Use a friendly tone to appeal to a young customer base.
It is necessary to adapt the existing prompt template to increase the 200-word limit and add the following entity values: Theme: [atmosphere, decor, menu]; Tone: [friendly, informative, formal]
In the following, we outline a reinforcement learning approach based on user feedback to automate prompt library maintenance.
This RL model does not aim to build a prompt library from scratch, but rather utilizes prompts with user feedback to automatically generate and optimize the prompt library content.
At the core of our approach is a scoring model that is trained to rate prompt-response tuples based on user feedback. The scores predicted by this model serve as rewards to a reinforcement learning agent. Policy learning is implemented offline with the help of a user simulator that generates preset prompt instances using templates from a prompt inventory. Policy learning is performed using a Deep Q Network (DQN) agent with an ε-greedy exploration mechanism, and is able to efficiently include alternate responses for out-of-range prompts.
Reinforcement Learning Model
The architecture of the RL model is shown in the figure below.
RL-based prompt repository management architecture (Image provided by the author)
The key components of the RL model are: the NLU unit, which is used to initially train the RL agent during the warm-up phase; the user simulator, which randomly extracts candidate prompts from the prompts database (Prompts DB) to add to new use cases, scenarios, etc.; the scoring model that trains prompts based on user feedback; and the Deep Q Network (DQN) based RL agent.
Natural Language Understanding
The Natural Language Understanding (NLU) component is an intent-recognizing NLU that is trained using existing (and approved) prompt templates from the prompt library. To simplify matters, we only consider the basic PromptTemplate of LangChain in this work and use the Rasa [5] open source NLU and TensorFlow [6] pipeline. However, the reinforcement learning approach is independent of the chosen NLU and can be easily extended to other NLU engines such as Google DialogFlow, IBM Watson, or Amazon LEX.
User feedback
We based our work on user feedback obtained during the development of an actual internal chatbot, which was scoped to answer employee queries regarding office building facilities, HR policies, and benefits.
All ten users who participated in the feedback process were informed that their feedback would be used to improve the quality of the prompts. Users gave a binary rating for each prompt-response pair. Therefore, the historical data contains a quadruple in the following format: (prompt, response, NLP confidence level, and user feedback).
Reinforcement Learning Reward Function
Natural Language Understanding (NLU) performance evaluation is a long-standing problem in computational linguistics. Evaluation metrics borrowed from machine translation perform poorly on short sentences, such as the responses (templates) in our case. On the other hand, user (human) review of prompts and responses is now considered the de facto standard for evaluating quality, accuracy, relevance, etc. - although these ratings are often difficult and costly to collect.
To apply user feedback in an offline reinforcement learning manner, we use a scoring model that models binary feedback for new (unseen) prompt-response tuples. The vector representation of the sentence is computed using the Universal Sentence Encoder provided by TensorFlow Hub.
Given the above conditions, the scoring model learns to project the vector representations of prompts and responses into a linearly transformed space such that similar vector representations give higher scores.
To train the reinforcement learning reward model, optimization is based on the squared error loss between the model predictions and human feedback with L2 regularization. When evaluating the model, the prediction scores are converted to binary results and compared with the target (user feedback). For those prompt pairs with recognized templates that contain both positive feedback and a target language understanding confidence level close to 1, we perform data augmentation to assign low scores to prompts and alternative intent combinations.
Reinforcement Learning Agent Strategy Learning Based on DQN
The RL agent learns the policy using a Q-learning algorithm with a DQN architecture. We adopt the approach proposed in [6] and utilize a fully connected network that is fed by an experience replay pool buffer containing one-hot representations of prompts and responses and their corresponding rewards.
During the warmup phase, DQN uses the confidence levels of NLU to train NLU. Whenever the confidence value of a state-action tuple exceeds a threshold, the DQN training set is augmented by setting the weights of the given state and all other available actions to zero. Therefore, at the beginning of RL training, the RL agent behaves similarly to an NLU unit.
During reinforcement learning training, we used an epsilon (ε)-greedy exploration strategy, randomly selecting actions for exploration with probability ε. We adopted a time-varying ε value, setting it to 0.2 at the beginning of training and dropping it to 0.05 at the end of the training epoch to encourage exploration.
In an epoch, we simulate a batch of dialogues of size n-episodes (ranging from 10 to 30 in our experiments) and fill the experience replay cache with triples (s_t, a_t, r_t). The cache has a fixed size and is refreshed when the RL agent performance first exceeds a specified threshold. In the case of those state-action tuples in an epoch that receive a reward greater than 50%, we perform data augmentation by assigning zero reward to any other action assigned to the current state.
in conclusion
In this paper, we focused on the challenges of building an enterprise prompt library. Unfortunately, due to the overlapping nature of prompts, curating and maintaining a prompt library remains a challenging task. For each new candidate prompt (to be added to the prompt library), we need to answer the following questions:
• Should we add it directly to the hint repository? But if we do so, the hint acquisition process will become complicated. • How do we resolve overlaps/conflicts with existing hints in the hint library? In these cases, how do we adapt the existing hints to cover the scope of the new candidate hint?
We highlight a structured approach to organizing prompts in the form of prompt templates in a detailed discussion. We give concrete examples of the types of prompt templates supported by LangChain. Finally, we show how to semi-automatically perform the prompt library curation process using reinforcement learning with user feedback.
In the future, better tools and strategies are needed to resolve the problem of conflicting prompt scopes, especially in the case of multi-domain prompt libraries where respective user groups (business units) hold different views on the usefulness of the same prompt.