Let the LLM judge | Design your own assessment prompt

Master LLM evaluation skills to improve model performance.
Core content:
1. General prompt design principles and applications
2. Detailed evaluation criteria and reasoning steps
3. Advantages of pairwise comparison and output scoring
General prompt design suggestions
I have summarized the general design principles of common prompts on the Internet as follows:
Clear mission description: Your task is to do X
.You will be provided with Y
.The evaluation criteria are detailed and the scoring rules are detailed (if necessary): You should evaluate property Z on a scale of 1 - 5, where 1 means ...
You should evaluate if property Z is present in the sample Y. Property Z is present if ... (Please indicate whether property Z is present in sample Y. If so, then ...)
Add some "reasoning" evaluation steps To judge this task, you must first make sure to read sample Y carefully to identify ..., then ... (Before evaluating this task, please read sample Y carefully, identify ..., and then ...)
The output format is clear (adding specific fields can improve consistency) Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}
For prompt writing inspiration, you can refer to the prompt templates or .
Other points:
Pairwise comparisons compare output scores and are generally more robust If the task really needs to score the output as a specific value, it is recommended to use integers and explain in detail, or add a prompt For example, provide 1 point for this characteristic of the answer, 1 additional point if ...
waitTry to use a dedicated scoring prompt for each ability you assess, which will give you better and more robust results.
Improve assessment accuracy
The following methods or technologies can be used to improve the accuracy of the assessment (which may increase costs):
Few-shot examples : Providing a small number of examples can help the model understand and reason, but it also increases the context length. Citing references : Providing reference content can improve the accuracy of model output. Chain of Thought (CoT) : requires the model to give a reasoning process before scoring , which is OK (refer to this article). Multiple rounds of analysis : can better Jury mechanism : Aggregating results from multiple evaluation models. Using multiple small models instead of one large model can significantly reduce costs. It is also possible to use multiple temperature parameters for a model to perform multiple experiments. The community accidentally discovered that prompt introduced a reward mechanism ( For example: If you answer correctly, you will get a kitten.
) can improve the accuracy of the answer. The effect of this method varies depending on the scenario, and you can flexibly adjust it according to your needs.
Note: If you want to reduce model bias, you can refer to the questionnaire design in sociology, and then write prompts according to the usage scenario. If you want to use the model to replace manual evaluation, you can design similar evaluation indicators: such as calculating the consistency of the annotators, using the correct questionnaire method to reduce bias, etc.
However, in real applications, most people do not need fully reproducible and high-quality unbiased evaluations, and a quick and slightly crude prompt will meet their needs. (As long as you know the consequences of use, this is acceptable.)