Excellent article! Is it really reliable to use LLM to evaluate LLM? How to achieve it technically?

Written by
Audrey Miles
Updated on:June-26th-2025
Recommendation

Explore new perspectives on LLM evaluation and gain insight into the advantages and limitations of LLM referees.

Core content:
1. The rise of LLM referees and their advantages over manual evaluation
2. Different types of LLM referees and their operation methods
3. The application and challenges of LLM referees in LLM evaluation indicators

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Recently, the phrase “LLM as a Judge” has become increasingly popular.

I am engaged in LLM evaluation and pay more attention to this aspect. However, this concept is popular for a reason. Compared with manual evaluation, using LLM as a referee to evaluate LLM has obvious advantages. Manual evaluation is slow, costly, and labor-intensive. However, LLM referees also have their own shortcomings. If you use them blindly, you will only get burned. In this article today, I will share with you all the knowledge I have learned about using LLM referees to evaluate LLM (systems), including:

  • What is “LLM as Referee” and why is it so popular?

  • What are the alternatives to using LLM as referee and why are they less effective?

  • What are the limitations of LLM referees and how can they be solved?

  • How to use LLM referees to evaluate LLMs in LLM evaluation metrics using DeepEval.

Without further ado, let’s get started.

(PS: You can now also use LLM referees to calculate the deterministic LLM metrics in DeepEval!)

What exactly does it mean to “use LLM as a referee”?

"Using LLM as a judge" simply means using LLM to evaluate the responses given by other LLMs according to the specific criteria you set, that is, using LLM to conduct LLM (system) evaluation. In the paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", this method is proposed as an alternative to manual evaluation, after all, manual evaluation is both expensive and time-consuming. There are three main types of using LLM as a judge:

Paper address: https://arxiv.org/abs/2306.05685

  • Single output scoring (no reference) : Give the referee LLM a set of scoring criteria and let it score the LLM's response based on various factors, such as the input to the LLM system, the search context in the search enhancement generation (RAG) process, etc.

  • Single output scoring (with reference) : Similar to the above, but LLM judges are sometimes inconsistent. Providing a reference, ideal or expected output can help LLM judges give more consistent scores.

  • Paired comparison : Give the referee LLM two outputs generated by the LLM, and it will judge which one is better based on the input. This also requires a set of custom criteria to define what is "better".

This concept is not complicated to understand. Just give LLM an evaluation standard and let it help you score. But how to do it specifically and in what scenarios should it be used?

"Use LLM as Referee" can be used to enhance LLM evaluation, such as using it as a scorer in LLM evaluation indicators. When using it, first give your selected LLM a clear and concise evaluation standard or scoring rule, and then let it calculate a metric score between 0 and 1 based on various parameters, such as the LLM's input and generated output. The following is an example prompt for an LLM referee to evaluate the coherence of the summary:

prompt =  "" "
You will receive a summary of the news article (LLM output).
Your task is to evaluate the coherence of this summary with the original text (input).

original:
{input}

Summarize:
{llm_output}

Fraction:
"
""

By collecting these indicator scores, a comprehensive set of LLM evaluation results can be formed to benchmark, evaluate, and even perform regression testing on the LLM (system).

Using LLMs as scorers for LLM evaluation metrics to evaluate other LLMs is becoming increasingly popular, simply because the alternatives are not very good. LLM evaluation is critical to quantifying and identifying areas for improvement in the performance of LLM systems, but manual evaluation is too slow, and traditional evaluation methods like BERT and ROUGE ignore the deeper semantics of LLM-generated text and make it difficult to achieve the desired effect. Think about it, how can traditional, much smaller natural language processing (NLP) models effectively judge large sections of open-ended generated text, as well as content in formats such as Markdown or JSON?

Is this method really feasible?

In short, it is feasible. Related research (the paper mentioned above) shows that the consistency of using LLM as a referee is even higher than the consistency of human judgement than that of judges between different people. Moreover, you don’t have to worry about the model used for evaluation being inferior to the model used in your application.

At first glance, it seems counterintuitive to use an LLM to evaluate text generated by another LLM. How can a model, which generates output, be better at judging that output or spotting errors in it?

The key is the separation of tasks. Instead of asking the LLM to regenerate the content, we use different prompts or even different models specifically for evaluation. This stimulates different capabilities of the model and often reduces the evaluation task to a simple classification problem, such as assessing quality, coherence, or correctness. Detecting problems is often easier than avoiding them in the first place, because evaluation is simpler than generation, and LLM judges only need to evaluate the generated content, such as checking relevance, without improving the answer.

In addition to fundamentally different evaluation prompts, there are many ways to improve the accuracy of using LLM as a judge, such as chain of thoughts prompting (CoT prompting) and few-shot learning, which we will discuss in detail later. We also found that limiting the output of LLM judges to a specific range can make the metric scores more deterministic. In DeepEval, users can build decision trees (modeled as directed acyclic graphs (DAGs), with nodes representing LLM judges and edges representing decisions) to create deterministic evaluation metrics that are highly consistent with their standards. There will be more details in the "Directed Acyclic Graph (DAG)" section.

(Off topic: LLM is better as a referee. At first, I relied on traditional non-LLM indicators like ROUGE and BLEU, which compare texts based on word overlap. But soon there was user feedback that these indicators were not very accurate even for simple sentences, let alone explaining the reasons for the score.)

Alternatives to using LLM as a referee

This section could have been omitted, but there are two popular alternatives for evaluating LLMs. Here are some common reasons why I think people choose them wrong:

  • Human evaluation : Human evaluation is often considered the gold standard of evaluation because humans understand context and nuance. However, it is time-consuming, expensive, and can be inconsistent due to subjective interpretation. In practice, an LLM application may generate about 100,000 responses per month. For me, it takes an average of 45 seconds to read a few paragraphs and make an evaluation. This works out to about 4.5 million seconds per month, or about 52 consecutive days (not counting lunch breaks) to evaluate all the responses.

  • Traditional NLP evaluation methods : Traditional scorers like BERT and ROUGE have many advantages, such as fast speed, low cost, and good reliability. However, as I mentioned in my previous article comparing various LLM evaluation metric scorers, these methods have two fatal flaws: one is that reference text must be available to compare with the output generated by LLM; the other is that the accuracy is very low because they ignore the semantics of the LLM-generated output, which is often subjective and has a complex format (such as JSON). Considering that the LLM output in actual applications is open and has no reference text, it is difficult for traditional evaluation methods to meet the needs.

(In addition, both manual evaluation and traditional NLP evaluation methods lack interpretability, that is, there is no way to explain how the evaluation scores are given.)

So, using LLM as a referee is the best choice at present. It is highly scalable and can reduce bias through fine-tuning or prompt engineering; it is relatively fast and low-cost (of course, this depends on which evaluation method is compared); most importantly, it can understand very complex generated texts, regardless of content and format. With this in mind, let's explore the effectiveness, advantages and disadvantages of using LLM as a referee in LLM evaluation.

LLMs are more judgmental than you think

The question is, how accurate is it to use LLM as a referee? After all, LLM is a probabilistic model, so it is still prone to hallucinations, right?

Research shows that when used properly, cutting-edge LLMs like GPT-4 (yes, that one again) can achieve up to 85% agreement with human judges in both pairwise comparisons and single-output scoring. For those of you who are skeptical, this is actually higher than the 81% agreement between human judges.

GPT-4 performs consistently in both pairwise comparison and single-output scoring, which means it has a relatively stable set of internal scoring rules, and this stability can be further improved through chain of thought prompts (CoT).

G-Eval

G-Eval is a framework that uses Chain of Thought (CoT) to make LLM judges more stable, reliable, and accurate in calculating indicator scores (see below for more on Chain of Thought (CoT)).

G-Eval first generates a series of evaluation steps based on the original evaluation criteria, and then uses these steps to determine the final score through a fill-in-the-blank paradigm (this may be a bit complicated, but in short, G-Eval needs some information to work). For example, when using G-Eval to evaluate the coherence of LLM output, you first construct a prompt containing the evaluation criteria and the text to be evaluated, generate the evaluation steps, and then let LLM give a score of 1 to 5 based on these steps.

You will find that the techniques used in G-Eval are very compatible with many techniques that can improve LLM evaluation ability. You can use G-Eval with just a few lines of code through DeepEval⭐, an open source LLM evaluation framework.

pip install deepeval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input= "input to your LLM" , actual_output= "your LLM output" )
coherence_metric = GEval(
    name = "Coherence" ,
    criteria= "Coherence - the collective quality of all sentences in the actual output" ,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

coherence_metric.measure(test_case)
print (coherence_metric.score, coherence_metric.reason)

Directed Acyclic Graph (DAG)

However, there is a problem with G-Eval, it is not deterministic. This means that for a benchmark that uses LLM as the referee metric, you cannot fully trust its results. But this does not mean that G-Eval is useless. In fact, it performs very well in tasks that require subjective judgment, such as evaluating coherence, similarity, answer relevance, etc. But if there are clear evaluation criteria, such as format correctness in text summarization use cases, the results need to be deterministic.

This can also be achieved with LLM by structuring the evaluation as a directed acyclic graph (DAG) . In this approach, each node represents an LLM referee that handles a specific decision, and the edges define the logical flow between decisions. Breaking down LLM interactions into smaller atomic units reduces ambiguity and ensures that the results are as expected . The finer the breakdown, the more inconsistencies can be avoided.

The above directed acyclic graph (DAG) is used to evaluate the meeting summary use case. Here is the corresponding code in DeepEval:

from deepeval.test_case import LLMTestCase
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric

correct_order_node = NonBinaryJudgementNode(
    criteria= "Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?" ,
    children=[
        VerdictNode(verdict= "Yes" , score=10),
        VerdictNode(verdict= "Two are out of order" , score=4),
        VerdictNode(verdict= "All out of order" , score=2),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria= "Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?" ,
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=correct_order_node),
    ],
)

extract_headings_node = TaskNode(
    instructions= "Extract all headings in `actual_output`" ,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label= "Summary headings" ,
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

# create the metric
format_correctness = DAGMetric(name= "Format Correctness" , dag=dag)

# create a test case
test_case = LLMTestCase(input= "your-original-text" , actual_output= "your-summary" )

# evaluate
format_correctness.measure(test_case)
print (format_correctness.score, format_correctness.reason)

However, I don’t recommend using DAGs right away because they are a bit difficult to use, whereas G-Eval is very easy to set up. You can try G-Eval first and then slowly transition to more sophisticated techniques like DAGs. In fact, if you want to use DAGs to filter out specific requirements such as format correctness first, then run G-Eval, that’s fine. There’s a full example at the end of the post where we use G-Eval as a leaf node instead of returning a hard-coded score.

LLMs are not perfect

As you might expect, using LLM as a referee is not perfect. It has many disadvantages:

  • Unstable results : The scores it gives are not deterministic, that is, for the same LLM output to be evaluated, the scores obtained at different times may be different. If you want to rely entirely on its evaluation results, you have to use a good method such as directed acyclic graph (DAG) to stabilize the scores.

  • Narcissistic bias : Studies have shown that LLMs may be biased towards the answers they generate. The word "may" is used here because the study found that although GPT-4 and Claude-v1 have a higher win rate of 10% and 25% respectively, they will also be biased towards other models, while GPT-3.5 does not have this bias.

  • More is better : We all know that “less is more”, but LLM judges often prefer lengthy texts over concise ones. This is a problem in LLM assessments, as the assessment score calculated by LLM may not accurately reflect the quality of the generated text.

  • Scoring is not granular enough : LLM is fairly reliable when it comes to making macro judgments, such as determining whether a fact is correct or scoring a generated text on a simple 1-5 scale. However, when the scoring criteria become more detailed and the score range becomes smaller, LLM is likely to give more arbitrary scores, resulting in less reliable and more random results.

  • Position bias : When performing pairwise comparisons with LLM judges, LLMs like GPT-4 will often favor the first generated LLM output.

In addition, there are issues like the LLM illusion that need to be considered. However, these issues are not unsolvable. In the next section, we will look at how to overcome these limitations.

Methods to improve LLM assessment ability

CoT Prompting

Chain of Thought Prompting (CoT Prompting) is to let the model explain its reasoning process. When using Chain of Thought Prompting (CoT) to assist LLM judging, detailed evaluation steps should be included in the prompts instead of vague macro standards. This can help the judging LLM make more accurate and reliable evaluations, and also make the LLM's evaluation results better meet people's expectations.

In fact, this is the technology used by G-Eval, which they call "automatic chain of thoughts prompt (auto-CoT)". Of course, this technology is also implemented in DeepEval. Here is how to use it:


from deepeval.test_case import LLMTestCase, LLMTestCaseParams

from deepeval.metrics import GEval

test_case = LLMTestCase(input= "input to your LLM" , actual_output= "your LLM output" )

coherence_metric = GEval(

    name = "Coherence" ,

    criteria= "Coherence - the collective quality of all sentences in the actual output" ,

    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],

)
coherence_metric.measure(test_case)
print (coherence_metric.score, coherence_metric.reason)

Few-Shot Prompting

The concept of few-shot prompts is simple: add some examples to the prompts to better guide the LLM in making judgments. Since the number of input tokens will increase, this method will consume more computing resources. However, research shows that few-shot prompts can improve the judgment consistency of GPT-4 from 65.0% to 77.5%.

There is not much to it when it comes to few-shot prompts. If you have tried different prompt templates, you will know that adding a few examples to the prompt is probably one of the most effective ways to guide the LLM to produce the desired output.

Using the probability of output tokens

In order to make the calculated evaluation scores more continuous and avoid having the referee LLM output scores on a finer scale (which may lead to arbitrary indicator scores), we can let the LLM generate 20 scores and then use the probability of the LLM output token to normalize the scores by calculating the weighted sum. This can minimize the bias in the LLM scoring and make the final calculated indicator score smoother and more continuous without sacrificing accuracy.

It is worth mentioning that this method is also used in DeepEval's G-Eval implementation.

Reference Guide Evaluation

Compared to single-output judgment without reference, providing the judge LLM with an expected output as the ideal answer can help it better meet human expectations. In the prompt, you can simply use the expected output as an example in the few-shot prompt.

Limiting the scope of LLM assessment

Instead of having the LLM evaluate the entire generated output at once, you can consider breaking it down into more fine-grained evaluations. For example, using LLM to implement question-answer generation (QAG), a powerful technique for calculating non-arbitrary scores. Question-answer generation (QAG) can calculate evaluation metric scores based on "yes/no" answers to closed questions. For example, if you want to calculate the answer relevance of the LLM output relative to a given input, you can first extract all the sentences in the LLM output, and then determine the proportion of sentences that are relevant to the input. The final answer relevance score is the proportion of relevant sentences in the LLM output. To some extent, the directed acyclic graph (DAG) we mentioned earlier also uses the question-answer generation (QAG) technology (especially at nodes that require binary judgments).

Question-Answer Generation (QAG) is powerful because it makes LLM scoring less arbitrary and allows each score to be mapped to a mathematical formula. Splitting the initial prompt into just sentences instead of the entire LLM output also reduces the amount of text that needs to be analyzed, helping to address the problem of model hallucination.

Swap positions

This method is not rocket science; we simply swap the positions of the outputs in paired LLM judge evaluations and only declare an answer the winner if it is preferred in both orders, thus addressing the position bias problem.

Fine-tuning

If you need more domain-specific LLM referees, you can consider fine-tuning an open source model such as Llama-3.1. Fine-tuning is also a good choice if you want the LLM evaluation process to be faster and less expensive.

Using LLM Judges in LLM Evaluation Indicators

Finally, the most widespread application of LLM referees at present is as a scorer in LLM evaluation indicators to evaluate LLM systems.

An excellent LLM evaluation metric implementation will use all of the above techniques to optimize the LLM judge scorer. Taking DeepEval as an example, in RAG metrics (such as contextual accuracy), we use question-answer generation (QAG) to limit the scope of LLM evaluation; in custom metrics (such as G-Eval), we use automatic chain of thoughts prompts (auto-CoT) and normalize the output token probabilities; most importantly, we use few-shot prompts in all metrics to cover various edge cases.

At the end of the article, I will show you how to call DeepEval's indicators with just a few lines of code. All the implementation code can be found on DeepEval's GitHub, which is completely free and open source.

Coherence Assessment

As we have seen several times before, G-Eval allows you to implement custom consistency evaluation metrics:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input= "input to your LLM" , actual_output= "your LLM output" )
coherence_metric = GEval(
    name = "Coherence" ,
    criteria= "Coherence - the collective quality of all sentences in the actual output" ,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

coherence_metric.measure(test_case)
print (coherence_metric.score, coherence_metric.reason)

Note that we have enabled G-Eval.verbose_modeWhen you turn on verbose mode in DeepEval, it will print out the inner workings of the LLM judge, allowing you to see all the intermediate judging steps .

Text Summarization Evaluation

Next is text summary evaluation. I like to talk about this because in this application scenario, users usually have a clear understanding of the evaluation criteria, such as format requirements are very important. Here, we will use DeepEval's DAG indicator, but with some adjustments. Unlike the directed acyclic graph (DAG) code shown earlier, we will first use a directed acyclic graph (DAG) to automatically give a score of 0 to summaries that do not meet the correct format requirements, and then use G-Eval as a leaf node in the DAG to give the final score. In this way, the final score is not hard-coded, and it can also ensure that the summary meets certain format requirements.

First, create the DAG structure:

from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric
from deepeval.metrics import GEval

g_eval_summarization = GEval(
 name = "Summarization" ,
  criteria= "Determine how good a summary the 'actual output' is to the 'input'" ,
  evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

correct_order_node = NonBinaryJudgementNode(
    criteria= "Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?" ,
    children=[
        VerdictNode(verdict= "Yes" , g_eval=g_eval_summarization),
        VerdictNode(verdict= "Two are out of order" , score=0),
        VerdictNode(verdict= "All out of order" , score=0),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria= "Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?" ,
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=correct_order_node),
    ],
)

extract_headings_node = TaskNode(
    instructions= "Extract all headings in `actual_output`" ,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label= "Summary headings" ,
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

Then, create DAG indicators based on this DAG and evaluate them:

from deepeval.test_case import LLMTestCase
...

# create the metric
summarization = DAGMetric(name= "Summarization" , dag=dag)

# create a test case for summarization
test_case = LLMTestCase(input= "your-original-text" , actual_output= "your-summary" )

# evaluate
summary.measure(test_case)
print (summarization.score, summarization.reason)

As can be seen from the DAG structure, as long as the format is incorrect, we will directly give 0 points and then run G-Eval.

Contextual Accuracy Evaluation

Contextual Precision is a RAG metric that determines whether the order of nodes retrieved in the RAG pipeline is correct. This is important because LLM tends to pay more attention to nodes near the end of the prompt (recency bias). Contextual Precision is calculated through Question Answer Generation (QAG), where LLM judges judge the relevance of each node based on the input, and the final score is the weighted cumulative precision.

from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

metric = ContextualPrecisionMetric()
test_case = LLMTestCase(
    input = "..." ,
    actual_output = "..." ,
    expected_output = "..." ,
    retrieval_context=[ "...""..." ]
)

metric.measure(test_case)
print (metric.score, metric.reason)

Summarize

That’s it! There is indeed a lot to cover when using LLMs as referees, but now we at least have a clear idea of ​​what the different types of LLM referees are, their role in LLM assessments, their strengths and weaknesses, and how to improve their performance.

The main goal of the LLM evaluation metric is to quantify the performance of the LLM (application), and achieving this goal requires different scorers. At present, the LLM referee is the best choice. Of course, the LLM referee also has some shortcomings, such as possible bias in the evaluation, but these problems can be solved by prompt engineering methods such as chain of thought prompts (CoT) and few-sample prompts.

- END -