How to test AI applications | Easy to get started with LLM assessment

Written by
Iris Vance
Updated on:June-30th-2025
Recommendation

This paper discusses how to evaluate LLM output in AI applications in a simple and in-depth manner, which is a practical guide for developers and technicians.

Core content:
1. Analysis of the working principle and evaluation difficulties of LLM
2. Introduction to various Python methods for evaluating LLM output
3. Application practice of DeepEval tool in LLM evaluation

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Most developers build LLM applications without setting up an automated profiling process—even though this could introduce unnoticed breaking changes, because profiling itself is challenging. In this article, you'll learn how to properly profile LLM output.

Table of contents

  • • What is LLM and why is it so difficult to assess?

  • • Different ways to evaluate LLM output using Python

  • • How to use DeepEval to evaluate LLM

What is LLM and why is it so difficult to evaluate?

To understand why LLM is difficult to measure and is often referred to as a “black box,” we need to first break down its nature and operating principles.

Take GPT-4, for example. This large language model (LLM) is trained on massive amounts of data — about 300 billion words of data, to be exact, from articles, tweets, Reddit’s r/tifu section, Stack Overflow, how-to guides, and other internet scraping content.

The “Generative Pre-trained Transformers” in “GPT” refers to a specific neural network architecture that is good at predicting the next token (for GPT-4, 1 token ≈ 4 characters, the specific length depends on the encoding strategy).

In reality, LLMs don’t really “know” things, but are trained to understand language patterns and thus become adept at generating reasonable responses.

Key conclusion : LLM generates the "best" next token through probabilistic prediction. This non-deterministic nature leads to diversity in its output, making it complicated to evaluate - there are often multiple reasonable answers.

Why do you need to evaluate LLM applications?

Common scenarios for LLM applications include:

  • •  Chatbots : customer service, virtual assistants, conversational agents

  • •  Code Assistant : code completion, error correction, debugging

  • •  Legal document analysis : Quickly analyze contracts and legal texts

  • •  Personalized email drafting : generate emails based on context, recipients and tone

LLM applications often have one thing in common - they can significantly improve task processing capabilities when combined with proprietary data. For example, you might want to build an internal chatbot that improves employee productivity, and OpenAI obviously will not (and should not) track your company's internal data.
The key point is: ensuring that LLM applications produce ideal outputs is now not only OpenAI's responsibility (such as ensuring the basic performance of GPT-4), but also the responsibility of developers. You need to achieve this through prompt template optimization, data retrieval process, model architecture selection, etc.

The purpose of benchmarking is to quantify the application's ability to handle tasks. Without a benchmarking mechanism, destructive changes may be introduced during iteration, and manually checking all outputs is tantamount to disaster.

Evaluation methods that do not rely on LLM

An effective way to evaluate LLM output is to use other machine learning models in the field of NLP . Despite the non-deterministic nature of LLM output, you can still evaluate the quality of the output using a specific model on multiple metrics (such as factual correctness, relevance, bias, helpfulness, etc., to name just a few).

For example , natural language inference models (NLI, output entailment scores) can be used to assess the factual correctness of answers based on context. The higher the entailment score, the more factually correct the output is - this is particularly useful for evaluating long texts where the factuality is unclear.

You might be wondering : how do these models "know" whether a text is factually correct? Well, you provide them with context (called "ground truth" or "reference answers") for them to refer to. The collection of this context is called a test dataset.

But not all metrics require references . For example, correlation can be calculated using a cross-encoder model (another ML model) that only needs the input and output to determine the correlation between the two.

The following is a list of no-reference indicators :

  • • Relevancy

  • • Summarization

  • • Bias

  • • Toxicity

  • • Helpfulness

  • • Harmlessness

  • • Coherence

The following is a list of reference indicators :

  • • Hallucination

  • • Semantic Similarity

Note : Reference-based indicators do not require initial input as they measure output only based on the context provided.

Using LLM for evaluation

An emerging trend is to use cutting-edge LLMs (such as GPT-4) to evaluate themselves or other models.

G-Eval: An LLM-based evaluation framework
I’ll include a diagram from the G-Eval research paper, but in short, the process is divided into two parts - first generating the evaluation steps, and then outputting the final score based on these steps.

Specific operation example
Step 1: Generate evaluation steps

  1. 1. Tell GPT-4 the task (e.g., “rate the summary on a scale of 1-5 based on relevance”).

  2. 2. Make the scoring basis clear (e.g. “relevance is based on the combined quality of all sentences”).

Step 2: Generate ratings

  1. 1. Combine inputs, measurement steps, context, and actual outputs.

  2. 2. The model is required to generate a score of 1-5 (5 is the best).

  3. 3. (Optional) Extract the probability values ​​of the LLM output tags and normalize the scores by weighted summation.

Practical complexity of step 3: To obtain the probability of the output token, you usually need to access the original model output (not just the final generated text). The paper introduces this step because it can provide a more fine-grained score, thus more accurately reflecting the output quality.

The following is a diagram from the paper to help you understand the above content intuitively:

G-Eval combined with GPT-4 outperforms traditional evaluation methods in terms of coherence, consistency, fluency, and relevance. However, it should be noted that LLM-based evaluation is often costly. Therefore, it is recommended to use G-Eval as a starting point to establish a performance benchmark, and then transition to more cost-effective evaluation indicators in applicable scenarios.

Evaluating LLM output with Python

At this point, you’re probably overwhelmed by the terminology and definitely don’t want to implement everything from scratch. Imagine having to research the best way to calculate each metric, train a dedicated model, and build an evaluation framework…

Fortunately, there are open source tools such asragasandDeepEvalProvides a ready-made evaluation framework, saving the cost of self-research.

As the co-founder of Confident (the company behind DeepEval), I will boldly show you how to useDeepEvalUnit test your LLM application in your CI/CD pipeline (we provide a Pytest-like developer experience, easy configuration, and a free visualization platform).

Let's wrap up by working through the code.

Setting up the test environment

To run our expected review process, please follow these steps:

Create a project folder and initialize a Python virtual environment (run in terminal):

mkdir  evals-example
cd  evals-example
python3 -m venv venv
source  venv/bin/activate

Your terminal should now display the following prompt:

(venv)

Install dependencies:

Run the following code:

pip install deepeval

Setting up an OpenAI API key

Finally, set the OpenAI API key as an environment variable. G-Evals will then call the OpenAI interface (essentially using LLM for evaluation). Paste the following command in the terminal (replace with your own API key):

export OPENAI_API_KEY="your-api-key-here"

Writing the first test file

Create a file calledtest_evals.py(Note that the test file must start with "test"):

touch test_evals.py

Paste the following code :

from  deepeval.metrics  import  GEval, HallucinationMetric  
from  deepeval.test_case  import  LLMTestCase, LLMTestCaseParams  
from  deepeval  import  assert_test  

def  test_hallucination ():  
    hallucination_metric = HallucinationMetric(minimum_score= 0.5 )  
    test_case = LLMTestCase(  
        input = "What if these shoes don't fit?" ,   
        actual_output= "We offer a 30-day full refund at no extra costs." ,   
        context=[ "All customers are eligible for a 30 day full refund at no extra costs." ]  
    )  
    assert_test(test_case, [hallucination_metric])  

def  test_relevancy ():  
    answer_relevancy_metric = AnswerRelevancyMetric(minimum_score= 0.5 )  
    test_case = LLMTestCase(  
        input = "What does your company do?" ,   
        actual_output= "Our company specializes in cloud computing"  
    )  
    assert_test(test_case, [relevancy_metric])  

def  test_humor ():  
    funny_metric = GEval(  
        name = "Humor" ,  
        criteria = "How funny it is" ,  
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]  
    )  
    test_case = LLMTestCase(  
        input = "Write me something funny related to programming" ,   
        actual_output= "Why did the programmer quit his job? Because he didn't get arrays!"  
    )  
    assert_test(test_case, [funny_metric])  

Run the tests:

deepeval test run test_evals.py

Each test case corresponds to a predefined indicator in DeepEval, which outputs a score of 0-1. For example,HallucinationMetric(minimum_score=0.5)Indicates the factual correctness of the evaluation output, whereminimum_score=0.5The threshold for the test to pass is set to score > 0.5.

Analyze the test cases item by item :

  • • “test_hallucination”: Evaluate the factual correctness of LLM output relative to the context.

  • • “test_relevancy”: Evaluate the correlation between output and input.

  • • “test_humor”: Evaluate the interestingness of the LLM output (using the G-Eval framework, relying on the LLM itself for evaluation).

Parameter Description : A single test case can contain up to 4 dynamic parameters:

  • • Input

  • • Expected Output

  • • Actual Output (from your application)

  • • Context (background information used to generate the actual output)

Different indicators have different requirements for parameters, some of which are mandatory and some are optional.

Finally, here is how you can aggregate multiple metrics in one test case:

def  test_everything ():  
    test_case = LLMTestCase(  
        input = "What did the cat do?" ,   
        actual_output= "The cat climbed up the tree" ,   
        context=[ "The cat ran up the tree." ],   
        expected_output= "The cat ran up the tree."  
    )  
    assert_test(test_case, [hallucination_metric, relevancy_metric, humor_metric])  

Not that hard, right? By writing enough tests (10-20), you can significantly improve your control over your application.

Additional features: DeepEval supports unit testing of LLM applications in CI/CD pipelines.

Alternatively, you can use the DeepEval free platform with the following command:

deepeval login

Follow the instructions (log in, get your API key, paste into the CLI), then rerun the test:

deepeval test run test_example.py

Summarize

In this article you have learned:

  • • How LLM works

  • • LLM application examples

  • • Difficulties in evaluating LLM output

  • • Unit testing with DeepEval

Through evaluation, you can:
✅ Avoid introducing destructive changes to LLM applications
✅ Quickly iterate and optimize key indicators
✅ Have confidence in the LLM applications you build