Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

LLM Assessment: Full process practice from prototype development to production deployment (including code)

Written by

Iris Vance

Updated on:June-13th-2025

1. Why is evaluation the core competitiveness of LLM products?

In the field of artificial intelligence, large language models (LLMs) are penetrating into various industries at an alarming rate. From intelligent customer service in financial technology to medical record analysis in the medical field, from data query in e-commerce to personalized tutoring in the education industry, the application scenarios of LLMs are becoming increasingly diverse. However, as the application deepens, a core issue gradually emerges: How to ensure that LLMs operate stably, reliably and as expected in real scenarios?

Management guru Peter Drucker once said: "If you can't measure it, you can't improve it." This sentence is particularly applicable in the field of LLM. Building a strong evaluation system is not only the key to improving model performance, but also a necessary means for enterprises to establish trust, operate in compliance, and reduce costs in the AI competition. This article will combine actual cases, from prototype development to production deployment, to explain the full process of LLM evaluation in detail, to help readers master the core methodology from data collection, indicator design to continuous monitoring.

2. Prototype development: building an evaluable LLM product prototype

1. Case background: Demand analysis of e-commerce data query agents

Suppose we develop a data analysis system for an e-commerce company. Users want to interact with the system through natural language to obtain key business indicators (such as the number of customers, revenue, fraud rate, etc.). Through user research, we found that the interpretation threshold of existing reports is high, and users prefer to get clear answers instantly through intelligent agents. Therefore, we decided to build an LLM-based SQL agent that can convert user queries into SQL statements that can be executed by the database and return structured results.

2. Technology stack selection and prototype construction

Core Components

LLM Model
: Choose the open source Llama 3.1 model and deploy it through the Ollama tool to balance performance and cost.
Agent Framework
: Use LangGraph to build ReAct-style agents to implement the "reasoning-tool call" loop logic.
database
: Use ClickHouse to store e-commerce data, includingecommerce.users(user table) andecommerce.sessions(Session table), fields cover user attributes, session behavior and transaction data.
Key code implementation

SQL tool package

: To prevent full table scans and format errors, SQL statements are required to containformat TabSeparatedWithNames, and limit the number of returned rows:

def get_clickhouse_data(query):
    if 'format tabseparatedwithnames' not in query.lower():
        return "Please specify the output format"
    r = requests.post(CH_HOST, params={'query': query})
    if len(r.text.split('\n')) >= 100:
        return "Too many result rows, please add a LIMIT clause"
    return r.text

System prompt word design
: Clarify the agent's role (senior data expert), answer specifications (only handle data questions, English responses) and database schema, and guide LLM to generate SQL queries that meet the requirements:
```
You are a senior data expert with more than 10 years of experience... Please answer in English. If you need to query the database, the following is the table structure...
```
Agent initialization and testing
: Quickly start with LangGraph's pre-built ReAct agent, test simple queries such as "number of purchasing customers in December 2024", and verify the basic correctness of the tool call process.

3. Prototype optimization direction

Although MVP can handle some simple queries, it still has obvious shortcomings:

Single function
: Only supports single tool calling and lacks multi-role collaboration (such as triage agent, SQL expert, editing agent).
Lack of accuracy
： Complex queries (such as cross-table associations and aggregate calculations) are prone to errors. The introduction of RAG (retrieval enhanced generation) can increase the accuracy from 10% to 60%.
Interaction rigidity
: There is a lack of user feedback mechanism, and the answer strategy cannot be adjusted dynamically.

However, during the evaluation phase, we will not go into deep optimization of the prototype, but instead focus on establishing an evaluation framework and discovering problems through data-driven methods.

3. Experimental Phase Evaluation: From Data Collection to Indicator Design

1.1 Evaluation Dataset Construction: Obtaining Diverse Data from Multiple Channels

The first step of the evaluation is to establish a "gold standard" dataset containing questions, expected answers (SOT, System of Truth) and reference SQL queries. Data collection methods include:

Manual build
: In the early stage, team members simulated real user questions, covering common scenarios (such as "Dutch users' income in December 2024"), edge cases (French questions, incomplete questions) and adversarial inputs (attempts to obtain data schema, irrelevant weather queries).
Historical data reuse
: Extract high-frequency questions from customer service chat records and convert them into question-answer pairs.
Synthetic Data Generation
: Leverage more powerful LLMs such as GPT-4 to generate variants based on existing questions, or automatically generate test cases based on document content.
Beta Testing
: Open the prototype to some users, collect real feedback and add it to the dataset.

The dataset needs to cover:

Normal scenario
: Questions that can be answered by directly calling SQL tools.
Edge scenes
: Questions in multiple languages, very long questions, and incorrectly formatted queries.
Confrontation scenario
: Attempts to bypass security restrictions (such as obtaining private data and executing malicious instructions).

2. Quality Assessment Indicators: Measuring Model Performance in Multiple Dimensions

There is no "one-size-fits-all" standard for LLM assessment. You need to choose a suitable combination of indicators based on the application scenario:

Traditional ML Metrics
: Applicable to classification tasks (such as intent recognition), including accuracy, precision, recall, and F1 value.
Semantic Similarity
: Calculate the cosine similarity between LLM answers and SOT through text embedding (such as OpenAI Embeddings) to measure content consistency.
Functional testing
: Verify the correctness of tool calls, such as whether the generated SQL is executable and complies with security specifications (avoidSELECT *).
Content security indicators
: Use the Hugging Face toxicity detection model to evaluate whether the answer contains offensive language, and use regular expressions to detect whether PII (personally identifiable information) is leaked.
LLM-as-Judge
: Use another LLM as a referee to score based on politeness, relevance, accuracy, etc. For example, define three classification criteria:

friendly
: Include polite language and offer to provide further assistance.
neutral
: Professional but lacking emotional warmth.
Rude
: No euphemism was used when refusing to answer.

3. Tool Practice: Using Evidently for Automated Assessment

The open source library Evidently provides full-process support from data loading to report generation. The core concepts include:

Dataset
: Encapsulates evaluation data and supports Pandas DataFrame import.
Descriptors
: Define calculated metrics, including pre-built sentiment analysis (Sentiment), text length (TextLength), and custom logic (such as greeting detection).
Reports
: Generate interactive reports to compare the differences in indicators between different versions (such as LLM answers vs. SOT).

Code Example: Evaluating the politeness of an LLM answer

from evidently import Dataset, LLMEval
from evidently.prompt_templates import MulticlassClassificationPromptTemplate

# Define politeness evaluation template
politeness_template = MulticlassClassificationPromptTemplate(
    criteria="Evaluate the politeness of the reply...",
    category_criteria={
        "friendly": "Contains words such as 'thank you' and 'glad to help'",
        "neutral": "Only facts, no emotion",
        "rude": "Use blunt expressions such as 'can't answer'"
    }
)

# Create dataset
eval_dataset = Dataset.from_pandas(eval_df, descriptors=[
    LLMEval("llm_answer", template=politeness_template, model="gpt-4o-mini")
])

# Generate report
report = Report([TextEvals()])
report.run(eval_dataset)

Interpretation of the evaluation results :

By comparing the sentiment scores of LLM responses with those of SOT, it was found that LLM was 15% lower on average in friendliness, and that prompt words needed to be optimized to guide polite responses.
Functional testing shows that the SQL generation error rate for complex queries (such as cross-table joins) reaches 40%, requiring the introduction of RAG or fine-tuning of the model.

4. Production deployment: from monitoring to continuous optimization

1. Observability Construction: Tracking Every Interaction

The core challenge of the production environment is the "black box problem" - the unexplainable nature of LLM may make it difficult to locate faults. Therefore, a comprehensive tracking system needs to be established:

Data collection
: Record user questions, LLM answers, intermediate reasoning steps (such as whether to call the tool, the results returned by the tool), and system metadata (response time, server status).
Tool Selection

Evidently Cloud
: The free version supports storing data within 30 days and up to 10,000 rows per month, which is suitable for small and medium-sized projects.
Self-hosted plan
: Use the Tracely library to send logs to a self-built monitoring system, supporting custom retention policies.

Real-time tracking code example :

from tracely import init_tracing, create_trace_event
init_tracing(
    address="https://app.evidently.cloud/",
    api_key="your_token",
    project_id="sql-Agent-prod"
)

def handle_query(question):
    with create_trace_event("user_query", session_id=uuid.uuid4()) as event:
        event.set_attribute("question", question)
        response = data_agent.invoke(question)
        event.set_attribute("response", response)
        return response

(II) New evaluation indicators in the production stage

In addition to the experimental phase indicators, the production environment needs to pay attention to:

User Engagement
：Function usage rate, average session duration, and question repetition rate are used to measure whether LLM truly solves user pain points.
A/B Testing Metrics
: Compare the differences in key business indicators (such as report generation volume and customer retention rate) between user groups that enable/disable the LLM function.
Implicit Feedback
：Whether the user copies the answer content and whether the user edits the answer again (such as in customer service scenarios) indirectly reflects the satisfaction with the answer.
Manual sampling
: 1% of the conversations are randomly selected every week, and the quality of the answers is evaluated by experts and added to the evaluation data set to form an "evaluation-optimization" closed loop.
Technical Health Indicators
：Server response time, error rate, model call cost (such as token consumption), set thresholds to trigger alarms (such as response time exceeding 5 seconds).

3. Continuous optimization: from single point improvement to system upgrade

Based on production data feedback, the optimization path includes:

Tip word tuning
: To address the problem of lack of politeness, add a clear instruction to "always respond in a friendly tone" in the system prompts.
Model iteration
: Add high-frequency error cases to the fine-tuning dataset and use LoRA (low-rank adaptation) technology to perform domain adaptation on LLM.
Architecture upgrade
: Introducing multi-agent systems, such as:

Triage Agent
: Identify the type of question (data query/function request/chat) and route it to the corresponding module.
SQL Expert Agent
: Specialized in handling complex queries and integrating code review tools to verify SQL security.
Edit Proxy
: Format the raw data returned by the tool into natural language answers to ensure consistency.

RAG Enhancement

: Store answers to frequently asked questions in a knowledge base and combine them with vector retrieval to improve answer accuracy, especially for scenarios that require real-time data.

5. Industry Practice: Dual Challenges of Compliance and Cost Optimization

1. Special requirements for highly regulated industries

In the fields of finance, medicine, etc., LLM assessments must meet strict compliance requirements:

Explainability
: The basis for decisions must be recorded, such as the reasoning chain for generating SQL queries, for auditing purposes.
Data Privacy
: Use regular expressions to detect whether the answer contains sensitive user information (such as ID number, medical record details), and use differential privacy technology to protect training data.
Continuous monitoring
: Establish a daily compliance check process to ensure that model outputs meet industry standards (such as GDPR, HIPAA).

2. Cost Optimization: Achieving Big Value from Small Models

Through continuous evaluation, enterprises may find that for specific scenarios (such as e-commerce data query), a fine-tuned 7-billion-parameter model (such as LLaMA-2-70B) can achieve performance comparable to that of a 10-billion-parameter model, while reducing inference costs by more than 50%. The specific steps include:

Benchmarks
: Compare the accuracy and latency of different models on the same evaluation dataset.
Incremental transfer learning
: Use the weights of the large model trained in the general domain as initialization and use the data in the domain for fine-tuning.
Model compression
：Apply quantization and pruning techniques to reduce the model size while maintaining performance.

6. Establish an evaluation-driven LLM development ecosystem

From prototype to production, LLM evaluates every step of the product life cycle:

Experimental Phase
: Through diversified data sets and multi-dimensional indicators, prototype defects can be quickly located and the iteration direction can be guided.
Production stage
: Use observability tools and real-time monitoring to ensure that the model runs stably in a dynamic environment, while collecting real feedback to promote continuous optimization.
Business Value
: Through the establishment of an evaluation system, enterprises can not only improve product quality, but also build competitive barriers in terms of compliance, cost control and user trust.

As shown in the case study, a mature LLM evaluation framework is not achieved overnight, but needs to be gradually improved through continuous iterations in combination with business needs, technology selection, and industry characteristics. In the future, with the intelligentization of evaluation tools (such as automatic generation of test cases and dynamic adjustment of indicator weights), LLM evaluation will become an increasingly critical infrastructure in AI engineering, pushing large language models from "laboratory miracles" to "industrial-grade solutions."