Challenges of content evaluation on large models
When we develop a piece of code, we can use test cases to test the code accurately. Fixed input corresponds to fixed output. Testing all inputs can cover all code logic. However, the input and output of large models are natural languages , so they will naturally receive all kinds of strange inputs . For example:
During product development, it is necessary to debug the input and output of the large model to determine whether the task decomposition capability meets expectations;
After the business is launched, it is necessary to evaluate the quality of the large model for real scenarios and whether it solves actual problems.
In daily operations, all inputs and outputs need to be audited to maintain compliance.
In light of the above requirements, we need to be able to judge the input and output of large model applications, retrieve, analyze and evaluate the content, and ensure that the overall behavior of the application meets expectations.
Semantic analysis: understanding large model input and output from multiple perspectives
In order to better process the input and output logs of large models, better understand user needs, and evaluate the performance of large models, log management tools are required to have the ability to search, process, and analyze natural language, including:
Semantic enrichment: extract structured information from multiple angles, including user intent, topic, emotion, etc.
Vector search: One-stop vector search capability, integrated embedding/vector_index capabilities, ready to use. In addition to keyword and substring matching, it can also search for content based on intent.
Hybrid search: Use keyword exact matching and vector approximate matching simultaneously to integrate multi-field search requirements.
Clustering: Perform cluster analysis on natural language from a higher perspective to identify hot spots and outliers.
(1) Semantic enrichment
In the RAG field, it is necessary to convert files into structured Markdown information, and then divide the Markdown files into chunks to build vector indexes. In the RAG scenario, the traditional document processing flow has information loss problems. It is necessary to extract multimodal features, build a multi-dimensional semantic feature space, and observe LLM input and output from different perspectives, including:
User intent: the intent shown by the user in the session, such as translation, technical consultation, legal consultation, query and retrieval;
Topic: The topic of the conversation, such as education, cloud computing, and law;
Summary: Original conversations often have complex contexts. Semantic indexing uses one sentence to describe the conversation information, and users can quickly search for summaries and find records.
Emotion: The emotion displayed by the user in the session, which can be positive, negative, or neutral;
Keywords: Extract keywords from the conversation;
Question: Several questions are asked about the original conversation, and the original conversation is the answer to the question. When the user uses the question to recall the historical conversation, they can directly recall it through this field;
Entity extraction: Extract entities that appear, such as countries, place names, personal names, etc.
With the help of LLM evaluation and vector indexing, structured information is extracted from Prompt/Response, and the analysis results are displayed in a visual form. Based on the evaluation results, the user's intentions, emotions, concerns, and common problems can be understood, and the quality of LLM responses can be understood, which is convenient for the next training and tuning, and compliance audits of inputs and outputs can be performed to avoid legal risks.
By leveraging the semantic processing capabilities of Log Service SLS, an open interface solution is provided during data processing, connecting to the Bailian managed model API or self-developed LLM API, thereby achieving semantic enrichment based on LLM.
The LLM assessment framework consists of the following key components:
General HTTP function: The data processing SPL syntax provides a general HTTP call function, which passes in information such as URL, Body, and Header, calls external services to process data, and obtains processing results.
Calling the Qwen model: Encapsulate the Qwen AIGC function on top of the general HTTP function, pass in the Qwen address, Bailian access-key, system prompt, and user prompt to call Bailian's Qwen model.
System/Custom Prompt Library: SLS provides the Evaluation System Prompt template library for evaluation functions. Select the Evaluation System Prompt corresponding to the required evaluation function and pass the prompt/response in the log as normal text to the User Prompt. You can also pass in your own prompt and customize the processing logic.
Implement unique business needs through customizable prompts, custom endpoints, and custom models. The following are the semantically enriched statements built into SLS:
* | extend "__tag__:__sls_qwen_user_tpl__" = replace(replace(replace(replace(replace(replace(replace(replace("__tag__:__sls_qwen_user_tpl__", '<INPUT_TEMPLATE>', "output.value"), '\', '\\'), '"', '\"'), chr(8), '\b'), chr(12), '\f'), chr(10), '\n'), chr(13), '\r'), chr(9), '\t') | extend "__tag__:__sls_qwen_sys_tpl__" = replace(replace(replace(replace(replace(replace(replace("__tag__:__sls_qwen_sys_tpl__", '\', '\\'), '"', '\"'), chr(8), '\b'), chr(12), '\f'), chr(10), '\n'), chr(13), '\r'), chr(9), '\t') | extend request_body = replace(replace("__tag__:__sls_qwen_body_tpl__", '<SYSTEM_PROMPT>', "__tag__:__sls_qwen_sys_tpl__"), '<USER_PROMPT>', "__tag__:__sls_qwen_user_tpl__") | http-call -method='post' -headers='{"Authorization": "Bearer xxxxxx", "Content-Type": "application/json", "Host": "dashscope.aliyuncs.com", "User-Agent":"sls-etl-test"}' -timeout_millis=60000 -body='request_body' 'http://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation' as status, response_body | extend tmp_content = json_extract_scalar(response_body, '$.output.choices.0.message.content') | extend output_enrich = regexp_replace(regexp_replace(tmp_content, '^([^{]|\s)+{', '{'), '}([^}]|\s)+$', '}') | project-away "__tag__:__sls_qwen_sys_tpl__", "__tag__:__sls_qwen_user_tpl__", "__tag__:__sls_qwen_body_tpl__", trimed_input, tmp_content, request_body , response_body
Semantic evaluation results:
(2) Vector search
In practice, vector retrieval faces many engineering challenges, including but not limited to:
To complete vector retrieval, you need to first convert the text into a vector through embedding, and then build an index for the vector. The project is relatively complex and requires the maintenance of data import, embedding module, vector index module, and query module.
The recall rate is affected by the embedding model and vector index type, and the R&D cost is high.
The cost is high. Both the embedding conversion cost and the construction of vector indexes need to be completed by introducing GPU. The storage space of vectors is also very high, and a large amount of memory is required when querying. These factors lead to the high cost of vector retrieval.
As shown in the above architecture diagram, SLS provides one-stop vector retrieval capabilities. After natural languages such as prompts and responses are written into SLS, embedding transformation and vector indexing can be automatically completed. When querying, the query statement is automatically converted into a vector, and then the approximate vector is found from the vector index, and the original data is read according to the hit docID. There is no need to worry about the intermediate vector type, just write the text data and query the text data.
SLS provides vector query syntax. When using the query syntax, you need to pay attention to the following key points:
The semantics of similarity is expressed through similarity syntax; Specify the vector index to search for a key; Specify the query statement; Specifies the query distance, where 0 represents the most similar and 1 represents the least similar.
similarity(Key,query) < distance
(3) Hybrid search
In some scenarios, you need to not only search for long text approximately, but also match certain fields accurately. For example, to query the prompt of a certain uid, you need to accurately hit the uid and the approximate query prompt columns at the same time. In this case, a mixed search is needed. The mixed search uses the and condition connection to query the keyword inverted index and vector index separately, and then merge the results of the two:
uid:123 and similarity(key,query) < distance
(4) Vector clustering
How to find hot issues among complex user input and large model output? What are the outliers? When there are only texts, it is difficult to analyze because the texts are different from each other. After converting the texts into vectors, the vectors can be clustered according to the spatial distance. Clustering relies on SQL functions . The cluster_centroids function specifies a two-dimensional array and the number of clusters to generate the corresponding clustering results:
clustering_centroids(array(array(double)) samples, integer num_of_clusters)
At the same time, high-dimensional vectors cannot be visualized in space. SLS provides a dimensionality reduction function to convert high-dimensional vectors into two-dimensional vectors for visualization. We can see the visualization effect of semantic clustering and dimensionality reduction in the figure below.
t_sne(array(array(double))
Engineering Practice of LLM Prompt/Response Semantic Insight
After extracting semantic information from the original prompt and response, the following business goals are achieved through keyword retrieval, vector retrieval, and semantic clustering:
(1) Compliance and audit based on search
Search for specific keywords to find out whether there is any non-compliant behavior. For example, set some banned words as keywords and search for similar words. This is achieved through similarity syntax. Similarity can be adjusted by adjusting the distance.
similarity("input_semantic.summary","malicious keywords") < 0.4
(2) Filtering based on search topics and emotions
In the semantic processing stage, the evaluation engine classifies the natural language and extracts the classification content such as topics and emotions. In the Chatbot application, you can view the conversation history of a specific topic.
input_semantic.topic: database
(3) Content clustering
Based on clustering, we can group similar content into one category and view the correlation and distance between topics. The clustering effect diagram is shown in the rightmost figure in the figure below. Each color is a category, and it can be clearly seen that some topics are far away from other topics.
Summarize
User portrait construction: Identify the distribution characteristics of long-tail demand.
Model iterative optimization: Based on Bad Case analysis, improve the LLM response accuracy.
Compliance risk management: Improve detection efficiency and reduce false positive rates.