Large Model Evaluation Troubleshooting Guide | About LaTeX Formula Analysis

Written by
Silas Grey
Updated on:June-28th-2025
Recommendation

When evaluating the performance of large models, the challenge of parsing LaTeX formulas cannot be ignored.

Core content:
1. The importance of LaTeX formula parsing in large model evaluation
2. Limitations and solutions of using the sympy library to parse LaTeX formulas
3. Analysis of model performance comparison results after repairing the LM evaluation tool

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

This is  the second in a series of articles on troubleshooting large model assessments  . Stay tuned for more articles in the series:

  • About Reasoning
  • about  Formula analysis
  • About Reproducibility

Parsing LaTeX is hard. This problem is reflected in the evaluation of output  This is often encountered when working with models such as Hugging Face .

This benchmark uses  to represent mathematical calculations and symbols. The difficulty of the evaluation lies in the analysis and comparison of the model output and the standard answer.  There is no standard approach.


Excerpt from the document

The lm-evaluation framework uses lm-evaluation, a Python library for symbolic mathematics, to parse and compare LaTeX. sympy The analysis of the true value (using the true value itself for comparison test) can only get an accuracy of about 0.94. How could this be? Later, I found that sympy Unable to parse some (standard ) expression.

For example:

could n't parse one of [0,1) or [0,1), I expected one of these: ' ] '
[0,1)
~~^
couldn 't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here
(-\iny,-5]\cup[5,\iny)
~~~~~~~^
couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don' t understand this
-\frac{1}{{}2x}
~~~~~~~~~~~~^

How to alleviate this problem?

Rewrite And add the required functions to the code; or add manual checks to the code to improve the model score. After almost falling into the trap of the problem, we think that adding string comparison checks to the code can almost alleviate this problem.


LM Assessment Tool Fixes

result

The comparison results of the top 25 models before and after restoration are as follows: