Large Model Evaluation Troubleshooting Guide | About LaTeX Formula Analysis

When evaluating the performance of large models, the challenge of parsing LaTeX formulas cannot be ignored.
Core content:
1. The importance of LaTeX formula parsing in large model evaluation
2. Limitations and solutions of using the sympy library to parse LaTeX formulas
3. Analysis of model performance comparison results after repairing the LM evaluation tool
This is the second in a series of articles on troubleshooting large model assessments . Stay tuned for more articles in the series:
About Reasoning about Formula analysis About Reproducibility
Parsing LaTeX is hard. This problem is reflected in the evaluation of output This is often encountered when working with models such as Hugging Face .
This benchmark uses to represent mathematical calculations and symbols. The difficulty of the evaluation lies in the analysis and comparison of the model output and the standard answer. There is no standard approach.
Excerpt from the document
The lm-evaluation framework uses lm-evaluation, a Python library for symbolic mathematics, to parse and compare LaTeX. sympy
The analysis of the true value (using the true value itself for comparison test) can only get an accuracy of about 0.94. How could this be? Later, I found that sympy
Unable to parse some (standard ) expression.
For example:
could n't parse one of [0,1) or [0,1), I expected one of these: ' ] '
[0,1)
~~^
couldn 't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here
(-\iny,-5]\cup[5,\iny)
~~~~~~~^
couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don' t understand this
-\frac{1}{{}2x}
~~~~~~~~~~~~^
How to alleviate this problem?
Rewrite And add the required functions to the code; or add manual checks to the code to improve the model score. After almost falling into the trap of the problem, we think that adding string comparison checks to the code can almost alleviate this problem.
LM Assessment Tool Fixes
result
The comparison results of the top 25 models before and after restoration are as follows: