Large Model Evaluation Troubleshooting Guide | About LaTeX Formula Analysis

Written by

Silas Grey

Updated on:June-28th-2025

This is the second in a series of articles on troubleshooting large model assessments . Stay tuned for more articles in the series:

About Reasoning
about Formula analysis
About Reproducibility

Parsing LaTeX is hard. This problem is reflected in the evaluation of output This is often encountered when working with models such as Hugging Face .

This benchmark uses to represent mathematical calculations and symbols. The difficulty of the evaluation lies in the analysis and comparison of the model output and the standard answer. There is no standard approach.

Excerpt from the document

The lm-evaluation framework uses lm-evaluation, a Python library for symbolic mathematics, to parse and compare LaTeX. sympy The analysis of the true value (using the true value itself for comparison test) can only get an accuracy of about 0.94. How could this be? Later, I found that sympy Unable to parse some (standard ) expression.

For example:

could n't parse one of [0,1) or [0,1), I expected one of these: ' ] '
[0,1)
~~^

couldn 't parse one of (-\iny,-5]\cup[5,\iny) or (-\iny,-5]\cup[5,\iny), I expected something else here
(-\iny,-5]\cup[5,\iny)
~~~~~~~^

couldn't parse one of -\frac{1}{{}2x} or -\frac{1}{{}2x}, I don' t understand this
-\frac{1}{{}2x}
~~~~~~~~~~~~^

How to alleviate this problem?

Rewrite And add the required functions to the code; or add manual checks to the code to improve the model score. After almost falling into the trap of the problem, we think that adding string comparison checks to the code can almost alleviate this problem.

LM Assessment Tool Fixes

result

The comparison results of the top 25 models before and after restoration are as follows: