When AI meets mathematics: How do large language models bring about a revolution in formal mathematics? | Deep Talk

Written by

Jasper Cole

Updated on:June-22nd-2025

As large language models have set off a huge wave of changes in areas such as content creation, code generation, and scientific question-answering, the field of mathematics, which is known for its rigorous logic and precise structure, has also ushered in a profound opportunity for transformation.

At present, the complexity of mathematical theories is constantly increasing, and the scale of proofs of many important theorems has far exceeded the capabilities of traditional manual review. Proofs of hundreds of pages not only challenge the limits of peer review, but also expose the slowness and fragility of the manual verification process. In response to this dilemma, formal mathematical methods have begun to become an important solution. This method effectively improves the accuracy and reliability of theorem proofs by strictly expressing mathematical propositions in formal logic language and using computers for automated verification.

Against the backdrop of the growing trend of formalized mathematics, Xin Huajian, a doctoral student from the University of Edinburgh, has been committed to combining large language model technology with formal mathematical methods since 2022. He has conducted related research at DeepSeek and ByteDance Seed teams.

On May 9, 2025, at a theme sharing session jointly organized by the Cambridge China AI Association, Jinqiu Foundation, Tsinghua University Student General Artificial Intelligence Association, and Tsinghua University Student Entrepreneurship Association, Xin Huajian gave a speech entitled "Formal Mathematical Revolution in the Era of Large Language Models", in which he elaborated in detail on the historical evolution, current challenges, and future development directions of formal mathematics.

He believes:

Mathematical research faces the bottleneck of traditional manual verification, and formalized mathematics has become an effective solution.
The combination of LLM and formal mathematics significantly reduces verification costs and promotes the development of mathematics.
The transformation from traditional Theorem Prover to autonomous Proof Engineering Agent is an inevitable trend in the future of formal mathematics.
The Automated Proof Engineering Benchmark (APE-Bench) provides a practical evaluation tool for the long-term evolution of formal mathematics.

The following content is a compilation and deepening of this report, reviewed and supplemented by the speaker himself.

introduction

As large language models (LLMs) sweep across fields such as content creation and scientific research on an unprecedented scale, the field of mathematical research is also undergoing an unprecedented paradigm shift. As modern mathematical theories become increasingly complex, the proofs of many key theorems are far beyond the capabilities of traditional peer review and manual verification. Formal mathematics, a method based on rigorous logic and computer-assisted proofs, has begun to show great potential and provides a practical path to solving verification problems in the field of mathematics.

Faced with large-scale and logically rigorous mathematical proofs, it is no longer possible to rely solely on human experts for review. More and more mathematical proofs are hundreds of pages long, with long verification cycles and high risk of errors. The limitations of human intuition and empirical methods are becoming increasingly prominent. As a result, the formalized mathematical paradigm has gradually emerged, expressing mathematical content with a strict axiomatic system and logical language, and using computer systems for reasoning and verification. This method can not only significantly improve the reliability of verification, but has also gradually become an important engine for promoting the sustainable development of mathematical research.

1. Challenges of Generative Intelligence and Opportunities of Formal Mathematics

In recent years, large language models have rapidly reached or even surpassed the performance of human experts in intellectually intensive fields such as mathematical proof generation and code development, thanks to their excellent reasoning and generation capabilities. However, the widespread "hallucination" problem of LLMs, that is, the model may generate content that seems reasonable but has hidden logical errors or factual flaws, which seriously limits people's trust in the generated content. This problem is particularly evident in the field of mathematical research, which requires extremely high logical rigor: generated content that has not been strictly verified is not only difficult to use directly, but also significantly reduces the reliability of research results.

The review and verification of mathematical proofs has always been an important bottleneck restricting the efficiency of mathematical results dissemination. Some classic cases, such as the proof of Kepler's conjecture proposed by Thomas Hales, were 300 pages long when they were first published. Even after years of repeated review by experts, doubts about their correctness could not be completely eliminated. Such problems are common in modern mathematics, indicating that the mathematical review process relying on traditional manual verification has gradually reached its capacity limit, and a more rigorous and reliable alternative method is urgently needed.

It is in this context that the formalized theorem proving method demonstrates its unique advantages. By converting mathematical proofs into strict formal languages and using computer tools to perform automatic or semi-automatic reasoning verification, it is possible to effectively solve the logical blind spots that are difficult to overcome with traditional methods and improve the reliability of proofs and the efficiency of verification. Practice has shown that this rigorous machine verification method not only solves the problems that cannot be overcome by human review, but also provides a solid logical foundation for the long-term evolution of mathematical research.

2. Formal theorem proving and its core applications

Formal theorem proving is a verification method based on axiom systems and logical reasoning rules. It expresses mathematical statements completely in a formal language that can be verified by computers, and uses theorem proving tools (such as Lean, Coq, Isabelle/HOL) for reasoning verification. This method does not need to rely on empirical judgment or exhaustive experiments, and can provide highly deterministic verification results for mathematical theories.

In the field of mathematics, there are many typical cases of successful application of formal methods, which demonstrate its unique value:

Flyspeck Project : This project used Isabelle/HOL and HOL Light proof assistant to thoroughly formally verify the proof of Kepler's conjecture proposed by Hales. The proof, which could not be completely confirmed by traditional manual methods, was completely eliminated through rigorous machine verification, becoming a landmark achievement in the history of formal mathematics.

Liquid Tensor Experiment : The academic community used the Lean proof tool to quickly complete the formal verification of the key lemma of condensed matter mathematics proposed by Schulze and Clausen. This not only eliminated doubts about the correctness of the theory, but also greatly promoted the subsequent development and promotion of the theory.

PFR conjecture crowdsourcing verification : The polynomial Freeman-Russa conjecture verification project led by Terence Tao efficiently completed the subtasks of formal proof in a very short time through Lean 4 tools and community crowdsourcing. This model fully demonstrates the potential of formal theorem proving to lower the threshold for large-scale mathematical collaboration.

The power of formal theorem proving lies not only in verifying a single complex proof, but also in its universality and scalability. Whether it is theoretical proof in the field of mathematics or code verification in software engineering, formal methods can ensure a high degree of consistency between implementation and specifications at the logical level, breaking through the limitations of traditional methods that rely on experience and intuition for verification, and is gradually becoming a solid foundation for cross-domain verification.

3. Latest progress in LLM-driven formal mathematics

With the rapid advancement of large language model technology, the field of formal theorem proving has also shown significant development potential. Current advanced LLMs, especially AlphaProof and DeepSeek-Prover V2, have achieved outstanding performance in solving competition-level mathematical problems. For example, the AlphaProof model, with the help of reinforcement learning technology, performed well in the formal proof task of the International Mathematical Olympiad questions, reaching the silver medal level; and DeepSeek-Prover V2 achieved a success rate of nearly 90% in the standardized miniF2F mathematical benchmark, far exceeding the performance of previous similar systems.

More importantly, current research is gradually shifting from simple proof generation to long-term knowledge accumulation and theoretical system construction. For example, the LEGO-Prover project uses LLM to automatically identify and generate reusable mathematical knowledge units, and gradually builds an easy-to-maintain mathematical knowledge base. This indicates that formal mathematics is moving from single problem solving to long-term library-level theoretical construction and management.

At the same time, some cutting-edge research has begun to explore the potential of LLM to proactively propose mathematical conjectures and discover abstract structures. This exploration has gradually enabled formal mathematics to have a higher level of creative ability. Although this direction is still in its early stages, it clearly points to the huge space for the future combination of formal mathematics and generative intelligence.

4. From Theorem Prover to Proof Engineering Agent

Although there have been many successful cases of formal theorem proving methods in the fields of mathematics and software, their large-scale promotion still faces practical challenges such as high labor costs, low collaboration efficiency, and long-term maintenance difficulties. For example, the Flyspeck project in the mathematics field took dozens of man-years of work, and the seL4 project in the software field also took many years to complete the full verification. This model is difficult to adapt to the rapidly changing actual needs and urgently needs a paradigm shift.

The next generation of formal tools needs to transform from the current static "Theorem Prover" to the "Proof Engineering Agent" with self-planning, self-repair and self-knowledge accumulation capabilities. This autonomous agent model can significantly reduce labor costs, improve long-term maintenance efficiency, and more effectively support large-scale cross-module collaboration and dynamic knowledge management.

To promote this transformation, it is necessary to establish a new automated evaluation benchmark. APE-Bench was born in this context. It is based on real Proof Engineering needs to evaluate and promote the performance of language models in long-term dynamic maintenance scenarios.

5. APE-Bench: Automated Proof Engineering Benchmark

The original intention of designing APE-Bench is to fill the gap in the current evaluation tools in the field of formal mathematics, and to expand the evaluation from single isolated theorem proofs to actual engineering tasks such as long-term maintenance and cross-file collaboration. Specifically, APE-Bench uses historical modification records in real mathematical libraries (such as Mathlib4) to truly reproduce actual Proof Engineering scenarios through clear natural language instructions and Lean code modification tasks.

The implementation of the benchmark was divided into three phases:

The first stage (APE-Bench I) focuses on local modification tasks within a single file.
The second phase (APE-Bench II) extends to consistency maintenance across files and multiple modules.
The third stage (APE-Bench III) realizes a fully autonomous agent mode, including autonomous planning, problem repair and knowledge accumulation.

Through the dual mechanisms of precise compilation verification and semantic evaluation, APE-Bench not only ensures the accuracy of the evaluation, but also promotes the transformation of the language model into a truly intelligent Proof Engineering Agent, thereby supporting the long-term dynamic evolution and large-scale application of formal mathematics.

6. Impact and Future Outlook: Towards Certified AI

The combination of LLM and formal methods will bring efficient, reliable and sustainable changes to mathematical research and related industrial applications. First, in the field of mathematics, the efficiency of proof verification will be significantly improved through automated formal tools, allowing researchers to focus more on theoretical innovation and concept exploration, and promote the rapid evolution of mathematical knowledge.

Secondly, in the industrial field, automated Proof Engineering methods represented by APE-Bench are expected to promote the widespread popularization of formal verification. For example, in software systems with high security requirements (such as operating system kernels, compilers, smart contracts, etc.), LLM-assisted formal verification can greatly improve the security and reliability of the system.

In the long run, this combination of formal verification and dynamic learning mechanisms will give rise to a new type of generative intelligence - Certified AI. This rigorously verified intelligent system will have higher credibility and explainability, and become a reliable partner for human knowledge production and decision-making.

Through the integration of formal mathematics and LLM, we are standing at the starting point of a new paradigm of knowledge production. The transformation from Theorem Prover to Proof Engineering Agent will not only improve the verification efficiency in the fields of mathematics and software, but also completely reshape the way of knowledge production. Looking to the future, an efficient, trustworthy and inclusive intelligent era is coming, and the pace of human knowledge exploration will be accelerated.