Logical

Intelligence

Logical

Intelligence

Logical

Intelligence

Aleph prover

Aleph
prover

Aleph prover

State-of-the-Art Formal Theorem Prover

Aleph Prover is a tool that automatically produces machine-checked proofs in Lean 4. On PutnamBench, a collection of problems from the Putnam Competition*, Aleph solved 637/672 problems, placing first among all provers on the benchmark.

*The Putnam Competition is one of the world's most challenging mathematics competitions.

Aleph Prover: 637 / 672 problems on PutnamBench

First place on leaderboard as of December 22, 2025
Average proof time: ≈ 5h 7m
Average lemmas per proof: ≈ 17
Average cost: ≈ $54 per solved problem
Maximum cost: $377
Stats for new 137 problems below, for first 500 in previous report

Previous best on leaderboard: 462 / 660 (as of December 22, 2025)

Aleph Prover:

637 / 672

problems on PutnamBench

First place on leaderboard as of December 22, 2025
Average proof time: ≈ 5h 7m
Average lemmas per proof: ≈ 17
Average cost: ≈ $54 per solved problem
Maximum cost: $377
Stats for new 137 problems below, for first 500 in previous report

Previous best on leaderboard:

462 / 660 (as of December 22, 2025)

Notes on This Run

We previously published a report on an earlier run of Aleph Prover with a $100 per problem limit. You can explore those results here. That run already achieved first place on the leaderboard. Since then, the Putnam Competition 2025 has taken place, expanding the benchmark from 660 to 672 problems. For this new run, we launched Aleph on the remaining 172 problems that were not solved in the first run, this time with a $300 per problem limit. We also adjusted our setup to prioritize more expensive models, since the higher budget allows us to make better use of them. This resulted in 137 additional problems solved. In the PutnamBench leaderboard, we noted that there were two runs: the original one with a $100 limit, and a new one on the remaining problems with a $300 limit. The results of the new run were submitted to the PutnamBench authors for verification on December 15 and published on December 22.

Fixing Formalization Errors

During this run, Aleph fixed formalization errors in 15 problems while working on their proofs. Aleph had access to both the Lean 4 formalization and the original English problem statement. While proving, Aleph identified that the formal statement didn't match the English problem and proposed corrections. We manually verified each fix, and all 15 corrections were accepted into the official PutnamBench dataset. In the first run, we filtered out all problems with statement changes for simplicity.

This explains the cost patterns in this batch. The 15 problems with fixed formalizations cost less than $100 since we didn't count previous attempts. Another 12 problems from Putnam 2025 were added between runs, also starting at zero cost. All other problems are re-runs from the first batch, so their costs include $100 from the previous attempt plus the cost of the current run.

Upcoming Runs

We have already launched a $1,000 per problem run on all 35 unsolved problems. This will take some time to complete, but it has already solved an additional 17 problems, so our next report will show at least 654 problems solved. Stay tuned for our updated results.

About Lean 4

Lean 4 is one of the most popular languages used for machine-checked theorem proving and is the language most widely used by AI tools for automated proving. Unlike proofs written in human languages, Lean 4 allows you to provide the compiler with a formal statement of a theorem and a mathematical proof, and it will automatically verify that the proof is correct or point out errors.

We created Aleph Prover for formal code verification, to provide machine-checked proofs of correctness for code written in any programming language. It has been fine-tuned specifically for proving properties of code. However, it turned out that it also performs very well in solving mathematical competition problems formulated in Lean 4.

While Aleph Prover isn't publicly available yet, we've evaluated it on PutnamBench, a large collection of Lean 4 translations of Putnam competition problems: Aleph solved 637 of 672 problems on a single attempt per problem, surpassing the best prior result on the public leaderboard as of December 22, 2025.

Putnam Competition

The Putnam Competition (officially the William Lowell Putnam Mathematical Competition) is widely regarded as the most important undergraduate mathematics competition in the United States and Canada. Teams from Harvard and MIT have been among the most successful participants. The authors of PutnamBench emphasize the exceptional difficulty of these problems: "The competition is scored out of 120 points, with the median score typically being zero. The top five individual scorers are named Putnam Fellows, and in recent years regularly include former International Mathematical Olympiad (IMO) medalists."

PutnamBench

A team of researchers from the University of Texas at Austin recently published a paper introducing PutnamBench. The paper describes the manual translation of Putnam competition problems from English into formal languages (including Lean 4). Over many years, statements of 672 problems were translated in this way.

The details about this study can be found at: trishullab.github.io/PutnamBench

Aleph Prover solved 637/672 problems on PutnamBench, placing first among all provers.

Hilbert, created by 4 authors from Apple and 2 from UC San Diego, previously held the best result with 462 solved, which was surpassed on the same benchmark. See the leaderboard for details.

Closing the Gap Between Formal and Informal Proofs

In the previous version of our report, we discussed the gap between solving problems in natural language versus Lean 4. Our new results now reach 94.8% on Lean 4, and with the ongoing third run already showing progress, we are at 97%. On this benchmark, this essentially closes the question of the difference between natural language and Lean 4 solutions.

Run Statistics

Across the 137 additional successful proofs from the second run, we summarize the distributions for time to proof, lines of Lean code, Lean-check calls, and cost per problem. For transparency, each problem was launched once but with many internal Lean checks as needed. For statistics on the 500 problems from the first run, see our previous report.

Time Statistics

→ Minimum proof time: 13 minutes

→ Maximum proof time: 35 hours 7 minutes

→ Average proof time: 13 hours 34 minutes

→ Minimum proof time: 13 minutes

→ Maximum proof time: 35 hours 7 minutes

→ Average proof time: 13 hours 34 minutes

Lines of Code

→ Average: 1,285 lines per successful proof

→ Minimum: 105 lines

→ Maximum: 7,183 lines

→ Average: 1,285 lines per successful proof

→ Minimum: 105 lines

→ Maximum: 7,183 lines

Lean Check Calls

Each problem was launched exactly once but with many internal Lean-check calls as needed.

→ Minimum: 23 calls

→ Maximum: 27,161 calls

→ Average: 4,241 calls

→ Minimum: 23 calls

→ Maximum: 27,161 calls

→ Average: 4,241 calls

Lemmas & Theorems

Each successful proof required on average 26 proved theorems or lemmas, with a minimum of 4 and a maximum of 100. In total 3,510 theorems or lemmas were proved.