State-of-the-Art Formal Theorem Prover
Aleph Prover is a tool that automatically produces machine-checked proofs in Lean 4. On PutnamBench, a collection of problems from the Putnam Competition*, Aleph solved 668/672 problems, placing first among all provers on the benchmark.
*The Putnam Competition is one of the world's most challenging mathematics competitions.
Notes on This Run
We previously published reports on earlier runs of Aleph Prover: the first run with a $100 per problem limit solved 500 problems, and the second run with a $300 per problem limit solved an additional 137 problems. Those runs already achieved first place on the leaderboard. For this new run, we launched Aleph on the remaining 35 problems that were not solved in previous runs, this time with a $1,000 per problem limit. This resulted in 31 additional problems solved. In the PutnamBench leaderboard, we noted that there were three runs: the original one with a $100 limit, a second one on the remaining problems with a $300 limit, and now a third one with a $1,000 limit. The results of the new run were submitted to the PutnamBench authors for verification on January 4 and published on January 6.
Fixing Formalization Errors
During this run, Aleph fixed formalization errors in 12 problems while working on their proofs. Aleph had access to both the Lean 4 formalization and the original English problem statement. While proving, Aleph identified that the formal statement didn't match the English problem and proposed corrections. We manually verified each fix, and all 12 corrections were accepted into the official PutnamBench dataset. These problems were attempted in earlier runs, but we filtered them out for simplicity. In this run, we thoroughly investigated each problem with statement changes.
This explains the cost patterns in this batch. The 12 problems with fixed formalizations don't include costs from previous attempts. All other problems are re-runs from previous batches, so their costs include $100 from the first run plus $300 from the second run, plus the cost of the current run.
About Lean 4
Lean 4 is one of the most popular languages used for machine-checked theorem proving and is the language most widely used by AI tools for automated proving. Unlike proofs written in human languages, Lean 4 allows you to provide the compiler with a formal statement of a theorem and a mathematical proof, and it will automatically verify that the proof is correct or point out errors.
We created Aleph Prover for formal code verification, to provide machine-checked proofs of correctness for code written in any programming language. It has been fine-tuned specifically for proving properties of code. However, it turned out that it also performs very well in solving mathematical competition problems formulated in Lean 4.
While Aleph Prover isn't publicly available yet, we've evaluated it on PutnamBench, a large collection of Lean 4 translations of Putnam competition problems: Aleph solved 668 of 672 problems on a single attempt per problem, surpassing the best prior result on the public leaderboard as of January 6, 2026.
Putnam Competition
The Putnam Competition (officially the William Lowell Putnam Mathematical Competition) is widely regarded as the most important undergraduate mathematics competition in the United States and Canada. Teams from Harvard and MIT have been among the most successful participants. The authors of PutnamBench emphasize the exceptional difficulty of these problems: "The competition is scored out of 120 points, with the median score typically being zero. The top five individual scorers are named Putnam Fellows, and in recent years regularly include former International Mathematical Olympiad (IMO) medalists."
PutnamBench
A team of researchers from the University of Texas at Austin recently published a paper introducing PutnamBench. The paper describes the manual translation of Putnam competition problems from English into formal languages (including Lean 4). Over many years, statements of 672 problems were translated in this way.
The details about this study can be found at: trishullab.github.io/PutnamBench
Aleph Prover solved 668/672 problems (99.4%) on PutnamBench, placing first among all provers.
In previous versions of our reports (first run, second run), we discussed the gap between solving problems in natural language versus Lean 4. Our new results now reach 99.4% on Lean 4. On this benchmark, this essentially closes the question of the difference between natural language and Lean 4 solutions.
Run Statistics
Across the 31 additional successful proofs from the third run, we summarize the distributions for time to proof, lines of Lean code, Lean-check calls, and cost per problem. For transparency, each problem was launched once but with many internal Lean checks as needed. For statistics on previous runs, see our reports on first 500 problems and 137 additional problems.
Time Statistics
Lines of Code
Lean Check Calls
Each problem was launched exactly once but with many internal Lean-check calls as needed.
Lemmas & Theorems
Each successful proof required on average 33 proved theorems or lemmas, with a minimum of 3 and a maximum of 95. In total 1,021 theorems or lemmas were proved.
Cost Analysis
We enforced a soft limit of $1,000 per problem. Costs include previous unsuccessful attempts where applicable.
Cost Correlations
Below are charts showing the relationship between cost and time, as well as cost and final code size.
Cost vs Time Correlation
Cost vs Lines of Code






