Logical Intelligence’s Aleph Solves PutnamBench
Logical
Intelligence’s
Aleph Solves PutnamBench
Many of the world’s critical systems involve significant automation. In areas like semiconductor design, nuclear reactor controls, industrial manufacturing, and financial markets, even subtle algorithmic errors can compound to be devastating, resulting in significant financial and human losses. At the same time, the underlying code can be so complex that it is virtually impossible for a human to check whether it works as intended.
Logical Intelligence builds AI tools that automatically generate machine-checkable proofs of safety and correctness in critical systems. We believe these tools will be indispensable components for the AGI ecosystem and will make AI more efficient and trustworthy.
To make this vision a reality, we are developing two kinds of AI systems:
A non-autoregressive energy-based reasoning model (EBRM) called Kona designed specifically to reason over complex systems.
An agentic orchestration layer, called Aleph, that plans, executes, and synthesizes separate calls to a variety of reasoning models and other tools.
We’ve designed Aleph to be paired with multiple reasoning models. That’s allowed us to decouple developing our proprietary reasoning model from de-risking our orchestration layer.
Conquering PutnamBench
To demonstrate the capabilities of Aleph, we’ve paired it with GPT-5.2 on a well-known formal verification benchmark called PutnamBench, which consists of 672 hard math problems from the William Lowell Putnam Exam, the oldest collegiate mathematics competition in North America.
Top math undergrads in the United States and Canada take the Putnam every year, with a scoring system based on 120 points, with 10 points for problem. The median score is generally two or less, which implies half the test takers can’t solve any of the questions.
In contrast, Aleph automatically generated proofs for 668 out of 672 (99.4%) of the problems from the past 50+ years of the exam. Aleph’s proofs are written in a formal language called Lean, and their correctness is automatically certified by an external deterministic Lean compiler. Our results put Logical Intelligence’s Aleph comfortably at the top of the current PutnamBench Leaderboard, ahead of the previous leaders (ByteDance and Apple).
To achieve such a high score, Aleph provided counterexamples to 15 of the formal problem statements (~2%), automatically suggested how to correct their formalization, and provided correct proofs. This was thoroughly verified by the PutnamBench team. In other words, we not only solved the problems - we fixed some of them before we even solved them.
Math and the real world
We care about PutnamBench not because we seek to solve math problems. Instead, in high-stakes settings where uncertainty is unacceptable, formal verification of correctness for real-world systems follows a “translate and verify” pipeline:
Translate: Rewrite requirements from real-world codebases into precise statements in a computer language called Lean.
Verify: Use Aleph to propose potential proofs of these Lean statements, which are vetted by a Lean-native algorithm.
Once the Lean verification algorithm says our models have produced a correct proof, one can be certain that the Lean statement used to encode correctness or safety is true. (An important caveat is that one has to additionally ensure that the translation to Lean correctly encodes the intended constraints.)
Aleph’s extraordinary success in producing and formally verified proofs of Putnam problems shows that once real-world requirements can be translated with high fidelity to Lean, Aleph is capable of weaving together sophisticated arguments for demonstrating correctness. This is a key capability for ensuring correctness and safety of critical systems at large scale.
Aleph at a glance
Aleph uses reasoning models (currently standard LLMs) for three tasks:
Planning: Breaking down the original problem statement into subproblems.
Proving: Producing potential proofs of these subparts in Lean, with proof correctness checked by the Lean kernel.
Refining: Modifying the high-level proof plan, depending on which subproblems are successfully proved.
The result is a recursive and interactive state management system with highly parallel Lean verification calls that has allowed us to achieve unmatched performance on PutnamBench. Ultimately, while Aleph’s back end is complex, the front end is dead simple: one just provides a theorem (in natural language, Lean, or via link to a GitHub repository), a time budget, and a cost budget.
Aleph milestones
Beyond automatically formalizing proofs to the vast majority of PutnamBench problems, Aleph has also produced several other notable formally verified proofs in Lean:
Erdős problems 124 and 481. Erdős problems are mathematical challenges attributed to legendary Hungarian mathematician Paul Erdős.
A variant of the golden gates conjecture from quantum cryptography from Greene and Damelin as formalized by DeepMind.
Binomial tail bounds conjecture framed by Telgarsky as formalized by DeepMind
Just as with the PutnamBench results, we find that Aleph is a highly capable orchestration layer for formal verification across a range of different formal proof tasks.
What’s next
Aleph is becoming the orchestration layer for verified reasoning. Integrated with Kona, Logical Intelligence’s EBRM-based reasoning model, it combines a purpose-built reasoning engine with a production-grade agent to make large-scale formal verification practical for semiconductors, finance, aerospace, automotive, and healthcare.
Many of the world’s critical systems involve significant automation. In areas like semiconductor design, nuclear reactor controls, industrial manufacturing, and financial markets, even subtle algorithmic errors can compound to be devastating, resulting in significant financial and human losses. At the same time, the underlying code can be so complex that it is virtually impossible for a human to check whether it works as intended.
Logical Intelligence builds AI tools that automatically generate machine-checkable proofs of safety and correctness in critical systems. We believe these tools will be indispensable components for the AGI ecosystem and will make AI more efficient and trustworthy.
To make this vision a reality, we are developing two kinds of AI systems:
A non-autoregressive energy-based reasoning model (EBRM) called Kona designed specifically to reason over complex systems.
An agentic orchestration layer, called Aleph, that plans, executes, and synthesizes separate calls to a variety of reasoning models and other tools.
We’ve designed Aleph to be paired with multiple reasoning models. That’s allowed us to decouple developing our proprietary reasoning model from de-risking our orchestration layer.
Conquering PutnamBench
To demonstrate the capabilities of Aleph, we’ve paired it with GPT-5.2 on a well-known formal verification benchmark called PutnamBench, which consists of 672 hard math problems from the William Lowell Putnam Exam, the oldest collegiate mathematics competition in North America.
Top math undergrads in the United States and Canada take the Putnam every year, with a scoring system based on 120 points, with 10 points for problem. The median score is generally two or less, which implies half the test takers can’t solve any of the questions.
In contrast, Aleph automatically generated proofs for 668 out of 672 (99.4%) of the problems from the past 50+ years of the exam. Aleph’s proofs are written in a formal language called Lean, and their correctness is automatically certified by an external deterministic Lean compiler. Our results put Logical Intelligence’s Aleph comfortably at the top of the current PutnamBench Leaderboard, ahead of the previous leaders (ByteDance and Apple).
To achieve such a high score, Aleph provided counterexamples to 15 of the formal problem statements (~2%), automatically suggested how to correct their formalization, and provided correct proofs. This was thoroughly verified by the PutnamBench team. In other words, we not only solved the problems - we fixed some of them before we even solved them.
Math and the real world
We care about PutnamBench not because we seek to solve math problems. Instead, in high-stakes settings where uncertainty is unacceptable, formal verification of correctness for real-world systems follows a “translate and verify” pipeline:
Translate: Rewrite requirements from real-world codebases into precise statements in a computer language called Lean.
Verify: Use Aleph to propose potential proofs of these Lean statements, which are vetted by a Lean-native algorithm.
Once the Lean verification algorithm says our models have produced a correct proof, one can be certain that the Lean statement used to encode correctness or safety is true. (An important caveat is that one has to additionally ensure that the translation to Lean correctly encodes the intended constraints.)
Aleph’s extraordinary success in producing and formally verified proofs of Putnam problems shows that once real-world requirements can be translated with high fidelity to Lean, Aleph is capable of weaving together sophisticated arguments for demonstrating correctness. This is a key capability for ensuring correctness and safety of critical systems at large scale.
Aleph at a glance
Aleph uses reasoning models (currently standard LLMs) for three tasks:
Planning: Breaking down the original problem statement into subproblems.
Proving: Producing potential proofs of these subparts in Lean, with proof correctness checked by the Lean kernel.
Refining: Modifying the high-level proof plan, depending on which subproblems are successfully proved.
The result is a recursive and interactive state management system with highly parallel Lean verification calls that has allowed us to achieve unmatched performance on PutnamBench. Ultimately, while Aleph’s back end is complex, the front end is dead simple: one just provides a theorem (in natural language, Lean, or via link to a GitHub repository), a time budget, and a cost budget.
Aleph milestones
Beyond automatically formalizing proofs to the vast majority of PutnamBench problems, Aleph has also produced several other notable formally verified proofs in Lean:
Erdős problems 124 and 481. Erdős problems are mathematical challenges attributed to legendary Hungarian mathematician Paul Erdős.
A variant of the golden gates conjecture from quantum cryptography from Greene and Damelin as formalized by DeepMind.
Binomial tail bounds conjecture framed by Telgarsky as formalized by DeepMind
Just as with the PutnamBench results, we find that Aleph is a highly capable orchestration layer for formal verification across a range of different formal proof tasks.
What’s next
Aleph is becoming the orchestration layer for verified reasoning. Integrated with Kona, Logical Intelligence’s EBRM-based reasoning model, it combines a purpose-built reasoning engine with a production-grade agent to make large-scale formal verification practical for semiconductors, finance, aerospace, automotive, and healthcare.
Vlad Isenbaev is Chief of AI for Logical Intelligence. Boris Hanin, a business and technical advisor to Logical Intelligence, is Associate Professor of Operations Research and Financial Engineering at Princeton University, working on deep learning, probability, and spectral asymptotics.
Vlad Isenbaev is Chief of AI for Logical Intelligence. Boris Hanin, a business and technical advisor to Logical Intelligence, is Associate Professor of Operations Research and Financial Engineering at Princeton University, working on deep learning, probability, and spectral asymptotics.