Aleph Reaches State‑of‑the‑Art
Across the Leading Formal Reasoning Benchmarks as
Verified Code Generation
Nears Reality
Aleph Reaches State‑of‑the‑Art
Across the Leading Formal Reasoning Benchmarks as
Verified Code Generation
Nears Reality
Every few weeks a new benchmark is posted online and people seem to immediately jump to one of two conclusions, either the system is suddenly “approaching AGI” or the benchmark itself is dismissed as meaningless because it only measures mathematics or theorem proving.
Both miss the point entirely.
We track formal reasoning benchmarks like PutnamBench, VeriSoftBench, LeanEval, and Verina very closely. We believe these benchmarks are important because they force AI systems to operate under conditions where correctness actually matters. As these tests progress, the problems do not simply become incrementally harder in a linear way. The reasoning chains become deeper, the search spaces become more fragile, and therefore tiny mistakes compound into complete failure.
That matters if your long-term goal is more than image generation or autocomplete, such as optimizing systems that will eventually operate inside semiconductor design flows, energy infrastructure, transportation systems, industrial software stacks, robotics, or financial environments where being “mostly right” is effectively the same as being wrong.
Over time, our agentic system Aleph has reached the top position across several of the industry’s leading public formal reasoning benchmarks, including PutnamBench, VeriSoftBench, LeanEval. Aleph also achieved a verified 100% score on Verina, the benchmark focused on formal software verification. While Verina does not currently maintain a public leaderboard, benchmark authors independently confirmed the results and Aleph’s first-place ranking. More technical information will come in a separate post from our Chief Technology Officer, Alex Fetisov.
Just last summer, other state-of-the-art AI provers were solving less than 2% of PutnamBench. Today, Aleph is approaching full benchmark coverage having now solved 668 of the 672 formally stated mathematical problems. Aleph is not just solving a single year’s competition (12 problems), but problem sets derived from decades of the William Lowell Putnam Mathematical Competition, widely regarded as the most prestigious undergraduate mathematics competition in North America.
Every few weeks a new benchmark is posted online and people seem to immediately jump to one of two conclusions, either the system is suddenly “approaching AGI” or the benchmark itself is dismissed as meaningless because it only measures mathematics or theorem proving.
Both miss the point entirely.
We track formal reasoning benchmarks like PutnamBench, VeriSoftBench, LeanEval, and Verina very closely. We believe these benchmarks are important because they force AI systems to operate under conditions where correctness actually matters. As these tests progress, the problems do not simply become incrementally harder in a linear way. The reasoning chains become deeper, the search spaces become more fragile, and therefore tiny mistakes compound into complete failure.
That matters if your long-term goal is more than image generation or autocomplete, such as optimizing systems that will eventually operate inside semiconductor design flows, energy infrastructure, transportation systems, industrial software stacks, robotics, or financial environments where being “mostly right” is effectively the same as being wrong.
Over time, our agentic system Aleph has reached the top position across several of the industry’s leading public formal reasoning benchmarks, including PutnamBench, VeriSoftBench, LeanEval. Aleph also achieved a verified 100% score on Verina, the benchmark focused on formal software verification. While Verina does not currently maintain a public leaderboard, benchmark authors independently confirmed the results and Aleph’s first-place ranking. More technical information will come in a separate post from our Chief Technology Officer, Alex Fetisov.
Just last summer, other state-of-the-art AI provers were solving less than 2% of PutnamBench. Today, Aleph is approaching full benchmark coverage having now solved 668 of the 672 formally stated mathematical problems. Aleph is not just solving a single year’s competition (12 problems), but problem sets derived from decades of the William Lowell Putnam Mathematical Competition, widely regarded as the most prestigious undergraduate mathematics competition in North America.
PutnamBench
PutnamBench
🏆
🏆
🏆
Advanced mathematical theorem proving
672 problems
Advanced mathematical theorem proving · 672 problems
Aleph Prover
Aleph Prover
Aleph Prover
99.4%
99.4%
99.4%
99.4%
Seed-Prover 1.5
(ByteDance)
Seed-Prover 1.5
(ByteDance)
Seed-Prover 1.5
(ByteDance)
86%
86%
Hilbert
Hilbert
Hilbert
69%
69%
Correctness Difficulty
Correctness
Difficulty
Easy
Easy
Lean AI
Lean AI
🏆
🏆
🏆
Lean-eval AI benchmark for mathematical formalization
Lean-eval AI benchmark for mathematical formalization
Aleph Prover
Aleph Prover
Aleph Prover
31/60
31/60
31/60
31/60
Aristotle (Harmonic)
Aristotle (Harmonic)
Aristotle (Harmonic)
27/60
27/60
Claude Opus 4.7
(Human-guided*)
Claude Opus 4.7
(Human-guided*)
19/60
19/60
VeriSoftBench
VeriSoftBench
🏆
🏆
🏆
Real-world software verification
100 tasks
Real-world software verification · 100 tasks
Real-world software verification · 100 tasks
Aleph Prover
Aleph Prover
Aleph Prover
94%
94%
94%
94%
Aristotle (Harmonic)
Aristotle (Harmonic)
69%
69%
Gemini-3-Pro
(Google)
Gemini-3-Pro
(Google)
65%
65%
Verina
Verina
🏆
🏆
🏆
Formal verification benchmark
Aleph Prover
Aleph Prover
Aleph Prover
100%
100%
100%
100%
Aristotle (Harmonic)
Aristotle (Harmonic)
96.8%
96.8%
That rate of progress is extraordinary. But simply solving mathematics problems is not the final destination.
One of the biggest misconceptions in AI today is the assumption that if a model becomes very good at mathematics, it will naturally become good at everything else in the way a highly trained human expert might. That is not how these systems work. Training an agent to solve theorem-proving tasks does not magically grant it generalized reasoning capabilities across every domain.
What these environments provide instead is something far more valuable, a stress test for correctness.
Formal mathematics and verification systems are unforgiving, as there is no partial credit. The proof either works or it doesn’t. The verification closes or it fails completely. Systems operating in these environments are forced to maintain consistency across long chains of reasoning without drifting into hallucination, contradiction, or invalid state transitions. That distinction matters.
Traditional AI systems generate outputs in natural language or source code. Even when those outputs appear convincing, they often contain subtle logical failures, unverifiable assumptions, or hidden correctness issues that only reveal themselves downstream. In low-risk consumer applications, those errors may be tolerable. Inside critical infrastructure or production engineering systems, they are not.
Formal systems solve a fundamentally different problem.
Proof assistants/compilers like Lean allow researchers and engineers to define mathematical theorems, software properties, and hardware constraints inside machine-readable logical frameworks. If the proof is invalid, the verifier rejects it. There is no ambiguity.
That creates the foundation for AI systems that do not merely generate plausible outputs but can mathematically prove correctness.
The implications extend far beyond academic mathematics:
provable correctness in safety-critical software
hardware verification for chips and embedded systems
high-assurance cryptography and infrastructure
automated theorem proving for scientific research
AI-generated code with provable guarantees
As AI-generated software becomes increasingly widespread, we believe verification will move from being an optional layer to a core architectural requirement.
Aleph was never intended to be a science project built to chase benchmark headlines. It was built as an intermediate layer toward something larger: scalable formally verified code generation.
The reason that matters will be obvious to anyone who has deployed software at scale. Large language models are impressive systems, but they remain fundamentally probabilistic architectures. They are exceptionally good at generating plausible outputs. They are significantly less reliable when operating inside environments where correctness must be provable and where errors carry real-world consequences.
The current workaround across much of the industry is brute force. Generate more outputs. Increase sampling. Add more verification layers. Add another model to critique the first model. Then add another verifier on top of that verifier.
That approach can work in narrow domains, but it does not scale efficiently forever.
Our view is that formally verified code generation ultimately requires a different reasoning architecture underneath the interface layer itself. That is where we see EBRMs, like our current Kona model in alpha release, and other potential alternative architectures entering the picture.
These architectures will finally allow verified code generation to scale in a sustainable manner from both a compute and reliability standpoint than today’s increasingly brute-force probabilistic approaches.
This is why these benchmark results matter to us. Not because leaderboard screenshots are the future of AI. But because they send a very clear signal to the market about where this industry is heading.
The field is evolving incredibly quickly. Benchmark standings will continue to move as new systems emerge and new benchmarks appear. That competition is healthy. Frankly, it is one of the most exciting aspects of this moment in AI.
But we are also proud that Aleph is already outperforming systems from the largest frontier labs as well as highly specialized startups focused exclusively on AI for mathematics and theorem proving. More importantly, we believe this category is finally moving from research novelty toward real-world deployment.
The next phase of AI adoption will not be won by whatever model mimics or sounds the most intelligent in a demo. It will be won by systems that enterprises, engineers, governments, and infrastructure operators are actually willing to trust.
That means systems capable of reasoning under constraints. Systems capable of verification. Systems capable of proving correctness before deployment rather than explaining failures afterward.
We are confident Aleph is an early glimpse of what that future looks like in a world where verified code generation is the industry standard.
That rate of progress is extraordinary. But simply solving mathematics problems is not the final destination.
One of the biggest misconceptions in AI today is the assumption that if a model becomes very good at mathematics, it will naturally become good at everything else in the way a highly trained human expert might. That is not how these systems work. Training an agent to solve theorem-proving tasks does not magically grant it generalized reasoning capabilities across every domain.
What these environments provide instead is something far more valuable, a stress test for correctness.
Formal mathematics and verification systems are unforgiving, as there is no partial credit. The proof either works or it doesn’t. The verification closes or it fails completely. Systems operating in these environments are forced to maintain consistency across long chains of reasoning without drifting into hallucination, contradiction, or invalid state transitions. That distinction matters.
Traditional AI systems generate outputs in natural language or source code. Even when those outputs appear convincing, they often contain subtle logical failures, unverifiable assumptions, or hidden correctness issues that only reveal themselves downstream. In low-risk consumer applications, those errors may be tolerable. Inside critical infrastructure or production engineering systems, they are not.
Formal systems solve a fundamentally different problem.
Proof assistants/compilers like Lean allow researchers and engineers to define mathematical theorems, software properties, and hardware constraints inside machine-readable logical frameworks. If the proof is invalid, the verifier rejects it. There is no ambiguity.
That creates the foundation for AI systems that do not merely generate plausible outputs but can mathematically prove correctness.
The implications extend far beyond academic mathematics:
provable correctness in safety-critical software
hardware verification for chips and embedded systems
high-assurance cryptography and infrastructure
automated theorem proving for scientific research
AI-generated code with provable guarantees
As AI-generated software becomes increasingly widespread, we believe verification will move from being an optional layer to a core architectural requirement.
Aleph was never intended to be a science project built to chase benchmark headlines. It was built as an intermediate layer toward something larger: scalable formally verified code generation.
The reason that matters will be obvious to anyone who has deployed software at scale. Large language models are impressive systems, but they remain fundamentally probabilistic architectures. They are exceptionally good at generating plausible outputs. They are significantly less reliable when operating inside environments where correctness must be provable and where errors carry real-world consequences.
The current workaround across much of the industry is brute force. Generate more outputs. Increase sampling. Add more verification layers. Add another model to critique the first model. Then add another verifier on top of that verifier.
That approach can work in narrow domains, but it does not scale efficiently forever.
Our view is that formally verified code generation ultimately requires a different reasoning architecture underneath the interface layer itself. That is where we see EBRMs, like our current Kona model in alpha release, and other potential alternative architectures entering the picture.
These architectures will finally allow verified code generation to scale in a sustainable manner from both a compute and reliability standpoint than today’s increasingly brute-force probabilistic approaches.
This is why these benchmark results matter to us. Not because leaderboard screenshots are the future of AI. But because they send a very clear signal to the market about where this industry is heading.
The field is evolving incredibly quickly. Benchmark standings will continue to move as new systems emerge and new benchmarks appear. That competition is healthy. Frankly, it is one of the most exciting aspects of this moment in AI.
But we are also proud that Aleph is already outperforming systems from the largest frontier labs as well as highly specialized startups focused exclusively on AI for mathematics and theorem proving. More importantly, we believe this category is finally moving from research novelty toward real-world deployment.
The next phase of AI adoption will not be won by whatever model mimics or sounds the most intelligent in a demo. It will be won by systems that enterprises, engineers, governments, and infrastructure operators are actually willing to trust.
That means systems capable of reasoning under constraints. Systems capable of verification. Systems capable of proving correctness before deployment rather than explaining failures afterward.
We are confident Aleph is an early glimpse of what that future looks like in a world where verified code generation is the industry standard.