Automatic

Formal Verification

for Code Generation

Automatic

Formal Verification

for Code Generation

Most of the software that runs the modern economy is deployed on a quiet act of faith. Code is typically written by software engineers, though now increasingly generated by an LLM, is run through a series of tests to surface potential bugs, reviewed by another human, and eventually shipped. At that point, the code is live, interacting with millions of real-world edge cases never previously contemplated.


Sometimes the code breaks. So we patch it. When a hole is exploited we issue an update and move on. This model (mostly) worked for decades. But that’s not good enough. Software underlies most critical infrastructure. When it fails, we feel it immediately. It shows up as bank apps freezing mid-transaction, flight systems going down, hospitals forced to delay care, or businesses losing millions from unplanned downtime.


Companies tolerated this model for decades because it was the only economically viable option. Proving software was correct before deployment was historically slow and expensive.


The top companies employ hundreds of mathematicians and Ph.D.s to verify code by hand. This process, called “Formal Verification” refers to mathematical proof that a piece of code does exactly what it should do and nothing else.


Few companies can afford that luxury. Hiring those mathematicians to verify code is extremely expensive. Budget constrained CTOs and CISOs had to settle for “good enough,” building risk models around failure. With AI now pouring billions of lines of unchecked code into the world, the “ship it and fix it later” model is no longer viable.


Now that AI generated code is everywhere, Formal Verification is a requirement for any critical system that cannot be allowed to fail.

Most of the software that runs the modern economy is deployed on a quiet act of faith. Code is typically written by software engineers, though now increasingly generated by an LLM, is run through a series of tests to surface potential bugs, reviewed by another human, and eventually shipped. At that point, the code is live, interacting with millions of real-world edge cases never previously contemplated.


Sometimes the code breaks. So we patch it. When a hole is exploited we issue an update and move on. This model (mostly) worked for decades. But that’s not good enough. Software underlies most critical infrastructure. When it fails, we feel it immediately. It shows up as bank apps freezing mid-transaction, flight systems going down, hospitals forced to delay care, or businesses losing millions from unplanned downtime.


Companies tolerated this model for decades because it was the only economically viable option. Proving software was correct before deployment was historically slow and expensive.


The top companies employ hundreds of mathematicians and Ph.D.s to verify code by hand. This process, called “Formal Verification” refers to mathematical proof that a piece of code does exactly what it should do and nothing else.


Few companies can afford that luxury. Hiring those mathematicians to verify code is extremely expensive. Budget constrained CTOs and CISOs had to settle for “good enough,” building risk models around failure. With AI now pouring billions of lines of unchecked code into the world, the “ship it and fix it later” model is no longer viable.


Now that AI generated code is everywhere, Formal Verification is a requirement for any critical system that cannot be allowed to fail.

It is our best defense in a world where AI powered hackers are moving faster than any human security team ever could.

It is our best defense in a world where AI powered hackers are moving faster than any human security team ever could.

Consider the importance of core financial and banking systems, data centers, internet infrastructure, factory automation, and industrial robots. Only formal verification can give us defensible certainty that they will always work as expected.


AI offers a promising way to accelerate and automate the otherwise tedious and expensive process. But using LLMs for verification has significant drawbacks. LLM-based verification remains both costly and hallucination-prone. LLMs are producing incredible volumes of code of highly suspect quality and quantity that needs to be checked and rechecked for accuracy and safety. Soon companies will spend more time fixing LLM generated code than on coding itself.


Energy-Based Reasoning Models (EBRMs) are better suited for the task of FV, with the ability to generate code that is formally verified upon creation at a fraction of the cost.

Consider the importance of core financial and banking systems, data centers, internet infrastructure, factory automation, and industrial robots. Only formal verification can give us defensible certainty that they will always work as expected.


AI offers a promising way to accelerate and automate the otherwise tedious and expensive process. But using LLMs for verification has significant drawbacks. LLM-based verification remains both costly and hallucination-prone. LLMs are producing incredible volumes of code of highly suspect quality and quantity that needs to be checked and rechecked for accuracy and safety. Soon companies will spend more time fixing LLM generated code than on coding itself.


Energy-Based Reasoning Models (EBRMs) are better suited for the task of FV, with the ability to generate code that is formally verified upon creation at a fraction of the cost.

Understanding FV


From SMT-based verifiers to Lean-style interactive proof assistants, all FV systems have three components:


  • Spec: A mathematical formalization of human intent that dictates what a program is allowed to do, and what it must never do;

  • Code: A set of executable instructions;

  • Proof: A mathematical argument for why the Code satisfies the Spec.


A program is formally verified when an FV system’s checker can determine that the Proof correctly demonstrates that the Code satisfies the Spec. This guarantee is unusually strong, but also narrower than people sometimes assume: formal verification proves that code adheres to a formal specification but not necessarily to original human intent.


FV has historically required writing the Spec and much of the Proof by hand, a process so expensive that it only made sense in select domains like core cloud infrastructure. But modern generative AI holds the promise to change this by automating some or all of the FV pipeline. Common approaches to “AI for formal verification” include:


  • Provers: Humans write the Spec and Code, and AI invents the Proof.

  • “Vericoding”: Humans provide a careful formal Spec, carrying a significant portion of the semantic burden, and AI synthesizes associated Code and Proof.

  • End-to-End: Humans describe a system in language, and AI produces the Spec, Code, and Proof from scratch.


For AI models to automate FV, they must learn to identify, respect, and reason about complex, often implicit, global constraints across the Spec, Code, and Proof. Important aspects of this task are particularly challenging for LLM-based reasoning models. Energy-Based Reasoning Models such as those developed by Logical Intelligence may play a key role in scaling FV for code generation.

Understanding FV


From SMT-based verifiers to Lean-style interactive proof assistants, all FV systems have three components:


  • Spec: A mathematical formalization of human intent that dictates what a program is allowed to do, and what it must never do;

  • Code: A set of executable instructions;

  • Proof: A mathematical argument for why the Code satisfies the Spec.


A program is formally verified when an FV system’s checker can determine that the Proof correctly demonstrates that the Code satisfies the Spec. This guarantee is unusually strong, but also narrower than people sometimes assume: formal verification proves that code adheres to a formal specification but not necessarily to original human intent.


FV has historically required writing the Spec and much of the Proof by hand, a process so expensive that it only made sense in select domains like core cloud infrastructure. But modern generative AI holds the promise to change this by automating some or all of the FV pipeline. Common approaches to “AI for formal verification” include:


  • Provers: Humans write the Spec and Code, and AI invents the Proof.

  • “Vericoding”: Humans provide a careful formal Spec, carrying a significant portion of the semantic burden, and AI synthesizes associated Code and Proof.

  • End-to-End: Humans describe a system in language, and AI produces the Spec, Code, and Proof from scratch.


For AI models to automate FV, they must learn to identify, respect, and reason about complex, often implicit, global constraints across the Spec, Code, and Proof. Important aspects of this task are particularly challenging for LLM-based reasoning models. Energy-Based Reasoning Models such as those developed by Logical Intelligence may play a key role in scaling FV for code generation.

EBRMs for FV Code Generation


This is where EBRMs excel. An EBRM abandons autoregressive generation and frames logical deduction as a continuous optimization problem. Their architecture relies on three components:


  • a neural network encoder, mapping inputs to a latent space

  • a neural network decoder, mapping this latent space to an output

  • an input-dependent energy function defined on the latent space


The encoder, decoder, and energy can all be efficiently learned from data using the same GPU-based training paradigms as LLMs. The learned encoder maps inputs to a latent space that often offers a particularly rich representation of problem inputs and domain constraints. The energy enables “reasoning” directly over the latent space by associating good outputs to low energy points in latent space.


Instead of scoring only the next word, an EBRM can score entire candidate states at once. In this way, an EBRM’s learned energy function provides a cheap and learnable but imperfect verifier.

EBRMs for FV Code Generation


This is where EBRMs excel. An EBRM abandons autoregressive generation and frames logical deduction as a continuous optimization problem. Their architecture relies on three components:


  • a neural network encoder, mapping inputs to a latent space

  • a neural network decoder, mapping this latent space to an output

  • an input-dependent energy function defined on the latent space


The encoder, decoder, and energy can all be efficiently learned from data using the same GPU-based training paradigms as LLMs. The learned encoder maps inputs to a latent space that often offers a particularly rich representation of problem inputs and domain constraints. The energy enables “reasoning” directly over the latent space by associating good outputs to low energy points in latent space.


Instead of scoring only the next word, an EBRM can score entire candidate states at once. In this way, an EBRM’s learned energy function provides a cheap and learnable but imperfect verifier.

Commoditizing Trust


Scaling automated formal verification will require an AI stack built not only on LLMs but also on other reasoning models, such as EBRMs.


LLMs will remain valuable as the interface. They are good at translating ambiguous human requirements into rough formal structure, decomposing tasks, navigating toolchains, and explaining failures. But the hardest part of verified code generation is not the interface. It is the constrained search required to keep Spec, Code, and Proof in global alignment.


That is where we believe EBRMs will shine. By reasoning in latent space, EBRMs can score entire candidate states instead of extending a token sequence one step at a time. EBRMs will make the search for verifiable artifacts more effective by using their learned energy as a verifier surrogate.


As search over Spec, Code, and Proof becomes reliable, the economics of formal verification will change. FV will stop being an exotic technique reserved for a handful of hyper-valuable systems and will become a new assurance layer in the software stack.

Commoditizing Trust


Scaling automated formal verification will require an AI stack built not only on LLMs but also on other reasoning models, such as EBRMs.


LLMs will remain valuable as the interface. They are good at translating ambiguous human requirements into rough formal structure, decomposing tasks, navigating toolchains, and explaining failures. But the hardest part of verified code generation is not the interface. It is the constrained search required to keep Spec, Code, and Proof in global alignment.


That is where we believe EBRMs will shine. By reasoning in latent space, EBRMs can score entire candidate states instead of extending a token sequence one step at a time. EBRMs will make the search for verifiable artifacts more effective by using their learned energy as a verifier surrogate.


As search over Spec, Code, and Proof becomes reliable, the economics of formal verification will change. FV will stop being an exotic technique reserved for a handful of hyper-valuable systems and will become a new assurance layer in the software stack.