Energy-Based Models for AI Reasoning: Beyond LLM Limitations

For high-stakes tasks where correctness is essential, using AI to “just generate an answer” is not enough. In such settings, frontier AI models are increasingly built as agentic compound systems that generate intermediate structure, check it, revise it, and only then commit. Such a pipeline is what is typically meant by AI reasoning.

The best reasoning AI systems are not monolithic models. They have multiple components, each with a clear role, connected by a shared objective. LLMs are central to this stack because they are strong at generating candidates—explanations, code, plans, and next steps—and they are an excellent interface between humans and machines.

At Logical Intelligence, we have three core technical theses:

LLMs are fundamentally limited as reasoning models due to their reliance on discrete tokens. This is a serious impediment for scaling up AI reasoning.

Energy-based reasoning models (EBRMs) overcome the main difficulties inherent in using LLM-based reasoning models.

Scaling AI reasoning requires using EBRMs for reasoning and LLMs for coordination, especially when translating to and from natural language instruction.

This blog post explains our thinking in detail.

Reasoning is adaptive planning

To make decisions, reasoning models iteratively produce so-called reasoning traces: additional, task-relevant context such as definitions, subgoals, intermediate calculations, proof skeletons, or tool outputs. The hope is that, conditional on this expanded context, producing the correct final output is easy and reliable.

A key framing is: reasoning is adaptive planning, and reasoning traces are current plans.

Across domains such as proofs, chip design, robotics, and scheduling, the structure is the same:

there is a space of possible states,
there are constraints and objectives,
and you seek a trajectory from “here” to “there” that stays valid.

Planning only works if you can evaluate progress while you’re still in the middle. If you only get feedback at the end (the plan “works” or “doesn’t”) you’re forced into guess-and-check. Many next steps look reasonable, but you can’t tell which ones keep the whole solution valid until it’s too late. When a plan fails, you backtrack and try again because you don’t know what broke.

So what you want is simple: a score you can apply to intermediate states (partially completed plans) that tells you, even if imperfectly, whether you’re staying consistent with the global constraints and helps you pinpoint what is broken so you can repair it. This is what EBRMs provide, and what LLM-only approaches typically lack.

LLM reasoning issues

Today, most reasoning models use LLMs to produce reasoning traces. While this has seen significant success, it also proves difficult to scale for several reasons:

LLMs are autoregressive: LLMs generate traces token-by-token. Revising earlier steps usually requires regenerating long prefixes. “Backwards” conditioning (optimizing traces given both context and a target answer/spec) is awkward, resulting in sample-inefficient methods that struggle with credit-assignment.

LLM training is locally-scored: Standard pre-training optimizes next-token prediction rather than global correctness and constraint satisfaction of long reasoning chains. Trace quality typically degrades with length without costly post-training and search.

LLMs are discrete: Reasoning traces generated by LLMs are sequences of discrete tokens. Because the trace is discrete, making small, targeted edits via gradient-based refinement is awkward; improvement typically relies on discrete search, reranking, or noisy gradients through surrogate objectives.

EBRM vs. LLM reasoning

The hallmark of any energy-based models, and in particular EBRMs, is that they learn to assign a scalar score—an energy—to each candidate state (e.g. a reasoning trace). Low energy means “more consistent with constraints / objectives.” High energy means “something is broken.”

The crucial advantage of our approach to EBRMs is that energies can be evaluated on partial traces, not just final answers. That means the system can localize failure. It can predict what is broken and where: which constraint is being violated, which part of the plan is inconsistent, which step introduced the contradiction. This turns “it failed” into actionable guidance.

Logical Intelligence is building a new breed of energy-based, non-auto-regressive reasoning EBRMs to address head-on the issues inherent in LLM-based reasoning:

Non-autoregressive at the trace level: Our flagship model Kona is non-autoregressive at the trace level - it simultaneously generates complete reasoning traces and can condition directly on the problem and constraints. A single global score over a continuous, editable trace turns reasoning into an optimization problem with dense feedback, instead of a sampling problem with sparse feedback. Kona can revise any part of a trace and can natively conditional on targets (specs/answers/proof goals).

Globally-scored: Kona is globally-scored. It learns an energy that evaluates end-to-end trace quality. Thus, long-horizon coherence is trained and optimized directly.

Reasoning in continuous space: Kona reasons in a continuous latent space. It outputs dense vector tokens rather than discrete tokens. This allows Kona to use the learned energy to make controlled, local edits to improve coherence/constraint satisfaction of reasoning traces via approximate gradient information.

What Logical Intelligence is doing

Logical Intelligence is building fundamentally new tools for reasoning and orchestration that we believe will be essential parts of the AGI ecosystem. So far these have been agentic compound system components, including

An EBRM-based reasoning model named Kona that learns a consistent, high quality score over both partial and complete reasoning traces;

A sophisticated orchestration layer named Aleph that coordinates calls to Kona, LLMs, and other tools

Kona has already demonstrated a remarkable ability to reason efficiently under highly nontrivial constraints. This is exemplified by its outstanding ability to learn reasoning specific to tasks requiring primarily spatial, rather than language-based, reasoning. Aleph has already proven itself to be an outstanding orchestrator of reasoning models, recently achieving a near-perfect score on PutnamBench, a well-known formal reasoning benchmark centered around finding formally verifiable solutions to hard math problems.

For high-stakes tasks where correctness is essential, using AI to “just generate an answer” is not enough. In such settings, frontier AI models are increasingly built as agentic compound systems that generate intermediate structure, check it, revise it, and only then commit. Such a pipeline is what is typically meant by AI reasoning.

The best reasoning AI systems are not monolithic models. They have multiple components, each with a clear role, connected by a shared objective. LLMs are central to this stack because they are strong at generating candidates—explanations, code, plans, and next steps—and they are an excellent interface between humans and machines.

At Logical Intelligence, we have three core technical theses:

LLMs are fundamentally limited as reasoning models due to their reliance on discrete tokens. This is a serious impediment for scaling up AI reasoning.

Energy-based reasoning models (EBRMs) overcome the main difficulties inherent in using LLM-based reasoning models.

Scaling AI reasoning requires using EBRMs for reasoning and LLMs for coordination, especially when translating to and from natural language instruction.

This blog post explains our thinking in detail.

Reasoning is adaptive planning

To make decisions, reasoning models iteratively produce so-called reasoning traces: additional, task-relevant context such as definitions, subgoals, intermediate calculations, proof skeletons, or tool outputs. The hope is that, conditional on this expanded context, producing the correct final output is easy and reliable.

A key framing is: reasoning is adaptive planning, and reasoning traces are current plans.

Across domains such as proofs, chip design, robotics, and scheduling, the structure is the same:

there is a space of possible states,
there are constraints and objectives,
and you seek a trajectory from “here” to “there” that stays valid.

Planning only works if you can evaluate progress while you’re still in the middle. If you only get feedback at the end (the plan “works” or “doesn’t”) you’re forced into guess-and-check. Many next steps look reasonable, but you can’t tell which ones keep the whole solution valid until it’s too late. When a plan fails, you backtrack and try again because you don’t know what broke.

So what you want is simple: a score you can apply to intermediate states (partially completed plans) that tells you, even if imperfectly, whether you’re staying consistent with the global constraints and helps you pinpoint what is broken so you can repair it. This is what EBRMs provide, and what LLM-only approaches typically lack.

LLM reasoning issues

Today, most reasoning models use LLMs to produce reasoning traces. While this has seen significant success, it also proves difficult to scale for several reasons:

LLMs are autoregressive: LLMs generate traces token-by-token. Revising earlier steps usually requires regenerating long prefixes. “Backwards” conditioning (optimizing traces given both context and a target answer/spec) is awkward, resulting in sample-inefficient methods that struggle with credit-assignment.

LLM training is locally-scored: Standard pre-training optimizes next-token prediction rather than global correctness and constraint satisfaction of long reasoning chains. Trace quality typically degrades with length without costly post-training and search.

LLMs are discrete: Reasoning traces generated by LLMs are sequences of discrete tokens. Because the trace is discrete, making small, targeted edits via gradient-based refinement is awkward; improvement typically relies on discrete search, reranking, or noisy gradients through surrogate objectives.

EBRM vs. LLM reasoning

The hallmark of any energy-based models, and in particular EBRMs, is that they learn to assign a scalar score—an energy—to each candidate state (e.g. a reasoning trace). Low energy means “more consistent with constraints / objectives.” High energy means “something is broken.”

The crucial advantage of our approach to EBRMs is that energies can be evaluated on partial traces, not just final answers. That means the system can localize failure. It can predict what is broken and where: which constraint is being violated, which part of the plan is inconsistent, which step introduced the contradiction. This turns “it failed” into actionable guidance.

Logical Intelligence is building a new breed of energy-based, non-auto-regressive reasoning EBRMs to address head-on the issues inherent in LLM-based reasoning:

Non-autoregressive at the trace level: Our flagship model Kona is non-autoregressive at the trace level - it simultaneously generates complete reasoning traces and can condition directly on the problem and constraints. A single global score over a continuous, editable trace turns reasoning into an optimization problem with dense feedback, instead of a sampling problem with sparse feedback. Kona can revise any part of a trace and can natively conditional on targets (specs/answers/proof goals).

Globally-scored: Kona is globally-scored. It learns an energy that evaluates end-to-end trace quality. Thus, long-horizon coherence is trained and optimized directly.

Reasoning in continuous space: Kona reasons in a continuous latent space. It outputs dense vector tokens rather than discrete tokens. This allows Kona to use the learned energy to make controlled, local edits to improve coherence/constraint satisfaction of reasoning traces via approximate gradient information.

What Logical Intelligence

is doing

Logical Intelligence is building fundamentally new tools for reasoning and orchestration that we believe will be essential parts of the AGI ecosystem. So far these have been agentic compound system components, including

An EBRM-based reasoning model named Kona that learns a consistent, high quality score over both partial and complete reasoning traces;

A sophisticated orchestration layer named Aleph that coordinates calls to Kona, LLMs, and other tools

Kona has already demonstrated a remarkable ability to reason efficiently under highly nontrivial constraints. This is exemplified by its outstanding ability to learn reasoning specific to tasks requiring primarily spatial, rather than language-based, reasoning. Aleph has already proven itself to be an outstanding orchestrator of reasoning models, recently achieving a near-perfect score on PutnamBench, a well-known formal reasoning benchmark centered around finding formally verifiable solutions to hard math problems.

For high-stakes tasks where correctness is essential, using AI to “just generate an answer” is not enough. In such settings, frontier AI models are increasingly built as agentic compound systems that generate intermediate structure, check it, revise it, and only then commit. Such a pipeline is what is typically meant by AI reasoning.

The best reasoning AI systems are not monolithic models. They have multiple components, each with a clear role, connected by a shared objective. LLMs are central to this stack because they are strong at generating candidates—explanations, code, plans, and next steps—and they are an excellent interface between humans and machines.

At Logical Intelligence, we have three core technical theses:

LLMs are fundamentally limited as reasoning models due to their reliance on discrete tokens. This is a serious impediment for scaling up AI reasoning.

Energy-based reasoning models (EBRMs) overcome the main difficulties inherent in using LLM-based reasoning models.

Scaling AI reasoning requires using EBRMs for reasoning and LLMs for coordination, especially when translating to and from natural language instruction.

This blog post explains our thinking in detail.

Reasoning is adaptive planning

To make decisions, reasoning models iteratively produce so-called reasoning traces: additional, task-relevant context such as definitions, subgoals, intermediate calculations, proof skeletons, or tool outputs. The hope is that, conditional on this expanded context, producing the correct final output is easy and reliable.

A key framing is: reasoning is adaptive planning, and reasoning traces are current plans.

Across domains such as proofs, chip design, robotics, and scheduling, the structure is the same:

there is a space of possible states,
there are constraints and objectives,
and you seek a trajectory from “here” to “there” that stays valid.

Planning only works if you can evaluate progress while you’re still in the middle. If you only get feedback at the end (the plan “works” or “doesn’t”) you’re forced into guess-and-check. Many next steps look reasonable, but you can’t tell which ones keep the whole solution valid until it’s too late. When a plan fails, you backtrack and try again because you don’t know what broke.

So what you want is simple: a score you can apply to intermediate states (partially completed plans) that tells you, even if imperfectly, whether you’re staying consistent with the global constraints and helps you pinpoint what is broken so you can repair it. This is what EBRMs provide, and what LLM-only approaches typically lack.

LLM reasoning issues

Today, most reasoning models use LLMs to produce reasoning traces. While this has seen significant success, it also proves difficult to scale for several reasons:

LLMs are autoregressive: LLMs generate traces token-by-token. Revising earlier steps usually requires regenerating long prefixes. “Backwards” conditioning (optimizing traces given both context and a target answer/spec) is awkward, resulting in sample-inefficient methods that struggle with credit-assignment.

LLM training is locally-scored: Standard pre-training optimizes next-token prediction rather than global correctness and constraint satisfaction of long reasoning chains. Trace quality typically degrades with length without costly post-training and search.

LLMs are discrete: Reasoning traces generated by LLMs are sequences of discrete tokens. Because the trace is discrete, making small, targeted edits via gradient-based refinement is awkward; improvement typically relies on discrete search, reranking, or noisy gradients through surrogate objectives.

EBRM vs. LLM reasoning

The hallmark of any energy-based models, and in particular EBRMs, is that they learn to assign a scalar score—an energy—to each candidate state (e.g. a reasoning trace). Low energy means “more consistent with constraints / objectives.” High energy means “something is broken.”

The crucial advantage of our approach to EBRMs is that energies can be evaluated on partial traces, not just final answers. That means the system can localize failure. It can predict what is broken and where: which constraint is being violated, which part of the plan is inconsistent, which step introduced the contradiction. This turns “it failed” into actionable guidance.

Logical Intelligence is building a new breed of energy-based, non-auto-regressive reasoning EBRMs to address head-on the issues inherent in LLM-based reasoning:

Non-autoregressive at the trace level: Our flagship model Kona is non-autoregressive at the trace level - it simultaneously generates complete reasoning traces and can condition directly on the problem and constraints. A single global score over a continuous, editable trace turns reasoning into an optimization problem with dense feedback, instead of a sampling problem with sparse feedback. Kona can revise any part of a trace and can natively conditional on targets (specs/answers/proof goals).

Globally-scored: Kona is globally-scored. It learns an energy that evaluates end-to-end trace quality. Thus, long-horizon coherence is trained and optimized directly.

Reasoning in continuous space: Kona reasons in a continuous latent space. It outputs dense vector tokens rather than discrete tokens. This allows Kona to use the learned energy to make controlled, local edits to improve coherence/constraint satisfaction of reasoning traces via approximate gradient information.

What Logical Intelligence is doing

Logical Intelligence is building fundamentally new tools for reasoning and orchestration that we believe will be essential parts of the AGI ecosystem. So far these have been agentic compound system components, including

An EBRM-based reasoning model named Kona that learns a consistent, high quality score over both partial and complete reasoning traces;

A sophisticated orchestration layer named Aleph that coordinates calls to Kona, LLMs, and other tools

Kona has already demonstrated a remarkable ability to reason efficiently under highly nontrivial constraints. This is exemplified by its outstanding ability to learn reasoning specific to tasks requiring primarily spatial, rather than language-based, reasoning. Aleph has already proven itself to be an outstanding orchestrator of reasoning models, recently achieving a near-perfect score on PutnamBench, a well-known formal reasoning benchmark centered around finding formally verifiable solutions to hard math problems.

Home

Home

Kona 1.0

Kona 1.0

Aleph

Aleph

Follow

Follow

Careers

Careers

Contact

Contact