EBM vs. LLMs: Our Kona EBM a 96% vs. 2% Sudoku Benchmark

Logical

Intelligence

Logical

Intelligence

Logical

Intelligence

Logical

Intelligence

What Sudoku Reveals About

AI Reasoning Architectures

and the Future

of Our Economy

What Sudoku

Reveals About

AI Reasoning Architectures

and the Future

of Our Economy

Feb 3, 2026

Benchmarking Energy-Based vs. Autoregressive Models on Constraint Satisfaction

Logical Intelligence has built the first commercial energy‑based reasoning model, Kona 1.0, which offers sharp advantages over large language models when problems need solutions that are verifiably accurate.

LLM reasoning is strong in problems where useful planning happens through language; conversation, summarization, translation, and code generation. They’re good at producing syntactically valid solutions that do roughly what you requested. LLM-generated code mostly performs, translations are mostly correct, and if they happen in a language one doesn’t likely know well (Haskell, Tagalog), the improvement over typical human performance is tremendous.

But many of the problems that matter the most in the economy do not look like language. Robotics firmware, industrial control systems, chip design, energy grid optimization, and high frequency trading are constraint satisfaction problems, where one needs to find a configuration where everything is consistent with everything else; that cannot be done by committing to choices one at a time in isolation and hoping they work out.

We found a powerful way to demonstrate how energy-based reasoning models (EBRMs) like Kona are inherently better at constraint satisfaction than LLMs, and why that matters: We had Kona learn to solve sudoku.

Sudoku nicely illustrates the kind of issue EBRMs solve far more efficiently than LLMs. Although the game is fully describable in language, it is awkward to reason through them in language - which precisely exposes the gap between what autoregressive models are optimized for and what constraint satisfaction requires. Kona is our effort to build something optimized for the latter.

Benchmarking Energy-Based vs. Autoregressive Models on Constraint Satisfaction

Benchmarking Energy-Based vs. Autoregressive Models

on Constraint Satisfaction

Benchmarking Energy-Based vs. Autoregressive Models on

Constraint Satisfaction

Our Sudoku demo is a simple way to see the difference yourself.

We have Kona solve hard sudoku puzzles in real-time alongside a group of popular LLMs, including GPT-5.2, Claude Opus 4.5 and Sonnet 4.5, Gemini 3 Pro, and DeepSeek V3.2. After about a week of public access, Kona solved 96.2% of puzzles at an average of 313 milliseconds. Frontier LLMs together had a solve rate of only 2%, taking up to 90 seconds before an incorrect answer or time out.

The reason LLMs struggle with sudoku is not that the rules are especially complicated; A sudoku cell is 81 cells arranged in a 9x9 grid, where each row, column, and 3x3 box must contain the digits 1 through 9 exactly once. But LLM architecture isn’t well-suited for spatial tasks; solving these puzzles requires holding the whole grid in mind and reasoning about how a change in one cell affects every other cell in the same row, column, and box.

Because LLMs generate tokens sequentially, committing to each one as they go, they cannot easily revise earlier decisions when they discover a conflict later. This is not a challenge that can be solved by adding more compute or seeing more sudoku puzzles - it is an inherent structural limitation on autoregressive language models.

See the live benchmark, at sudoku.logicalintelligence.com.

We disabled code execution for all models in the demo, to prevent the LLMs from writing a brute-force backtracking solver in Python and running it, which would tell us something about their coding ability but nothing about their reasoning ability. With code execution off, the models have to actually work through the puzzle using whatever reasoning capabilities they have, and those capabilities turn out to be poorly suited to constraint satisfaction.

We also did not fine-tune these LLMs on sudoku problems. Fine-tuning would test pattern recognition over memorized solutions, not reasoning. Neither was Kona trained on solved puzzles; it learns the constraint structure and solves novel configurations from that information alone. The comparison is between reasoning architectures, not between levels of domain-specific exposure.

The LLMs receive a puzzle and start producing chains of reasoning: "Let me analyze by rows, columns and boxes," "for cell (4,1), the row needs 1, 2, 3, 5, 7, 8, 9," and so on, sometimes generating hundreds of tokens of step-by-step analysis. Then they output a grid that has duplicate digits in a row, or a missing digit in a column, or some other constraint violation. DeepSeek V3.2 tends to finish quickly but with many duplicates across rows, columns, and boxes - failing to solve the puzzle. The Claude models often think for a long time and then produce something close but still wrong. GPT-5.2 does best among the LLMs at 6.9%, which is still a 93% failure rate on a puzzle that any patient human can solve.

Kona approaches the problem differently because it's a different kind of model. Instead of generating a solution token-by-token, Kona produces a complete candidate grid and evaluates it against all the constraints simultaneously, assigning a scalar energy score where low energy means the constraints are satisfied and high energy means something is violated. If the energy is high, Kona can identify which constraints are broken and revise the relevant parts of the grid without starting over, using gradient information in continuous latent space to move toward valid configurations. This process takes about 313 milliseconds on average, compared to the 30+ seconds the LLMs spend generating reasoning traces that lead nowhere.

The cost difference is also notable. Running Kona on the nearly 13,000 puzzles it has actually solved has so far cost about $4 in GPU time, and roughly 1.1 hours of compute. The LLM API calls for the same demo at a 98% failure rate have cost around $11,000, mostly from Claude's tokens.

If a model architecture can’t handle sudoku, where the constraints are simple and the state space is tiny, it is not going to handle supervising industrial control systems or billions of dollars in global transactions. The gap between 96% and 2% is not about sudoku - it’s a demonstration of how completely different these architectures are designed.

Q: What is an Energy-Based Model (EBM) in AI?

A: An Energy-Based Model (EBM) is an AI architecture that evaluates complete solutions against constraints, assigning a low "energy" score to valid states. This allows it to reason holistically and correct errors, unlike LLMs that generate outputs step-by-step.

Q: Why are LLMs bad at Sudoku?

A: LLMs are autoregressive; they commit to answers token-by-token and struggle to revise early mistakes when later constraints conflict. This makes them poorly suited for spatial, constraint-satisfaction problems like Sudoku.

Q: Which AI is best at solving Sudoku?

A: In our benchmark, Logical Intelligence's Energy-Based Model, Kona, solved 96.2% of hard Sudoku puzzles with an average latency of 313ms, significantly outperforming all frontier LLMs tested.

Q: What is an Energy-Based Model (EBM) in AI?

Q: Why are LLMs bad at Sudoku?

Q: Which AI is best at solving Sudoku?

A: In our benchmark, Logical Intelligence's Energy-Based Model, Kona, solved 96.2% of hard Sudoku puzzles with an average latency of 313ms, significantly outperforming all frontier LLMs tested.

logical

intelligence

Home

Home

Kona 1.0

Kona 1.0

Aleph

Aleph

Follow

Follow

Contact

Contact