[2602.12665] Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

[2602.12665] Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

arXiv - AI 3 min read Article

Summary

This paper introduces a diagnostic benchmark for evaluating the robustness of reasoning models on parameterized logical problems, specifically focusing on 2-SAT instances and their structural characteristics.

Why It Matters

Understanding the robustness of reasoning models is crucial for improving AI systems that rely on logical reasoning. This research provides insights into how different structural interventions affect model performance, highlighting areas for enhancement in AI reasoning capabilities.

Key Takeaways

  • Introduces a benchmark for 2-SAT problems that isolates specific reasoning competencies.
  • Evaluates LLM-based reasoners on decision accuracy and robustness under structural changes.
  • Identifies brittleness in models that is not apparent from aggregate SAT accuracy.

Computer Science > Artificial Intelligence arXiv:2602.12665 (cs) [Submitted on 13 Feb 2026] Title:Evaluating Robustness of Reasoning Models on Parameterized Logical Problems Authors:Naïm Es-sebbani, Esteban Marquer, Yakoub Salhi, Zied Bouraoui View a PDF of the paper titled Evaluating Robustness of Reasoning Models on Parameterized Logical Problems, by Na\"im Es-sebbani and 3 other authors View PDF HTML (experimental) Abstract:Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under s...

Related Articles

I can't help rooting for tiny open source AI model maker Arcee | TechCrunch
Llms

I can't help rooting for tiny open source AI model maker Arcee | TechCrunch

Arcee is a tiny 26-person U.S. startup that built a high-performing, massive, open source LLM. And it's gaining popularity with OpenClaw ...

TechCrunch - AI · 4 min ·
Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·
Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime