[2602.12665] Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
Summary
This paper introduces a diagnostic benchmark for evaluating the robustness of reasoning models on parameterized logical problems, specifically focusing on 2-SAT instances and their structural characteristics.
Why It Matters
Understanding the robustness of reasoning models is crucial for improving AI systems that rely on logical reasoning. This research provides insights into how different structural interventions affect model performance, highlighting areas for enhancement in AI reasoning capabilities.
Key Takeaways
- Introduces a benchmark for 2-SAT problems that isolates specific reasoning competencies.
- Evaluates LLM-based reasoners on decision accuracy and robustness under structural changes.
- Identifies brittleness in models that is not apparent from aggregate SAT accuracy.
Computer Science > Artificial Intelligence arXiv:2602.12665 (cs) [Submitted on 13 Feb 2026] Title:Evaluating Robustness of Reasoning Models on Parameterized Logical Problems Authors:Naïm Es-sebbani, Esteban Marquer, Yakoub Salhi, Zied Bouraoui View a PDF of the paper titled Evaluating Robustness of Reasoning Models on Parameterized Logical Problems, by Na\"im Es-sebbani and 3 other authors View PDF HTML (experimental) Abstract:Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under s...