[2602.14189] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
Summary
The paper discusses an abstention-aware framework for scientific reasoning, emphasizing the importance of knowing when to abstain from answering rather than providing potentially harmful incorrect answers.
Why It Matters
This research addresses a critical gap in the evaluation of large language models in scientific contexts, where providing an incorrect answer can have significant consequences. By focusing on abstention, the study promotes safer and more reliable scientific reasoning, which is essential for advancing AI applications in research and decision-making.
Key Takeaways
- Abstention can prevent harmful conclusions in scientific reasoning.
- The proposed framework evaluates claims based on available evidence.
- Confidence-based abstention significantly reduces error risk.
- The study highlights the need for model-agnostic evaluation methods.
- Future work should focus on selective reasoning in scientific domains.
Computer Science > Computation and Language arXiv:2602.14189 (cs) [Submitted on 15 Feb 2026] Title:Knowing When Not to Answer: Abstention-Aware Scientific Reasoning Authors:Samir Abdaljalil, Erchin Serpedin, Hasan Kurban View a PDF of the paper titled Knowing When Not to Answer: Abstention-Aware Scientific Reasoning, by Samir Abdaljalil and 2 other authors View PDF HTML (experimental) Abstract:Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk...