[2602.23199] SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Summary
SC-Arena introduces a natural language benchmark for evaluating single-cell reasoning in large language models, addressing gaps in current assessment practices.
Why It Matters
This framework is crucial as it enhances the evaluation of LLMs in single-cell biology, ensuring that assessments are biologically relevant and interpretable. It aims to unify fragmented evaluation practices and improve the reliability of model performance in complex biological tasks.
Key Takeaways
- SC-Arena provides a unified evaluation framework for single-cell biology.
- It introduces five natural language tasks that assess core reasoning in cellular biology.
- The framework incorporates knowledge-augmented evaluation for biologically grounded assessments.
- Current LLMs show uneven performance in complex biological tasks.
- SC-Arena aims to develop biology-aligned, generalizable foundation models.
Computer Science > Artificial Intelligence arXiv:2602.23199 (cs) [Submitted on 26 Feb 2026] Title:SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation Authors:Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu, Guibing Guo, Hamid Alinejad-Rokny, Min Yang View a PDF of the paper titled SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation, by Jiahao Zhao and 7 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-...