Llms Machine Learning Ai Agents Data Science

[2602.19517] Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

arXiv - AI February 24, 2026 4 min read Article

Summary

The paper presents CFE, a multimodal benchmark for evaluating large language models' reasoning capabilities in STEM domains, highlighting challenges faced by current models.

Why It Matters

This research is significant as it addresses the limitations of large language models in reasoning tasks, providing a structured benchmark that can guide future improvements in AI capabilities across various STEM fields. It emphasizes the need for better reasoning efficiency and accuracy in AI systems.

Key Takeaways

Introduction of CFE, a benchmark for reasoning in STEM.
Current models struggle with maintaining correct intermediate states in multi-step solutions.
The benchmark reveals significant room for improvement in model accuracy.
Model-generated solutions often involve more reasoning steps, increasing error risk.
Data and code for the benchmark are publicly available for further research.

Computer Science > Artificial Intelligence arXiv:2602.19517 (cs) [Submitted on 23 Feb 2026] Title:Classroom Final Exam: An Instructor-Tested Reasoning Benchmark Authors:Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen View a PDF of the paper titled Classroom Final Exam: An Instructor-Tested Reasoning Benchmark, by Chongyang Gao and Diji Yang and Shuyan Zhou and Xichen Yan and Luchuan Song and Shuo Li and Kezhen Chen View PDF HTML (experimental) Abstract:We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE{} is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE{} presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more re...

Read Original Article

Llms

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min · about 3 hours ago

Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min · about 10 hours ago

[2602.19517] Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

OpenClaw security checklist: practical safeguards for AI agents

No comments

Stay updated with AI News