[2602.16703] Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

[2602.16703] Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

arXiv - AI 4 min read Article

Summary

This study evaluates the impact of large language models (LLMs) on novice performance in biology laboratory tasks, revealing modest benefits but no significant improvements in overall task completion.

Why It Matters

Understanding the effectiveness of LLMs in real-world applications, particularly in sensitive fields like biology, is crucial for assessing their role in education and biosecurity. This research highlights the gap between theoretical capabilities and practical outcomes, informing future AI integration in laboratory settings.

Key Takeaways

  • LLMs showed no significant improvement in overall workflow completion for novices.
  • Numerical advantages were observed in specific tasks, particularly cell culture.
  • Post-hoc analysis indicated a potential modest performance benefit with LLM assistance.
  • The study emphasizes the need for validation of AI tools in real-world scenarios.
  • Results suggest a disconnect between AI capabilities and practical laboratory performance.

Computer Science > Computers and Society arXiv:2602.16703 (cs) [Submitted on 18 Feb 2026] Title:Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology Authors:Shen Zhou Hong, Alex Kleinman, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Alex Letizia, Dora Liao, Deepika Pahari, Xavier Roberts-Gaal, Luca Righetti, Joe Torres View a PDF of the paper titled Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology, by Shen Zhou Hong and 11 other authors View PDF Abstract:Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics...

Related Articles

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime