[2412.17596] Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context
Summary
This article evaluates the divergent thinking capabilities of Large Language Models (LLMs) for scientific idea generation using minimal context, introducing the LiveIdeaBench benchmark.
Why It Matters
Understanding LLMs' divergent thinking is crucial for enhancing their utility in scientific research. The findings suggest that traditional metrics may not accurately predict creative performance, highlighting the need for specialized evaluation benchmarks and training strategies tailored to scientific contexts.
Key Takeaways
- LiveIdeaBench benchmark assesses LLMs' scientific idea generation capabilities.
- Divergent thinking is evaluated across originality, feasibility, fluency, flexibility, and clarity.
- Standard metrics of general intelligence do not predict scientific idea generation performance.
- Models like QwQ-32B-preview show comparable creativity to top-tier models despite lower general intelligence scores.
- Specialized training strategies may be needed to enhance LLMs' idea generation capabilities.
Computer Science > Computation and Language arXiv:2412.17596 (cs) [Submitted on 23 Dec 2024 (v1), last revised 23 Feb 2026 (this version, v4)] Title:Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context Authors:Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, Hao Sun View a PDF of the paper titled Evaluating LLMs' Divergent Thinking Capabilities for Scientific Idea Generation with Minimal Context, by Kai Ruan and 5 other authors View PDF HTML (experimental) Abstract:While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs' scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics...