[2602.14216] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports
Summary
This article explores the effectiveness of reasoning language models (RLMs) in assessing parental cooperation during child protection interventions, demonstrating significant advancements in accuracy compared to traditional methods.
Why It Matters
The study highlights the potential of RLMs to enhance decision-making in child protection services, addressing complex assessments that often involve ambiguous information. By improving accuracy in evaluating parental cooperation, the findings could lead to better outcomes in child welfare interventions.
Key Takeaways
- RLMs achieved an accuracy of 89% in assessing parental cooperation.
- The largest RLM outperformed traditional methods, which had an accuracy of 80%.
- Higher classification accuracy was noted for mothers (93%) compared to fathers (85%).
- The study underscores the need for balanced focus on both parents in CPS interventions.
- RLMs can effectively handle complex case factors in child protection scenarios.
Computer Science > Computers and Society arXiv:2602.14216 (cs) [Submitted on 15 Feb 2026] Title:Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports Authors:Dragan Stoll, Brian E. Perron, Zia Qi, Selina Steinmann, Nicole F. Eicher, Andreas Jud View a PDF of the paper titled Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports, by Dragan Stoll and 5 other authors View PDF Abstract:Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs' reasoning can effectively assess...