[2602.17106] Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
Summary
The paper proposes a human-AI collaborative framework for creating benchmark datasets to evaluate sustainability rating methodologies, addressing inconsistencies in ESG ratings across agencies.
Why It Matters
Sustainability ratings are crucial for informed decision-making by investors and stakeholders. This framework aims to enhance the credibility and comparability of these ratings, which is essential for advancing sustainability agendas and ensuring accountability in corporate practices.
Key Takeaways
- Current ESG ratings vary widely, impacting their reliability.
- The proposed STRIDE framework utilizes large language models for dataset construction.
- SR-Delta framework identifies discrepancies for potential improvements.
- The study emphasizes the need for AI-driven solutions in sustainability assessments.
- Collaboration between AI and human expertise is vital for trustworthy evaluations.
Computer Science > Artificial Intelligence arXiv:2602.17106 (cs) [Submitted on 19 Feb 2026] Title:Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction Authors:Xiaoran Cai, Wang Yang, Xiyu Ren, Chekun Law, Rohit Sharma, Peng Qi View a PDF of the paper titled Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction, by Xiaoran Cai and 5 other authors View PDF HTML (experimental) Abstract:Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single company vary widely, limiting their comparability, credibility, and relevance to decision-making. To harmonize the rating results, we propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies. The framework comprises two complementary parts: STRIDE (Sustainability Trust Rating & Integrity Data Equation) provides principled criteria and a scoring system that guide the construction of firm-level benchmark datasets using large language models (LLMs), and SR-Delta, a discrepancy-analysis procedural framework that surfaces insights for potential adjustments. The framework enables scalable...