[2512.20352] Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation
Summary
This paper presents a novel framework for validating qualitative research using multi-LLM thematic analysis, integrating Cohen's Kappa and semantic similarity metrics to enhance reliability.
Why It Matters
The study addresses the reliability challenges in qualitative research, particularly in the context of AI-assisted analysis. By combining traditional metrics with modern LLM capabilities, it offers a scalable and efficient solution for researchers, potentially transforming qualitative methodologies.
Key Takeaways
- Introduces a dual reliability metrics framework for qualitative research validation.
- Demonstrates high reliability in thematic analysis using three leading LLMs.
- Provides an open-source implementation for researchers to utilize.
- Enhances traditional qualitative methods with AI capabilities.
- Establishes a methodological foundation for AI-assisted qualitative research.
Computer Science > Computation and Language arXiv:2512.20352 (cs) [Submitted on 23 Dec 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation Authors:Nilesh Jain, Hyungil Suh, Seyi Adeyinka, Leor Roseman, Aza Allsop View a PDF of the paper titled Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation, by Nilesh Jain and 4 other authors View PDF HTML (experimental) Abstract:Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($\kappa$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves hi...