[2602.16241] Are LLMs Ready to Replace Bangla Annotators?
Summary
This article evaluates the effectiveness of Large Language Models (LLMs) as annotators for Bangla hate speech, revealing significant biases and inconsistencies in their performance compared to smaller, task-aligned models.
Why It Matters
Understanding the limitations of LLMs in low-resource languages like Bangla is crucial for developing reliable AI systems. This research highlights the need for careful evaluation of AI tools in sensitive contexts, which is essential for ethical AI deployment and ensuring fair outcomes.
Key Takeaways
- LLMs exhibit significant annotator bias and instability in judgments.
- Larger models do not necessarily provide better annotation quality than smaller, task-specific models.
- The study emphasizes the importance of evaluating AI tools before deployment in sensitive tasks.
- Current LLMs may not be suitable for low-resource languages without further refinement.
- Understanding model behavior is critical for ethical AI applications.
Computer Science > Computation and Language arXiv:2602.16241 (cs) [Submitted on 18 Feb 2026] Title:Are LLMs Ready to Replace Bangla Annotators? Authors:Md. Najib Hasan, Touseef Hasan, Souvika Sarkar View a PDF of the paper titled Are LLMs Ready to Replace Bangla Annotators?, by Md. Najib Hasan and 1 other authors View PDF Abstract:Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16241 [cs.CL] (or arXiv:2602.16241v1 [cs.CL] for this version) https://doi.org...