[2602.14488] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

[2602.14488] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

arXiv - AI 4 min read Article

Summary

This article presents the BETA-labeling framework for constructing a Bangla IR dataset, addressing challenges in low-resource languages and the reliability of LLMs for dataset annotation.

Why It Matters

The study highlights the critical need for high-quality annotated datasets in low-resource languages, which are often overlooked. By exploring the potential and limitations of LLMs in this context, it provides valuable insights for researchers and practitioners aiming to improve multilingual information retrieval systems.

Key Takeaways

  • BETA-labeling framework enhances dataset quality through multiple LLM annotators.
  • Human evaluation is crucial for ensuring label reliability in low-resource settings.
  • Cross-lingual dataset reuse poses risks due to language-dependent biases.

Computer Science > Computation and Language arXiv:2602.14488 (cs) [Submitted on 16 Feb 2026] Title:BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR Authors:Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique View a PDF of the paper titled BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR, by Md. Najib Hasan and 3 other authors View PDF HTML (experimental) Abstract:IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly...

Related Articles

Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch
Llms

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

It’s about to become more expensive for Claude Code subscribers to use Anthropic’s coding assistant with OpenClaw and other third-party t...

TechCrunch - AI · 4 min ·
Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime