Machine Learning Nlp Ai Agents

[2602.23286] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

arXiv - AI February 27, 2026 4 min read Article

Summary

The paper presents SPARTA, a novel framework for generating scalable benchmarks for tree-structured multi-hop question answering (QA) over text and tables, addressing limitations in existing benchmarks.

Why It Matters

SPARTA addresses the critical need for robust benchmarks in multi-hop QA tasks, which are essential for advancing AI's ability to reason across complex data. By automating benchmark creation, it reduces human effort while improving the quality of QA systems, thus enhancing AI's practical applications in real-world scenarios.

Key Takeaways

SPARTA automates the creation of large-scale QA benchmarks, significantly reducing annotation time.
The framework ensures high-quality question generation through novel techniques like provenance-based refinement.
State-of-the-art models show significant performance drops on SPARTA, highlighting weaknesses in current cross-modal reasoning capabilities.

Computer Science > Computation and Language arXiv:2602.23286 (cs) [Submitted on 26 Feb 2026] Title:SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables Authors:Sungho Park, Jueun Kim, Wook-Shin Han View a PDF of the paper titled SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables, by Sungho Park and 1 other authors View PDF HTML (experimental) Abstract:Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbal...

Read Original Article

Machine Learning

[R] Are there ML approaches for prioritizing and routing “important” signals across complex systems?

I’ve been reading more about attention mechanisms in transformers and how they effectively learn to weight and prioritize relevant inputs...

Reddit - Machine Learning · 1 min · 25 minutes ago

Llms

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Hi Everybody! I just wanted to share an update on a project I’ve been working on called BULaMU, a family of language models trained (20M,...

Reddit - Machine Learning · 1 min · 25 minutes ago

Machine Learning

[R] Structure Over Scale: Memory-First Reasoning and Depth-Pruned Efficiency in Magnus and Seed Architecture Auto-Discovery

Dataset Model Acc F1 Δ vs Log Δ vs Static Avg Params Peak Params Steps Infer ms Size Banking77-20 Logistic TF-IDF 92.37% 0.9230 +0.00pp +...

Reddit - Machine Learning · 1 min · 25 minutes ago

Machine Learning

UM Computer Scientists Land Grant to Improve Models of Melting Greenland Glaciers

Two UM researchers are using advanced neural networks, machine learning and artificial intelligence to improve climate models to better p...

AI News - General · 5 min · 25 minutes ago

[2602.23286] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Are there ML approaches for prioritizing and routing “important” signals across complex systems?

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

[R] Structure Over Scale: Memory-First Reasoning and Depth-Pruned Efficiency in Magnus and Seed Architecture Auto-Discovery

UM Computer Scientists Land Grant to Improve Models of Melting Greenland Glaciers

No comments

Stay updated with AI News