[2602.15189] ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction

[2602.15189] ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction

arXiv - AI 3 min read Article

Summary

ScrapeGraphAI-100k introduces a large-scale dataset for LLM-based web information extraction, addressing limitations of existing datasets by providing real-world examples across diverse domains and languages.

Why It Matters

This dataset is significant as it enhances the capabilities of large language models in web information retrieval by providing a rich, structured dataset that reflects real-world extraction events. It supports fine-tuning smaller models and benchmarking structured extraction, which is crucial for improving AI applications in various domains.

Key Takeaways

  • ScrapeGraphAI-100k consists of 93,695 examples from real-world LLM extraction events.
  • The dataset includes diverse content types, prompts, and metadata, enhancing its utility for training models.
  • Fine-tuning smaller models on this dataset can significantly improve their performance, narrowing the gap with larger models.
  • The dataset is publicly available on HuggingFace, promoting accessibility for researchers and developers.
  • It provides insights into schema complexity and failure modes, aiding in the study of web information retrieval.

Computer Science > Information Retrieval arXiv:2602.15189 (cs) [Submitted on 16 Feb 2026] Title:ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction Authors:William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan View a PDF of the paper titled ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction, by William Brach and 3 other authors View PDF HTML (experimental) Abstract:The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small mo...

Related Articles

Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min ·
Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min ·
Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge
Llms

I let Gemini in Google Maps plan my day and it went surprisingly well | The Verge

Gemini in Google Maps is a surprisingly useful way to explore new territory.

The Verge - AI · 11 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime