[2602.15189] ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
Summary
ScrapeGraphAI-100k introduces a large-scale dataset for LLM-based web information extraction, addressing limitations of existing datasets by providing real-world examples across diverse domains and languages.
Why It Matters
This dataset is significant as it enhances the capabilities of large language models in web information retrieval by providing a rich, structured dataset that reflects real-world extraction events. It supports fine-tuning smaller models and benchmarking structured extraction, which is crucial for improving AI applications in various domains.
Key Takeaways
- ScrapeGraphAI-100k consists of 93,695 examples from real-world LLM extraction events.
- The dataset includes diverse content types, prompts, and metadata, enhancing its utility for training models.
- Fine-tuning smaller models on this dataset can significantly improve their performance, narrowing the gap with larger models.
- The dataset is publicly available on HuggingFace, promoting accessibility for researchers and developers.
- It provides insights into schema complexity and failure modes, aiding in the study of web information retrieval.
Computer Science > Information Retrieval arXiv:2602.15189 (cs) [Submitted on 16 Feb 2026] Title:ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction Authors:William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan View a PDF of the paper titled ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction, by William Brach and 3 other authors View PDF HTML (experimental) Abstract:The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small mo...