[2602.16942] SourceBench: Can AI Answers Reference Quality Web Sources?
Summary
The paper introduces SourceBench, a benchmark designed to evaluate the quality of web sources cited by AI models across various query types, revealing insights for future AI and web search research.
Why It Matters
As AI systems increasingly rely on web sources for information, understanding the quality of these sources is crucial for improving the reliability of AI-generated answers. SourceBench provides a structured approach to assess source quality, which can enhance the trustworthiness of AI outputs and guide future developments in generative AI and search technologies.
Key Takeaways
- SourceBench evaluates the quality of web sources cited by AI models using an eight-metric framework.
- The benchmark covers various query intents, including informational and shopping queries.
- Findings reveal significant differences in source quality across different AI models and search tools.
- The study includes a human-labeled dataset that closely aligns with expert evaluations.
- Insights from SourceBench can inform future research directions in AI and web search.
Computer Science > Artificial Intelligence arXiv:2602.16942 (cs) [Submitted on 18 Feb 2026] Title:SourceBench: Can AI Answers Reference Quality Web Sources? Authors:Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, Yiying Zhang View a PDF of the paper titled SourceBench: Can AI Answers Reference Quality Web Sources?, by Hexi Jin and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16942 [cs.AI] (or arXiv:2602.16942v1 [cs.AI] for this version) https://doi.org/...