[2512.18080] From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
Summary
This paper introduces a human-centered benchmark for evaluating agentic app generation systems, comparing platforms like Replit, Bolt, and Firebase Studio based on user experience and functionality.
Why It Matters
As AI-driven app generation tools become more prevalent, understanding their effectiveness through human-centered benchmarks is crucial for developers and businesses. This study highlights performance discrepancies among leading platforms, guiding users in selecting the most reliable tools for app development.
Key Takeaways
- A human-centered benchmark is essential for evaluating app generation systems.
- Firebase Studio outperforms Replit and Bolt in user trust and ease of use.
- Visual appeal does not always correlate with functional reliability in app generation.
Computer Science > Human-Computer Interaction arXiv:2512.18080 (cs) [Submitted on 19 Dec 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems Authors:Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor View a PDF of the paper titled From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems, by Marcos Ortiz and 6 other authors View PDF HTML (experimental) Abstract:Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived com...