Nlp Ai Agents Ai Startups Generative Ai Machine Learning

[2512.18080] From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

arXiv - AI February 16, 2026 4 min read Article

Summary

This paper introduces a human-centered benchmark for evaluating agentic app generation systems, comparing platforms like Replit, Bolt, and Firebase Studio based on user experience and functionality.

Why It Matters

As AI-driven app generation tools become more prevalent, understanding their effectiveness through human-centered benchmarks is crucial for developers and businesses. This study highlights performance discrepancies among leading platforms, guiding users in selecting the most reliable tools for app development.

Key Takeaways

A human-centered benchmark is essential for evaluating app generation systems.
Firebase Studio outperforms Replit and Bolt in user trust and ease of use.
Visual appeal does not always correlate with functional reliability in app generation.

Computer Science > Human-Computer Interaction arXiv:2512.18080 (cs) [Submitted on 19 Dec 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems Authors:Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor View a PDF of the paper titled From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems, by Marcos Ortiz and 6 other authors View PDF HTML (experimental) Abstract:Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived com...

Read Original Article

[2512.18080] From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

Summary

Why It Matters

Key Takeaways

Related Articles

Midjourney has a new offer on the cancel page there is 20 off for 2 months

Walmart CEO reportedly brags that company's in-app AI agent is making people spend 35% more money

[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry

[D] KDD Review Discussion

No comments

Stay updated with AI News