[2512.18080] From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

[2512.18080] From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

arXiv - AI 4 min read Article

Summary

This paper introduces a human-centered benchmark for evaluating agentic app generation systems, comparing platforms like Replit, Bolt, and Firebase Studio based on user experience and functionality.

Why It Matters

As AI-driven app generation tools become more prevalent, understanding their effectiveness through human-centered benchmarks is crucial for developers and businesses. This study highlights performance discrepancies among leading platforms, guiding users in selecting the most reliable tools for app development.

Key Takeaways

  • A human-centered benchmark is essential for evaluating app generation systems.
  • Firebase Studio outperforms Replit and Bolt in user trust and ease of use.
  • Visual appeal does not always correlate with functional reliability in app generation.

Computer Science > Human-Computer Interaction arXiv:2512.18080 (cs) [Submitted on 19 Dec 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems Authors:Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor View a PDF of the paper titled From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems, by Marcos Ortiz and 6 other authors View PDF HTML (experimental) Abstract:Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived com...

Related Articles

Generative Ai

Midjourney has a new offer on the cancel page there is 20 off for 2 months

submitted by /u/RainDragonfly826 [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Walmart CEO reportedly brags that company's in-app AI agent is making people spend 35% more money
Nlp

Walmart CEO reportedly brags that company's in-app AI agent is making people spend 35% more money

AI Tools & Products · 4 min ·
Llms

[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry

Hi r/MachineLearning, I’m looking for an arXiv endorser in cs.LG for a paper on inference-time distribution shift detection for deployed ...

Reddit - Machine Learning · 1 min ·
Nlp

[D] KDD Review Discussion

KDD 2026 (Feb Cycle) reviews will release today (4-April AoE), This thread is open to discuss about reviews and importantly celebrate suc...

Reddit - Machine Learning · 1 min ·
More in Nlp: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime