[2506.06251] DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Summary
DesignBench introduces a comprehensive benchmark for evaluating MLLM-based front-end code generation, addressing limitations in existing benchmarks by incorporating multiple frameworks and tasks.
Why It Matters
As front-end development evolves, effective evaluation tools like DesignBench are crucial for assessing the capabilities of Multimodal Large Language Models (MLLMs). This benchmark not only enhances the understanding of MLLM performance across various frameworks but also guides future research in automated front-end engineering, making it relevant for developers and researchers alike.
Key Takeaways
- DesignBench evaluates MLLMs across multiple frameworks (React, Vue, Angular).
- It addresses existing benchmarks' limitations by including tasks like editing and repairing code.
- The benchmark consists of 900 webpage samples, enabling detailed performance analysis.
- Insights from DesignBench can guide improvements in automated front-end development.
- The framework-specific evaluations reveal critical performance bottlenecks.
Computer Science > Software Engineering arXiv:2506.06251 (cs) [Submitted on 6 Jun 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation Authors:Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu View a PDF of the paper titled DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation, by Jingyu Xiao and Man Ho Lam and Ming Wang and Yuxuan Wan and Junliang Liu and Yintong Huo and Michael R. Lyu View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs' capabi...