[2602.13306] Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique
Summary
This paper presents a framework for automating the scoring and critique of artwork using a fine-tuned vision-language model, achieving high accuracy in assessments.
Why It Matters
Automating the assessment of artistic creativity can significantly reduce the labor involved in traditional scoring methods, making it scalable for educational and research purposes. This study bridges the gap between computer vision and art evaluation, potentially transforming how creativity is assessed in various contexts.
Key Takeaways
- The proposed model fine-tunes Qwen2-VL-7B for artwork assessment.
- It utilizes a dataset of 1000 human-created paintings with expert evaluations.
- Achieves a Pearson correlation coefficient of over 0.97, indicating strong predictive accuracy.
- Generates qualitative feedback that closely aligns with expert critiques.
- Offers a scalable solution for creativity assessment in educational settings.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13306 (cs) [Submitted on 9 Feb 2026] Title:Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique Authors:Zhehan Zhang, Meihua Qian, Li Luo, Siyu Huang, Chaoyi Zhou, Ripon Saha, Xinxin Song View a PDF of the paper titled Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique, by Zhehan Zhang and 6 other authors View PDF Abstract:Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric...