[2411.06657] Renaissance: Investigating the Pretraining of Vision-Language Encoders
Summary
The paper 'Renaissance' explores the pretraining of vision-language encoders, addressing best practices and introducing a flexible evaluation framework for VL tasks.
Why It Matters
As vision-language models become increasingly prevalent, understanding their pretraining processes is crucial for optimizing performance and resource use. This research provides insights that can enhance model efficiency and effectiveness, benefiting both academic and practical applications in AI.
Key Takeaways
- Freezing parts of VL models during pretraining can save compute resources without sacrificing performance.
- The choice of base model (vision vs. text) significantly affects VL transformer performance.
- The Renaissance framework offers flexibility in creating and evaluating VL models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2411.06657 (cs) [Submitted on 11 Nov 2024 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Renaissance: Investigating the Pretraining of Vision-Language Encoders Authors:Clayton Fields, Casey Kennington View a PDF of the paper titled Renaissance: Investigating the Pretraining of Vision-Language Encoders, by Clayton Fields and 1 other authors View PDF HTML (experimental) Abstract:In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary. In this paper, we seek to answer several questions related to the pretraining of VL encoders through meta-analysis. To conduct these experiments, we introduce a VL evaluation framework called Renaissance. In our first set of experiments, we show that we can save significant compute at little to no cost to downstream performance, by freezing large parts of VL models during pretraining. In our second set of experiments, we examine the effect of basing a VL transformer on a vision model versus a text model. Renaissance offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. Its source code will be made publicly avail...