[2602.22405] MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion
Summary
MolFM-Lite introduces a multi-modal approach to molecular property prediction, integrating various molecular representations through advanced attention mechanisms, significantly improving prediction accuracy.
Why It Matters
This research addresses limitations in traditional molecular property prediction models that rely on single representations. By leveraging multi-modal data and advanced attention techniques, it enhances predictive performance, which is crucial for drug discovery and material science. The release of code and models also promotes reproducibility in the field.
Key Takeaways
- MolFM-Lite combines 1D, 2D, and 3D molecular representations for better predictions.
- The model's conformer ensemble attention mechanism captures diverse molecular shapes effectively.
- Cross-modal fusion allows for enhanced information sharing between different molecular data types.
- AUC improvements of 7-11% over single-modality models demonstrate the model's effectiveness.
- The study emphasizes the importance of reproducibility by providing code and data splits.
Computer Science > Machine Learning arXiv:2602.22405 (cs) [Submitted on 25 Feb 2026] Title:MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion Authors:Syed Omer Shah, Mohammed Maqsood Ahmed, Danish Mohiuddin Mohammed, Shahnawaz Alam, Mohd Vahaj ur Rahman View a PDF of the paper titled MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion, by Syed Omer Shah and 4 other authors View PDF HTML (experimental) Abstract:Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model's own splits, and report all baselines re-evaluat...