[2501.10471] VillageNet: Graph-based, Easily-interpretable, Unsupervised Clustering for Broad Biomedical Applications
Summary
VillageNet introduces a novel unsupervised clustering algorithm designed for high-dimensional biomedical datasets, enhancing interpretability and efficiency.
Why It Matters
This research addresses the challenge of clustering large, complex datasets in biomedical fields, offering a method that autonomously determines the optimal number of clusters. Its implications for data analysis could significantly improve insights in healthcare and biological research, where understanding complex relationships is crucial.
Key Takeaways
- VillageNet effectively clusters high-dimensional data without prior knowledge of cluster numbers.
- The algorithm combines K-Means clustering with a community detection approach for optimal results.
- It demonstrates competitive performance against existing methods, particularly in normalized mutual information scores.
- VillageNet is computationally efficient, suitable for large-scale datasets.
- The method enhances interpretability in biomedical applications, facilitating better data-driven decisions.
Computer Science > Machine Learning arXiv:2501.10471 (cs) [Submitted on 16 Jan 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:VillageNet: Graph-based, Easily-interpretable, Unsupervised Clustering for Broad Biomedical Applications Authors:Aditya Ballal, Gregory A. DePaul, Esha Datta, Asuka Hatano, Erik Carlsson, Ye Chen-Izu, Javier E. López, Leighton T. Izu View a PDF of the paper titled VillageNet: Graph-based, Easily-interpretable, Unsupervised Clustering for Broad Biomedical Applications, by Aditya Ballal and 7 other authors View PDF HTML (experimental) Abstract:Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomo...