[2602.16626] A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models
Summary
This article evaluates sample-level tokenization strategies for MEG foundation models, comparing learnable and non-learnable approaches to enhance neuroimaging data analysis.
Why It Matters
Understanding tokenization strategies is crucial for improving the performance of large-scale neuroimaging models. This research provides insights into how different tokenization methods affect data fidelity and modeling outcomes, which can inform future developments in neuroimaging and machine learning.
Key Takeaways
- Both learnable and non-learnable tokenization methods show high reconstruction accuracy.
- Simple fixed sample-level tokenization can be effective for developing neural foundation models.
- The study uses diverse MEG datasets to validate the findings across different conditions.
- A novel autoencoder-based approach for learnable tokenization is introduced.
- Results indicate comparable performance across various evaluation criteria for both tokenization strategies.
Computer Science > Machine Learning arXiv:2602.16626 (cs) [Submitted on 18 Feb 2026] Title:A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models Authors:SungJun Cho, Chetan Gohil, Rukuang Huang, Oiwi Parker Jones, Mark W. Woolrich View a PDF of the paper titled A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models, by SungJun Cho and 4 other authors View PDF HTML (experimental) Abstract:Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquis...