[2604.08556] EMA Is Not All You Need: Mapping the Boundary Between

[2604.08556] EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

arXiv - AI April 13, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.08556: EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Computer Science > Computation and Language arXiv:2604.08556 (cs) [Submitted on 17 Mar 2026] Title:EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context Authors:Arth Singh View a PDF of the paper titled EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context, by Arth Singh View PDF HTML (experimental) Abstract:What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, in...

Originally published on April 13, 2026. Curated by AI News.

Llms

Transformer Math Explorer [P]

This is an interactive math reference for transformer models, presented via dataflow graphs, all the way down to elementary math. Covers ...

Reddit - Machine Learning · 1 min · 3 minutes ago

Machine Learning

how much of your time goes into environment setup vs actual model work?

For most people I've talked to, it's embarrassingly high. New machine? Set up CUDA again. New team member? Good luck for reproducing the ...

Reddit - ML Jobs · 1 min · about 1 hour ago

Machine Learning

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

Hi! I am trying to sanity-check an assumption for diffusion video generation reproducibility. Suppose I run the same video diffusion mode...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

(Posting Here because removed by Chatgpt Complaints moderators because the model here is 4o, and refuse to believe there were any safety ...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

[2604.08556] EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

About this article

Related Articles

Transformer Math Explorer [P]

how much of your time goes into environment setup vs actual model work?

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

No comments

Stay updated with AI News