[R] Causal self-attention as a probabilistic model over embeddings
About this article
We’ve been working on a probabilistic interpretation of causal self-attention where token embeddings are treated as latent variables. In that view, the attention map induces a change-of-variables term, which leads to a barrier / degeneracy boundary in embedding space. The resulting picture is: a stability-margin interpretation of causal attention “support tokens,” i.e. the positions closest to the degeneracy boundary a simple MAP-style training penalty: standard cross-entropy plus a smooth lo...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket