[R] Hybrid attention for small code models: 50x faster inference, but data scaling still dominates
About this article
TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests . I trained a 25.6M parameter Rust-focused language model from scratch using a byte-level GPT-style decoder. The main result is that increasing dataset size mattered more than any architectural change. Expanding the corpus from about 31MB of core Rust sources to roughly 173MB by adding a few hundred cra...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket