[R] I trained a 3k parameter model on XOR sequences of length 20. It extrapolates perfectly to length 1,000,000. Here's why I think that's architecturally significant.
I've been working on an alternative to attention-based sequence modeling that I'm calling Geometric Flow Networks (GFN). The core idea: i...