[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.
About this article
GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The official RLM implementation dropped too (69.7% to 50.2%). Our implementation - where the model writes Python to query data instead of attending to all of it with task pattern matching and entropy - went from 72.7% to 69.5%. The architecture absorbed what the model couldn't. Also: AIME 2025 is 80% vs 0% vanilla. Same pattern as GPT-5.2. The model outputs a b...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket