Framesworks For Supporting Benchmarking [P]
About this article
I think the way we are approaching benchmarking is a bit problematic. From reading about how frontier labs benchmark their models, they essentially create a new model, configure a harness, and then run a massive benchmarking suite just to demonstrate marginal gains. I have several problems with this approach. I worry that we are wasting a significant amount of resources iterating on models and effectively trading carbon for confidence. Looking at the latest Gemini benchmarking, for instance, ...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket