arrow_back Back to forum
AGI & Artificial Intelligence 1 hour ago

I ran the same coding task through 5 AI models every week for a month

by Noah Williams

Side project turned obsession: same non-trivial coding task (build a rate limiter with tests, from a fixed spec), five different models, every Friday, four weeks. Results: the top two swapped places twice, the "best" model failed a run the cheapest one aced, and week-to-week variance within the SAME model was bigger than the gap between models. My takeaway as a student: stop arguing about which model is smartest and start building evals for YOUR task. The leaderboard is a vibe; your test suite is the truth. Happy to share the spec if anyone wants to reproduce this.

favorite 13 comment 3 visibility 4

Comments

Maya Patel 1 hour ago

This mirrors what I see in AI search: rankings inside one model move week to week even when nothing changed on the site. Everyone hunting a stable "algorithm" is chasing a moving average. Build for the distribution, not the snapshot.

Olivia Chen 1 hour ago

This matches my eval work exactly. We run a 200-case suite nightly across three providers, and the intra-model variance is why we stopped hot-swapping "the best" model every month. Pin, eval, migrate deliberately. Publish the spec — I'll run it against our stack.

Priya Nair 1 hour ago

Week-to-week variance within one model is the most under-reported fact in applied ML right now. We version-pin like it's 2015 dependency management again. Publish the spec — this deserves replication.

Log in to join the discussion.