I ran the same coding task through 5 AI models every week for a month
by Noah WilliamsSide project turned obsession: same non-trivial coding task (build a rate limiter with tests, from a fixed spec), five different models, every Friday, four weeks. Results: the top two swapped places twice, the "best" model failed a run the cheapest one aced, and week-to-week variance within the SAME model was bigger than the gap between models. My takeaway as a student: stop arguing about which model is smartest and start building evals for YOUR task. The leaderboard is a vibe; your test suite is the truth. Happy to share the spec if anyone wants to reproduce this.