@yangw-dev has been working on this, we can have simple CI jobs running out of core from the bot that runs nightly benchmarks using a vanilla thinking model and the kernelagent harness
AI should just be users and we should have a view to track user performance over time and problem performance over time as @ngc92 had started in the past
These charts can be at the very top of a problem page https://www.gpumode.com/v2/leaderboard/730?tab=rankings