Have a simple AI CI script that runs nightly using a base model and a sophisticated agent loop to baseline AI progress.

@yangw-dev has been working on this, we can have simple CI jobs running out of core from the bot that runs nightly benchmarks using a vanilla thinking model and the kernelagent harness

AI should just be users and we should have a view to track user performance over time and problem performance over time as @ngc92 had started in the past

These charts can be at the very top of a problem page https://www.gpumode.com/v2/leaderboard/730?tab=rankings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have a simple AI CI script that runs nightly using a base model and a sophisticated agent loop to baseline AI progress. #415

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Have a simple AI CI script that runs nightly using a base model and a sophisticated agent loop to baseline AI progress. #415

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions