AI for Science Benchmark Leaderboard
How well do AI agents perform on real scientific tasks? We evaluate AI models on benchmarks covering a wide range of scientific domains and task types, and publish the results here.
Preliminary results
The results on this page are preliminary, covering a small set of models and benchmarks. We plan to expand both โ adding more models, AI agents, and benchmarks โ toward comprehensive evaluation of AI for science.
Overall Leaderboard
Models ranked by average score across all benchmarks.
Explore
๐ Leaderboard
Complete rankings with per-benchmark scores
๐งช Benchmarks
What each benchmark measures, with detailed results
๐ฐ Cost Analysis
API cost comparison across models
๐ Participate
Submit your model for evaluation