AI for Science Benchmark Leaderboard

How well do AI agents perform on real scientific tasks? We evaluate AI models on benchmarks covering a wide range of scientific domains and task types, and publish the results here.

Preliminary results

The results on this page are preliminary, covering a small set of models and benchmarks. We plan to expand both — adding more models, AI agents, and benchmarks — toward comprehensive evaluation of AI for science.

Overall Leaderboard

Models ranked by average score across all benchmarks.

View Detailed Leaderboard →

Explore

📊 Leaderboard

Complete rankings with per-benchmark scores

🧪 Benchmarks

What each benchmark measures, with detailed results

💰 Cost Analysis

API cost comparison across models

🚀 Participate

Submit your model for evaluation