Dec 17, 2025

AI4Science評価データ FrontierScience（OpenAI）の概要と簡易検証

2025年9月3日にOpenAI for Scienceについての投稿がされてから約3ヶ月経ち、昨日2025年12月16日にAIによる科学研究タスクの遂行能力を評価するためのデータセットであるFrontierScienceがOpenAI社から公開されました。本記事では、まずこのデータセットの概要を整理し、その後、実際に簡単な検証を行ってみた結果を共有します。

FrontierScienceとは？

FrontierSceinceは、物理、科学、生物学の専門家によって作成された、AIが科学的課題をどの程度解けるかを評価するための問題群です。難易度が高く、独創性や学術的意義を要する問題で構成されていると主張されています。

OpenAI社がこのデータを作成した意義の1つとして、既存ベンチマークが廃れてきていることが挙げられています。例えば、Googleが作成したGPQA（“Google-Proof” science benchmark of questions written by PhD experts）は、GPT-4の時点では正答率39%程度でしたが、2年後の現在ではGPT-5.2により92%が解けると報告されています。これは、モデル性能の向上に伴い、従来の評価問題では十分な識別力を保てなくなっていることを示唆しています。このような状況を踏まえ、今後もAIの進化に追従した評価を行うには、より難度の高い新たな問題セットが必要であり、その試みの一つがFrontierScienceであると考えられます。

FrontierScienceは、以下2種類のデータセットから構築されています。

FrontierScience-Olympiad: 国際科学オリンピック形式の短答問題で構成されており、問題は実際のオリンピックメダリストによって作成されています。

FrontierScience-Research: 実際の研究活動で直面するような、より実践的かつオープンエンドな研究課題から構成されています。問題作成者は、教授職、ポストドクター研究者、博士課程学生などの研究者です。

全体では約 700 問が用意されているらしいですが、そのうち160 問（Olympiad：100 問、Research：60 問）のみが公開されています。残りの問題は、将来的な学習データ汚染を防ぐ目的で非公開とされています。

早速使ってみた

FrontierScience-OlympiadとFrontierScience-Research、それぞれ最初の5問のみを用いて簡易的な評価を行いました。サンプル数は極めて少ないため定量評価としての意味は限定的ですが、あくまで挙動確認と API 間の比較を目的としています。

個人的に気になっていた点として、Chat Completions API と Responses API で、同一モデルを用いた場合に挙動や性能差が出るのかを確認してみました。

まずは Hugging Face からデータセットを取得します。

git clone https://huggingface.co/datasets/openai/frontierscience

評価方法は論文に記載されている設定に従い、以下の 2 モデル構成を採用しました。

Solver モデル：問題を解くモデル
Judge モデル：Solver の出力を採点するモデル

Judge 用のプロンプトは論文 Appendix B に記載されているものをそのまま使用しています。

FrontierScience-Olympiad

まずは比較的シンプルな Olympiad タスクです。以下では、Solver に o4-mini を用い、Judge に gpt-5-mini を使用しています。

"""
Solver：Responses API + o4-miniで問題を解かせる
Judge：回答の採点は、Responses API + gpt-5-mini
"""
import json
import os
from pathlib import Path

from dotenv import load_dotenv
from openai import OpenAI

# モデル設定
SOLVER_MODEL = "o4-mini-2025-04-16"
JUDGE_MODEL = "gpt-5-mini-2025-08-07"

# Judge用プロンプトテンプレート
JUDGE_PROMPT_TEMPLATE = """
You are grading an attempted answer to a science olympiad problem. You will be given the problem, attempted answer, and reference answer. Evaluate the solution against the provided reference solution, ensuring it is complete and matches the reference solution. Pay close attention to detail and grade it strictly, but fairly.
The reference answer is either a single number or expression in latex formatting, a chemical formula, a compound name, or a phrase referring to a specific name, entity, or method.
Mark the attempted answer as correct if it fully matches the reference answer or is otherwise equivalent (e.g., an equivalent algebraic expression, a numerical number within 1 decimal place rounding of the reference answer (e.g., 6.69 approx 6.7), an equivalent name for a compound/formula, equivalent when accounting for units, etc.). Mark it as incorrect if it is not equivalent to the reference answer.
***
The problem: {problem}
***
The reference answer: {reference_answer}
***
The attempted answer: {answer}
***
First, think step-by-step about whether the attempted answer matches the reference answer. If the attempted answer is correct, write "VERDICT: CORRECT" in the last line of your response, with no other text or formatting. If it is incorrect, write "VERDICT: INCORRECT".
"""

def parse_response_output(response):
    """responseからanswerを抽出"""
    answer = ""
    for item in response.output if isinstance(response.output, list) else []:
        if not hasattr(item, "type"):
            continue
        if item.type == "message":
            content = item.content if hasattr(item, "content") else []
            if isinstance(content, list):
                for c in content:
                    if hasattr(c, "text"):
                        answer = c.text.strip()
                        break
            break
    return answer

def get_solver_answer(client, problem):
    """Solverモデルに問題を解かせる"""
    response = client.responses.create(
        model=SOLVER_MODEL,
        input=problem,
        tools=[
            {"type": "web_search_preview"},
            {"type": "code_interpreter", "container": {"type": "auto"}},
        ],
    )
    return parse_response_output(response)

def judge_answer(client, problem, reference_answer, attempted_answer):
    """Judgeモデルで採点"""
    prompt = JUDGE_PROMPT_TEMPLATE.format(
        problem=problem, reference_answer=reference_answer, answer=attempted_answer
    )
    response = client.responses.create(
        model=JUDGE_MODEL,
        input=prompt,
    )
    judgment = parse_response_output(response)
    # 最終行からVERDICTを抽出
    verdict = "INCORRECT"
    if "VERDICT: CORRECT" in judgment.split("\n")[-1]:
        verdict = "CORRECT"
    return verdict, judgment

def main():
    load_dotenv()
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    # テストデータ読み込み
    test_file = Path("olympiad/test.jsonl")
    problems = []
    with open(test_file, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= 5:  # 最初の5問のみ
                break
            problems.append(json.loads(line))

    results = []
    for i, problem_data in enumerate(problems, 1):
        print(f"\n{'=' * 60}")
        print(f"問題 {i}/{len(problems)}")
        print(f"{'=' * 60}")
        print(f"Subject: {problem_data['subject']}")
        print(f"\n{problem_data['problem'][:200]}...")

        # Solverで解答
        print("\n[Solver] 解答中...")
        attempted_answer = get_solver_answer(client, problem_data["problem"])
        print(f"Attempted Answer: {attempted_answer[:200]}...")

        # Judgeで採点
        print("\n[Judge] 採点中...")
        verdict, judgment = judge_answer(
            client, problem_data["problem"], problem_data["answer"], attempted_answer
        )
        print(f"Verdict: {verdict}")

        results.append(
            {
                "problem_id": i,
                "subject": problem_data["subject"],
                "problem": problem_data["problem"],
                "reference_answer": problem_data["answer"],
                "attempted_answer": attempted_answer,
                "verdict": verdict,
                "judgment": judgment,
            }
        )

    # 結果を保存
    output_file = Path("evaluation_results.json")
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, ensure_ascii=False, indent=2)

    # サマリー表示
    correct_count = sum(1 for r in results if r["verdict"] == "CORRECT")
    print(f"\n{'=' * 60}")
    print("評価完了")
    print(f"{'=' * 60}")
    print(f"正解数: {correct_count}/{len(results)}")
    print(f"正答率: {correct_count / len(results) * 100:.1f}%")
    print(f"\n結果保存: {output_file}")

if __name__ == "__main__":
    main()

Chat Completions APIを試すスクリプトでは、get_solver_answer関数のみを下記のように修正して使用しました。

"""
Solver：Chat Completions API + o4-miniで問題を解かせる
Judge：回答の採点は、Responses API + gpt-5-mini
"""

def get_solver_answer(client, problem):
    """Chat Completions APIでSolverモデルに問題を解かせる"""
    response = client.chat.completions.create(
        model=SOLVER_MODEL,
        messages=[
            {
                "role": "user",
                "content": problem,
            }
        ],
        temperature=1.0,
    )
    return response.choices[0].message.content.strip()

FrontierScience-Research

Research タスクでは、rubric ベースの採点方式が採用されています。評価データの answer フィールドに詳細な採点基準（rubric）が記述されており、Judgeモデルはそれに従って段階的にスコアリングを行います。

プロンプト自体は論文に記載されているものをそのまま使用し、judge_answer 関数のみを rubric 対応用に軽微に修正しました。

def judge_answer(client, problem, rubric, attempted_answer):
    """Judgeモデルで採点（rubricベース）"""
    prompt = JUDGE_PROMPT_TEMPLATE.format(
        problem=problem, rubric=rubric, answer=attempted_answer
    )
    response = client.responses.create(
        model=JUDGE_MODEL,
        input=prompt,
    )
    judgment = parse_response_output(response)

    # 最終行からVERDICT: <points>を抽出
    score = 0.0
    last_line = judgment.split("\n")[-1]
    match = re.search(r"VERDICT:\s*([\d.]+)", last_line)
    if match:
        score = float(match.group(1))

    return score, judgment

結果

今回は各データセット 5 問ずつのみの評価のため、結果自体に統計的な意味はありませんが、参考までにスコアをまとめると以下のようになりました。

API Type	Model	Dataset	Score
Chat Completions API	o4-mini	Olympiad	80%
Responses API	o4-mini	Olympiad	100%
Chat Completions API	o4-mini	Research	45.0%
Responses API	o4-mini	Research	44.5%

Olympiad タスクでは両 API ともに高い性能を示しましたが、Responses API の方が今回のサンプルでは全問正解と、わずかに上回る結果となりました。

一方、Research タスクでは両者とも正答率（スコア）は 50% 未満にとどまり、問題の難易度の高さがうかがえます。API 間の差はほとんど見られず、Chat Completions API がごく僅差で上回る結果となりました。

まとめ

本記事では、OpenAI が公開した AI4Science 評価データセット FrontierScience の概要と、ごく簡単な試行結果を紹介しました。全 700 問中 160 問とはいえ、評価データを公開してくれた点は非常にありがたいと感じます。

特に FrontierScience-Research は、今回の少数サンプルの結果を見る限りでも非常に難易度が高く、単純な正誤判定ではなく、rubric に基づいて細かくスコアリングされる設計になっています。この点は、今後 AI エージェントの研究遂行能力をより詳細かつ多面的に評価する上で、使い勝手の良いデータセットになりそうです。

今回は API や評価パイプラインの挙動確認に留まりましたが、今後は問題内容そのものや、サブタスクごとの得点傾向などももう少し詳しく見ていきたいと考えています。

※ なお、検証用スクリプトについては簡易実装のため、不備や改善点がある可能性があります。お気づきの点があればご指摘いただけると助かります。

参考URL:

Contact

Science Aidは、研究を中心とした幅広い領域をAIによって支援します。システム開発やコンサルティング、共同研究、セミナーのご依頼などお気軽にご相談ください

お問い合わせ