Benchmark Generator

Pipeline Dashboard

  • Overview

Pipeline

  • 1Generate
  • 2Rubric
  • 3Run
  • 4Score

Analysis

  • Stats
  • Analysis

Reference

  • Skills

Overview

LLM benchmark pipeline — generate cases, create rubrics, run models, score responses

Cases

-

Progress

-

Pipeline

1Generate

Create clinical cases from stems using an LLM

2Rubric

Generate scoring rubrics for each case

3Run

Run subject models against each case

4Score

Score each response against the rubric

Analysis

Stats

Token usage, cost, and runtime per step

Analysis

Score distributions and model comparisons