LLM benchmark pipeline — generate cases, create rubrics, run models, score responses
Cases
-
Progress
Pipeline
Create clinical cases from stems using an LLM
Generate scoring rubrics for each case
Run subject models against each case
Score each response against the rubric
Analysis
Token usage, cost, and runtime per step
Score distributions and model comparisons