Score comparison across models
No score data for this case. Run the pipeline first:
uv run python scripts/04_score.py