Benchmarking¶

The benchmark suite runs all models through k-fold cross-validation and computes metrics with bootstrap confidence intervals and statistical tests.

CLI Usage¶

# Run with default config
pitch-benchmark

# Run with custom config
pitch-benchmark --config configs/benchmark.yaml

Benchmark Configuration¶

# configs/benchmark.yaml
experiment_name: pitch_benchmark
models:
  - logistic_regression
  - random_forest
  - hmm
  - autogluon
  - lstm
  - cnn1d
  - transformer
n_folds: 5
metrics:
  - accuracy
  - balanced_accuracy
  - macro_f1
  - log_loss

Python API¶

from pitch_sequencing.config import DataConfig, BenchmarkConfig
from pitch_sequencing.evaluation.benchmark import BenchmarkRunner

data_cfg = DataConfig.from_yaml("configs/data.yaml")
bench_cfg = BenchmarkConfig(
    experiment_name="my_benchmark",
    models=["lstm", "random_forest", "transformer"],
    n_folds=5,
    metrics=["accuracy", "macro_f1"]
)

runner = BenchmarkRunner(bench_cfg, data_cfg, models_config_dir="configs/models")
results_df = runner.run()
print(results_df)

Output¶

The benchmark produces:

Per-fold metrics for each model
Bootstrap confidence intervals (95% CI by default, 1000 bootstrap samples)
Paired t-tests between model pairs with p-values
Cohen's d effect sizes for pairwise comparisons
MLflow experiment logs with parameters, metrics, and artifacts

Metrics¶

Metric	Description
`accuracy`	Overall accuracy
`balanced_accuracy`	Average per-class recall
`macro_precision`	Macro-averaged precision
`macro_recall`	Macro-averaged recall
`macro_f1`	Macro-averaged F1 score
`log_loss`	Logarithmic loss (requires `predict_proba`)

Per-class precision, recall, and F1 are also computed for each pitch type (Fastball, Slider, Curveball, Changeup).

Statistical Comparisons¶

After k-fold CV, models are compared pairwise:

Paired t-test: Tests whether the difference in fold scores is statistically significant
Cohen's d: Measures the effect size (small: 0.2, medium: 0.5, large: 0.8)

MLflow Integration¶

All benchmark runs are logged to MLflow under the experiment name. See MLflow Tracking for details on viewing and comparing runs.