Benchmarking¶
The benchmark suite runs all models through k-fold cross-validation and computes metrics with bootstrap confidence intervals and statistical tests.
CLI Usage¶
# Run with default config
pitch-benchmark
# Run with custom config
pitch-benchmark --config configs/benchmark.yaml
Benchmark Configuration¶
# configs/benchmark.yaml
experiment_name: pitch_benchmark
models:
- logistic_regression
- random_forest
- hmm
- autogluon
- lstm
- cnn1d
- transformer
n_folds: 5
metrics:
- accuracy
- balanced_accuracy
- macro_f1
- log_loss
Python API¶
from pitch_sequencing.config import DataConfig, BenchmarkConfig
from pitch_sequencing.evaluation.benchmark import BenchmarkRunner
data_cfg = DataConfig.from_yaml("configs/data.yaml")
bench_cfg = BenchmarkConfig(
experiment_name="my_benchmark",
models=["lstm", "random_forest", "transformer"],
n_folds=5,
metrics=["accuracy", "macro_f1"]
)
runner = BenchmarkRunner(bench_cfg, data_cfg, models_config_dir="configs/models")
results_df = runner.run()
print(results_df)
Output¶
The benchmark produces:
- Per-fold metrics for each model
- Bootstrap confidence intervals (95% CI by default, 1000 bootstrap samples)
- Paired t-tests between model pairs with p-values
- Cohen's d effect sizes for pairwise comparisons
- MLflow experiment logs with parameters, metrics, and artifacts
Metrics¶
| Metric | Description |
|---|---|
accuracy |
Overall accuracy |
balanced_accuracy |
Average per-class recall |
macro_precision |
Macro-averaged precision |
macro_recall |
Macro-averaged recall |
macro_f1 |
Macro-averaged F1 score |
log_loss |
Logarithmic loss (requires predict_proba) |
Per-class precision, recall, and F1 are also computed for each pitch type (Fastball, Slider, Curveball, Changeup).
Statistical Comparisons¶
After k-fold CV, models are compared pairwise:
- Paired t-test: Tests whether the difference in fold scores is statistically significant
- Cohen's d: Measures the effect size (small: 0.2, medium: 0.5, large: 0.8)
MLflow Integration¶
All benchmark runs are logged to MLflow under the experiment name. See MLflow Tracking for details on viewing and comparing runs.