Baseball Pitch Sequence Prediction¶
A professional-grade Python package for baseball pitch sequence prediction using 7 ML models, with benchmarking, ablation studies, and MLflow experiment tracking.
Overview¶
This project generates synthetic baseball pitch data with realistic pitcher archetypes, pitch sequence strategies, fatigue modeling, and game situation context — then trains and compares multiple models for predicting the next pitch type.
Models¶
| Model | Type | Description |
|---|---|---|
| Logistic Regression | Tabular | Baseline linear classifier |
| Random Forest | Tabular | Ensemble of decision trees |
| HMM | Sequence | Hidden Markov Model (hmmlearn) |
| AutoGluon | Tabular | AutoML with model ensembling |
| LSTM | Sequence | 2-layer LSTM neural network |
| 1D-CNN | Sequence | 3-layer convolutional network |
| Transformer | Sequence | Self-attention encoder |
All models share a unified interface (fit, predict, predict_proba) and are benchmarked via k-fold cross-validation with bootstrap confidence intervals and paired statistical tests.
Key Features¶
- Realistic synthetic data with pitcher archetypes, fatigue, and game context
- 7 prediction models spanning tabular and sequence architectures
- Comprehensive benchmarking with k-fold CV and bootstrap confidence intervals
- Ablation studies for feature importance, architecture, data scaling, and hyperparameters
- MLflow tracking for experiment management and comparison
- YAML-driven configuration for all settings
- CLI commands for data generation, training, benchmarking, and ablation
Quick Links¶
- Installation — Get up and running
- Quick Start — Generate data and train your first model
- Models Overview — Compare all 7 models
- CLI Reference — Command-line interface
- API Reference — Python API documentation