Quick Start¶
This guide walks you through generating data, training a model, and running the benchmark suite.
Setup¶
1. Generate Synthetic Data¶
This produces ~384K pitch rows with realistic pitcher archetypes, sequence strategies, fatigue modeling, and game context.
2. Train a Single Model¶
Available models: logistic_regression, random_forest, hmm, autogluon, lstm, cnn1d, transformer
3. Run the Full Benchmark¶
This runs all 7 models through 5-fold cross-validation and reports accuracy, F1, and other metrics with bootstrap confidence intervals.
4. Run Ablation Studies¶
Ablation types: feature, architecture, data, hyperparam
5. View Results in MLflow¶
Python API¶
You can also use the package programmatically:
from pitch_sequencing import load_pitch_data, get_model, MODEL_REGISTRY
# Load data
df = load_pitch_data("data/baseball_pitch_data.csv")
# List available models
print(list(MODEL_REGISTRY.keys()))
# Create and train a model
model = get_model("random_forest", {"n_estimators": 200, "max_depth": 15})
# ... prepare X_train, y_train ...
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
See the User Guide for a complete walkthrough of the data pipeline, training, and evaluation.