Skip to content

Data Pipeline

The data pipeline generates synthetic baseball pitch data, loads it, preprocesses features, and creates sequences for model training.

Data Generation

The simulator (pitch_sequencing.data.simulator) generates realistic pitch-by-pitch data.

Pitcher Archetypes

Each simulated pitcher is assigned one of four archetypes:

Archetype Fastball % Slider % Curveball % Changeup % Fatigue Threshold
Power 55% 20% 10% 15% 95 pitches
Finesse 25% 15% 30% 30% 80 pitches
Slider Specialist 20% 40% 20% 20% 85 pitches
Balanced 30% 25% 25% 20% 90 pitches

Archetype blending uses 60% archetype bias and 40% count-based probabilities.

Sequence Strategies

Eight pitch patterns create learnable sequential dependencies by boosting follow-up pitch probability by 15-25%:

  • Fastball → Fastball → Changeup
  • Slider → Slider → Fastball
  • Curveball → Fastball (and more)

Count-Dependent Outcomes

Hit rates vary from 5-6% in pitcher's counts (0-2, 1-2) to 19-23% in hitter's counts (3-0, 3-1).

Fatigue Modeling

After an archetype-specific threshold (80-95 pitches), pitchers shift toward fastballs and more balls.

Game Situation

Runners on base and score differential affect pitch selection probabilities.

CLI Usage

pitch-generate --num-games 3000 --at-bats 35 --seed 42 --output-dir ./data

Python Usage

from pitch_sequencing.data.simulator import generate_dataset, generate_hmm_sequences

# Main dataset (~384K rows)
df = generate_dataset(num_games=3000, at_bats_per_game=35, seed=42)

# HMM sequences (2500 x 100)
hmm_df = generate_hmm_sequences(num_sequences=2500, sequence_length=100, seed=42)

Data Loading

from pitch_sequencing.data.loader import load_pitch_data, create_sequences

df = load_pitch_data("data/baseball_pitch_data.csv", filter_none_prev=True)

Dataset Columns

Column Type Description
Balls int Current ball count (0-3)
Strikes int Current strike count (0-2)
PitchType str Fastball, Slider, Curveball, Changeup
Outcome str ball, strike, hit
PitcherType str power, finesse, slider_specialist, balanced
PitchNumber int Cumulative per-game pitch count
AtBatNumber int At-bat number within game (1-35)
RunnersOn int Number of runners on base (0-3)
ScoreDiff int Score differential
PreviousPitchType str Previous pitch thrown

Note

PitchNumber is the same value for all pitches within an at-bat. It is a cumulative per-game pitch count, not sequential per-pitch.

Preprocessing

Encoding Categoricals

from pitch_sequencing.data.preprocessing import encode_categoricals

df, encoders = encode_categoricals(df, ["PitchType", "Outcome", "PitcherType", "PreviousPitchType"])
# Creates PitchType_enc, Outcome_enc, etc.

Normalizing Numericals

from pitch_sequencing.data.preprocessing import normalize_numericals

df, stats = normalize_numericals(df, ["PitchNumber", "AtBatNumber", "RunnersOn", "ScoreDiff"])
# Saves PitchNumber_raw, AtBatNumber_raw for boundary detection

Creating Sequences

For sequence models (LSTM, CNN1D, Transformer):

from pitch_sequencing.data.loader import create_sequences

X_seq, y_seq, game_starts = create_sequences(
    df, window_size=8,
    feature_cols=["Balls", "Strikes", "PitchType_enc", ...],
    target_col="PitchType_enc"
)
# X_seq shape: (n_samples, 8, n_features)
# y_seq shape: (n_samples,)

Game boundaries are detected using AtBatNumber resets (drops from ~35 back to 1).

Creating Train/Test Splits

from pitch_sequencing.data.preprocessing import create_splits

folds = create_splits(X, y, n_folds=5, stratify=True, random_state=42)
for train_idx, test_idx in folds:
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]