Data Loader¶
Data loading and sequence creation utilities.
pitch_sequencing.data.loader
¶
Data loading utilities for pitch sequence prediction.
create_sequences(df, window_size=8, feature_cols=None, target_col='PitchType_enc')
¶
Create sliding-window sequences respecting game boundaries.
Game boundaries are detected via PitchNumber resets (the raw column must be present or reconstructable). The function expects that categorical columns have already been encoded (e.g. PitchType_enc, PitcherType_enc).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with encoded features. |
required |
window_size
|
int
|
Number of previous timesteps per sample. |
8
|
feature_cols
|
Optional[List[str]]
|
Columns to include as features in each timestep. |
None
|
target_col
|
str
|
Column to predict. |
'PitchType_enc'
|
Returns:
| Type | Description |
|---|---|
ndarray
|
(X, y, game_starts) where X has shape (n_samples, window_size, n_features), |
ndarray
|
y has shape (n_samples,), and game_starts lists the indices where new games start. |
Source code in src/pitch_sequencing/data/loader.py
load_hmm_sequences(path)
¶
Load the HMM synthetic pitch sequences dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to synthetic_pitch_sequences.csv. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
(flat_sequences, encoder) where flat_sequences is shape (n_total, 1) |
LabelEncoder
|
of encoded pitch types, and encoder can invert labels. |
Source code in src/pitch_sequencing/data/loader.py
load_pitch_data(path, filter_none_prev=True)
¶
Load the main pitch dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to baseball_pitch_data.csv. |
required |
filter_none_prev
|
bool
|
If True, drop rows where PreviousPitchType is 'None'. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with pitch data. |