Data Preprocessing¶
Encoding, normalization, and train/test splitting.
pitch_sequencing.data.preprocessing
¶
Preprocessing utilities for encoding, normalization, and splitting.
create_splits(X, y, test_size=0.2, n_folds=5, stratify=True, random_state=42, temporal=False)
¶
Create train/test splits: either a single split or k-fold CV.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature array. |
required |
y
|
ndarray
|
Target array. |
required |
test_size
|
float
|
Fraction for test set (single split mode). |
0.2
|
n_folds
|
int
|
Number of CV folds. If 1, performs a single split. |
5
|
stratify
|
bool
|
Whether to stratify by y. |
True
|
random_state
|
int
|
Random seed. |
42
|
temporal
|
bool
|
If True, use temporal (ordered) splits instead of random. |
False
|
Returns:
| Type | Description |
|---|---|
List[Tuple[ndarray, ndarray]]
|
List of (train_indices, test_indices) tuples. |
Source code in src/pitch_sequencing/data/preprocessing.py
encode_categoricals(df, columns, encoders=None)
¶
Label-encode categorical columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
columns
|
List[str]
|
Columns to encode. |
required |
encoders
|
Optional[Dict[str, LabelEncoder]]
|
Pre-fitted encoders to reuse (for test data). |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
(df_encoded, encoders_dict) — DataFrame with new *_enc columns and |
Dict[str, LabelEncoder]
|
the fitted encoders. |
Source code in src/pitch_sequencing/data/preprocessing.py
normalize_numericals(df, columns, stats=None)
¶
Z-score normalize numerical columns.
Also stores the raw PitchNumber (before normalization) as PitchNumber_raw so that game-boundary detection still works downstream.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
columns
|
List[str]
|
Columns to normalize. |
required |
stats
|
Optional[Dict[str, Tuple[float, float]]]
|
Pre-computed (mean, std) per column (for test data). |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[DataFrame, Dict[str, Tuple[float, float]]]
|
(df_normalized, stats_dict) |