Metrics¶
Evaluation metrics, bootstrap confidence intervals, and statistical tests.
pitch_sequencing.evaluation.metrics
¶
Metrics computation, bootstrap confidence intervals, and statistical tests.
bootstrap_confidence_interval(scores, confidence=0.95, n_bootstrap=1000, seed=42)
¶
Compute bootstrap confidence interval for a list of scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
List[float]
|
List of metric values (e.g. per-fold accuracies). |
required |
confidence
|
float
|
Confidence level (default 0.95). |
0.95
|
n_bootstrap
|
int
|
Number of bootstrap samples. |
1000
|
seed
|
int
|
Random seed. |
42
|
Returns:
| Type | Description |
|---|---|
Tuple[float, float, float]
|
(mean, ci_low, ci_high) |
Source code in src/pitch_sequencing/evaluation/metrics.py
compute_effect_size(scores_a, scores_b)
¶
Compute Cohen's d effect size between two score distributions.
Source code in src/pitch_sequencing/evaluation/metrics.py
compute_metrics(y_true, y_pred, y_proba=None, labels=None)
¶
Compute a comprehensive set of classification metrics.
Returns dict with: accuracy, balanced_accuracy, macro_precision, macro_recall, macro_f1, per_class_precision, per_class_recall, per_class_f1, confusion_matrix, and optionally log_loss.
Source code in src/pitch_sequencing/evaluation/metrics.py
paired_t_test(scores_a, scores_b)
¶
Paired t-test between two sets of fold scores.
Returns:
| Type | Description |
|---|---|
Tuple[float, float]
|
(t_statistic, p_value) |