Metrics¶
Calibration and ranking metrics for evaluating model performance.
Calibration Error Metrics¶
ece¶
Expected Calibration Error metrics.
adaptive_ece(scores, labels, n_bins=10)
¶
Compute adaptive Expected Calibration Error with equal-mass bins.
Unlike standard ECE which uses equal-width bins, adaptive ECE uses quantile-based binning so each bin has approximately the same number of samples. This is more robust when scores are not uniformly distributed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Adaptive ECE value as a scalar tensor |
adaptive_ece_at_k(scores, labels, k, n_bins=10)
¶
Compute adaptive Expected Calibration Error at top-k.
Only considers the top-k items by score when computing adaptive ECE. Uses quantile-based binning for equal-mass bins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Adaptive ECE@k value as a scalar tensor |
calibration_error_per_bin(scores, labels, n_bins=10)
¶
Compute per-bin calibration statistics.
Useful for building reliability diagrams.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins |
10
|
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Tensor, Tensor, Tensor]
|
Tuple of (bin_centers, bin_accuracies, bin_confidences, bin_counts) |
ece(scores, labels, n_bins=10)
¶
Compute Expected Calibration Error.
ECE measures the average absolute difference between predicted confidence and actual accuracy across bins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
ECE value as a scalar tensor |
ece_at_k(scores, labels, k, n_bins=10)
¶
Compute Expected Calibration Error at top-k.
Only considers the top-k items by score when computing ECE. This measures calibration where ranking decisions actually happen.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
ECE@k value as a scalar tensor |
mce(scores, labels, n_bins=10)
¶
Compute Maximum Calibration Error.
MCE measures the worst-case calibration error across all bins, unlike ECE which computes a weighted average.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
MCE value as a scalar tensor |
mce_at_k(scores, labels, k, n_bins=10)
¶
Compute Maximum Calibration Error at top-k.
Only considers the top-k items by score when computing MCE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
MCE@k value as a scalar tensor |
ece_at_k¶
Compute Expected Calibration Error at top-k.
Only considers the top-k items by score when computing ECE. This measures calibration where ranking decisions actually happen.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
ECE@k value as a scalar tensor |
adaptive_ece¶
Compute adaptive Expected Calibration Error with equal-mass bins.
Unlike standard ECE which uses equal-width bins, adaptive ECE uses quantile-based binning so each bin has approximately the same number of samples. This is more robust when scores are not uniformly distributed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Adaptive ECE value as a scalar tensor |
adaptive_ece_at_k¶
Compute adaptive Expected Calibration Error at top-k.
Only considers the top-k items by score when computing adaptive ECE. Uses quantile-based binning for equal-mass bins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Adaptive ECE@k value as a scalar tensor |
mce¶
Compute Maximum Calibration Error.
MCE measures the worst-case calibration error across all bins, unlike ECE which computes a weighted average.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
MCE value as a scalar tensor |
mce_at_k¶
Compute Maximum Calibration Error at top-k.
Only considers the top-k items by score when computing MCE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
n_bins
|
int
|
Number of bins for bucketing scores |
10
|
Returns:
| Type | Description |
|---|---|
Tensor
|
MCE@k value as a scalar tensor |
calibration_error_per_bin¶
Compute per-bin calibration statistics.
Useful for building reliability diagrams.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
n_bins
|
int
|
Number of bins |
10
|
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Tensor, Tensor, Tensor]
|
Tuple of (bin_centers, bin_accuracies, bin_confidences, bin_counts) |
Ranking Metrics¶
precision_at_k¶
Compute precision at k.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores, shape (n_samples,) |
required |
labels
|
Tensor
|
Binary relevance labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Precision@k as a scalar tensor |
expected_precision_at_k¶
Compute expected precision at k from calibrated scores.
If scores are well-calibrated, this should match actual precision@k.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Calibrated scores (probabilities), shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Expected precision@k as a scalar tensor |
calibration_gap_at_k¶
Compute the calibration gap at top-k.
This is the difference between expected precision (from scores) and actual precision (from labels) at top-k.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Calibrated scores, shape (n_samples,) |
required |
labels
|
Tensor
|
Binary relevance labels, shape (n_samples,) |
required |
k
|
int
|
Number of top items to consider |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
Calibration gap as a scalar tensor (positive = overconfident) |
mean_predicted_relevance¶
Compute mean predicted relevance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores, shape (n_samples,) |
required |
k
|
Optional[int]
|
If provided, only consider top-k items |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Mean predicted relevance |
mean_actual_relevance¶
Compute mean actual relevance at top-k by score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores, shape (n_samples,) |
required |
labels
|
Tensor
|
Binary relevance labels, shape (n_samples,) |
required |
k
|
Optional[int]
|
If provided, only consider top-k items by score |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Mean actual relevance |
Visualization¶
reliability_diagram¶
Create a reliability diagram.
Shows predicted confidence vs actual accuracy per bin, with a perfect calibration line for reference.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
Tensor
|
Predicted scores in (0, 1), shape (n_samples,) |
required |
labels
|
Tensor
|
Binary labels, shape (n_samples,) |
required |
k
|
Optional[int]
|
If provided, only use top-k items by score |
None
|
n_bins
|
int
|
Number of bins for bucketing |
10
|
title
|
Optional[str]
|
Optional plot title |
None
|
figsize
|
Tuple[float, float]
|
Figure size (width, height) |
(8, 6)
|
Returns:
| Type | Description |
|---|---|
Figure
|
Matplotlib Figure object |