Skip to content

Metrics

Calibration and ranking metrics for evaluating model performance.

Calibration Error Metrics

ece

Expected Calibration Error metrics.

adaptive_ece(scores, labels, n_bins=10)

Compute adaptive Expected Calibration Error with equal-mass bins.

Unlike standard ECE which uses equal-width bins, adaptive ECE uses quantile-based binning so each bin has approximately the same number of samples. This is more robust when scores are not uniformly distributed.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

Adaptive ECE value as a scalar tensor

adaptive_ece_at_k(scores, labels, k, n_bins=10)

Compute adaptive Expected Calibration Error at top-k.

Only considers the top-k items by score when computing adaptive ECE. Uses quantile-based binning for equal-mass bins.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k int

Number of top items to consider

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

Adaptive ECE@k value as a scalar tensor

calibration_error_per_bin(scores, labels, n_bins=10)

Compute per-bin calibration statistics.

Useful for building reliability diagrams.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins

10

Returns:

Type Description
Tuple[Tensor, Tensor, Tensor, Tensor]

Tuple of (bin_centers, bin_accuracies, bin_confidences, bin_counts)

ece(scores, labels, n_bins=10)

Compute Expected Calibration Error.

ECE measures the average absolute difference between predicted confidence and actual accuracy across bins.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

ECE value as a scalar tensor

ece_at_k(scores, labels, k, n_bins=10)

Compute Expected Calibration Error at top-k.

Only considers the top-k items by score when computing ECE. This measures calibration where ranking decisions actually happen.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k int

Number of top items to consider

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

ECE@k value as a scalar tensor

mce(scores, labels, n_bins=10)

Compute Maximum Calibration Error.

MCE measures the worst-case calibration error across all bins, unlike ECE which computes a weighted average.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

MCE value as a scalar tensor

mce_at_k(scores, labels, k, n_bins=10)

Compute Maximum Calibration Error at top-k.

Only considers the top-k items by score when computing MCE.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k int

Number of top items to consider

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

MCE@k value as a scalar tensor

ece_at_k

Compute Expected Calibration Error at top-k.

Only considers the top-k items by score when computing ECE. This measures calibration where ranking decisions actually happen.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k int

Number of top items to consider

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

ECE@k value as a scalar tensor

adaptive_ece

Compute adaptive Expected Calibration Error with equal-mass bins.

Unlike standard ECE which uses equal-width bins, adaptive ECE uses quantile-based binning so each bin has approximately the same number of samples. This is more robust when scores are not uniformly distributed.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

Adaptive ECE value as a scalar tensor

adaptive_ece_at_k

Compute adaptive Expected Calibration Error at top-k.

Only considers the top-k items by score when computing adaptive ECE. Uses quantile-based binning for equal-mass bins.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k int

Number of top items to consider

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

Adaptive ECE@k value as a scalar tensor

mce

Compute Maximum Calibration Error.

MCE measures the worst-case calibration error across all bins, unlike ECE which computes a weighted average.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

MCE value as a scalar tensor

mce_at_k

Compute Maximum Calibration Error at top-k.

Only considers the top-k items by score when computing MCE.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k int

Number of top items to consider

required
n_bins int

Number of bins for bucketing scores

10

Returns:

Type Description
Tensor

MCE@k value as a scalar tensor

calibration_error_per_bin

Compute per-bin calibration statistics.

Useful for building reliability diagrams.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
n_bins int

Number of bins

10

Returns:

Type Description
Tuple[Tensor, Tensor, Tensor, Tensor]

Tuple of (bin_centers, bin_accuracies, bin_confidences, bin_counts)

Ranking Metrics

precision_at_k

Compute precision at k.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores, shape (n_samples,)

required
labels Tensor

Binary relevance labels, shape (n_samples,)

required
k int

Number of top items to consider

required

Returns:

Type Description
Tensor

Precision@k as a scalar tensor

expected_precision_at_k

Compute expected precision at k from calibrated scores.

If scores are well-calibrated, this should match actual precision@k.

Parameters:

Name Type Description Default
scores Tensor

Calibrated scores (probabilities), shape (n_samples,)

required
k int

Number of top items to consider

required

Returns:

Type Description
Tensor

Expected precision@k as a scalar tensor

calibration_gap_at_k

Compute the calibration gap at top-k.

This is the difference between expected precision (from scores) and actual precision (from labels) at top-k.

Parameters:

Name Type Description Default
scores Tensor

Calibrated scores, shape (n_samples,)

required
labels Tensor

Binary relevance labels, shape (n_samples,)

required
k int

Number of top items to consider

required

Returns:

Type Description
Tensor

Calibration gap as a scalar tensor (positive = overconfident)

mean_predicted_relevance

Compute mean predicted relevance.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores, shape (n_samples,)

required
k Optional[int]

If provided, only consider top-k items

None

Returns:

Type Description
Tensor

Mean predicted relevance

mean_actual_relevance

Compute mean actual relevance at top-k by score.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores, shape (n_samples,)

required
labels Tensor

Binary relevance labels, shape (n_samples,)

required
k Optional[int]

If provided, only consider top-k items by score

None

Returns:

Type Description
Tensor

Mean actual relevance

Visualization

reliability_diagram

Create a reliability diagram.

Shows predicted confidence vs actual accuracy per bin, with a perfect calibration line for reference.

Parameters:

Name Type Description Default
scores Tensor

Predicted scores in (0, 1), shape (n_samples,)

required
labels Tensor

Binary labels, shape (n_samples,)

required
k Optional[int]

If provided, only use top-k items by score

None
n_bins int

Number of bins for bucketing

10
title Optional[str]

Optional plot title

None
figsize Tuple[float, float]

Figure size (width, height)

(8, 6)

Returns:

Type Description
Figure

Matplotlib Figure object