Evaluating Calibration¶
This guide explains how to measure and interpret calibration quality.
Core Metrics¶
from rankcal import ece, ece_at_k, adaptive_ece, mce
# Expected Calibration Error (lower is better)
print(f"ECE: {ece(scores, labels):.4f}")
# ECE focused on top-k (where decisions happen)
print(f"ECE@10: {ece_at_k(scores, labels, k=10):.4f}")
# Adaptive ECE (better for skewed score distributions)
print(f"Adaptive ECE: {adaptive_ece(scores, labels):.4f}")
# Maximum Calibration Error (worst-case bin)
print(f"MCE: {mce(scores, labels):.4f}")
Understanding Each Metric¶
ECE (Expected Calibration Error)¶
The most common calibration metric. It measures the average absolute difference between predicted probabilities and observed frequencies across bins.
from rankcal import ece
error = ece(scores, labels, n_bins=10)
- Lower is better (0 = perfectly calibrated)
- Uses equal-width bins by default
- Weighted by the number of samples in each bin
ECE@k (ECE at Top-k)¶
Measures calibration only for the top-k ranked items. This is crucial for ranking systems where decisions are made at the top.
from rankcal import ece_at_k
# Evaluate calibration for top 10 items
error = ece_at_k(scores, labels, k=10)
# Evaluate calibration for top 100 items
error = ece_at_k(scores, labels, k=100)
- Use this when your application shows only top-k results
- More relevant than overall ECE for ranking systems
Adaptive ECE¶
Uses equal-mass (quantile) bins instead of equal-width bins. Better for skewed score distributions.
from rankcal import adaptive_ece
error = adaptive_ece(scores, labels, n_bins=10)
- Ensures each bin has roughly the same number of samples
- More robust when scores are not uniformly distributed
MCE (Maximum Calibration Error)¶
The worst calibration error across all bins. Useful for understanding worst-case behavior.
from rankcal import mce
error = mce(scores, labels, n_bins=10)
- Shows the maximum miscalibration in any bin
- Important when worst-case errors matter
Interpreting Results¶
| ECE Value | Interpretation |
|---|---|
| < 0.02 | Excellent calibration |
| 0.02 - 0.05 | Good calibration |
| 0.05 - 0.10 | Moderate miscalibration |
| > 0.10 | Poor calibration, consider recalibrating |
Note
ECE@k at small k may be higher due to fewer samples. Focus on trends rather than absolute values for small k.
Visualizing Calibration¶
Reliability Diagram¶
The reliability diagram is the standard way to visualize calibration quality.
from rankcal import reliability_diagram
import matplotlib.pyplot as plt
# Full reliability diagram
fig = reliability_diagram(scores, labels, n_bins=10)
plt.show()
# Top-k focused view
fig = reliability_diagram(scores, labels, k=50, n_bins=10)
plt.show()
Reading the diagram:
- The diagonal line represents perfect calibration
- Points above the diagonal: underconfident (actual > predicted)
- Points below the diagonal: overconfident (predicted > actual)
- Bar heights show the number of samples in each bin
Comparing Before and After Calibration¶
from rankcal import IsotonicCalibrator, ece, ece_at_k, reliability_diagram
import matplotlib.pyplot as plt
# Fit calibrator
calibrator = IsotonicCalibrator()
calibrator.fit(scores, labels)
calibrated = calibrator(scores)
# Compare metrics
print("Before calibration:")
print(f" ECE: {ece(scores, labels):.4f}")
print(f" ECE@10: {ece_at_k(scores, labels, k=10):.4f}")
print("\nAfter calibration:")
print(f" ECE: {ece(calibrated, labels):.4f}")
print(f" ECE@10: {ece_at_k(calibrated, labels, k=10):.4f}")
# Compare visually
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
reliability_diagram(scores, labels, ax=axes[0])
axes[0].set_title("Before Calibration")
reliability_diagram(calibrated, labels, ax=axes[1])
axes[1].set_title("After Calibration")
plt.tight_layout()
plt.show()
Ranking-Specific Metrics¶
Precision at k¶
from rankcal import precision_at_k, expected_precision_at_k
# Actual precision in top-k
actual = precision_at_k(scores, labels, k=10)
# Expected precision based on calibrated scores
expected = expected_precision_at_k(calibrated_scores, k=10)
# The gap indicates calibration quality
print(f"Actual P@10: {actual:.4f}")
print(f"Expected P@10: {expected:.4f}")
print(f"Gap: {abs(actual - expected):.4f}")
Calibration Gap at k¶
Directly measures the gap between expected and actual precision at k.
from rankcal import calibration_gap_at_k
gap = calibration_gap_at_k(scores, labels, k=10)
print(f"Calibration gap @10: {gap:.4f}")