Decision Making with Calibrated Scores¶
This guide explains when calibration matters and how to use calibrated scores for decision-making.
When Calibration Matters¶
Calibration is essential when you need to interpret scores as probabilities or make decisions based on score thresholds.
1. Decision-Making with Thresholds¶
If you're selecting items above a threshold, miscalibration leads to wrong decisions:
from rankcal import optimal_threshold
# Find threshold that maximizes utility
# benefit = value of showing a relevant item
# cost = cost of showing an irrelevant item
threshold, utility = optimal_threshold(
calibrated_scores, labels, benefit=1.0, cost=0.5
)
# Select items above threshold
selected = scores >= threshold
2. Ranking with Budget Constraints¶
When you can only show k items, calibration helps predict expected outcomes:
from rankcal import expected_precision_at_k, precision_at_k
# With calibrated scores, expected precision ≈ actual precision
expected = expected_precision_at_k(calibrated_scores, k=10)
actual = precision_at_k(calibrated_scores, labels, k=10)
# The gap between these indicates calibration quality
3. Combining Scores from Multiple Models¶
When merging rankings from different models, scores must be calibrated to be comparable:
# Without calibration, Model A's 0.7 might mean something
# different than Model B's 0.7
# Calibrate both to the same scale
scores_a_cal = calibrator_a(scores_a)
scores_b_cal = calibrator_b(scores_b)
# Now they can be meaningfully combined
combined = 0.5 * scores_a_cal + 0.5 * scores_b_cal
Decision Tools¶
Optimal Threshold¶
Find the threshold that maximizes expected utility:
from rankcal import optimal_threshold
# Define your utility function via benefit and cost
threshold, expected_utility = optimal_threshold(
scores, labels,
benefit=1.0, # Value of a true positive
cost=0.5 # Cost of a false positive
)
print(f"Optimal threshold: {threshold:.4f}")
print(f"Expected utility: {expected_utility:.4f}")
Utility Curves¶
Visualize how utility changes with threshold:
from rankcal import utility_curve, plot_utility_curve
# Get curve data
thresholds, utilities = utility_curve(scores, labels, benefit=1.0, cost=0.5)
# Or plot directly
fig = plot_utility_curve(scores, labels, benefit=1.0, cost=0.5)
fig.savefig("utility_curve.png")
Risk-Coverage Curves¶
Understand the tradeoff between coverage (how many items you select) and risk (error rate):
from rankcal import risk_coverage_curve, plot_risk_coverage
# Get curve data
coverages, risks = risk_coverage_curve(scores, labels)
# Or plot directly
fig = plot_risk_coverage(scores, labels)
fig.savefig("risk_coverage.png")
Budget-Constrained Selection¶
Select the best k items given a budget constraint:
from rankcal import budget_constrained_selection, expected_utility_at_budget
# Select top-k items
selected_indices = budget_constrained_selection(scores, k=10)
# Estimate utility at a given budget
utility = expected_utility_at_budget(
scores, labels, k=10, benefit=1.0, cost=0.5
)
Threshold for Coverage¶
Find the threshold that achieves a target coverage:
from rankcal import threshold_for_coverage
# Find threshold to select approximately 20% of items
threshold = threshold_for_coverage(scores, coverage=0.2)
Integration with Ranking Pipelines¶
Post-hoc Calibration (Recommended for Most Cases)¶
Use this when your ranking model is already trained:
from rankcal import IsotonicCalibrator, ece, ece_at_k
# 1. Get model predictions on calibration set
raw_scores = model.predict(X_cal)
# 2. Fit calibrator
calibrator = IsotonicCalibrator()
calibrator.fit(raw_scores, cal_labels)
# 3. Use in production
def predict_with_calibration(X):
raw = model.predict(X)
return calibrator(torch.tensor(raw))
# 4. Monitor calibration
print(f"Overall ECE: {ece(calibrated_scores, labels):.4f}")
print(f"ECE@10: {ece_at_k(calibrated_scores, labels, k=10):.4f}")
End-to-End Differentiable Training¶
Use MonotonicNNCalibrator when you want to train calibration jointly with your model:
import torch.nn as nn
from rankcal import MonotonicNNCalibrator
class CalibratedRanker(nn.Module):
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.calibrator = MonotonicNNCalibrator(hidden_dims=(16, 16))
def forward(self, x):
raw_scores = self.base_model(x)
return self.calibrator(raw_scores)
# Train with combined loss
model = CalibratedRanker(base_model)
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
x, labels = batch
calibrated_scores = model(x)
# Your ranking loss + calibration regularization
loss = ranking_loss(calibrated_scores, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()