validmind.CalibrationCurve
CalibrationCurve
@tags('sklearn', 'model_performance', 'classification')
@tasks('classification')
defCalibrationCurve(model:validmind.vm_models.VMModel,dataset:validmind.vm_models.VMDataset,n_bins:int=10):
Evaluates the calibration of probability estimates by comparing predicted probabilities against observed frequencies.
Purpose
The Calibration Curve test assesses how well a model's predicted probabilities align with actual observed frequencies. This is crucial for applications requiring accurate probability estimates, such as risk assessment, decision-making systems, and cost-sensitive applications where probability calibration directly impacts business decisions.
Test Mechanism
The test uses sklearn's calibration_curve function to:
- Sort predictions into bins based on predicted probabilities
- Calculate the mean predicted probability in each bin
- Compare against the observed frequency of positive cases
- Plot the results against the perfect calibration line (y=x) The resulting curve shows how well the predicted probabilities match empirical probabilities.
Signs of High Risk
- Significant deviation from the perfect calibration line
- Systematic overconfidence (predictions too close to 0 or 1)
- Systematic underconfidence (predictions clustered around 0.5)
- Empty or sparse bins indicating poor probability coverage
- Sharp discontinuities in the calibration curve
- Different calibration patterns across different probability ranges
- Consistent over/under estimation in critical probability regions
- Large confidence intervals in certain probability ranges
Strengths
- Visual and intuitive interpretation of probability quality
- Identifies systematic biases in probability estimates
- Supports probability threshold selection
- Helps understand model confidence patterns
- Applicable across different classification models
- Enables comparison between different models
- Guides potential need for recalibration
- Critical for risk-sensitive applications
Limitations
- Sensitive to the number of bins chosen
- Requires sufficient samples in each bin for reliable estimates
- May mask local calibration issues within bins
- Does not account for feature-dependent calibration issues
- Limited to binary classification problems
- Cannot detect all forms of miscalibration
- Assumes bin boundaries are appropriate for the problem
- May be affected by class imbalance