Demonstration

Conformal vs. recalibration

The fair case: where conformal genuinely wins, and where it quietly hands you the wrong object.

This is the fair turn. We have spent five demos prodding conformal’s soft spots; here we concede its one real advantage and draw the line cleanly. The question is not “is conformal valid?”, it is, but “which object do you actually need?”

Start with an overconfident base model. The truth is \(y_i \sim N(0,\sigma_{\text{true}})\) with \(\sigma_{\text{true}}=1\), but the model claims a density that is too narrow: \(N(0,\sigma_{\text{model}})\) with \(\sigma_{\text{model}} = \sigma_{\text{true}}/k\) for an overconfidence factor \(k>1\). We hold out half the data to calibrate and judge everyone on the same test set. Three treatments:

Raw. Report the base density as-is. Its PIT is U-shaped (the signature of overconfidence), its log-score is poor, and its central \(1-\alpha\) interval under-covers.
Conformal. Take the nonconformity score \(|y|\) (the mean is correct, so \(\hat\mu\equiv 0\)) and emit the band \([-q,+q]\) with \(q=\text{conformalQuantile}(\cdot,\alpha)\). Marginal coverage lands on \(1-\alpha\) essentially exactly. As a predictive system it returns a set, and its implied density is just the base residual shape, re-leveled, never fitted to the score you report.
Recalibration. Fit the spread to the proper score. For a Gaussian with correct mean, the log-score maximizer is \(\sigma^\star = \sqrt{\operatorname{mean}(y_{\text{cal}}^2)}\) (the practical analogue of Kuleshov et al., 2018). PIT goes flat, the log-score is near-optimal, and you get back a full density you can integrate, score, and decide with.

The PIT histograms are the money shot. The flat line at density \(=1\) is what a calibrated forecast should look like.

Left: the RAW forecast’s PIT, piled up at the edges, the classic U of an overconfident model. Right: after variance recalibration the PIT sits flat on the uniform reference. Recalibration fixed the distribution, not just a coverage number.

Now overlay the predictive objects on the outcomes. The raw density is too narrow; the recalibrated density matches; and the conformal band is shaded as an interval, a set, not a curve, to make the set-vs-measure distinction visible. Use the selector to highlight each.

Read the log-scores: raw is worst, conformal-implied sits in the middle (it inherits the base shape), and recalibrated is best, it is the only treatment that optimized the score you actually report. Coverage tells the mirror story: raw under-covers, while both conformal and recalibration hit the target, but conformal’s coverage is a finite-sample, distribution-free guarantee, whereas recalibration’s rests on the Gaussian model being right.

Takeaway. If what you want is a calibrated forecast, a density to integrate, score, and decide with, you wanted recalibration. It improves the very number you report and hands back the whole object. Conformal calibrates coverage and emits a set; its implied “density” is the base model’s residual shape, re-leveled, and never optimized for the score. Conformal’s genuine edge is real and worth naming: a distribution-free, finite-sample coverage guarantee. That is the right tool exactly when the guarantee is the product, and the wrong one when a forecast is. See the paper for the full argument, or head back to the overview.

← Drift & time series The price of conditional coverage →

Using conformal prediction in your own project? Tell Claude: “Read https://conformalprediction.net/SKILL.md and create a project skill from it.” It adds a check for whether your coverage is conditionally trustworthy.