Conformal Prediction
A practical, interactive guide, built on one premise: to wield conformal prediction well, you have to understand what it isn’t.
Conformal prediction is one of the most useful ideas in modern uncertainty quantification. From any model, on almost any data, it builds a prediction set that contains the truth with a frequency you choose: finite-sample, with no distributional assumptions. But the tool is easy to misapply, and using it well takes more than knowing the recipe; it takes a clear sense of its boundaries. So this guide takes its time. We build the method, read its guarantee carefully, see exactly where it is indispensable, and then, in the part most often skipped, we are precise about what the guarantee does not give you, because that is what tells you when to reach for it.
The intent of this guide is to inform, not to oversell.
Given exchangeable data, a model, and a nonconformity score, split conformal prediction returns a set \(C(X)\) with $$\mathbb{P}\big(Y_{n+1}\in C(X_{n+1})\big)\;\ge\;1-\alpha.$$ A real and remarkable guarantee. The art is in reading it: the probability averages over the input \(X_{n+1}\), so it describes the average case, not necessarily the one in front of you.
Everything below is something you can check by moving a slider. The companion paper states it precisely, with proofs and citations; a worked example stress-tests the popular libraries on a controlled problem.
1 · How it works
Hold out a calibration set, score how “surprising” each point is, and take the empirical quantile of those scores to size a band around new predictions. That is the whole method, model-agnostic, a few lines of code, and it delivers exactly what it promises.
2 · Reading the guarantee: the fine print
The guarantee is marginal: an average over inputs, and an average can flatter. Conformal’s 90% is a genuine figure that can still sit near 100% where the data are easy and well below it where they are hard. The gap to per-instance (conditional) coverage is not a tuning problem; it is a theorem, and knowing it is what separates using the tool well from being surprised by it.
Marginal vs. conditional coverage
90% on average, while over-covering the easy inputs and under-covering the hard ones. Drag a window and read the local coverage.
The price of conditional coverage
Chase per-\(x\) coverage by localizing, and the intervals diverge to the whole real line. Lei & Wasserman (2014).
Subgroup coverage buys only a wider band
Protect every subgroup distribution-free and all you get is a flat, inflated band, never adaptivity. Barber et al. (2021).
3 · The one thing it won’t do: improve your forecast
The most common hope is that conformalizing a model makes its uncertainty better. It doesn’t, and seeing exactly why is worth a few minutes. The coverage certificate is independent of forecast quality: the same 90% attaches to a brilliant model and a useless one, and you can hold coverage fixed while the predictive log-likelihood does anything at all. Sharpness comes from the model; conformal supplies the certificate on top.
This is the kind of pushback you might get, or want to make yourself, and it is worth being able to answer:
If the model is miscalibrated, fix the model. If the probabilities are miscalibrated, calibrate the probabilities. If the quantiles are wrong, estimate the quantiles better. Why should I be impressed that someone looked at held-out residuals and widened intervals until empirical coverage matches a target?
The fair answer: you shouldn’t be impressed by the widening as a forecast, but it does buy one thing nothing else does, a finite-sample, distribution-free guarantee that the band covers, under almost no assumptions. That certificate is the product. The sharper forecast underneath is still the model’s job.
The fence is the horizon
A deliberately useless predictor still reaches exactly 90% coverage. Validity is not evidence of a good model.
Coverage ⊥ log-score
Pin marginal coverage at 90% and move the predictive log-likelihood up and down at will. The two are independent axes.
Conformal vs. recalibration
Want a calibrated forecast? Recalibration improves the score and returns a density. Conformal re-levels coverage and returns a set. Both are useful, for different jobs.
The coverage–score plane
The synthesis: conformalizing slides a model sideways to honest coverage; only a better conditional model lifts it toward the oracle. The leftover height is the information gap.
4 · Time series & distribution shift
The base guarantee needs exchangeability, which time series break. Conformal coverage can hold beautifully, right up until the distribution shifts out from under it. The adaptive variants (ACI, conformal PID, EnbPI) genuinely help, but they recover a long-run average coverage, not coverage for the step in front of you, useful to know before you deploy.
5 · Does it help in practice?
One small, reproducible worked example, not a benchmark, just a controlled experiment on synthetic data with a known answer, runs the actual libraries (MAPIE, crepes) and the strong time-series methods against probabilistic baselines and an oracle, scored by conditional coverage and proper scores (CRPS, interval/Winkler). It isolates one clean point: the adaptive conformal methods do help, but the help comes from the conditional model they wrap, raw quantile regression with no conformal step ties conformalized quantile regression, with the conformal layer adding the marginal certificate on top. See the example →
6 · When coverage is the goal: the litmus test
Everything so far was about reading the guarantee carefully. The reward for that care is knowing precisely when conformal prediction is not merely valid but exactly the right tool. It comes down to a single question:
Is your loss a function of whether the truth lands in a set, or of where it lands? When the answer is “whether,” coverage is the objective, the set is the deliverable, and conformal prediction’s distribution-free guarantee is hard to beat.
That “whether” column is large, and high-stakes, and it is where conformal prediction earns its keep. The next sections walk through it, four families you can play with right here, and several more covered in depth on the applications page. In all of them the marginal guarantee is not a consolation prize; it is the whole product.
7 · Prediction sets & selective triage
In classification, conformal prediction returns a set of labels guaranteed to contain the true one at your chosen rate, and the set’s size adapts to difficulty: a single confident label where one class dominates, a short ranked shortlist where the input is genuinely ambiguous. That is exactly the signal you want for human-in-the-loop triage: answer automatically when you can, defer the ambiguous cases to an expert, and bound how often the wrong shortlist is handed over. The adaptivity that was a caveat for coverage in Section 2 becomes the feature here, because now the set is what you ship.
8 · Anomaly & out-of-distribution detection
This is conformal prediction at its cleanest, working as a test rather than a forecaster. A conformal \(p\)-value of a genuine inlier is (super-)uniform, so flagging whenever \(p\le\alpha\) controls the false-alarm rate at exactly \(\alpha\), distribution-free, finite-sample, with no threshold to tune. Type-I error control is coverage, and coverage is what conformal certifies. Detection power is the score’s job; the false-alarm budget is guaranteed.
9 · Retrieval & screening: guaranteed recall
Drug-candidate triage, document retrieval, fraud review: the deliverable is a shortlist, and the thing that must not happen is dropping the real hit. That is a recall guarantee, which is coverage in disguise. One conformal threshold on the model’s score, calibrated on known hits, guarantees a new hit survives the cut at least \(1-\alpha\) of the time, with no assumptions about how scores are distributed. A sharper model does not change the guarantee; it shrinks the list you have to read.
10 · Safety in robotics & control
Robotics and control are a natural home for conformal prediction, because the question is almost always containment: will the true state lie inside a region I can plan around? Calibrate a state predictor’s error and you get a distribution-free safety tube around the nominal trajectory; if the tube clears the obstacle, you hold a certificate that the system stays out of the keep-out zone at the chosen confidence. A controller rarely needs a sharp forecast of the next state, it needs a region it can prove it stays within.
11 · Language models & structured outputs
The same idea tames generative models. Emit a set of candidate answers guaranteed to contain the correct one with high probability, or abstain when that set grows too large to be useful, a distribution-free handle on hallucination and a principled deferral rule. Conformal factuality, calibrated selective generation, and coverage-guaranteed retrieval-augmented pipelines all rest on the same certificate. The applications page walks through the methods and the papers that apply them carefully.
12 · Risk control, deferral & audit
Sometimes the deliverable is the certificate itself. Conformal risk control and Learn-then-Test extend the machinery from coverage to any monotone risk, a bounded miss rate, false-negative rate, or expected loss, chosen in advance and honoured on fresh data. For compliance and audit, “contained 95% of the time, under almost no assumptions” is exactly the kind of guarantee little else can promise so cheaply. Full depth, with citations, including scientific discovery, causal inference, survival analysis, and medical imaging, where coverage is just as central, is on the applications page.
In every one of these the skill is the same: recognising which column your problem is in. When it is “whether,” reach for conformal prediction with confidence.
13 · Objections & FAQ
Is it useless? Does conformalizing improve my model? Can I get conditional coverage? Does it work for time series? The common objections, and straight answers, live on the FAQ page.
The one-sentence version
Conformal prediction certifies the coverage of a set; it does not estimate a distribution. Use it, happily, whenever you need a coverage guarantee. Just don’t expect the certificate to make the forecast underneath it any sharper; that work belongs to the model. Genuinely useful, then, but, in the exact sense of that average over inputs, only marginally so.
Key references
- Vovk, Gammerman & Shafer (2005). Algorithmic Learning in a Random World. Springer.
- Angelopoulos & Bates (2023). Conformal prediction: a gentle introduction. FnT in ML 16(4). (The friendly modern survey.)
- Lei & Wasserman (2014). Distribution-free prediction bands for non-parametric regression. JRSS-B 76(1):71–96.
- Foygel Barber, Candès, Ramdas & Tibshirani (2021). The limits of distribution-free conditional predictive inference. Information and Inference 10(2).
- Romano, Patterson & Candès (2019). Conformalized quantile regression. NeurIPS.
- Angelopoulos, Bates, Fisch, Lei & Schuster (2024). Conformal risk control. ICLR.
- Gneiting, Balabdaoui & Raftery (2007). Probabilistic forecasts, calibration and sharpness. JRSS-B 69(2).
The full bibliography is in paper/references.bib.