Conformal Prediction Demonstrations

Demonstrations of some principles of conformal prediction and its connection to other areas of mathematics

About this site

This site began as a hunt for anything in conformal prediction that might improve the out-of-sample performance of autonomous time-series algorithms, the kind that run unattended at skaters. The ways in which work under the general “conformal” banner have helped are documented there. It grew into a dozen or so interactive demonstrations and a few notes. The topic is extremely interesting mathematically, with surprising connections to things like Steinitz balancing, herding, Riemann rearrangement, parimutuel rent, the Kerns–Székely representation of finite exchangeable collections, and that’s just a shortlist of things the community might not have noticed yet — there is plenty they have. I strongly encourage the mathematically minded person to invest some time in conformal prediction.

Admittedly this site has a minor identity crisis. It is a mix of elementary demonstrations and common-sense, well-meaning advice for those in the time-series prediction community (which does not seem to be the exact sweet spot for conformal prediction due to the “information gap”), but also novel insights into conformal prediction that may hopefully find use in applications outside the author’s wheelhouse. You might say that, in the spirit of conformal prediction itself, the treatment here is highly uneven, both in rigor and in sophistication. Some arguments are tight. Others loose. You have no way of knowing whether the next statement to arrive should be one or the other.

1 · How Split Conformal Prediction works

Hold out a calibration set, score how “surprising” each point is, and take the empirical quantile of those scores to size a band around new predictions. All of it fits in a few lines of model-agnostic code, and it delivers the following guarantee.

Given exchangeable data, a model, and a nonconformity score, split conformal prediction returns a set $C(X)$ with $$\mathbb{P}\big(Y_{n+1}\in C(X_{n+1})\big)\;\ge\;1-\alpha.$$ A real and remarkable guarantee. The art is in reading it: the probability averages over the input $X_{n+1}$, so it describes the average case, not necessarily the one in front of you.

The guarantee is due to Papadopoulos, Proedrou, Vovk & Gammerman (2002), in the tradition of Vovk, Gammerman & Shafer (2005).

How split conformal works

Build a prediction band from nonconformity scores and watch it hit your target coverage.

2 · Reading the guarantee: the fine print

The guarantee is marginal: an average over inputs, and an average can flatter. Conformal’s 90% is a genuine figure that can still sit near 100% where the data are easy and well below it where they are hard. The gap to per-instance (conditional) coverage is not a tuning problem; it is a theorem, and knowing it is what separates using the tool well from being surprised by it.

Marginal vs. conditional coverage

90% on average, while over-covering the easy inputs and under-covering the hard ones. Drag a window and read the local coverage.

no-go

The price of conditional coverage

Chase per-$x$ coverage by localizing, and the intervals diverge to the whole real line. Lei & Wasserman (2014).

no-go

Subgroup coverage buys only a wider band

Protect every subgroup distribution-free and all you get is a flat, inflated band, never adaptivity. Barber et al. (2021).

The coverage lottery

The guarantee also averages over calibration sets, and you calibrated once. Your realized coverage is a Beta draw: at $n=100$ the middle 95% of draws spans roughly 84–95%, and an informed counterparty collects the difference until $n$ reaches the tens of thousands.

3 · A tougher challenge

The most common hope is that conformalizing a model makes its uncertainty better. That can lead to disappointment, and seeing exactly why is worth a few minutes. The coverage certificate is independent of forecast quality: the same 90% attaches to a brilliant model and an uninformative one, and you can hold coverage fixed while the predictive log-likelihood does anything at all. Sharpness comes from the model; conformal supplies the certificate on top.

The fence is the horizon

A deliberately uninformative predictor still reaches exactly 90% coverage. Validity, by itself, is not evidence about the model.

Coverage ⊥ log-score

Pin marginal coverage at 90% and move the predictive log-likelihood up and down at will. The two are independent axes.

Conformal vs. recalibration

Want a calibrated forecast? Recalibration improves the score and returns a density. Conformal re-levels coverage and returns a set. Both are useful, for different jobs.

Same coverage, different price

Two forecasts share an identical 90% interval; an out-of-the-money option priced under each differs by orders of magnitude. Coverage is blind to the tail shape that sets the price.

4 · The Information Gap

Coverage is orthogonal to quality, so a model can hit its coverage and still forecast poorly. How poorly? The shortfall has a name and a closed form, the residual-information gap. These take it apart: where it sits, and what it equals.

A note on the name

“Information gap” is not (yet) standard terminology. It is a rigorous observation, and the key one that helped the author understand the limitations of conformal prediction in a crisp manner, not to mention sort through the literature (see the map). The identity is proved in Marginally Useful, and the coverage–score plane shows it as a height no conformal step can climb.

The coverage–score plane

The synthesis: conformalizing slides a model sideways to nominal coverage; only a better conditional model lifts it toward the oracle. The leftover height is the information gap.

One number, five pictures

The information gap read five ways — false pooling, an average log Bayes factor, non-uniform conformal ranks, a projection onto independence, a Kelly betting rent. One slider, five synchronized panels, the same number.

Crossing the gap, in the limit

The strongest counterpoint in the literature, run live: Vovk’s universally consistent conformal predictive system adapts its shape as histogram cells shrink, and its distance to the true conditional law descends while the single-shape system plateaus. Valid at every $n$; sharp only in the limit.

The price of a dumb model

On the real motorcycle-crash data: a crude regressogram and a smooth kernel density estimate, both conformalized to the same 90%. Both cover; the regressogram is a quarter wider and a quarter worse on CRPS, at every cell width. The bill for reaching for the dumb model to get a guarantee you could have had anyway.

5 · Exchangeability: de Finetti and the signed corner

A different perspective, on the one assumption many results are based on. Exchangeability is what de Finetti’s theorem describes, and its finite form lets the mixing measure go negative. The sign of that measure is a separate axis from the information gap above: it decides whether conformal’s per-case coverage stays near nominal or quietly fans.

de Finetti, and where it goes negative

Exchangeable laws as mixtures of i.i.d. ones. Drag the correlation negative and watch the mixing measure turn signed — a Feynman–Wigner negative probability, exact but un-samplable.

A Thurstone contest at the −1/n floor

A field of competitors sits at the negative-association floor. Conformalize a relative score: marginal coverage holds, but per-case coverage fans — it covers the mid-pack and misses the extremes.

The conformal fan: dependence sets the variance

Drag the cross-sample correlation from the floor to comonotone and watch the realized-coverage distribution breathe: negative dependence narrows the fan (to zero at the floor), positive widens it. The inequalities of the fan note, live.

Conformal as Bayesian quadrature

The calibration scores induce a Beta(k, n+1−k) posterior over your realized coverage; the conformal guarantee is just its mean. Snell & Griffiths 2025, with the Monte Carlo check overlaid.

6 · The mechanism: balanced placements

Where the 1/n comes from. Conformal validity is a counting statement: rotate the “test” label through the slots and exactly k of n+1 placements are covered, for any distinct scores; exchangeability enters once, to make the realized placement an even draw. The exact dual construction under the same permutation action is Steinitz balancing, the ordering of a zero-sum population with bounded prefix sums; herding (Welling 2009) is its with-replacement relaxation, and the demo below runs it. The note has the exact statements.

How this section came about

Thanks to Peter Urbani for pointing me to the result behind the dependence tax. It was that 1/n rate that made me realize there had to be a contragredient relationship to herding, something I had briefly worked on with Max Welling many years ago. Steinitz balancing later turned out to be the more precise dual; the note has the exact statement.

Moving the balls, moving the slots

Herding steers points to balance a target at rate 1/n. Conformal rotates the “test” label over fixed scores, and exactly k of n+1 placements are covered, for any distinct scores.

Steinitz balancing

Order a zero-sum population so every prefix sum stays inside a fixed ball. A random order wanders like √t; a balanced order does not wander at all.

Balanced placements

Conformal placements are such a population. The balanced word keeps the running placement-acceptance average within 1/(2t) of the orbit level k/(n+1) at every prefix, for any scores.

7 · Time series & distribution shift

Time series break exchangeability, but that actually isn’t such a big deal for coverage: plain split conformal degrades gracefully, with a coverage loss provably bounded by (dependence range) / (calibration size) for stationary series whose dependence vanishes beyond a finite lag, plus a mixing term when it merely fades (Barber & Pananjady 2026). The real question is which assumption replaces exchangeability, and what the resulting 90% averages over: stationarity plus mixing, block structure (permute blocks, not points), a consistent wrapped model (EnbPI, SPCI), or nothing at all for the online variants (ACI, conformal PID), which certify a long-run average along your one path. Each is an average, over paths or over time, not coverage for the step in front of you; the coverage games sorts them, and a second page covers the attempts at conditional coverage.

The dependence tax

Barber & Pananjady’s Figure 1, live: split conformal on MA(t) noise loses coverage linearly in t/n — curves for n, 2n, 3n collapse onto one shallow line, an order of magnitude more gently than the worst case allows.

The time-series fan

Serial dependence moves both moments of realized coverage: the mean by a tax that vanishes with n, the dispersion by a long-run-variance factor that never does. Drag φ and watch the fan shift and widen.

Drift & time series

Turn on drift: fixed split conformal collapses, the adaptive methods recover the average, and a simple volatility model keeps pace on the proper score.

laplace vs conformal on a time series

Same coverage, same point forecast: a conditional model’s band breathes with the volatility while the conformal band stays rigid. They tie on CRPS; the conditional model wins on log-likelihood — it crosses the information gap.

8 · A minimalist example of using conformal wrappers

One small, reproducible worked example, not a benchmark, just a controlled experiment on synthetic data with a known answer, runs the actual libraries (MAPIE, crepes) and the strong time-series methods against probabilistic baselines and an oracle, scored by conditional coverage and proper scores (CRPS, interval/Winkler). It isolates one clean point: the adaptive conformal methods do help, but the help comes from the conditional model they wrap, raw quantile regression with no conformal step ties conformalized quantile regression, with the conformal layer adding the marginal certificate on top. See the example →

9 · When coverage is the goal: the litmus test?

Everything so far was about reading the guarantee carefully. One reward is a sense of when conformal prediction is a natural fit, not merely valid. A useful question:

Is your loss a function of whether the truth lands in a set, or of where it lands? When the answer is “whether,” coverage is the objective, the set is the deliverable, and a distribution-free guarantee of it is exactly what conformal prediction supplies.

A disclaimer

In keeping with the comments at the top of this site: applications where coverage is the goal are not the author’s specialty at all. The demonstrations that follow are self-education as much as anything, and should be read that way.

That “whether” column is large, and often high-stakes. The next sections walk through four families you can play with right here, with several more covered on the applications page. In each, the marginal guarantee is the product.

10 · Prediction sets & selective triage

In classification, conformal prediction returns a set of labels guaranteed to contain the true one at your chosen rate, and the set’s size adapts to difficulty: a single confident label where one class dominates, a short ranked shortlist where the input is genuinely ambiguous. That is exactly the signal you want for human-in-the-loop triage: answer automatically when you can, defer the ambiguous cases to an expert, and bound how often the wrong shortlist is handed over. The adaptivity that was a caveat for coverage in Section 2 becomes the feature here, because now the set is what you ship.

shines

Adaptive prediction sets

Singletons where the model is sure, two or three labels where it isn’t, a confidence signal with a coverage contract attached.

11 · Anomaly & out-of-distribution detection

This is conformal prediction at its strongest, it would seem, working as a test rather than a forecaster. A conformal $p$-value of a genuine inlier is (super-)uniform, so flagging whenever $p\le\alpha$ controls the false-alarm rate at at most $\alpha$ (exactly $\lfloor\alpha(n+1)\rfloor/(n+1)$ for continuous scores), distribution-free, finite-sample, with no threshold to tune. Type-I error control is coverage, and coverage is what conformal certifies. Detection power is the score’s job; the false-alarm budget is guaranteed.

shines

Calibrated anomaly detection

Set a false-alarm budget and it is honoured regardless of the unknown normal distribution. The inlier $p$-values fill out a uniform histogram by construction.

12 · Retrieval & screening: guaranteed recall

Drug-candidate triage, document retrieval, fraud review: the deliverable is a shortlist, and the thing that must not happen is dropping the real hit. That is a recall guarantee, which is coverage under another name. One conformal threshold on the model’s score, calibrated on known hits, guarantees a new hit survives the cut at least $1-\alpha$ of the time, with no assumptions about how scores are distributed. A sharper model does not change the guarantee; it shrinks the list you have to read.

shines

Guaranteed recall

Coverage is recall when the product is a shortlist. Watch the recall guarantee hold flat while a better model buys a shorter list at the same recall.

13 · Safety in robotics & control

Robotics and control are a natural home for conformal prediction, because the question is almost always containment: will the true state lie inside a region I can plan around? Calibrate a state predictor’s error and you get a distribution-free safety tube around the nominal trajectory; if the tube clears the obstacle, you hold a certificate that the system stays out of the keep-out zone at the chosen confidence. A controller rarely needs a sharp forecast of the next state, it needs a region it can prove it stays within.

shines

A certified safety envelope

A conformal tube around a trajectory, with a clearance certificate that flips to “safe” exactly when the guaranteed region clears the obstacle.

14 · Language models & structured outputs

The same idea tames generative models. Emit a set of candidate answers guaranteed to contain the correct one with high probability, or abstain when that set grows too large to be useful, a distribution-free handle on hallucination and a principled deferral rule. Conformal factuality, calibrated selective generation, and coverage-guaranteed retrieval-augmented pipelines all rest on the same certificate. The applications page walks through the methods and the papers that apply them carefully.

15 · Risk control, deferral & audit

Sometimes the deliverable is the certificate itself. Conformal risk control and Learn-then-Test extend the machinery from coverage to any monotone risk, a bounded miss rate, false-negative rate, or expected loss, chosen in advance and honoured on fresh data. For compliance and audit, “contained 95% of the time, under almost no assumptions” is exactly the kind of guarantee little else can promise so cheaply. Full depth, with citations, including scientific discovery, causal inference, survival analysis, and medical imaging, where coverage is just as central, is on the applications page.

In every one of these the skill is the same: recognising which column your problem is in. When it is “whether,” reach for conformal prediction with confidence.

16 · Contributions

The common objections, and straight answers, live on the FAQ page. Beyond that, the site is open to contributions: pull requests and issues are welcome, whether more demonstrations, papers, insight, or pushback.

The two-sentence version

Conformal prediction certifies the coverage of a set; it does not typically estimate a sharp conditional distribution and nor can it because you are not supplying the requisite information. Use it, happily, whenever you need a coverage guarantee.

Key references

Vovk, Gammerman & Shafer (2005). Algorithmic Learning in a Random World. Springer.
Angelopoulos & Bates (2023). Conformal prediction: a gentle introduction. FnT in ML 16(4). (The friendly modern survey.)
Lei & Wasserman (2014). Distribution-free prediction bands for non-parametric regression. JRSS-B 76(1):71–96.
Foygel Barber, Candès, Ramdas & Tibshirani (2021). The limits of distribution-free conditional predictive inference. Information and Inference 10(2).
Romano, Patterson & Candès (2019). Conformalized quantile regression. NeurIPS.
Angelopoulos, Bates, Fisch, Lei & Schuster (2024). Conformal risk control. ICLR.
Gneiting, Balabdaoui & Raftery (2007). Probabilistic forecasts, calibration and sharpness. JRSS-B 69(2).

The full bibliography is in papers/marginally-useful/references.bib.

Using conformal prediction in your own project? Tell Claude: “Read https://conformalprediction.net/SKILL.md and create a project skill from it.” It adds a check for whether your coverage is conditionally trustworthy.