Code

Conformal prediction in code

Two things you actually need: a way to audit and fix a setup you already have, and recipes for the uses where coverage is genuinely the objective.

The method is always the same idea: score a held-out calibration fold, take an order statistic, act on it. Two cautions hold throughout. The guarantee is exact when the calibration data are exchangeable with the test point; stationary, fading dependence costs provably little coverage, but plain split conformal does not survive drift (see the time-series case). And the calibration fold must be disjoint from whatever the model trained on. Here alpha is the miscoverage level, so alpha = 0.1 asks for 90%.

Audit and fix an existing setup

If you already use MAPIE, crepes, puncc, TorchCP, or a hand-rolled wrapper, the code is almost certainly producing valid marginal coverage. The failure mode is not a bug, it is the reading: the marginal 1 - alpha is an average over inputs, and it can sit near 100% on easy x and well below target on hard ones while the average looks fine. There is a cheap check for that, and it is library-free.

pip install conformalguide

The check tests whether the conformal rank of the residual is independent of the features. Under good conditional calibration it is; distance covariance flags when it is not. Feed it your calibration residuals and the matching features. Getting those out of each library:

Hand-rolled or sklearn. You already have the pieces.

from conformalguide import coverage_dependence_test
resid = y_cal - model.predict(X_cal)          # out-of-fold or held-out residuals
p, info = coverage_dependence_test(resid, X_cal)

MAPIE. Wrap the regressor and it warns after fit, then delegates everything else, so it disappears into a pipeline.

from conformalguide import CoverageAudited
from mapie.regression import SplitConformalRegressor   # MAPIE >= 1.0
scr = SplitConformalRegressor(base, confidence_level=0.95, prefit=True)
model = CoverageAudited(scr, alpha=0.05).fit(X_cal, y_cal)   # base already fit on train
# UserWarning if the 95% level is uneven across x; model.predict(...) unchanged
# manual route: resid = y_cal - base.predict(X_cal)

crepes. The point model is kept on .learner.

from conformalguide import coverage_dependence_test
rf.calibrate(X_cal, y_cal)                     # rf is a crepes WrapRegressor
resid = y_cal - rf.learner.predict(X_cal)
p, info = coverage_dependence_test(resid, X_cal)

In a cross-validation loop. It is also an sklearn scorer.

from conformalguide import coverage_dependence_scorer
cross_validate(est, X, y, scoring={"cov_uniformity": coverage_dependence_scorer})

Reading the result, and the fix:

p >= 0.05: nothing found at this power. Report it as “not detected”, not as conditional validity. No test can certify conditional coverage (the no-go results); this one only detects.
p < 0.05: coverage is conditionally uneven in X. If info["per_feature_dcov"] is present, its largest entries name the features driving it. The fix is upstream: model the conditional spread, with conformalized quantile regression or Mondrian / binned conformal. Do not touch alpha or the quantile, the conformal step cannot reduce this gap.

Get the guard rails as a Claude skill

The same checks, plus a review pass for the usual misuses (conformalizing to make a forecast sharper, coverage reported as a quality metric, in-sample residuals, expecting conditional coverage from a marginal guarantee), are packaged as a Claude skill. Drop it into a project and Claude runs them whenever it sees conformal code. To install, tell Claude:

Read https://conformalprediction.net/SKILL.md and create a project skill from it.

The SKILL.md is the whole thing: what it flags, the check it runs, and the caveat it states every time.

Uses where coverage is the point

These are the cases where a coverage or containment certificate is exactly what you want, and the model’s sharpness is a separate concern. This is conformal prediction at its most useful.

Prediction sets and selective triage

Return a set of labels guaranteed to contain the truth, small where the model is sure, large where it is not. The basis of adaptive prediction sets and of routing the uncertain cases to a human.

import numpy as np
# probs_cal: (n, K) softmax on a calibration fold; y_cal: integer labels
s = 1 - probs_cal[np.arange(len(y_cal)), y_cal]          # 1 - p(true class)
n = len(s); k = int(np.ceil((n + 1) * (1 - alpha)))
qhat = np.inf if k > n else np.sort(s)[k - 1]             # k > n: keep every label
sets = [np.where(p >= 1 - qhat)[0] for p in probs_test]   # labels confident enough to keep
# P(true label in set) >= 1 - alpha; route |set| != 1 to review

Calibrated anomaly detection (conformal p-values)

Turn any anomaly score into a test with an exact false-alarm rate. This is calibrated anomaly detection, and it is one of the cleanest homes for the method.

import numpy as np
cal = np.sort(inlier_scores)            # calibration scores; higher = more anomalous
n = len(cal)
def conformal_pvalue(s):
    return (1 + np.count_nonzero(cal >= s)) / (n + 1)
flag = conformal_pvalue(new_score) <= alpha
# among true inliers, P(flag) <= alpha, distribution-free

Guaranteed-recall shortlist (screening, retrieval)

Keep a shortlist that contains a genuine hit at least 1 - alpha of the time. This is guaranteed recall for retrieval and screening.

import numpy as np
pos = np.sort(relevant_scores)          # scores of known relevant items (ascending)
n = len(pos)
k = int(np.floor(alpha * (n + 1)))      # tolerate up to an alpha miss-rate
t = -np.inf if k < 1 else pos[k - 1]     # keep-threshold
shortlist = np.where(test_scores >= t)[0]
# a new relevant item clears t with probability >= 1 - alpha

Risk control (beyond coverage)

When the cost is not miss-or-cover but a bounded loss, false-negative rate, edit distance, fraction of a set you must read, conformal risk control picks a threshold whose expected loss on a fresh point stays under target. Coverage is the 0–1 special case (Angelopoulos et al., 2024).

import numpy as np
# loss(lam) is monotone non-increasing in the threshold lam, valued in [0, B]
B, target = 1.0, 0.1
for lam in np.sort(lambdas):                              # strict -> permissive
    Rhat = np.mean([loss(lam, i) for i in range(n)])      # mean loss on calibration
    if (n * Rhat + B) / (n + 1) <= target:                # finite-sample risk bound
        break
# using lam controls E[loss on a fresh point] <= target

Interval recipes (conformal only re-levels)

You can also wrap a point predictor in a band. Useful, but be clear about the division of labour: the band’s width comes from the model, conformal only fixes the coverage level. The worked example shows a conformal step leaving a proper score untouched.

Regression intervals (split conformal)

import numpy as np
# model already fit on a training fold; calibrate on a disjoint fold
scores = np.sort(np.abs(y_cal - model.predict(X_cal)))   # nonconformity scores
n = len(scores)
k = int(np.ceil((n + 1) * (1 - alpha)))                  # conformal rank (1-indexed)
q = np.inf if k > n else scores[k - 1]                   # the conformal quantile
lo = model.predict(X_test) - q
hi = model.predict(X_test) + q
# P(y in [lo, hi]) >= 1 - alpha, finite-sample, distribution-free

Adaptive width (conformalized quantile regression)

Want the band to widen where the data are noisy? Fit conditional quantiles, then conformalize them. The adaptivity is the quantile model’s; this is also the fix when the audit above fires.

from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
lo_m = GradientBoostingRegressor(loss="quantile", alpha=alpha / 2).fit(X_tr, y_tr)
hi_m = GradientBoostingRegressor(loss="quantile", alpha=1 - alpha / 2).fit(X_tr, y_tr)
# conformity = how far the truth falls outside the predicted band, on calibration
s = np.maximum(lo_m.predict(X_cal) - y_cal, y_cal - hi_m.predict(X_cal))
n = len(s); k = int(np.ceil((n + 1) * (1 - alpha)))
q = np.sort(s)[k - 1]
lo = lo_m.predict(X_test) - q
hi = hi_m.predict(X_test) + q                            # adaptive width, exact marginal coverage

Time series (adaptive conformal inference)

Exchangeability fails under drift, so the fixed quantile is replaced by an online one. ACI recovers long-run coverage, not per-step conditional coverage (exchangeability & time series).

import numpy as np
from collections import deque
alpha_t, gamma, W = alpha, 0.01, deque(maxlen=200)       # trailing |residual| window
for x_t, y_t, pred_t in stream:
    if alpha_t <= 0:   q = np.inf                        # the theorem needs these escapes
    elif alpha_t >= 1: q = 0.0
    else:              q = np.quantile(W, 1 - alpha_t) if W else np.inf
    covered = pred_t - q <= y_t <= pred_t + q
    alpha_t += gamma * (alpha - (0.0 if covered else 1.0))   # online level update
    W.append(abs(y_t - pred_t))
# time-average coverage -> 1 - alpha (windowed quantile makes this approximate)

If the actual problem is univariate distributional time-series, consider modelling the conditional distribution directly rather than wrapping a point forecaster: that is what crosses the information gap a conformal wrapper cannot. Any conditional-scale model does the job — a t-GARCH fit is the classical choice and would likely do as well or better; the snippet below uses laplace from the skaters package only because it is online and keeps the example to a few lines (a demo shows it side by side with a conformal wrap):

from skaters import laplace
f = laplace(k=3)
state = None
for y in stream:
    dists, state = f(y, state)
    lo, hi = dists[0].quantile(0.025), dists[0].quantile(0.975)   # 95% band, online
    # dists[0].mean, .std, .logpdf(y) also available

← Applications Literature →

Using conformal prediction in your own project? Tell Claude: “Read https://conformalprediction.net/SKILL.md and create a project skill from it.” It adds a check for whether your coverage is conditionally trustworthy.