Demonstration

Exchangeability & time series

Fixed split conformal is a straw man here. The adaptive methods are not, so judge everyone on a proper score.

Every coverage statement in conformal prediction is paid for by one assumption: the calibration data and the future point are exchangeable, their joint distribution is invariant to reordering. That is plausible for an i.i.d. sample. It is rarely true for a series unfolding in time, where the level trends and occasionally jumps to a new regime. A fixed split-conformal band that freezes a single width does not weaken gracefully under drift; it simply stops covering.

But it would be cheap to stop there. The literature has strong online repairs, and a fair test pits them not only against fixed split but against a plain probabilistic model, and scores all of them on a proper criterion, not marginal coverage alone (which several methods hit by construction). So below we forecast a nonstationary series $y_t = \text{level}_t + \sigma_t\,\varepsilon_t$ with known, time-varying $\sigma_t$, a trend, a regime jump at the orange line, and a high-volatility block afterwards. The point forecast $\hat y_t$ is a trivial rolling mean. We calibrate on $[0,t_0]$ (the dashed line) and then run five methods online over $t>t_0$, all consuming the same $\hat y_t$:

fixed split (CP), constant half-width $q$ from the calibration residuals. The straw man, kept for contrast.
ACI (Gibbs & Candès, 2021), online $\alpha_t$, width from the $(1-\alpha_t)$ quantile of a trailing window of $|y-\hat y|$, updated by $\alpha_{t+1}=\operatorname{clip}\!\big(\alpha_t+\gamma(\alpha-\mathrm{err}_t)\big)$.
conformal PID (after Angelopoulos, Candès & Tibshirani, 2024), trailing-quantile base radius plus an integral correction on the running coverage error.
NexCP (weighted) (Barber et al., 2023), a recency-weighted quantile of trailing residuals, weights $\rho^{\,\text{age}}$.
EWMA-vol Gaussian, the yardstick: estimate the conditional sd by RiskMetrics EWMA of squared residuals ($\lambda=0.94$) and emit $\hat y_t \pm z\,\sigma_t$. This models the spread directly, so it also returns a full predictive distribution.

Pick the method shown in the main panel; turn up drift and watch the green points go red for fixed split. The readouts report, for the selected method, marginal coverage, the worst-window coverage (the local collapse the marginal hides), and the interval (Winkler) score at level $1-\alpha$, $$\mathrm{IS} = \overline{\;(\hat h-\hat\ell) + \tfrac{2}{\alpha}(\hat\ell-y)\mathbf 1\{y<\hat\ell\} + \tfrac{2}{\alpha}(y-\hat h)\mathbf 1\{y>\hat h\}\;},$$ a proper score that rewards narrow bands and penalises misses, lower is better.

The second panel overlays the rolling coverage (trailing 80 steps) of whichever methods you tick on, against the dashed target. This is where you see fixed split (blue) collapse after the regime jump while ACI, conformal PID and the EWMA model track $1-\alpha$. They are doing real work, they are not straw men.

Fixed split sits near target while the data are roughly stationary, then collapses once the trend and the volatility block bite. The adaptive conformal methods and the EWMA-vol model oscillate around target because they re-estimate the width from recent realized residuals. Note what stays true for all of them: the trailing window dips hard right after the jump. The long-run average recovers; the window you are standing in does not.

Takeaway. Two readings, both from the worked example. First, fixed split conformal really does collapse under drift (≈0.56 marginal, ≈0.17 worst-window), a straw man. But ACI, conformal PID and NexCP genuinely recover long-run and marginal coverage; they are competitive, not strawmen. Second, the repair is conditional modeling, not the conformal step itself: a plain EWMA-volatility Gaussian matches the adaptive conformal methods on the interval score (≈6.9 vs ≈7.0, against an oracle ≈5.5) and returns a full predictive distribution. Where conformal helps under drift, it helps by adapting the width to recent residuals, local conditional modeling, with conformal supplying the coverage leveling on top. And none of these delivers per-step conditional coverage: even the oracle’s worst-window coverage is ~0.82. The repairs buy long-run/marginal coverage, never coverage now. The paper works through why. Conformal vs. recalibration asks whether plain probabilistic recalibration would have served you better all along.

← Coverage ⊥ log-score Conformal vs. recalibration →