KPSS vs ADF vs Phillips-Perron: Stationarity Testing in Practice

Category: Foundations & Core Concepts • Updated: June 2026 • Reading time: 15 minutes

TL;DR

ADF and Phillips-Perron test the null hypothesis that a series has a unit root (is non-stationary); KPSS tests the opposite null, that the series is stationary. Because the nulls are opposite, "KPSS vs ADF" is the wrong framing — run ADF and KPSS together and read the result off the four-cell matrix below. Use Phillips-Perron as a robustness check when heteroskedasticity is severe, and reach for DFGLS (in the arch package) when the series is highly persistent and your sample is short — that's exactly where ADF's power collapses. For volatility series, spreads around regime breaks, and anything that might be fractionally integrated, no single test gives you the answer; this article shows what each one can and cannot tell you, with runnable statsmodels code.

In this guide

The three tests at a glance
What each test actually tests
The ADF × KPSS decision matrix
Running the tests in statsmodels
Phillips-Perron and DFGLS with the arch package
The power problem near a unit root
Level vs trend stationarity: the regression argument
Structural breaks fool all three tests
Finance-specific gotchas
Which test when

The Three Tests at a Glance

All three tests answer a version of the same question — does this series wander permanently (unit root, I(1)) or does it revert to a mean or trend (stationary, I(0))? — but they differ in which hypothesis sits in the null, how they handle serial correlation in the errors, and where you'll find a maintained implementation.

	ADF	KPSS	Phillips-Perron
Null hypothesis	Unit root (non-stationary)	Stationary (level or trend)	Unit root (non-stationary)
Rejecting the null means	Evidence of stationarity	Evidence of non-stationarity	Evidence of stationarity
Serial correlation handled by	Adding lagged differences to the regression (parametric)	Newey-West long-run variance estimate (non-parametric)	Newey-West correction to the test statistic (non-parametric)
Key tuning choice	Number of lags (`autolag="AIC"`)	Bandwidth (`nlags="auto"`) and `regression="c"` vs `"ct"`	Bandwidth (Newey-West lags)
Python implementation	`statsmodels.tsa.stattools.adfuller`	`statsmodels.tsa.stattools.kpss`	`arch.unitroot.PhillipsPerron` (not in statsmodels)
Known weakness	Low power near a unit root; sensitive to lag choice	p-value only reported on (0.01, 0.1); over-rejects with persistent errors	Worse small-sample size distortions than ADF with negative MA errors

The most important row is the first one. ADF starting from "guilty until proven stationary" and KPSS starting from "stationary until proven guilty" means a low-information sample produces opposite conclusions from the two tests by default — not because the tests disagree about the data, but because failing to reject a null is weak evidence. Everything else in this article builds on that asymmetry.

What Each Test Actually Tests

Augmented Dickey-Fuller: null of a unit root

The Dickey-Fuller test (1979) estimates the regression Δy_t = α + βt + γy_t−1 + ε_t and tests H₀: γ = 0 (unit root) against H₁: γ < 0 (stationary). The "augmented" version adds lagged terms Δy_t−1 … Δy_t−k to soak up serial correlation in the errors, which is why lag selection matters so much: too few lags and the size of the test is wrong; too many and you burn power. Under the null the t-statistic on γ does not follow a t-distribution — statsmodels' adfuller uses MacKinnon's response-surface approximations (1994, updated 2010) for p-values, and by default selects the lag count by AIC up to a maximum of 12·(n/100)^1/4.

One subtlety that trips people up: the test statistic is negative when there is evidence of stationarity. An ADF statistic of −3.8 against a 5% critical value of −2.86 rejects the unit root; a statistic of −1.4 does not.

KPSS: null of stationarity

The Kwiatkowski-Phillips-Schmidt-Shin test (1992) decomposes the series into a deterministic part (constant, or constant plus trend), a random walk component, and a stationary error, then tests H₀: the random walk component has zero variance — i.e., the series is stationary around its deterministic part. Rejecting KPSS is evidence of a unit root. The statistic is built from cumulative sums of demeaned (or detrended) residuals, scaled by a Newey-West estimate of the long-run variance.

Two implementation details of statsmodels' kpss matter in practice. First, the p-value is interpolated from Table 1 of the original paper and is only available on the interval (0.01, 0.1); outside it you get the boundary value plus an InterpolationWarning. A reported p-value of exactly 0.01 means "≤ 0.01" and exactly 0.1 means "≥ 0.1" — never report these as exact. Second, the default bandwidth nlags="auto" uses the data-dependent method of Hobijn et al. (1998); the "legacy" option is the older fixed rule int(12·(n/100)^1/4). Results can change meaningfully between the two on persistent series.

Phillips-Perron: ADF's null, handled non-parametrically

The Phillips-Perron test (1988) tests the same unit-root null as ADF, but instead of adding lagged differences to the regression, it runs the plain Dickey-Fuller regression and then corrects the test statistic using a Newey-West long-run variance estimator. This makes PP robust to heteroskedasticity and serial correlation of unspecified form without choosing an explicit lag order for the regression — attractive for financial data, where volatility clustering guarantees heteroskedastic errors. The cost, documented in Schwert's Monte Carlo work and acknowledged in the arch documentation, is that PP suffers worse small-sample size distortions than ADF when the errors have a large negative moving-average component. In practice PP usually agrees with ADF; when it disagrees, suspect heteroskedasticity or an MA error structure rather than treating PP as a tiebreaker.

Note that statsmodels does not ship a Phillips-Perron implementation — you need the arch package (pip install arch), which also provides DFGLS and Zivot-Andrews, both of which we'll need below.

The ADF × KPSS Decision Matrix

Because the nulls are opposite, running ADF and KPSS together yields four possible outcomes, and each one is informative. This joint procedure — sometimes called confirmatory analysis — is recommended in the statsmodels stationarity notebook, which spells out the same four cases.

	KPSS fails to reject (looks stationary)	KPSS rejects (looks non-stationary)
ADF rejects (looks stationary)	Stationary. Both tests agree; model the series in levels.	Conflict. Often a near-unit-root or long-memory (fractionally integrated) series, or heteroskedasticity distorting one test. Common with volatility series. Investigate; don't difference reflexively.
ADF fails to reject (looks non-stationary)	Conflict. Usually trend stationarity tested with the wrong specification, or simply too little data for either test to reject. Re-run ADF with `regression="ct"` and check the sample size before concluding anything.	Unit root. Both tests agree; difference the series (use returns, not prices) and re-test.

The textbook reading of the two conflict cells — "KPSS stationary + ADF non-stationary ⇒ trend stationary, detrend it; ADF stationary + KPSS non-stationary ⇒ difference stationary, difference it" — is a reasonable first interpretation, but in financial data the conflicts more often signal one of the deeper problems covered below: a near-unit root the ADF can't resolve, fractional integration that fits neither I(0) nor I(1), or a structural break masquerading as a unit root. Treat a conflict cell as a prompt to investigate, not as a lookup table answer.

Running the Tests in statsmodels

The cleanest way to see the machinery is on data where we know the truth. The snippet below simulates 1,500 days of log prices as a random walk with drift (so the truth is: prices are I(1), returns are I(0)) and runs both tests on both series. Everything here is reproducible — fixed seed, statsmodels 0.14.6.

import numpy as np
from statsmodels.tsa.stattools import adfuller, kpss

rng = np.random.default_rng(4)
returns = 0.0003 + 0.01 * rng.standard_normal(1500)   # ~7.5% ann. drift, ~16% ann. vol
log_price = np.cumsum(returns)                         # random walk with drift: I(1) by construction

for name, series in [("prices", log_price), ("returns", returns)]:
    adf_stat, adf_p, adf_lags, nobs, crit, _ = adfuller(series, regression="c", autolag="AIC")
    kpss_stat, kpss_p, kpss_lags, kpss_crit = kpss(series, regression="c", nlags="auto")
    print(f"{name}: ADF={adf_stat:.2f} (p={adf_p:.3f}, lags={adf_lags})  "
          f"KPSS={kpss_stat:.3f} (p={kpss_p}, lags={kpss_lags})")

Output (your machine will reproduce this exactly with the same seed):

prices:  ADF=-1.38 (p=0.590, lags=0)   KPSS=2.123 (p=0.01, lags=25)
returns: ADF=-39.49 (p=0.000, lags=0)  KPSS=0.109 (p=0.1, lags=6)

InterpolationWarning: The test statistic is outside of the range of p-values
available in the look-up table. The actual p-value is smaller than the p-value
returned.

Reading this through the matrix: for prices, ADF fails to reject (−1.38 is well above the 5% critical value of −2.86) and KPSS rejects hard (2.123 against a 1% critical value of 0.739) — bottom-right cell, unit root, difference it. For returns, ADF rejects emphatically and the KPSS statistic of 0.109 is below even the 10% critical value of 0.347 — top-left cell, stationary. Both conclusions match the known truth.

Three things to notice in the output, because they generate most of the confusion you'll find in forum threads:

The KPSS p-values of 0.01 and 0.1 are bounds, not values. The InterpolationWarning is statsmodels telling you the statistic fell off the edge of the published table. The warning fires for strong results in either direction — it is not an error, and suppressing it without reading it first is how people end up reporting "p = 0.01" as if it were exact. Compare the statistic against the returned critical-value dictionary instead: {'10%': 0.347, '5%': 0.463, '2.5%': 0.574, '1%': 0.739} for regression="c".
An ADF p-value of 0.000 is an approximation artifact. MacKinnon's response surface isn't meant to resolve p-values that small; report it as p < 0.001.
The two tests chose very different bandwidths (25 lags on prices, 6 on returns) under nlags="auto". That's the Hobijn method adapting to persistence — and a reminder that on real data you should check whether your conclusion survives a different lag choice before trusting it.

Phillips-Perron and DFGLS with the arch Package

The arch package wraps each test in a class with a readable summary. On the same simulated data:

from arch.unitroot import PhillipsPerron, DFGLS

pp = PhillipsPerron(log_price)        # trend="c" and Newey-West bandwidth by default
print(pp.stat, pp.pvalue, pp.lags)    # -1.445  0.561  24

dfgls = DFGLS(log_price)              # Elliott-Rothenberg-Stock GLS-detrended ADF
print(dfgls.stat, dfgls.pvalue)       # 0.650  0.869

PhillipsPerron(returns).pvalue        # ~0.0 — agrees with ADF, as it usually does

Phillips-Perron tells the same story as ADF here (−1.445, p = 0.561 on prices), which is the typical case. Its constructor takes trend in {"n", "c", "ct"} and test_type in {"tau", "rho"} — tau is the t-statistic-based version you almost always want; rho is the coefficient-based variant.

DFGLS (Elliott, Rothenberg & Stock, 1996) is the test most practitioners should use more and most articles never mention: it's the ADF regression applied to a GLS-detrended series, which substantially improves power exactly where ADF is weakest — highly persistent series in modest samples. The next section quantifies that.

The Power Problem Near a Unit Root

"ADF has low power" is repeated everywhere; here is what it actually looks like. We simulated stationary AR(1) series y_t = φy_t−1 + ε_t — genuinely mean-reverting, no unit root — and counted how often ADF (constant, AIC lags) correctly rejected the unit-root null at the 5% level across 500 simulations per cell:

Sample size	φ = 0.97 (half-life ≈ 23 days)	φ = 0.99 (half-life ≈ 69 days)	φ = 0.99, DFGLS
n = 250 (1 year daily)	20%	9%	18%
n = 500 (2 years)	58%	10%	28%
n = 1,000 (4 years)	98%	24%	75%
n = 2,500 (10 years)	100%	97%	—

Read the φ = 0.99 column carefully: with a year of daily data, ADF detects mean reversion in a genuinely mean-reverting series 9% of the time — barely above the 5% false-positive rate it would produce on a true random walk. The test is nearly uninformative. KPSS doesn't rescue you in this regime either: a stationary series with φ = 0.99 generates long excursions that look exactly like a random walk's, so KPSS tends to reject stationarity. That's the top-right conflict cell of the matrix, and it's the honest answer: at this persistence and sample size, the data cannot distinguish I(0) from I(1).

Two practical consequences. First, DFGLS roughly doubles or triples ADF's power in this region (75% vs 24% at n = 1,000) at zero cost — if a unit-root decision actually matters to your pipeline, use DFGLS rather than ADF. Second, half-lives matter more than test verdicts for trading: a spread with φ = 0.99 mean-reverts with a 69-day half-life, which may be tradeable regardless of whether a test certifies it I(0) — and may be I(1) regardless of a lucky rejection. Statistical significance and economic usefulness are different questions.

Level vs Trend Stationarity: The regression Argument

Both adfuller and kpss take a regression argument, and getting it wrong produces confidently wrong answers. adfuller accepts "c" (constant), "ct" (constant + trend), "ctt" (adds quadratic trend), and "n" (neither); kpss accepts only "c" (level stationary) and "ct" (trend stationary). Here is a series that is stationary around a deterministic trend — y_t = 0.002t + 0.05u_t with u_t an AR(1), φ = 0.5, n = 1,500, seed 0 — tested under both specifications:

ADF  regression="c" : stat=-0.285  p=0.928   # "unit root" — wrong
ADF  regression="ct": stat=-22.18  p<0.001   # stationary around trend — correct
KPSS regression="c" : stat=5.867   p≤0.01    # "non-stationary" — wrong question
KPSS regression="ct": stat=0.108   p≥0.1    # trend stationary — correct

Under the "c" specification both tests scream non-stationary, because relative to a constant mean the series is drifting away. Allow for a trend and both tests flip completely. The procedure that avoids this trap: plot the series first; if it visibly trends, test with "ct" (and interpret a "stationary" verdict as trend stationary — you still need to detrend before modeling); if it oscillates around a level, use "c".

One honest caveat from our simulations: KPSS's size is not exact when the stationary errors are persistent. On trend-stationary series with AR errors at n = 1,500 and automatic bandwidth, KPSS ("ct", 5% level) falsely rejected stationarity 7.8% of the time at φ = 0.5, 9.4% at φ = 0.7, and 19.2% at φ = 0.9. A nominal 5% test that rejects 19% of the time on persistent-but-stationary data is a real bias toward "non-stationary" verdicts — one more reason not to treat a lone KPSS rejection as decisive on slow-moving financial series.

Structural Breaks Fool All Three Tests

Perron (1989) showed that a stationary series with a one-time break in its mean or trend is systematically misclassified as a unit root by Dickey-Fuller-type tests. This is not an edge case in finance: regime changes from policy shifts, index reconstitutions, mergers affecting one leg of a spread, or a currency peg breaking all produce exactly this pattern. A demonstration — an AR(1) series (φ = 0.5) whose mean jumps from 0 to 3 halfway through a 1,000-observation sample, stationary within each regime:

ADF : stat=-1.719  p=0.421    # fails to reject unit root — fooled
KPSS: stat=4.159   p≤0.01    # rejects stationarity — also fooled

from arch.unitroot import ZivotAndrews
ZivotAndrews(x, trend="c")    # stat=-11.21, p<0.001 — rejects unit root

Both standard tests agree the series is non-stationary — the bottom-right "difference it" cell — and both are wrong: differencing this series destroys a perfectly stationary structure and replaces a one-time level shift with a one-time spike. The Zivot-Andrews test (1992), which tests the unit-root null against the alternative of stationarity with a single estimated break, correctly rejects at any conventional level. It's available both as statsmodels.tsa.stattools.zivot_andrews and as arch.unitroot.ZivotAndrews. The practical rule: when ADF and KPSS jointly say "unit root" on a series you have economic reasons to believe mean-reverts, plot it and look for a break before differencing.

Finance-Specific Gotchas

Prices, returns, and what not to test

Log prices of liquid assets are I(1) for any practical purpose; running stationarity tests to "confirm" this adds nothing, and the 5% of samples where ADF spuriously rejects (we measured a 4.8–5.6% false rejection rate on simulated random walks — exactly the nominal size) will occasionally tell you a major index is mean-reverting. It isn't. Test the things whose order of integration is genuinely uncertain: spreads, ratios, factor returns, volatility measures, and model features. And difference once, not twice: returns of returns are essentially never needed for financial series, and over-differencing injects a non-invertible MA(1) component into the series — you've traded a unit root in the AR part for one in the MA part, and forecasting models fit on it behave badly. If first differences still look non-stationary, the usual culprits are breaks or variance changes, not a second unit root.

Cointegration pretesting: don't use adfuller on regression residuals

The Engle-Granger recipe for pairs trading — regress one leg on the other, test the residual for stationarity — has a trap in step two: the residual comes from a regression that already chose coefficients to make it look as stationary as possible, so standard ADF critical values are too lenient and you'll find "cointegrated" pairs that aren't. Use statsmodels.tsa.stattools.coint, which runs the augmented Engle-Granger test with the correct MacKinnon critical values for residual-based tests. (Stationarity tests still matter as the pre-test: cointegration is only defined between I(1) series, so confirm each leg is I(1) — ADF fails to reject, KPSS rejects — before testing for cointegration. Two stationary series are trivially "cointegrated" by any linear combination, and a stationary-plus-I(1) pair can't be.) For the strategy side of this, see our piece on combining statistical arbitrage with ML.

Volatility series: near-unit-root and fractionally integrated

Realized volatility, VIX-style indices, and GARCH conditional variances occupy the awkward middle ground the I(0)/I(1) dichotomy can't represent. Their autocorrelations decay slowly — hyperbolically rather than geometrically — and the long-memory literature models them as fractionally integrated, I(d) with d typically estimated around 0.3–0.45 for realized volatility (Andersen, Bollerslev, Diebold & Labys, 2003): stationary, but with shocks that persist far longer than any low-order ARMA implies. On such series ADF often rejects (there really is no unit root) while KPSS also rejects (the long memory inflates its statistic) — the top-right conflict cell. Differencing a d ≈ 0.4 series over-differences it; leaving it in levels under-states persistence. The fractional-differentiation approach popularized in quant ML — difference by a fractional order just large enough to pass stationarity tests while preserving memory — exists precisely for this regime, and the joint ADF+KPSS conflict is your diagnostic for when it applies. The same long-memory behavior shows up in volatility clustering and GARCH-X hybrid models as estimated persistence parameters summing to nearly one.

Regimes, and what stationarity means for ML pipelines

A series can pass every stationarity test over a decade and still have a mean and variance that differ wildly between the first and second halves of your training window — tests characterize the whole sample, not the stability your model actually experiences. This is why stationarity testing complements rather than replaces regime analysis (see regime-switching models with hidden Markov chains) and careful train/validation/test splits for non-stationary series. For feature engineering, the tests earn their keep as an automated screen: compute candidate features, run ADF+KPSS on each, and flag the conflicted and non-stationary ones for transformation — the workflow we walk through in feature engineering for price series.

Which Test When

Situation	Recommendation
Default workflow, any series	ADF + KPSS together; read the four-cell matrix. Match `regression` to what the plot shows (level vs trend).
Persistent series, < ~1,000 observations	DFGLS instead of ADF — materially more power near a unit root. Accept that φ ≈ 0.99 may be undecidable at your sample size.
Strong volatility clustering / heteroskedasticity	Add Phillips-Perron as a robustness check on the ADF verdict; distrust it specifically when errors have large negative MA components.
Both tests say unit root, but economics says mean reversion	Plot the series; suspect a structural break; run Zivot-Andrews before differencing.
ADF and KPSS both reject (conflict)	Suspect long memory / fractional integration — common for volatility. Consider fractional differencing rather than first differences.
Pairs / cointegration residuals	Never plain `adfuller` on fitted residuals; use `statsmodels.tsa.stattools.coint` (correct critical values). Pre-test each leg for I(1) first.
Reporting results	Report the statistic and critical values, not just p-values; KPSS p-values are bounded to (0.01, 0.1) and ADF p = 0.000 means p < 0.001.

Bottom Line

The question behind "KPSS vs ADF vs Phillips-Perron" has a clean answer once you see that it's not a competition: ADF and PP put the unit root in the null, KPSS puts stationarity there, and the pair of opposite nulls is the feature — run ADF (or better, DFGLS) together with KPSS and let agreement, not a single p-value, drive the decision. The harder truth this article's simulations make concrete is that for the series quants actually care about — spreads with multi-week half-lives, volatility with long memory, anything spanning a regime break — the honest output of stationarity testing is often "the data cannot tell," and the right response is a power-aware test choice, a break test, or fractional differencing rather than a reflexive diff(). Let the tests inform the economics; don't let a lookup table overrule it.