Handling Survivorship Bias in Alternative-Data Backtests

Category: Data Sourcing & Alternative Data • Article #12 • Reading time: 5 minutes

Introduction

Backtesting alternative-data strategies is fraught with a subtle yet devastating bias: survivorship bias. When backtesting on companies still in existence, you ignore companies that failed, went bankrupt, or delisted. This creates artificially optimistic results. Alternative data amplifies this bias because alternative data often becomes available only for successful firms. This article explores survivorship bias in alternative data backtesting and techniques to mitigate it.

How Survivorship Bias Contaminates Alternative-Data Backtests

Imagine backtesting a strategy using foot-traffic satellite imagery of retail locations. Your alternative data provider has imagery from 2015-2023. You backtest by predicting future revenue from observed foot traffic changes. Your strategy shows 15% annual returns.

What's the survivorship bias? Your dataset likely includes imagery only from retail locations that are still operating today. Stores that closed between 2015-2023 are underrepresented or absent. This means your backtest oversamples winners (stores that survived) and undersamples losers (stores that closed). An alternative data strategy that would have correctly identified stores about to close is artificially penalized—those failing stores are missing from your data.

Alternative-Data-Specific Survivorship Issues

Survivorship bias is worse with alternative data than traditional price data because:

Coverage bias: alternative data providers prioritize well-known, successful companies. Satellite providers prioritize major retail chains, not small stores about to fail
Historical data gaps: you might have 2020-2023 data but not 2015-2019 for companies that failed in 2018
Selection bias in collection: social media data might have better coverage for growing startups than failing companies
Access restrictions: private company data often becomes available after IPO (IPO survivorship bias)—failures are invisible

Quantifying Survivorship Bias

Historical Universe Expansion

Track the universe of entities covered by your alternative data source over time. If 1,000 companies have satellite imagery in 2023 but only 700 had imagery in 2015, this doesn't mean 300 companies disappeared. It likely means your data provider expanded coverage. This confounds analysis—you can't distinguish real economic events from changing coverage.

Comparison to Benchmarks

Compare survival rates in your data to known survival rates. If your equity dataset has 95% of companies surviving from 2015-2023 but market-wide survival was 85%, survivorship bias is present. Use indices (like the Russell 3000) as ground truth for what should have survived.

Mitigation Techniques

Include Delisted Companies

Incorporate companies that delisted, failed, or went bankrupt into your backtest. This requires: obtaining delisting data (CRSP database, exchange data), historical alternative data for delisted companies (challenging; most providers only retain active company data), pricing data for delisted entities.

Many providers don't maintain historical data for companies that are no longer relevant to current investors. You may need to reconstruct this data or rely on historical archives.

Conditional Universe Expansion

Instead of backtesting on "all companies with alternative data," limit your universe to companies that had data at each point in history. If you test a strategy in 2016, use only companies that had alternative data available in 2016, not companies that were added to coverage later.

This reduces sample size (fewer companies available at earlier dates) but eliminates look-ahead bias where you accidentally use future coverage as if it existed historically.

Explicit Delisting Modeling

Acknowledge that some companies will fail or delist. Rather than backtesting on a universe that doesn't fail, explicitly model failure rates and integrate them into backtest calculations. If historical alternative data predicts bankruptcy (negative foot traffic trend, declining social media mentions), model a forced liquidation with realistic loss.

Statistical Adjustments

Imputation Methods

For delisted companies with missing alternative data, use imputation: assume similar decline rates to other failing competitors, or assume cessation of signal (foot traffic goes to zero for closed stores). This is imperfect but better than ignoring failures entirely.

Reweighting Schemes

Weight surviving and non-surviving companies by inverse probability of survival. If 90% of companies in a cohort survived, weight survivors at 1.0 and upweight failing companies' impact. This compensates for sample skew.

Backtesting Against a Dead-Firm Benchmark

Create a separate test portfolio of firms that actually failed. Use your alternative data strategy on these firms as a gut check. If your strategy would have correctly identified most failures (large negative signals before bankruptcy), you have evidence that survivorship bias isn't destroying returns. If your strategy missed failures, survivorship bias likely inflates results.

Forward-Testing and Out-of-Sample Reality Checks

The ultimate test: live trading or forward-testing on real-time alternative data for all companies (survivors and potential failures). Forward-testing on new failures that emerge in current market resolves survivorship bias questions definitively—either your strategy correctly identifies emerging failures or it doesn't.

Conclusion

Alternative-data strategies are particularly susceptible to survivorship bias because coverage is often skewed toward successful firms and historical data gaps are common. Rigorous backtesting requires explicitly incorporating delisted entities, limiting universes to historically consistent coverage, and validating results against failure cases. Quant teams implementing alternative-data strategies should treat survivorship bias as a primary concern, not an afterthought, and build it into their backtesting framework from the start.