Introduction

In offline RL scenarios—common in finance where live experimentation is costly—agents must be evaluated on policies they never executed. Off-policy evaluation (OPE) estimates a new policy's performance using data collected by older policies. Accurate OPE is crucial: poor estimation leads to deploying suboptimal strategies; overly conservative estimation wastes improvement opportunities.

The Off-Policy Evaluation Problem

Why Off-Policy Matters in Finance

Building a live trading system from scratch is risky. Practitioners use historical market data (generated by past policies) to develop and evaluate new strategies. Accurate OPE answers: "If we deployed this new strategy yesterday, how would it have performed?" without actually deploying it.

Distribution Shift Challenges

The new policy may take actions rarely (or never) taken by the historical policy. At these out-of-distribution actions, we have no empirical data on outcomes. OPE must extrapolate—a dangerous endeavor. High variance and bias in OPE estimates are fundamental challenges.

Core OPE Techniques

Importance Sampling (IS)

IS reweights historical trajectory returns by the ratio of the new policy's probability to the historical policy's probability. For a trajectory τ, the IS estimate is:

V_IS = E[R(τ) × ∏ π_new(a|s) / π_old(a|s)]

Intuition: upweight trajectories that the new policy would also take; downweight those it would avoid.

Limitation: If the new policy deviates significantly, the product of probability ratios becomes extremely large (or zero), causing high variance. IS estimates become unreliable with large policy divergence.

Doubly Robust (DR) Estimation

DR combines importance sampling with a learned value function, balancing bias and variance. The estimate is:

V_DR = V_learned(initial_state) + IS_error_correction

The value function acts as a baseline, reducing variance. The IS term corrects for the model's bias. DR is "doubly robust": it works well if either the value function OR the IS weights are accurate.

Model-Based OPE

Learn a forward model: given state and action, predict next state and reward. Evaluate the new policy by rolling out the learned model. The estimate depends on model quality—a biased model produces biased OPE estimates. However, model-based OPE enables extrapolation to out-of-distribution actions.

Application to Financial Policies

OPE for Portfolio Rebalancing Strategies

A fund tests a new dynamic rebalancing policy on 5 years of historical daily data. The historical policy was static (hold equal-weight). The new policy uses RL to adjust allocations based on momentum and volatility.

IS estimate: Reweight each day's return by the probability ratio. Days where the new policy matches the old policy (hold) have ratio ≈ 1. Days where they diverge have higher ratios, amplifying their impact. If the new policy diverges frequently, the IS variance explodes.

DR estimate: Train a value function to predict portfolio returns given current allocation, momentum, and volatility. Use DR to combine the learned baseline with IS corrections. This stabilizes estimates even when policies diverge.

Handling Action Space Differences

Historical data reflects discrete rebalancing actions (e.g., "shift 5% to tech"). A new policy might propose continuous action (e.g., "shift 5.3% to tech"). Since the exact action was never taken, there's zero historical data. OPE must estimate the outcome probabilistically or using a learned model.

Variance Reduction Techniques

Weighted Importance Sampling (WIS)

Standard IS weights can explode; WIS normalizes by the sum of weights. This trades bias for variance, often yielding better estimates. Particularly useful when many trajectories have very low policy ratio.

Clipped Importance Weights

Clip probability ratios to a maximum value (e.g., ratio > 10 is clipped to 10). This explicitly caps variance at the cost of slight bias. For financial applications with tight risk bounds, the tradeoff is worthwhile.

Multi-Step Returns and Bootstrapping

Instead of using full-episode returns, use multi-step estimates: V_k = sum of k-step returns + bootstrapped terminal value. Shorter horizons have lower variance but higher bias. Ensemble multiple horizons for robustness.

Practical Implementation Considerations

Confidence Intervals for OPE Estimates

A single point estimate is insufficient; report confidence intervals. Use bootstrap resampling: resample trajectories with replacement and compute OPE on each subsample. The empirical distribution of bootstrap estimates approximates the true sampling distribution of the OPE estimator.

Policy Distance Metrics

Monitor how far the new policy diverges from the historical policy. Use KL divergence or Wasserstein distance on action distributions. High divergence signals that OPE estimates are unreliable; consider collecting more exploratory data before deployment.

Cross-Validation for OPE Quality**

In offline settings, split historical data: train models on fold 1, compute OPE on fold 2 using fold 1 data as historical. In fold 2, observe actual outcomes, compare to OPE estimates. Discrepancies reveal OPE bias and guide estimator selection.

Challenges and Limitations

Exploration and Extrapolation**

OPE is most reliable for policies close to the historical behavior. Radical new strategies (e.g., short selling when historical data only includes longs) cannot be reliably evaluated offline. Some extrapolation experiments (small live tests, simulation) are necessary.

Non-Stationary Markets

Financial markets are non-stationary. A policy that performed well in historical data may fail in new market regimes. OPE cannot extrapolate to future regime shifts. Augment OPE with stress testing and regime analysis.

Conclusion

Off-policy evaluation is the linchpin of offline RL in finance. By combining importance sampling, learned baselines, and variance reduction techniques, practitioners can estimate new policy performance from historical data with quantified uncertainty. Doubly robust methods balance bias and variance effectively. However, OPE is not a substitute for careful domain knowledge: understand your data distribution, validate estimates rigorously, and augment offline evaluation with live experimentation for high-impact strategies.