Execution Tactic Selection with Contextual Bandits

Execution algorithms maintain multiple tactics available: VWAP, TWAP, aggressive market orders, passive limit orders, split across venues. Choosing which tactic to use given current market conditions is a dynamic decision problem. Contextual bandits—a reinforcement learning framework for sequential decision-making with side information—provide an elegant solution that balances exploration (trying tactics to learn which work best) against exploitation (using the best-known tactic).

The Exploration-Exploitation Tradeoff

If the algorithm always uses the best-known tactic, it risks being suboptimal when market conditions change. If it spends too much time exploring new tactics, it sacrifices immediate execution quality. The optimal balance depends on confidence in current knowledge and potential upside from learning.

Contextual Bandit Framework

Contextual bandits formalize this tradeoff:

  • Context: current market state (volatility, liquidity, recent imbalance)
  • Action: choose execution tactic
  • Reward: execution price achieved, adjusted for information leakage costs
  • Learning: which tactics perform best in which contexts?

Different from standard bandits which ignore context; contextual bandits use side information to make smarter choices.

Exploration Strategies

Common approaches:

Epsilon-greedy: with probability ε, choose randomly; with probability 1-ε, choose best-known tactic. Simple but can explore suboptimal tactics too frequently.

Upper Confidence Bound (UCB): for each tactic, maintain estimate of average reward and uncertainty. Choose tactic with highest upper confidence bound, balancing exploitation and optimistic exploration.

Thompson Sampling: maintain posterior distribution of reward for each (context, tactic) pair. Sample from posterior, choose tactic with highest sampled reward. This balances exploration and exploitation naturally.

Regret Bounds and Theoretical Performance

Contextual bandit algorithms have theoretical regret bounds: how much worse is the algorithm's cumulative reward compared to knowing the optimal tactic in advance?

Good algorithms achieve sublinear regret: after T decisions, cumulative regret grows slower than T (roughly as sqrt(T) or log(T) depending on algorithm). This means the algorithm learns quickly and asymptotically approaches optimal behavior.

Implementation with Neural Networks

In high dimensions (many possible contexts), tabular approaches (maintaining a table of rewards for each context-action pair) become infeasible. Neural networks estimate reward functions:

Reward ≈ neural_network(context, tactic)

Deep contextual bandits use deep networks to learn context representations, enabling efficient learning in high dimensions.

Practical Execution Application

Contexts for execution tactics:

  • Time remaining: spread execution over time vs execute now
  • Current volatility: passive execution in calm, aggressive in stress
  • Order queue depth: trade when queues are short
  • Market direction: aggressive buying into weakness, aggressive selling into strength

The bandit algorithm learns these mappings through experience, adapting its tactic selection over time.

Offline to Online Learning

Initially, train the algorithm offline on historical data (batch reinforcement learning). Then deploy online, where it continues to learn and adapt as new data arrives.

Caution: distribution shift between historical data and live markets can cause online learning to diverge. Careful monitoring and safeguards prevent deployment of suboptimal tactics.

Conclusion

Contextual bandits provide a principled approach to execution tactic selection, balancing the need to exploit known-good tactics against exploration to discover better tactics for different market conditions. This exemplifies how online learning enables adaptive trading systems.