Pegging Orders to Arrival Price with Reinforcement Learning
Pegging Orders to Arrival Price with Reinforcement Learning
An institutional investor's execution algorithm often does not know the arrival price—the price at which the order first became available for execution. In many cases, execution is triggered by algorithmic processes rather than explicit decision-making, and the "natural" benchmark price is the price at order arrival. Pegging to arrival price—dynamically adjusting order prices to match market movements from the arrival price—is a clever execution tactic.
The Arrival Price Benchmark
When a portfolio manager decides to buy 100,000 shares of a stock, the "arrival price" is typically the mid-market price at the time of decision. A natural performance metric is to minimize the difference between actual execution price and this arrival price. Orders that execute better than arrival price outperform; worse execution underperforms.
This benchmark has the advantage of being execution-algorithm agnostic. Whether the algorithm uses VWAP, TWAP, or some other approach, all can be compared on the same scale: how much better or worse than arrival price?
Why Peg Orders?
A pegged order is one where the limit price is automatically adjusted as the market moves. For example, a pegged bid might always be set exactly 1 tick below the current national best bid and offer (NBBO). As the market rallies, the pegged limit price automatically rises, continuing to follow the market.
Pegging has several advantages. It reduces overfill risk (orders executing at worse prices than necessary). It allows passive participation in market upticks. It keeps orders competitive in dynamic markets.
The Pegging Decision Problem
The execution algorithm must decide: what price should the pegged order target relative to the current market? Options include:
- Peg to the NBBO bid (for buy orders) to be maximally passive
- Peg inside the spread to be more aggressive
- Peg to mid-market for maximum predictability
- Peg dynamically based on market conditions
The optimal choice depends on market state. During illiquid periods, passive pegging waits forever without executing. During highly liquid periods, passive pegging executes quickly. The algorithm must adapt its aggressiveness to achieve a target execution rate.
Reinforcement Learning Approach
RL provides a principled way to optimize pegging decisions. The agent observes state (current market prices, arrival price, time remaining, remaining quantity to execute) and chooses a peg offset (distance from reference price to actual order price). It receives immediate reward (if the order executes, reward equals the difference between actual and arrival price; if not, minimal reward). Over many episodes, it learns which peg offsets maximize expected reward.
The learned policy typically exhibits intuitive behavior: when time pressure is high (few minutes left, large quantity remaining), the agent shifts orders more aggressively inward. When ample time exists and liquidity is available, it remains passive. When volatility spikes, it may hold orders to avoid execution at unfavorable prices.
State Space Design
The state must capture enough information for good decisions:
- Relative time (what fraction of execution window has elapsed)
- Relative quantity (what fraction of order size remains)
- Current market spread and mid-price
- Recent price movements and volatility
- Order-flow intensity
- Actual vs arrival price (current underwater/profitable status)
Rich state representation enables sophisticated policies but requires more training. Sparse state enables faster training but may miss important distinctions.
Reward Function Design
The reward should incentivize two goals: (1) execute the order, and (2) execute at good prices. A simple reward of (execution_price - arrival_price) works, but must be normalized appropriately. If execution takes longer, the agent should not be penalized for mere passage of time if prices have moved.
A more sophisticated reward accounts for:
- Execution vs arrival price advantage
- Time cost (patience has a cost; very slow execution is suboptimal)
- Slippage relative to order's urgency (if one order remains and time is short, execute immediately even if price is poor)
Online Learning and Adaptation
Rather than learning offline on historical data then deploying a fixed policy, some systems use continual learning. The RL agent observes actual executions and updates its policy parameters in real-time. This enables rapid adaptation to changing market conditions (e.g., if volatility spikes, the agent quickly becomes more conservative).
Practical Implementation
Pegging orders requires compliance with exchange rules. Many exchanges limit how aggressively orders can peg into the spread (to prevent market disruption). Some exchanges do not allow pegging at all. Proper implementation requires understanding regulatory rules for each venue.
Conclusion
Reinforcement learning enables sophisticated pegging strategies that dynamically adjust order prices based on market conditions and execution progress. By learning from actual execution outcomes, RL agents discover policies superior to hand-coded heuristics. This exemplifies how machine learning enhances execution quality in complex, dynamic markets.