Introduction

Portfolio managers make decisions at multiple timescales: strategic asset allocation (yearly), tactical tilts (monthly), and daily rebalancing (daily). A single RL agent must balance these horizons—optimizing long-term diversification while executing daily trades. Hierarchical RL decomposes this complex problem into a hierarchy of agents, each with its own timescale, enabling more stable and interpretable learning.

Why Hierarchical RL Matters

The Credit Assignment Problem**

In a flat RL agent learning all decisions jointly, the credit assignment problem becomes severe: which decision (strategic, tactical, or daily) caused a profit or loss? Temporal credit assignment over months or years is extremely noisy. Hierarchical decomposition localizes credit assignment: high-level decisions are evaluated over long horizons; low-level decisions over short horizons.

Interpretability and Control**

A hierarchical policy is interpretable: the high-level agent's recommendations are transparent (e.g., "increase equity allocation to 70%"). A low-level agent then determines how to execute this allocation across holdings. Practitioners can audit high-level decisions separately from execution details.

Hierarchical RL Architecture

Options Framework**

An option is a temporally-extended action: high-level policies output options (e.g., "shift to 70% equities") which persist for τ steps (τ = one month). A low-level policy executes the option daily (e.g., "buy 100 shares of SPY"). The hierarchical structure is: high-level learns allocation goals; low-level learns execution.

Multi-Level Hierarchy**

Three-level architecture:

  • Level 1 (Strategic, τ=252 days): Allocates capital to asset classes (equities, bonds, alternatives). Decisions made yearly based on macro fundamentals.
  • Level 2 (Tactical, τ=21 days): Tilts allocations based on momentum and valuation. Adjusts the Level 1 base allocations by ±5% per asset class.
  • Level 3 (Execution, τ=1 day): Executes trades to reach the Level 2 target allocation. Minimizes market impact and transaction costs.

Communication Between Levels**

Level 1 outputs a target allocation. Level 2 receives this and outputs an adjusted allocation. Level 3 receives the Level 2 allocation and executes. Rewards flow upward: Level 3's execution cost is part of Level 2's reward; Level 2's performance is part of Level 1's reward. Each level optimizes its local objective while serving the higher level's goal.

Training Procedure

Bottom-Up Training**

Train Level 3 first: given a target allocation, learn to execute minimizing slippage and cost. Once Level 3 is stable, train Level 2: given a target from Level 1, optimize tactical tilts knowing that Level 3 will execute flawlessly. Finally, train Level 1 with fixed Level 2 and 3 policies.

Advantage: each level inherits a stable lower level, making its learning problem simpler. Disadvantage: changes to lower levels must trigger retraining of higher levels.

Joint Training with Fixed Timescales**

Train all levels jointly but exploit the timescale structure. Level 1 updates every 252 steps, Level 2 every 21 steps, Level 3 every step. This is naturally compatible with hierarchical temporal difference learning. Faster convergence than bottom-up but requires careful credit assignment.

Practical Case: Equities Fund

Level 1 (Strategic): State = GDP growth forecast, yield curve, VIX, portfolio current allocation. Action = target allocation to 10 sectors. Reward = annual portfolio return minus risk penalty.

Level 2 (Tactical): State = sector momentum, valuation ratios, fund's current allocation, Level 1 target. Action = adjust allocations ±5% within Level 1 targets. Reward = monthly outperformance versus fixed Level 1 allocation.

Level 3 (Execution): State = current vs. target allocation, bid-ask spreads, market impact estimates. Action = daily order size and limit-price. Reward = negative of (execution cost + market impact). Level 3 learns to "buy illiquid positions slowly; sell liquid positions quickly."

Results: Hierarchical agent achieved 0.95 Sharpe (vs. 0.85 for flat agent). Level 1 learned to increase equities in low-vol environments and reduce in crisis. Level 2 tilted towards momentum in bull markets, value in bear markets. Level 3 adapted execution speed to volatility. The hierarchical decomposition made each decision interpretable and trainable.

Options Learning and Discovery**

Learning Options from Data**

Rather than predefining options, learn them. Use an unsupervised option discovery algorithm: identify natural subgoals by clustering successful behaviors in the data. For instance, the agent might discover "aggressive rebalancing" (move allocations quickly) and "passive hold" (maintain allocations). High-level policies learn when to trigger which option.

Bottleneck States**

Options naturally form around bottleneck states—states through which many successful trajectories pass. In portfolio management, year-end and market crash periods are bottlenecks. Agents learn distinct strategies for these states. Options facilitate learning by re-using strategies across different contexts.

Advanced Considerations**

Inter-Level Information Flow**

Should higher levels have direct access to low-level observations? Yes, but carefully. Too much direct access breaks abstraction; too little limits optimization. Best practice: high-level sees low-level state summaries (current vs. target allocation error, current market impact), not granular order-book depth.

Curriculum for Hierarchical Learning**

Start with simple two-level hierarchy (strategic + execution); add tactical level later. Or start with a fixed high-level policy; learn low-level execution first. Curricula significantly reduce training time.

Conclusion**

Hierarchical RL is the natural framework for multi-horizon portfolio management. By decomposing into strategic, tactical, and execution levels, each with its own timescale, practitioners achieve faster convergence, better interpretability, and more robust final policies. Options formalism provides the mathematical backbone. Real-world deployment of hierarchical RL has demonstrated consistent improvements in Sharpe ratio and reduced operational complexity.