Adaptive Learning Rates in Streaming Gradient Descent
Introduction
Traditional machine learning trains on fixed datasets. Streaming gradient descent trains models continuously as new data arrives—essential for non-stationary financial data. Adaptive learning rates adjust the step size based on recent performance, balancing stability and responsiveness.
Fixed Learning Rate Problems
A learning rate too high causes instability: model oscillates and diverges. Too low causes slow convergence: model takes many steps to reach minimum. Fixed rates fail to adapt to non-stationarity: during calm periods, a fast rate works well; during volatile periods, the same rate causes instability.
Adaptive Methods: AdaGrad
AdaGrad (Adaptive Gradient) maintains per-parameter learning rates. Parameters with consistent gradients get smaller learning rates (prevent overfitting). Parameters with sparse gradients get larger learning rates (ensure updates). After t steps with gradient g_t, AdaGrad update is θ = θ - (α / sqrt(G + ε)) ⊙ g_t where G is sum of squared gradients.
AdaGrad works well for sparse data but suffers in non-stationary environments: old gradients accumulate indefinitely, making learning rate shrink over time. Eventually learning stops despite non-stationarity.
RMSprop for Non-Stationary Data
RMSprop (Root Mean Square Propagation) uses exponential moving average of squared gradients instead of cumulative sum. This forgets old gradients, keeping learning rates adaptive. Update: G_t = βG_{t-1} + (1-β)g_t^2, then θ = θ - (α / sqrt(G_t + ε)) ⊙ g_t.
RMSprop is ideal for financial data: recent performance dominates, old regime history has minimal impact. Learning rates remain large enough to respond to distribution shifts.
Adam: Combining Momentum and Adaptive Rates
Adam combines momentum (exponential moving average of gradients) with adaptive rates (exponential moving average of squared gradients). Empirically, Adam often converges faster and more stably than RMSprop alone, especially on non-stationary data.
Adam is the de facto standard for deep learning in finance. Parameters: α (learning rate, typically 0.001), β_1 (gradient momentum, typically 0.9), β_2 (squared gradient momentum, typically 0.999).
Empirical Comparison on Streaming Data
Training an LSTM on a stream of stock returns with 1000 data points arriving daily: SGD (fixed LR=0.01) diverges after 50 days. AdaGrad learning rate shrinks to near-zero after 200 days. RMSprop maintains stable learning throughout. Adam and RMSprop achieve similar performance.
Practical Implementation
Use Adam for streaming models with β_1=0.9 and β_2=0.999. Adjust α based on learning curve: if loss doesn't decrease, reduce α by 0.5. Implement warm-start: initialize model on first N observations, then stream updates. Retrain from scratch periodically (monthly) if distribution shift is detected.
Scaling to Multiple Assets
When training models on streams from multiple assets simultaneously, use per-asset learning rates. Large-cap stocks stream more frequently; small-caps less frequently. Adaptive rates naturally balance this without manual intervention.