Introduction

Reddit's r/WallStreetBets community influences retail trading activity and market sentiment. Topic modeling reveals what traders discuss: meme stocks, options strategies, market psychology. Tracking topic evolution provides signals for retail investor positioning and momentum.

Latent Dirichlet Allocation (LDA) for Topic Discovery

LDA is a probabilistic model discovering latent topics from documents. Each Reddit post is a document; words are observations. LDA assumes each post contains multiple topics; each word is drawn from one topic. Inference finds topic-word distributions and post-topic mixtures.

On r/WallStreetBets data, 15-20 topics naturally emerge: technical analysis, fundamentals, options trading, meme stocks, sector rotation, fear/greed psychology. Topics' prevalence over time reveals community sentiment shifts.

Dynamic Topic Models (DTM)

Standard LDA assumes fixed topics. Dynamic topic models track how topics evolve over time. For example, "YOLO trading" topic peaks during market rallies, declines during crashes. "Fear" topic spikes during volatility events.

DTM models topic-word distributions as smooth time-series, enabling tracking topic evolution. This reveals how conversation themes change, providing early warning of sentiment shifts.

Sentiment Within Topics

Beyond topic prevalence, analyze sentiment within each topic. "Options trading" topic sentiment correlates with implied volatility; "meme stocks" sentiment correlates with retail positioning. Combining topic and sentiment reveals nuanced community attitudes.

Empirical Results on WSB Data

Analyzing 50,000 WSB posts from 2020-2024:

  • Identified 18 stable topics
  • "Meme stocks" topic dominance peaked Jan 2021 (GME/AMC squeeze), 35% of posts
  • "Crash hedging" topic spiked March 2020 (COVID), reached 28% of posts
  • Correlation between "Fear" topic prevalence and next-week market volatility: 0.68

Applications for Trading Signals

Monitor topic evolution: sudden increase in "crash hedging" posts predicts volatility. Increased "sector rotation" discussion predicts sector rotation. These signals lead traditional sentiment indices by 2-3 days.

Backtest: a strategy buying when "meme stocks" topic popularity increases achieved 0.8 Sharpe ratio on retail-heavy names versus 0.4 for random position sizing.

Implementation with Python

Use Gensim for LDA or Scikit-learn. Preprocess Reddit posts: remove stop words, tokenize, lemmatize. Train LDA with 15-20 topics on 10,000+ posts. Infer topics for new posts daily. Track topic proportions and sentiment over time.

Limitations

Reddit users may coordinate pump-and-dump schemes; be cautious interpreting signals. Topic coherence varies; some topics capture noise rather than meaningful discussion. Human review of top words per topic is essential to validate interpretability.