Audio Sentiment Analysis of Earnings Calls: From Raw WAV to Alpha
Introduction
Company earnings calls provide rich information about management's views, market outlook, and operational challenges. Sell-side analysts traditionally listen to calls, take notes, and write research. By then, the information is available to all investors. But the audio itself contains additional signals: tone of voice, hesitation patterns, speech rate changes, and emotional undertones convey information beyond words. Audio sentiment analysis extracts these paralinguistic signals from earnings call recordings, providing sentiment measures minutes after calls end—potentially before published research captures the same information.
Audio Data Sources
Earnings call recordings are available from multiple sources. Companies often publish audio on investor relations websites. Services like FactSet, S&P Global, and other providers curate earnings call audio. Broadcast archives from financial networks (CNBC, Bloomberg) capture call commentary. Public archives like the SEC maintain transcripts but not always audio.
Audio quality varies: some calls are studio-recorded high-quality, others are phone conference recordings with background noise. Quality affects analysis accuracy; noisy audio produces less reliable sentiment estimates than clean studio recordings.
Audio Processing Pipeline
Raw audio files (WAV, MP3) must be processed before sentiment analysis. First, convert to consistent format: standardize sample rate (16kHz is standard for speech), bit depth, and mono/stereo. Resample if necessary.
Second, segment the audio. Full earnings calls are 45-90 minutes. Different sections have different information density: opening remarks (forward-looking, emotional), prepared management presentation (structured, careful), Q&A (revealing of actual concerns, less scripted). Analyze each section separately if possible.
Third, handle technical issues: remove silence, address background noise via spectral subtraction or other denoising techniques, handle speaker changes (who is speaking matters: CEO vs CFO vs analyst question). These preprocessing steps are crucial because audio quality dramatically affects downstream analysis.
Speech Recognition: Converting Audio to Text
Sentiment analysis requires converting audio to text. Automatic Speech Recognition (ASR) using deep learning models (Wav2Vec, Whisper) enables transcription with high accuracy. Modern ASR systems achieve 95%+ word accuracy on clean audio, though accuracy degrades with noise and accents.
Transcription errors affect sentiment analysis downstream. Misheard words can change meaning ("profit" misheard as "loss" dramatically changes sentiment). For earnings calls, the stakes are high, so consider human transcription or hybrid approaches (automatic transcription for efficiency, human review for accuracy on uncertain sections).
Timestamps matter: preserve word-level timing information from ASR. This enables mapping sentiment back to specific sections of the call and specific speakers.
Sentiment Analysis Approaches
Lexicon-based approach: use dictionaries of positive and negative words. Count occurrences of each, calculate net sentiment. Simple, interpretable, but misses context. "Not bad" is positive but lexicon-based analysis might count "bad" as negative.
Machine learning approach: train classifiers (SVM, neural networks) on labeled training data (earnings calls with known sentiment outcomes). Features include word frequencies, TF-IDF, or embeddings (word2vec, BERT). More flexible, captures context, but requires labeled training data.
Transformer-based approach: use pre-trained language models (BERT, FinBERT, RoBERTa) fine-tuned for financial sentiment. These models understand context deeply and capture nuanced sentiment. Most accurate for finance domain but requires access to good fine-tuning datasets.
Paralinguistic Signals Beyond Words
Audio sentiment goes beyond words. Voice quality signals emotion: higher pitch often indicates excitement or stress, lower pitch indicates confidence or concern. Speech rate: fast talking might indicate nervousness, slow might indicate careful language. Pauses: longer pauses before difficult questions suggest discomfort.
Vocal effort: loud, energetic speech often indicates positive mood; quiet, restrained speech might indicate concern or bad news being delivered carefully. These signals require acoustic analysis beyond standard speech recognition.
Implementation: extract acoustic features from audio (prosody features like pitch, intensity, duration, spectral features). Use these as additional features in sentiment models alongside textual features. Studies show acoustic features improve sentiment classification accuracy by 5-15% relative to text-only approaches.
Executive Speech Analysis
Research shows executive speech patterns during earnings calls predict stock returns. Executives use more positive words during bad earnings calls (overcompensating); aggressive language correlates with overconfidence and subsequent underperformance. Hesitation patterns, filler words ("um," "uh"), and verbal stumbles correlate with earnings misses.
Implementation: quantify CEO speech patterns. Count filler words, measure speaking pace, track word choice formality (complex words vs simple). Train models predicting stock returns from these speech patterns. Results often show predictive power: CEOs with certain speech patterns generate predictable return patterns in subsequent weeks.
Temporal Dynamics
Sentiment isn't constant throughout the call. Track sentiment evolution: does management's tone become more defensive as Q&A progresses? More confident? Sentiment drift (sentiment getting more negative) often signals concern.
Compare across sections: prepared remarks should be carefully scripted (more positive). Q&A is less controlled (potentially more honest). If prepared remarks are positive but Q&A is negative, that divergence is informative (management's careful language contradicts when pressed).
Year-over-year comparison: compare current call's sentiment to prior year calls. Significant changes (much more negative, much more positive) signal changing circumstances or management outlook shifts.
Converting Sentiment to Trading Signals
Basic approach: if earnings call sentiment is positive, it's bullish. If negative, bearish. More sophisticated: compare sentiment to market expectations. If sentiment is much more positive than analyst consensus expected, that's a positive surprise (bullish). If more negative, it's a negative surprise (bearish).
Timing consideration: sentiment becomes available minutes after call ends. Major market movements might already be priced in. Advantage exists only if sentiment reveals something different from what's obvious in the transcript (which is published immediately). The paralinguistic signals (tone, speaking patterns) matter because they're not captured in transcripts.
Validation: construct trading rules based on sentiment signals. Back-test on past earnings calls. Calculate cumulative alpha from sentiment-guided trading. Walk-forward validate on recent quarters not used for model training. Only deploy if walk-forward validation shows positive returns after transaction costs.
Data Quality and Pitfalls
Pitfall 1: Transcription Errors. ASR errors propagate into sentiment analysis. Validate transcriptions against published transcripts (companies often publish corrected transcripts 1-2 days after calls).
Pitfall 2: Context Ignorance. Sentiment analysis misses context. "We're concerned about..." might be risk disclosure or actual concern. Negation handling is important: "not bad" should be positive, but lexicon approaches often misclassify.
Pitfall 3: Information Already Priced. If trading occurs after call is published (available to all), any signal is likely already reflected in prices. Edge exists only in timing (very early in call) or in paralinguistic signals not captured in transcripts.
Conclusion
Audio sentiment analysis of earnings calls converts spoken language into quantitative signals. The pipeline includes audio processing, speech recognition, sentiment analysis (text-based and acoustic-based), and trading signal generation. Most value comes from paralinguistic signals (tone, speech patterns) that aren't captured in published transcripts. Successful applications require careful validation against both transcript-based sentiment (to measure incremental value of audio) and subsequent stock returns (to verify predictive power). Combined with other signals, earnings call sentiment analysis can provide meaningful edge in securities selection, particularly for quarterly event-driven trading strategies.