Building a Domain-Specific Financial Vocabulary from Scratch
Introduction
General NLP models trained on web text lack financial domain knowledge. "Guidance" means something different in finance (forward-looking statements) than general English (instructions). Building specialized financial vocabulary improves model understanding of domain-specific language.
Domain-Specific Vocabulary Requirements
Financial vocabulary includes:
- Technical terms: P/E ratio, EBITDA, basis points, beta, volatility
- Events: earnings beat, M&A announcement, bankruptcy, IPO
- Sentiment: "guidance raised" (positive) vs "capital allocation" (neutral)
- Acronyms: SEC, FOMC, VIX, RSI, MACD
- Company-specific: ticker symbols, company names, executive names
Vocabulary Construction Process
1. Collect large financial corpus: earnings transcripts, SEC filings, financial news, research reports. Target 1-10 billion tokens.
2. Identify domain-specific terms: run word frequency analysis, compare to general English. Terms with high frequency in finance but low in general English are domain-specific.
3. Build specialized tokenizer: add domain terms as single tokens (don't split "basis points" into "basis" + "points").
4. Evaluate: test whether models trained with specialized vocabulary outperform general vocabulary.
Frequency Analysis: Domain-Specific Terms
In financial text, high-frequency terms include: guidance (0.5% of financial text vs 0.01% general), beat (0.3% vs 0.05%), guidance, outlook, margin, EBITDA, etc. These should definitely be included in vocabulary.
Handling Financial Acronyms
Financial text is acronym-heavy: SEC, FOMC, VIX, MACD, RSI, EPS, ROIC. Standard tokenizers split "VIX" into "V" + "IX" (meaningless). Add all financial acronyms as vocabulary tokens for proper representation.
Named Entity Vocabulary
Company names, ticker symbols, executive names are important for financial NLP. Add as special tokens: "
Sentiment Vocabulary for Finance
Words have context-dependent sentiment. In finance:
- "Aggressive" = aggressive growth (positive), aggressive accounting (negative)
- "Weak" = weak demand (negative), weak dollar (positive for exporters)
- "Sharply higher" = higher costs (negative), higher earnings (positive)
Vocabulary Size Optimization
Vocabulary too small: unfamiliar words are split into subwords (slow). Vocabulary too large: models become bloated, training slow. Optimal size for financial vocabulary: 32k-50k tokens (versus 30k for general English).
Training Tokenizer on Financial Corpus
Use Hugging Face `tokenizers` library. Train BPE (Byte Pair Encoding) tokenizer on financial corpus. Result: specialized tokenizer that efficiently represents financial language with fewer tokens per document.
Evaluation: Impact on Performance
Train sentiment model with general vocabulary vs financial vocabulary. Financial vocabulary model achieves 95% accuracy; general vocabulary model 91%. Improvement comes from:
- Better representation of domain terms (no subword splitting)
- Improved context from seeing terms in domain context
Maintaining and Updating Vocabulary
Financial language evolves: new terms emerge ("blockchain", "ESG", "SPACs"), old terms fall out of use. Periodically (quarterly/annually) analyze recent financial text, identify new frequent terms, update vocabulary. This keeps vocabulary contemporary.
Open-Source Financial Vocabularies
Community has published financial vocabularies (FinBERT vocab, FinVocab). Consider using existing vocabularies before building from scratch. Evaluate their quality on your specific financial NLP task.