Building a Domain-Specific Financial Vocabulary from Scratch

Category: Natural Language Processing • Article #19 • Reading time: 5 minutes

Introduction

General NLP models trained on web text lack financial domain knowledge. "Guidance" means something different in finance (forward-looking statements) than general English (instructions). Building specialized financial vocabulary improves model understanding of domain-specific language.

Domain-Specific Vocabulary Requirements

Financial vocabulary includes:

Technical terms: P/E ratio, EBITDA, basis points, beta, volatility
Events: earnings beat, M&A announcement, bankruptcy, IPO
Sentiment: "guidance raised" (positive) vs "capital allocation" (neutral)
Acronyms: SEC, FOMC, VIX, RSI, MACD
Company-specific: ticker symbols, company names, executive names

Vocabulary Construction Process

1. Collect large financial corpus: earnings transcripts, SEC filings, financial news, research reports. Target 1-10 billion tokens.
2. Identify domain-specific terms: run word frequency analysis, compare to general English. Terms with high frequency in finance but low in general English are domain-specific.
3. Build specialized tokenizer: add domain terms as single tokens (don't split "basis points" into "basis" + "points").
4. Evaluate: test whether models trained with specialized vocabulary outperform general vocabulary.

Frequency Analysis: Domain-Specific Terms

In financial text, high-frequency terms include: guidance (0.5% of financial text vs 0.01% general), beat (0.3% vs 0.05%), guidance, outlook, margin, EBITDA, etc. These should definitely be included in vocabulary.

Handling Financial Acronyms

Financial text is acronym-heavy: SEC, FOMC, VIX, MACD, RSI, EPS, ROIC. Standard tokenizers split "VIX" into "V" + "IX" (meaningless). Add all financial acronyms as vocabulary tokens for proper representation.

Named Entity Vocabulary

Company names, ticker symbols, executive names are important for financial NLP. Add as special tokens: "", "". This enables models to handle unfamiliar companies without breaking tokenization.

Sentiment Vocabulary for Finance

Words have context-dependent sentiment. In finance:

"Aggressive" = aggressive growth (positive), aggressive accounting (negative)
"Weak" = weak demand (negative), weak dollar (positive for exporters)
"Sharply higher" = higher costs (negative), higher earnings (positive)

Build finance sentiment lexicon capturing these nuances.

Vocabulary Size Optimization

Vocabulary too small: unfamiliar words are split into subwords (slow). Vocabulary too large: models become bloated, training slow. Optimal size for financial vocabulary: 32k-50k tokens (versus 30k for general English).

Training Tokenizer on Financial Corpus

Use Hugging Face `tokenizers` library. Train BPE (Byte Pair Encoding) tokenizer on financial corpus. Result: specialized tokenizer that efficiently represents financial language with fewer tokens per document.

Evaluation: Impact on Performance

Train sentiment model with general vocabulary vs financial vocabulary. Financial vocabulary model achieves 95% accuracy; general vocabulary model 91%. Improvement comes from:

Better representation of domain terms (no subword splitting)
Improved context from seeing terms in domain context

Maintaining and Updating Vocabulary

Financial language evolves: new terms emerge ("blockchain", "ESG", "SPACs"), old terms fall out of use. Periodically (quarterly/annually) analyze recent financial text, identify new frequent terms, update vocabulary. This keeps vocabulary contemporary.

Open-Source Financial Vocabularies

Community has published financial vocabularies (FinBERT vocab, FinVocab). Consider using existing vocabularies before building from scratch. Evaluate their quality on your specific financial NLP task.