Merging Heterogeneous Data Sources with Entity Resolution Techniques

Category: Data Sourcing & Alternative Data • Article #11 • Reading time: 5 minutes

Introduction

Alternative data sources use different identifiers and formats. One dataset identifies companies by ticker; another uses LEI (Legal Entity Identifier); a third uses company names with slight spelling variations. Combining these heterogeneous sources requires entity resolution—the process of determining which records refer to the same real-world entity. This article explores techniques for reliable entity resolution at scale.

The Entity Resolution Challenge in Finance

Financial entities are complex. A single company might be referred to as "Apple Inc.", "Apple Inc (AAPL)", "Apple Computer Inc", "APPLE COMPUTER CORPORATION", or simply "AAPL". Subsidiaries might operate under different names. After acquisitions, entity names change. Matching entities across data sources requires handling these variations programmatically.

Common Entity Types in Finance

Companies (different names, subsidiaries, legal entity variants)
Individuals (executives, insiders, different name formats)
Accounts (bank accounts, credit lines, with multiple identifiers)
Transactions (matches across exchanges, settlement systems)
Locations (addresses with spelling variations, abbreviations)

Exact Matching (The Easy Case)

If datasets use consistent identifiers (all use ticker symbols, for example), matching is trivial: join on ticker. However, this rarely happens. Most datasets use inconsistent or missing identifiers.

String Similarity Approaches

Exact and Fuzzy Matching

When identifiers aren't consistent, compare strings directly. Simple approaches: exact string matching (only matches if identical), case-insensitive matching (handle capitalization), whitespace normalization (handle extra spaces).

For more flexibility, use edit distance metrics: Levenshtein distance measures the minimum number of character edits needed to transform one string to another. "Apple Inc." and "Apple Inc" have distance 1 (one extra period). "Appl Inc." and "Apple Inc." have distance 2 (one substitution, one addition). Set thresholds: match if distance is below threshold.

Token-Based Matching

Split names into tokens (words) and compare token sets. "Apple Inc." becomes ["Apple", "Inc"]. "Inc Apple" becomes ["Inc", "Apple"]. These have identical token sets (order-independent matching). Jaccard similarity measures overlap: matching_tokens / total_unique_tokens. Handle common tokens (Inc, Corp, Ltd) separately—every company has these.

Phonetic and Normalization Approaches

Names that sound similar but are spelled differently can be matched using phonetic encoding. Soundex and Metaphone convert names to phonetic codes, allowing "Jon" and "John" to be recognized as similar.

Normalization handles known variations: expand abbreviations (Inc → Incorporated, Corp → Corporation), remove punctuation, standardize formatting. A well-designed normalizer converts "APPLE INC." and "apple inc" and "Apple Incorporated" to a canonical form enabling exact matching.

Machine Learning Approaches

Supervised Matching

Train classifiers to predict whether two entity records refer to the same entity. Features: string similarity metrics (edit distance, Jaccard), domain-specific features (matching ticker, matching LEI), contextual features (found in same geographic region, same industry).

Label a training set of record pairs as "same entity" or "different entity" (typically 1,000-10,000 examples). Train gradient boosting (XGBoost) or neural networks to predict probability of match. Deploy at scale: for each record in dataset A, find best matches in dataset B using model predictions.

Embedding-Based Matching

Convert entity names to continuous vector embeddings using transformer models (BERT fine-tuned on financial entity names). Compute similarity in embedding space using cosine distance. This handles semantic similarity: "Johnson & Johnson" and "J&J" have different strings but similar embeddings.

Advantage: single embedding lookup is faster than running classifier on every possible pair. Disadvantage: requires training on representative entity data.

Blocking and Scalability

Comparing every record in dataset A (e.g., 10 million) against every record in dataset B (e.g., 20 million) creates 200 trillion comparisons—computationally infeasible. Blocking strategies reduce this by pre-filtering candidate matches.

Common Blocking Approaches

Sort-merge blocking: sort both datasets by key (first letter of name), only compare within blocks
Hash blocking: hash entities and only compare within same hash buckets
Inverted index: build full-text index and retrieve top-K candidates for each entity
Locality-sensitive hashing (LSH): hash high-dimensional embeddings to find similar vectors efficiently

Handling Ambiguity and One-to-Many Relationships

Some matches are ambiguous. "XYZ Corp" might match multiple entities in database. Some entities have legitimate duplicates (subsidiary names). Effective systems model these as probabilities: entity A matches entity B with 85% confidence, entity C with 40% confidence.

Use confidence scores rather than binary matches. Set policies: only use matches above 95% confidence for trading decisions, flag 70-95% confidence matches for manual review, discard below 70%.

Validation and Quality Assurance

Entity resolution errors propagate through analysis. Incorrectly matching two companies causes spurious correlations. Validation approaches:

Manual spot-checking: review sample of matches for correctness
Consistency checking: if A matches B and B matches C, does A match C? Transitive inconsistencies flag problems
Validation against known relationships: use published company relationships (subsidiaries, merged entities) to validate matches
Cross-source validation: if multiple independent matching approaches agree, confidence is high

Tools and Platforms

Building entity resolution from scratch is non-trivial. Open-source libraries: fuzzywuzzy (simple string matching), dedupe (probabilistic matching), Spacy (NLP-based). Commercial platforms: Tamr, Trifacta, Reltio offer enterprise entity resolution with pre-built financial domain knowledge.

Conclusion

Merging heterogeneous alternative data sources is impossible without robust entity resolution. Simple exact matching fails; fuzzy matching handles basic variations but requires tuning. Machine learning approaches generalize better but require training data and ongoing maintenance. Successful quant shops implement entity resolution as infrastructure: investing in high-quality matching to enable seamless data fusion across alternative sources.