OCR on Scanned Balance Sheets: Automating Legacy PDF Extraction

Category: Computer Vision in Finance • Article #7 • Reading time: 5 minutes

Introduction

Many historical financial documents exist only as scanned PDFs: annual reports from the 1990s, bankruptcy filings, private company financial statements. Optical Character Recognition (OCR) converts scanned document images to machine-readable text, enabling automated extraction of financial data. For quants analyzing historical financial information or working with private company data, OCR unlocks datasets previously locked in images. This guide covers OCR techniques, challenges, and how to extract structured financial data from unstructured document images.

OCR Technology Overview

Modern OCR uses deep learning: CNNs identify characters in images, language models correct obvious errors. Open-source (Tesseract, EasyOCR) and commercial (Google Cloud Vision, AWS Textract) options exist. Commercial solutions typically outperform open-source on low-quality scans.

Accuracy depends on image quality: clean 300+ DPI scans achieve 95%+ accuracy, low-quality faxes 70-80% accuracy. Financial documents often are good-quality (official documents scanned properly), so OCR usually works well.

Challenges with Financial Documents

Tables and Formatting: Balance sheets are heavily formatted with tables, columns, alignment. Standard OCR treats tables as text sequences, losing structure. Table-aware OCR or post-processing to reconstruct table structure is necessary.

Specialized Characters: Accounting documents use special formatting (negative numbers in parentheses), currency symbols, decimals. Some symbols confuse OCR ($ becomes S, € becomes E). Post-processing must handle these.

Multiple Columns: 10-K filings have multiple columns (current year, prior year). OCR must reconstruct column ordering; simple left-to-right text reading produces nonsense ("2024: $100M, 2023: $80M" becomes "$100M 2024 $80M 2023" if column order is misread).

Building an OCR Pipeline for Financial Data

Step 1: Image preprocessing. Correct skew (tilted scans), increase contrast, enhance text visibility. Libraries like OpenCV help.

Step 2: Run OCR. Use cloud Vision API or local Tesseract. Output is raw text.

Step 3: Post-processing. Correct common OCR errors (0 vs O, 1 vs l). Reconstruct table structure using heuristics (columns detected by large horizontal gaps in text).

Step 4: Extract financial data. Identify balance sheet line items (Assets, Liabilities, Equity), extract numerical values. Regular expressions and pattern matching help identify account names and amounts.

Step 5: Validate. Compare extracted data to known values (if you have the correct statements), or manually review sample documents for accuracy.

Handling Low-Quality Scans

Some historical documents are low-quality (faxes, photocopies). Options: 1) Use commercial OCR with higher accuracy, 2) Invest in better document scanning/restoration, 3) Accept lower accuracy and manually verify important values, 4) Exclude lowest-quality documents.

Example: Extracting 10-K Balance Sheet

Workflow: scan 1995 10-K filing, run OCR, extract text, identify balance sheet (usually titled "Consolidated Balance Sheet"), identify line items (total assets, current liabilities, shareholders' equity), extract amounts. Parse dates from filing to identify fiscal year end. Link back to company name and CIK from filing cover page. Result: structured financial data (company, date, metric, amount) extracted from unstructured scanned document.

Structured Data Extraction: From Text to Numbers

Raw OCR output is text; trading needs structured data (numbers keyed by metric name). Custom parsers identify line items and extract amounts. "Total Current Assets: 1,234,567" becomes structured record: {metric: "TCA", amount: 1234567, unit: "USD"}.

Machine learning approaches: train models to identify financial metrics from surrounding text. Given surrounding words, predict what financial metric appears. Use models trained on correctly-formatted 10-Ks to help parse OCR'd historic documents.

Validation and Quality Assurance

Critical: validate extracted financial data. Check: do extracted assets equal extracted liabilities plus equity? (accounting identity should hold). Do extracted figures match industry peer values reasonably? Are trends sensible (did company go bankrupt unexpectedly? Revenue swings wildly)? Manual spot-checks of sample documents detect systematic extraction errors.

Legal and Compliance Considerations

Using historical documents for analysis is generally legal. Ensure extracted data is used legitimately (research, historical analysis) not in ways violating confidentiality or private company agreements. If data is proprietary, confirm usage rights.

Conclusion

OCR enables extracting financial data from historical scanned documents, unlocking datasets previously inaccessible. Modern OCR achieves high accuracy on good-quality documents. Challenges include table structure preservation and specialized financial notation. Post-processing and validation are essential to ensure extraction accuracy. For quantitative researchers working with historical financial data, OCR can significantly expand dataset availability, enabling analysis of long-term trends or private company financials not available in structured form.