Creating Synthetic ESG Disclosures for Model Training

Category: Generative AI & Synthetic Content • Article #8 • Reading time: 5 minutes

Introduction

Environmental, Social, and Governance (ESG) data is increasingly important for investing. However, historical ESG disclosures are limited: before 2015, few companies reported standardized ESG metrics. Machine learning models trained on sparse historical data struggle. Generative models can create synthetic historical ESG disclosures, augmenting datasets for model training without fabricating deceptive information.

The ESG Data Challenge**

Historical Sparsity**

Before 2015: ESG disclosures were voluntary, non-standard, and rare. Before 2020: many companies still provided minimal data. Today, ESG disclosures are abundant but backwards incomplete. Models trained on 10 years of ESG data have limited generalization.

Use Cases for Synthetic ESG**

Train ESG scoring models: given limited historical ESG and financial data, train models to predict future ESG issues and their financial impact. Synthetic historical ESG augments training data, improving model robustness.

Generating Synthetic ESG Disclosures**

Conditioned Generation**

Train a generative model: P(ESG_metrics | company_characteristics, industry, year). Given company size, sector, location, and year, generate plausible ESG metrics. Constraints: carbon emissions scale with revenue; female board representation varies by country and era; governance maturity increases over time.

Realism Through Historical Patterns**

Analyze historical ESG data. Identify patterns: how emissions per dollar of revenue evolved by industry; how governance scores improved over time. Embed these patterns into the generative model. Synthetic ESG should follow historical trends, not defy them.

Quality Control**

Validation Against Benchmarks**

Generate synthetic ESG for historical years where real data exists (e.g., 2020 data). Compare synthetic vs. real: distributions, correlations with financials, trends. Close match indicates good generation.

Sector-Specific Logic**

Different sectors have distinct ESG characteristics. Oil companies: high carbon, lower governance scores. Tech companies: high governance, moderate labor practices. Synthetic data must reflect sector-specific patterns.

Training ESG Models**

Data Augmentation**

Original ESG dataset: 2000 company-years. Synthetic augmentation: add 3000 synthetic company-years from periods with sparse data. Train ESG scoring model on augmented dataset. Reduces overfitting, improves generalization to new companies.

Example: ESG Scores Predicting Future Returns**

Goal: build a model predicting which high-ESG scores correlate with superior returns. Historical data: 15 years, 2000 companies = 30K observations (sparse). Synthetic data: augment to 60K observations. Train a neural network to predict future returns from current ESG metrics.

Results: Model trained on augmented data has lower test error and better out-of-sample performance than one trained on original data alone.

Ethical Considerations**

Transparency and Disclosure**

Models trained on synthetic data must disclose this to investors/stakeholders. "This model was trained on synthetic ESG data for periods 1990-2015." Transparency prevents misrepresentation.

Avoiding Fabrication**

Synthetic ESG is an augmentation tool, not a replacement for real disclosures. Never publish synthetic ESG as if it were real company data. Use only for model training, internal research.

Advanced Techniques**

Multi-Stage Generation**

Generate ESG metrics in stages: first environmental (influenced by industry), then social (influenced by size and location), finally governance (influenced by country and era). Sequential generation maintains logical consistency.

Temporal Coherence**

Synthetic ESG sequences should show plausible evolution over time. A company with poor governance in 1990 might improve by 2010. Generate time-series with smooth transitions reflecting reasonable corporate evolution.

Regulatory and Compliance**

SEC rules on ESG disclosures prohibit fraudulent statements. Synthetic data for internal analysis is acceptable; synthetic data falsely attributed to companies is fraud. Maintain strict boundaries.

Conclusion**

Synthetic ESG data addresses the historical data scarcity problem, enabling robust model training. With proper validation, transparency, and ethical guardrails, generative models augment ESG datasets responsibly. The key is using synthetic data as a training aid, not a substitute for real disclosures. For institutional investors developing ESG-driven strategies, synthetic augmentation of historical data is a practical path to better models.