Introduction

Regulators increasingly demand detailed audit trails of trading models and data. Data lakes—centralized repositories of raw and processed data—support compliance by enabling complete tracing of data through pipelines. Version control of data and code ensures auditability: regulators can verify exactly what data and models were used for each decision.

Data Lake Architecture

Organize data lake in layers: (1) Raw data (unchanged, immutable); (2) Bronze (raw copies, preserved as ingested); (3) Silver (cleaned, standardized); (4) Gold (analysis-ready). Version each layer; timestamp all transformations. Maintain data lineage: track which gold-layer tables depend on which silver-layer tables, back to raw sources.

Auditability

For any model output or trading decision, trace back to exact data versions used. Auditors can reproduce exact model inputs and outputs. Change logs document all data modifications. Immutable append-only design prevents accidental or intentional data tampering. Encryption ensures integrity.

Conclusion

Version-controlled data lakes enable comprehensive regulatory audit trails, improving compliance and reducing audit friction.