DataSynth
High-Performance Synthetic Enterprise Financial Data Generator
Developed by Michael Ivertowski, Zurich, Switzerland
What is DataSynth?
DataSynth is a high-performance, configurable synthetic data generator that produces realistic, interconnected enterprise financial data at scale. It generates coherent General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, ML-ready graph exports, and complete enterprise process chains covering 20+ process families.
All generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford’s Law, log-normal mixtures), and maintains referential integrity across 100+ output tables.
What’s New in v1.3.0
Enterprise Group Audit Simulation — DataSynth now generates complete audit simulation datasets covering the ISA lifecycle from engagement to opinion, with ISA 600 group audits, 10+ accounting standards (IFRS 3/8/9, IAS 12/19/21/37, ASC 326/718/740/805), and a full audit methodology framework (CRA, materiality, sampling, SCOTS, analytical procedures).
Use --preset audit-group to generate 113+ interconnected files ready for ML training, AI agent interaction, and auditor training.
Quick Links
| Section | Description |
|---|---|
| Getting Started | Installation, quick start guide, and demo mode |
| User Guide | CLI reference, server API, desktop UI, Python SDK |
| Configuration | Complete YAML schema and industry presets |
| Architecture | System design, data flow, resource management |
| Crate Reference | Detailed documentation for all 16 crates |
| Advanced Topics | Anomaly injection, graph export, fingerprinting, standards |
| Deployment | Docker, Kubernetes, bare metal, security hardening |
| Use Cases | Fraud detection, audit, AML/KYC, compliance, ESG |
| Changelog | Release history and version details |
Key Features
Core Data Generation
| Feature | Description |
|---|---|
| Statistical Distributions | Log-normal mixtures, Gaussian mixtures, Pareto, Weibull, Beta, zero-inflated with configurable components |
| Copula Correlations | Cross-field dependencies via Gaussian, Clayton, Gumbel, Frank, and Student-t copulas |
| Benford’s Law | First and second-digit compliance with configurable deviation for anomaly injection |
| Temporal Patterns | Month-end/quarter-end/year-end volume spikes, intraday segments, business day calendars (15 regions), processing lags |
| Regime Changes | Economic cycles, acquisition effects, and structural breaks in time series |
| Industry Presets | Manufacturing, Retail, Financial Services, Healthcare, Technology |
| Chart of Accounts | Small (~100), Medium (~400), Large (~2500) account structures |
| Country Packs | Pluggable JSON packs (US, DE, GB + 7 more) with holidays, names, tax, addresses, payroll |
Enterprise Process Simulation
DataSynth covers the full enterprise process landscape:
| Process Family | Scope |
|---|---|
| General Ledger | Journal entries, chart of accounts, ACDOCA event logs |
| Procure-to-Pay | Purchase requisitions, POs, goods receipts, vendor invoices, payments, three-way match |
| Order-to-Cash | Sales orders, deliveries, customer invoices, receipts, dunning |
| Source-to-Contract | Spend analysis, sourcing projects, supplier qualification, RFx, bids, contracts, scorecards |
| Hire-to-Retire | Payroll, tax/deduction calculations, time & attendance, expense reports, benefit enrollment |
| Manufacturing | Production orders, BOM explosion, routing, WIP costing, quality inspections, cycle counts |
| Financial Reporting | Balance sheet, income statement, cash flow, changes in equity, KPIs, budget variance |
| Tax Accounting | Multi-jurisdiction, VAT/GST returns, ASC 740/IAS 12 provisions, FIN 48, withholding |
| Treasury | Cash positioning, forecasts, cash pooling, hedging (ASC 815/IFRS 9), debt covenants, netting |
| Project Accounting | WBS hierarchies, cost lines, PoC revenue, earned value (SPI/CPI/EAC), change orders |
| ESG / Sustainability | GHG Scope 1/2/3, energy/water/waste, diversity, safety, GRI/SASB/TCFD disclosures |
| Intercompany | IC matching, transfer pricing, consolidation eliminations, currency translation |
| Subledgers | AR, AP, Fixed Assets, Inventory with GL reconciliation |
| Period Close | Monthly close engine, depreciation, accruals, year-end closing entries |
| Banking / KYC / AML | Customer personas, KYC profiles, AML typologies (structuring, layering, mule, funnel) |
| Audit | Complete ISA lifecycle: engagements, workpapers, evidence, risk assessments, findings, opinions (ISA 700), KAMs (ISA 701), SOX 302/404 |
| Group Audit (ISA 600) | Component auditors, materiality allocation, scope assignment, component instructions/reports, consolidation |
| Sales | Quote-to-order pipeline with win rate modeling |
| Bank Reconciliation | Statement matching, outstanding checks, deposits in transit |
Accounting & Audit Standards
- Accounting frameworks: US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB/SKR04), dual reporting
- Revenue recognition: ASC 606 / IFRS 15 with performance obligations and SSP allocation
- Leases: ASC 842 / IFRS 16 with ROU assets and lease liabilities
- Fair value: ASC 820 / IFRS 13 Level 1/2/3 hierarchy
- Impairment: ASC 360 / IAS 36 testing with fair value estimation
- Audit standards: ISA (34 standards), PCAOB (19+ standards), SOX 302/404 compliance
- COSO 2013: 5 components, 17 principles, maturity levels
- Localized exports: FEC (French) and GoBD (German) audit file formats
- Enterprise Group Audit (ISA 600): Component auditor assignment, group materiality allocation, scope assignment (full/specific/analytical), component instructions and reports
- Audit Opinion (ISA 700/705/706/701): Opinion derived from findings severity and going concern, Key Audit Matters, PCAOB ICFR opinion
- Audit Methodology: Combined Risk Assessment (ISA 315), materiality calculations (ISA 320), sampling methodology (ISA 530), SCOTS classification, unusual item detection, analytical relationships (ISA 520)
- Deferred Tax (IAS 12 / ASC 740): Temporary differences, ETR reconciliation, rollforward schedules, valuation allowances
- Business Combinations (IFRS 3 / ASC 805): Purchase price allocation, fair value step-ups, goodwill, contingent consideration
- Segment Reporting (IFRS 8 / ASC 280): Operating segments with reconciliation to consolidated totals
- Expected Credit Loss (IFRS 9 / ASC 326): Provision matrix by aging bucket, forward-looking scenarios, ECL movements
- Pensions (IAS 19 / ASC 715): DBO rollforward, plan assets, pension expense, OCI remeasurements
- Consolidated Financial Statements: Standalone + consolidated with elimination schedules and going concern assessment
Fraud, Anomalies & Data Quality
- ACFE-aligned fraud taxonomy: Asset misappropriation, corruption, financial statement fraud
- 60+ anomaly types with full labeling for supervised ML
- Collusion modeling: 9 ring types with role-based conspirators and escalation dynamics
- Management override: Fraud triangle modeling (pressure, opportunity, rationalization)
- Red flag generation: 40+ probabilistic fraud indicators with Bayesian calibration
- Industry-specific patterns: Manufacturing yield manipulation, retail sweethearting, healthcare upcoding
- Data quality variations: Missing values (MCAR/MAR/MNAR), format variations, typos, duplicates
Machine Learning & Graph Export
- Graph formats: PyTorch Geometric, Neo4j, DGL, RustGraph JSON
- Multi-layer hypergraph: 3-layer (Governance, Process Events, Accounting Network)
- Train/val/test splits with configurable partitioning
- Anomaly, fraud, quality, and drift labels in standardized format
- Evaluation framework: Auto-tuning with quality gate enforcement
Advanced Generation
| Capability | Description |
|---|---|
| LLM enrichment | Pluggable providers (mock/OpenAI-compatible) for vendor names, descriptions, anomaly explanations |
| Diffusion models | Statistical diffusion with Langevin reverse process and hybrid blending |
| Causal models | Structural causal models with do-calculus interventions and counterfactual generation |
| Natural language config | Generate YAML configurations from plain English |
| Scenario engine | Built-in fraud packs: revenue_fraud, payroll_ghost, vendor_kickback, management_override |
| Process mining | OCEL 2.0 + XES 2.0 with 101+ activity types across 12 process families |
Production Features
- REST / gRPC / WebSocket APIs with streaming and backpressure handling
- Authentication: API key (Argon2id), JWT/OIDC (RS256), RBAC (Admin/Operator/Viewer)
- Resource guards: Memory, disk, CPU monitoring with graceful degradation
- Deterministic generation: Seeded ChaCha8 RNG for reproducible output
- Desktop UI: Cross-platform Tauri/SvelteKit with 40+ configuration pages
- Python SDK: Programmatic access with blueprints and DataFrame loading
- Docker & Kubernetes: Distroless containers, Helm chart with HPA/PDB
- Observability: OpenTelemetry traces, Prometheus metrics, structured JSON logging
- Data lineage: Per-file checksums, lineage graph, W3C PROV-JSON export
- Privacy-preserving fingerprinting: Differential privacy, k-anonymity, federated extraction
- Ecosystem integrations: Apache Airflow, dbt, MLflow, Apache Spark
Quick Start
# Install from source
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release
# Demo mode
./target/release/datasynth-data generate --demo --output ./output
# Custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output
Performance
| Metric | Value |
|---|---|
| Single-threaded throughput | 200,000+ journal entries/second |
| Parallel scaling | Linear with available CPU cores |
| Memory model | Streaming generation with configurable backpressure |
| Determinism | Fully reproducible via seeded ChaCha8 RNG |
Architecture
DataSynth is organized as a Rust workspace with 16 modular crates:
datasynth-cli CLI binary (generate, validate, init, info, fingerprint, scenario)
datasynth-server REST / gRPC / WebSocket server with auth and rate limiting
datasynth-ui Tauri + SvelteKit desktop application
│
datasynth-runtime Generation orchestrator (parallel execution, resource guards, streaming)
│
datasynth-generators 50+ data generators across all process families
datasynth-banking KYC / AML banking transaction generator
datasynth-ocpm OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint Privacy-preserving fingerprint extraction and synthesis
datasynth-standards Accounting and audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
│
datasynth-graph Graph export (PyTorch Geometric, Neo4j, DGL, RustGraph, Hypergraph)
datasynth-graph-export Unified graph export pipeline with 78 entity types
datasynth-eval Statistical evaluation, quality gates, auto-tuning
│
datasynth-config Configuration schema, validation, industry presets
│
datasynth-core Domain models, traits, distributions, resource guards
│
datasynth-output Output sinks (CSV, JSON, NDJSON, Parquet + Zstd) with streaming
datasynth-test-utils Test utilities, fixtures, mocks
License
Copyright 2024-2026 Michael Ivertowski
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Support
Commercial support, custom development, and enterprise licensing are available upon request. Open an issue on GitHub.
DataSynth is provided “as is” without warranty of any kind. It is intended for testing, development, and research purposes. Generated data should not be used as a substitute for real financial records.