Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DataSynth

High-Performance Synthetic Enterprise Financial Data Generator

Version License Rust

Developed by Michael Ivertowski, Zurich, Switzerland

What is DataSynth?

DataSynth is a high-performance, configurable synthetic data generator that produces realistic, interconnected enterprise financial data at scale. It generates coherent General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, ML-ready graph exports, and complete enterprise process chains covering 20+ process families.

All generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford’s Law, log-normal mixtures), and maintains referential integrity across 100+ output tables.

What’s New in v1.3.0

Enterprise Group Audit Simulation — DataSynth now generates complete audit simulation datasets covering the ISA lifecycle from engagement to opinion, with ISA 600 group audits, 10+ accounting standards (IFRS 3/8/9, IAS 12/19/21/37, ASC 326/718/740/805), and a full audit methodology framework (CRA, materiality, sampling, SCOTS, analytical procedures).

Use --preset audit-group to generate 113+ interconnected files ready for ML training, AI agent interaction, and auditor training.

SectionDescription
Getting StartedInstallation, quick start guide, and demo mode
User GuideCLI reference, server API, desktop UI, Python SDK
ConfigurationComplete YAML schema and industry presets
ArchitectureSystem design, data flow, resource management
Crate ReferenceDetailed documentation for all 16 crates
Advanced TopicsAnomaly injection, graph export, fingerprinting, standards
DeploymentDocker, Kubernetes, bare metal, security hardening
Use CasesFraud detection, audit, AML/KYC, compliance, ESG
ChangelogRelease history and version details

Key Features

Core Data Generation

FeatureDescription
Statistical DistributionsLog-normal mixtures, Gaussian mixtures, Pareto, Weibull, Beta, zero-inflated with configurable components
Copula CorrelationsCross-field dependencies via Gaussian, Clayton, Gumbel, Frank, and Student-t copulas
Benford’s LawFirst and second-digit compliance with configurable deviation for anomaly injection
Temporal PatternsMonth-end/quarter-end/year-end volume spikes, intraday segments, business day calendars (15 regions), processing lags
Regime ChangesEconomic cycles, acquisition effects, and structural breaks in time series
Industry PresetsManufacturing, Retail, Financial Services, Healthcare, Technology
Chart of AccountsSmall (~100), Medium (~400), Large (~2500) account structures
Country PacksPluggable JSON packs (US, DE, GB + 7 more) with holidays, names, tax, addresses, payroll

Enterprise Process Simulation

DataSynth covers the full enterprise process landscape:

Process FamilyScope
General LedgerJournal entries, chart of accounts, ACDOCA event logs
Procure-to-PayPurchase requisitions, POs, goods receipts, vendor invoices, payments, three-way match
Order-to-CashSales orders, deliveries, customer invoices, receipts, dunning
Source-to-ContractSpend analysis, sourcing projects, supplier qualification, RFx, bids, contracts, scorecards
Hire-to-RetirePayroll, tax/deduction calculations, time & attendance, expense reports, benefit enrollment
ManufacturingProduction orders, BOM explosion, routing, WIP costing, quality inspections, cycle counts
Financial ReportingBalance sheet, income statement, cash flow, changes in equity, KPIs, budget variance
Tax AccountingMulti-jurisdiction, VAT/GST returns, ASC 740/IAS 12 provisions, FIN 48, withholding
TreasuryCash positioning, forecasts, cash pooling, hedging (ASC 815/IFRS 9), debt covenants, netting
Project AccountingWBS hierarchies, cost lines, PoC revenue, earned value (SPI/CPI/EAC), change orders
ESG / SustainabilityGHG Scope 1/2/3, energy/water/waste, diversity, safety, GRI/SASB/TCFD disclosures
IntercompanyIC matching, transfer pricing, consolidation eliminations, currency translation
SubledgersAR, AP, Fixed Assets, Inventory with GL reconciliation
Period CloseMonthly close engine, depreciation, accruals, year-end closing entries
Banking / KYC / AMLCustomer personas, KYC profiles, AML typologies (structuring, layering, mule, funnel)
AuditComplete ISA lifecycle: engagements, workpapers, evidence, risk assessments, findings, opinions (ISA 700), KAMs (ISA 701), SOX 302/404
Group Audit (ISA 600)Component auditors, materiality allocation, scope assignment, component instructions/reports, consolidation
SalesQuote-to-order pipeline with win rate modeling
Bank ReconciliationStatement matching, outstanding checks, deposits in transit

Accounting & Audit Standards

  • Accounting frameworks: US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB/SKR04), dual reporting
  • Revenue recognition: ASC 606 / IFRS 15 with performance obligations and SSP allocation
  • Leases: ASC 842 / IFRS 16 with ROU assets and lease liabilities
  • Fair value: ASC 820 / IFRS 13 Level 1/2/3 hierarchy
  • Impairment: ASC 360 / IAS 36 testing with fair value estimation
  • Audit standards: ISA (34 standards), PCAOB (19+ standards), SOX 302/404 compliance
  • COSO 2013: 5 components, 17 principles, maturity levels
  • Localized exports: FEC (French) and GoBD (German) audit file formats
  • Enterprise Group Audit (ISA 600): Component auditor assignment, group materiality allocation, scope assignment (full/specific/analytical), component instructions and reports
  • Audit Opinion (ISA 700/705/706/701): Opinion derived from findings severity and going concern, Key Audit Matters, PCAOB ICFR opinion
  • Audit Methodology: Combined Risk Assessment (ISA 315), materiality calculations (ISA 320), sampling methodology (ISA 530), SCOTS classification, unusual item detection, analytical relationships (ISA 520)
  • Deferred Tax (IAS 12 / ASC 740): Temporary differences, ETR reconciliation, rollforward schedules, valuation allowances
  • Business Combinations (IFRS 3 / ASC 805): Purchase price allocation, fair value step-ups, goodwill, contingent consideration
  • Segment Reporting (IFRS 8 / ASC 280): Operating segments with reconciliation to consolidated totals
  • Expected Credit Loss (IFRS 9 / ASC 326): Provision matrix by aging bucket, forward-looking scenarios, ECL movements
  • Pensions (IAS 19 / ASC 715): DBO rollforward, plan assets, pension expense, OCI remeasurements
  • Consolidated Financial Statements: Standalone + consolidated with elimination schedules and going concern assessment

Fraud, Anomalies & Data Quality

  • ACFE-aligned fraud taxonomy: Asset misappropriation, corruption, financial statement fraud
  • 60+ anomaly types with full labeling for supervised ML
  • Collusion modeling: 9 ring types with role-based conspirators and escalation dynamics
  • Management override: Fraud triangle modeling (pressure, opportunity, rationalization)
  • Red flag generation: 40+ probabilistic fraud indicators with Bayesian calibration
  • Industry-specific patterns: Manufacturing yield manipulation, retail sweethearting, healthcare upcoding
  • Data quality variations: Missing values (MCAR/MAR/MNAR), format variations, typos, duplicates

Machine Learning & Graph Export

  • Graph formats: PyTorch Geometric, Neo4j, DGL, RustGraph JSON
  • Multi-layer hypergraph: 3-layer (Governance, Process Events, Accounting Network)
  • Train/val/test splits with configurable partitioning
  • Anomaly, fraud, quality, and drift labels in standardized format
  • Evaluation framework: Auto-tuning with quality gate enforcement

Advanced Generation

CapabilityDescription
LLM enrichmentPluggable providers (mock/OpenAI-compatible) for vendor names, descriptions, anomaly explanations
Diffusion modelsStatistical diffusion with Langevin reverse process and hybrid blending
Causal modelsStructural causal models with do-calculus interventions and counterfactual generation
Natural language configGenerate YAML configurations from plain English
Scenario engineBuilt-in fraud packs: revenue_fraud, payroll_ghost, vendor_kickback, management_override
Process miningOCEL 2.0 + XES 2.0 with 101+ activity types across 12 process families

Production Features

  • REST / gRPC / WebSocket APIs with streaming and backpressure handling
  • Authentication: API key (Argon2id), JWT/OIDC (RS256), RBAC (Admin/Operator/Viewer)
  • Resource guards: Memory, disk, CPU monitoring with graceful degradation
  • Deterministic generation: Seeded ChaCha8 RNG for reproducible output
  • Desktop UI: Cross-platform Tauri/SvelteKit with 40+ configuration pages
  • Python SDK: Programmatic access with blueprints and DataFrame loading
  • Docker & Kubernetes: Distroless containers, Helm chart with HPA/PDB
  • Observability: OpenTelemetry traces, Prometheus metrics, structured JSON logging
  • Data lineage: Per-file checksums, lineage graph, W3C PROV-JSON export
  • Privacy-preserving fingerprinting: Differential privacy, k-anonymity, federated extraction
  • Ecosystem integrations: Apache Airflow, dbt, MLflow, Apache Spark

Quick Start

# Install from source
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release

# Demo mode
./target/release/datasynth-data generate --demo --output ./output

# Custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Performance

MetricValue
Single-threaded throughput200,000+ journal entries/second
Parallel scalingLinear with available CPU cores
Memory modelStreaming generation with configurable backpressure
DeterminismFully reproducible via seeded ChaCha8 RNG

Architecture

DataSynth is organized as a Rust workspace with 16 modular crates:

datasynth-cli            CLI binary (generate, validate, init, info, fingerprint, scenario)
datasynth-server         REST / gRPC / WebSocket server with auth and rate limiting
datasynth-ui             Tauri + SvelteKit desktop application
                │
datasynth-runtime        Generation orchestrator (parallel execution, resource guards, streaming)
                │
datasynth-generators     50+ data generators across all process families
datasynth-banking        KYC / AML banking transaction generator
datasynth-ocpm           OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint    Privacy-preserving fingerprint extraction and synthesis
datasynth-standards      Accounting and audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
                │
datasynth-graph          Graph export (PyTorch Geometric, Neo4j, DGL, RustGraph, Hypergraph)
datasynth-graph-export   Unified graph export pipeline with 78 entity types
datasynth-eval           Statistical evaluation, quality gates, auto-tuning
                │
datasynth-config         Configuration schema, validation, industry presets
                │
datasynth-core           Domain models, traits, distributions, resource guards
                │
datasynth-output         Output sinks (CSV, JSON, NDJSON, Parquet + Zstd) with streaming
datasynth-test-utils     Test utilities, fixtures, mocks

License

Copyright 2024-2026 Michael Ivertowski

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available upon request. Open an issue on GitHub.


DataSynth is provided “as is” without warranty of any kind. It is intended for testing, development, and research purposes. Generated data should not be used as a substitute for real financial records.