Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DataSynth

High-Performance Synthetic Enterprise Financial Data Generator

Version License Rust

Developed by Michael Ivertowski, Zurich, Switzerland

What is DataSynth?

DataSynth is a high-performance, configurable synthetic data generator that produces realistic, interconnected enterprise financial data at scale. It generates coherent General Ledger journal entries, Chart of Accounts, SAP HANA-compatible ACDOCA event logs, document flows, subledger records, banking/KYC/AML transactions, OCEL 2.0 process mining data, audit workpapers, ML-ready graph exports, and complete enterprise process chains covering 20+ process families.

All generated data respects accounting identities (debits = credits, Assets = Liabilities + Equity), follows empirical distributions (Benford’s Law, log-normal mixtures), and maintains referential integrity across 100+ output tables.

SectionDescription
Getting StartedInstallation, quick start guide, and demo mode
User GuideCLI reference, server API, desktop UI, Python SDK
ConfigurationComplete YAML schema and industry presets
ArchitectureSystem design, data flow, resource management
Crate ReferenceDetailed documentation for all 15 crates
Advanced TopicsAnomaly injection, graph export, fingerprinting, standards
DeploymentDocker, Kubernetes, bare metal, security hardening
Use CasesFraud detection, audit, AML/KYC, compliance, ESG
ChangelogRelease history and version details

Key Features

Core Data Generation

FeatureDescription
Statistical DistributionsLog-normal mixtures, Gaussian mixtures, Pareto, Weibull, Beta, zero-inflated with configurable components
Copula CorrelationsCross-field dependencies via Gaussian, Clayton, Gumbel, Frank, and Student-t copulas
Benford’s LawFirst and second-digit compliance with configurable deviation for anomaly injection
Temporal PatternsMonth-end/quarter-end/year-end volume spikes, intraday segments, business day calendars (15 regions), processing lags
Regime ChangesEconomic cycles, acquisition effects, and structural breaks in time series
Industry PresetsManufacturing, Retail, Financial Services, Healthcare, Technology
Chart of AccountsSmall (~100), Medium (~400), Large (~2500) account structures
Country PacksPluggable JSON packs (US, DE, GB + 7 more) with holidays, names, tax, addresses, payroll

Enterprise Process Simulation

DataSynth covers the full enterprise process landscape:

Process FamilyScope
General LedgerJournal entries, chart of accounts, ACDOCA event logs
Procure-to-PayPurchase requisitions, POs, goods receipts, vendor invoices, payments, three-way match
Order-to-CashSales orders, deliveries, customer invoices, receipts, dunning
Source-to-ContractSpend analysis, sourcing projects, supplier qualification, RFx, bids, contracts, scorecards
Hire-to-RetirePayroll, tax/deduction calculations, time & attendance, expense reports, benefit enrollment
ManufacturingProduction orders, BOM explosion, routing, WIP costing, quality inspections, cycle counts
Financial ReportingBalance sheet, income statement, cash flow, changes in equity, KPIs, budget variance
Tax AccountingMulti-jurisdiction, VAT/GST returns, ASC 740/IAS 12 provisions, FIN 48, withholding
TreasuryCash positioning, forecasts, cash pooling, hedging (ASC 815/IFRS 9), debt covenants, netting
Project AccountingWBS hierarchies, cost lines, PoC revenue, earned value (SPI/CPI/EAC), change orders
ESG / SustainabilityGHG Scope 1/2/3, energy/water/waste, diversity, safety, GRI/SASB/TCFD disclosures
IntercompanyIC matching, transfer pricing, consolidation eliminations, currency translation
SubledgersAR, AP, Fixed Assets, Inventory with GL reconciliation
Period CloseMonthly close engine, depreciation, accruals, year-end closing entries
Banking / KYC / AMLCustomer personas, KYC profiles, AML typologies (structuring, layering, mule, funnel)
AuditISA-compliant engagements, workpapers, evidence, risk assessments, findings
SalesQuote-to-order pipeline with win rate modeling
Bank ReconciliationStatement matching, outstanding checks, deposits in transit

Accounting & Audit Standards

  • Accounting frameworks: US GAAP, IFRS, French GAAP (PCG), German GAAP (HGB/SKR04), dual reporting
  • Revenue recognition: ASC 606 / IFRS 15 with performance obligations and SSP allocation
  • Leases: ASC 842 / IFRS 16 with ROU assets and lease liabilities
  • Fair value: ASC 820 / IFRS 13 Level 1/2/3 hierarchy
  • Impairment: ASC 360 / IAS 36 testing with fair value estimation
  • Audit standards: ISA (34 standards), PCAOB (19+ standards), SOX 302/404 compliance
  • COSO 2013: 5 components, 17 principles, maturity levels
  • Localized exports: FEC (French) and GoBD (German) audit file formats

Fraud, Anomalies & Data Quality

  • ACFE-aligned fraud taxonomy: Asset misappropriation, corruption, financial statement fraud
  • 60+ anomaly types with full labeling for supervised ML
  • Collusion modeling: 9 ring types with role-based conspirators and escalation dynamics
  • Management override: Fraud triangle modeling (pressure, opportunity, rationalization)
  • Red flag generation: 40+ probabilistic fraud indicators with Bayesian calibration
  • Industry-specific patterns: Manufacturing yield manipulation, retail sweethearting, healthcare upcoding
  • Data quality variations: Missing values (MCAR/MAR/MNAR), format variations, typos, duplicates

Machine Learning & Graph Export

  • Graph formats: PyTorch Geometric, Neo4j, DGL, RustGraph JSON
  • Multi-layer hypergraph: 3-layer (Governance, Process Events, Accounting Network)
  • Train/val/test splits with configurable partitioning
  • Anomaly, fraud, quality, and drift labels in standardized format
  • Evaluation framework: Auto-tuning with quality gate enforcement

Advanced Generation

CapabilityDescription
LLM enrichmentPluggable providers (mock/OpenAI-compatible) for vendor names, descriptions, anomaly explanations
Diffusion modelsStatistical diffusion with Langevin reverse process and hybrid blending
Causal modelsStructural causal models with do-calculus interventions and counterfactual generation
Natural language configGenerate YAML configurations from plain English
Scenario engineBuilt-in fraud packs: revenue_fraud, payroll_ghost, vendor_kickback, management_override
Process miningOCEL 2.0 + XES 2.0 with 101+ activity types across 12 process families

Production Features

  • REST / gRPC / WebSocket APIs with streaming and backpressure handling
  • Authentication: API key (Argon2id), JWT/OIDC (RS256), RBAC (Admin/Operator/Viewer)
  • Resource guards: Memory, disk, CPU monitoring with graceful degradation
  • Deterministic generation: Seeded ChaCha8 RNG for reproducible output
  • Desktop UI: Cross-platform Tauri/SvelteKit with 40+ configuration pages
  • Python SDK: Programmatic access with blueprints and DataFrame loading
  • Docker & Kubernetes: Distroless containers, Helm chart with HPA/PDB
  • Observability: OpenTelemetry traces, Prometheus metrics, structured JSON logging
  • Data lineage: Per-file checksums, lineage graph, W3C PROV-JSON export
  • Privacy-preserving fingerprinting: Differential privacy, k-anonymity, federated extraction
  • Ecosystem integrations: Apache Airflow, dbt, MLflow, Apache Spark

Quick Start

# Install from source
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
cargo build --release

# Demo mode
./target/release/datasynth-data generate --demo --output ./output

# Custom configuration
./target/release/datasynth-data init --industry manufacturing --complexity medium -o config.yaml
./target/release/datasynth-data generate --config config.yaml --output ./output

Performance

MetricValue
Single-threaded throughput200,000+ journal entries/second
Parallel scalingLinear with available CPU cores
Memory modelStreaming generation with configurable backpressure
DeterminismFully reproducible via seeded ChaCha8 RNG

Architecture

DataSynth is organized as a Rust workspace with 15 modular crates:

datasynth-cli            CLI binary (generate, validate, init, info, fingerprint, scenario)
datasynth-server         REST / gRPC / WebSocket server with auth and rate limiting
datasynth-ui             Tauri + SvelteKit desktop application
                │
datasynth-runtime        Generation orchestrator (parallel execution, resource guards, streaming)
                │
datasynth-generators     50+ data generators across all process families
datasynth-banking        KYC / AML banking transaction generator
datasynth-ocpm           OCEL 2.0 / XES 2.0 process mining
datasynth-fingerprint    Privacy-preserving fingerprint extraction and synthesis
datasynth-standards      Accounting and audit standards (IFRS, US GAAP, ISA, SOX, PCAOB)
                │
datasynth-graph          Graph export (PyTorch Geometric, Neo4j, DGL, RustGraph, Hypergraph)
datasynth-eval           Statistical evaluation, quality gates, auto-tuning
                │
datasynth-config         Configuration schema, validation, industry presets
                │
datasynth-core           Domain models, traits, distributions, resource guards
                │
datasynth-output         Output sinks (CSV, JSON, NDJSON, Parquet + Zstd) with streaming
datasynth-test-utils     Test utilities, fixtures, mocks

License

Copyright 2024-2026 Michael Ivertowski

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Support

Commercial support, custom development, and enterprise licensing are available upon request. Open an issue on GitHub.


DataSynth is provided “as is” without warranty of any kind. It is intended for testing, development, and research purposes. Generated data should not be used as a substitute for real financial records.