Architecture Overview
Status: â Production Ready | Version: v0.4.1-rc2 | Last Updated: November 2025
DotCompute follows a layered architecture designed for extensibility, performance, and maintainability. This document provides a high-level overview of the system's design and key architectural decisions.
ðïļ System Layers
graph TB
A[ðą Application Layer<br/>Kernel attributes & IComputeOrchestrator]
B[âïļ Source Generators & Analyzers<br/>Compile-time code generation & validation]
C[ðŊ Core Runtime & Orchestration<br/>Execution, debugging, optimization, telemetry]
D[ð§ Backend Implementations<br/>CPU, CUDA, Metal, OpenCL]
E[ðū Memory Management<br/>Unified buffers, pooling, transfers, P2P]
A --> B
B --> C
C --> D
D --> E
style A fill:#e1f5fe
style B fill:#fff9c4
style C fill:#c8e6c9
style D fill:#ffccbc
style E fill:#f8bbd0
ðŊ Core Architectural Principles
1. ð Separation of Concerns
Each layer has distinct responsibilities:
- Application: Business logic and kernel definitions
- Generators: Compile-time code generation and validation
- Core Runtime: Execution orchestration and cross-cutting concerns
- Backends: Device-specific implementations
- Memory: Unified memory abstraction
2. ð Backend Independence
The application layer is isolated from backend specifics through:
IAcceleratorinterface for all backendsIComputeOrchestratorfor unified kernel executionIUnifiedMemoryManagerfor memory operations- Automatic backend selection based on workload characteristics
3. ⥠Performance by Design
Performance is baked into the architecture:
- Compile-time code generation: Zero-overhead abstractions
- Memory pooling: 90% reduction in allocations
- Native AOT support: Sub-10ms startup times
- Async-first: Non-blocking operations throughout
4. ð§ Extensibility
The system is designed for extension:
- Plugin architecture: Hot-reload capable backend plugins
- Source generators: Custom code generation pipelines
- Analyzers: Custom validation rules
- Optimization strategies: Pluggable optimization algorithms
5. ð Observability
Built-in observability from the ground up:
- OpenTelemetry integration: Distributed tracing and metrics
- Debug services: Cross-backend validation
- Telemetry providers: Performance profiling
- Health monitoring: Plugin and service health checks
ð§Đ Key Components
ðą Application Layer
Purpose: Define compute kernels and orchestrate execution
Key Types:
[Kernel]attribute: Marks methods for GPU accelerationIComputeOrchestrator: High-level execution interfaceKernelDefinition: Metadata for kernel methods
Responsibilities:
- Kernel declaration with attributes
- Service configuration and DI setup
- Result materialization and processing
âïļ Source Generator Layer
Purpose: Compile-time code generation and validation
Key Components:
KernelSourceGenerator: Generates backend-specific implementationsKernelMethodAnalyzer: Validates kernel code (DC001-DC012)KernelCodeFixProvider: Automated IDE fixes
Responsibilities:
- Backend code generation (CPU SIMD, CUDA, Metal, OpenCL)
- Kernel registry generation for runtime discovery
- Compile-time validation and diagnostics
- Performance hint injection
ðŊ Core Runtime Layer
Purpose: Orchestration, debugging, optimization, and telemetry
Key Components:
KernelExecutionService: Coordinates kernel executionKernelDebugService: Cross-backend validationAdaptiveBackendSelector: ML-powered backend selectionTelemetryProvider: OpenTelemetry integrationAcceleratorManager: Backend lifecycle management
Responsibilities:
- Kernel discovery and registration
- Execution orchestration across backends
- Cross-backend debugging and validation
- Performance profiling and metrics collection
- Fault tolerance and error recovery
ð§ Backend Layer
Purpose: Device-specific compute implementations
Implementations:
- CPU Backend: SIMD vectorization (AVX512/AVX2/NEON)
- CUDA Backend: NVIDIA GPU support (CC 5.0+)
- Metal Backend: Apple Silicon optimization
- OpenCL Backend: Cross-platform GPU support
Common Interface (IAccelerator):
CompileKernelAsync()- Compile kernel for specific backendAllocateAsync()- Allocate device memorySynchronizeAsync()- Wait for completionDisposeAsync()- Clean up resources
ðū Memory Management Layer
Purpose: Unified memory abstraction with performance optimization
Key Components:
UnifiedMemoryManager: Central memory authorityOptimizedUnifiedBuffer<T>: Performance-optimized buffersMemoryPool: Buffer pooling with 21 size classesAdvancedMemoryTransferEngine: Concurrent transfer orchestration
Responsibilities:
- Cross-device memory allocation
- Host-device data transfers
- Memory pooling and reuse
- Zero-copy operations via Span
- P2P transfers between GPUs
ð Data Flow
Kernel Execution Flow
sequenceDiagram
participant App as ðą Application
participant Gen as âïļ Source Generator
participant Orch as ðŊ Orchestrator
participant Sel as ðĪ Backend Selector
participant Mem as ðū Memory Manager
participant Back as ð§ Backend (GPU/CPU)
participant Debug as ð Debug Service
participant Tel as ð Telemetry
App->>Gen: 1. Define [Kernel] method
Gen->>Orch: 2. Generate implementations
Orch->>Orch: 3. Discover & register kernels
App->>Orch: 4. ExecuteKernelAsync()
Orch->>Sel: 5. Select optimal backend
Sel-->>Orch: CUDA/CPU/Metal/OpenCL
Orch->>Mem: 6. Allocate/transfer buffers
Mem-->>Orch: UnifiedBuffer ready
Orch->>Back: 7. Compile & execute kernel
Back-->>Orch: Execution complete
Orch->>Debug: 8. Validate results (optional)
Debug-->>Orch: Validation passed
Orch->>Tel: 9. Record metrics
Orch->>Mem: 10. Transfer results back
Orch-->>App: Return results
Memory Transfer Flow
1. Application requests buffer allocation
â
2. Memory manager checks pool for available buffer
â
3. If available: return pooled buffer (fast path)
If not: allocate new buffer
â
4. Application writes data to buffer
â
5. Buffer transfers to device (async, pipelined)
â
6. Kernel executes on device buffer
â
7. Results transfer back to host
â
8. Application reads results
â
9. Buffer returned to pool for reuse
Design Patterns
Factory Pattern
Used for backend creation:
IAcceleratorFactory- Create accelerators based on configurationIUnifiedAcceleratorFactory- Workload-aware accelerator selection
Strategy Pattern
Used for optimization and backend selection:
IOptimizationStrategy- Different optimization approachesIBackendSelectionStrategy- Backend selection algorithms
Observer Pattern
Used for telemetry and monitoring:
ITelemetryProvider- Telemetry event publishingIKernelExecutionMonitor- Execution monitoring
Plugin Pattern
Used for extensibility:
IBackendPlugin- Backend plugin interfacePluginLoader- Dynamic plugin loading with isolation
Dependency Injection
Used throughout for loose coupling:
- Microsoft.Extensions.DependencyInjection integration
- Service lifetimes (Singleton, Scoped, Transient)
- Configuration via IOptions
Cross-Cutting Concerns
Error Handling
Comprehensive error handling strategy:
- Compile-time: Analyzer diagnostics (DC001-DC012)
- Runtime: Typed exceptions (ComputeException, CompilationException)
- Recovery: Automatic retry with exponential backoff
- Fallback: CPU fallback for GPU failures
Logging
Structured logging throughout:
- Microsoft.Extensions.Logging integration
- Contextual logging with scope
- Minimum overhead in production
- Trace-level for development debugging
Configuration
Flexible configuration system:
- IConfiguration integration
- IOptions
pattern - Validation via IValidateOptions
- Environment-specific overrides
Thread Safety
Explicit thread safety guarantees:
- Thread-safe: Memory management, pooling, caches
- Single-threaded: Individual kernel executions
- Documentation: Thread safety documented per type
Performance Characteristics
Overhead Analysis
| Component | Overhead | Optimization |
|---|---|---|
| Orchestration | < 50Ξs | Direct method calls, no reflection |
| Memory pooling | < 1Ξs | Lock-free concurrent structures |
| Telemetry | < 1% | Sampling and async collection |
| Debugging (Dev) | 2-5x | Extensive validation |
| Debugging (Prod) | < 5% | Targeted checks only |
| Backend selection | < 10Ξs | Cached decisions, ML inference |
Scalability
- Concurrent kernels: Unlimited (backend-limited)
- Memory buffers: Millions with pooling
- Plugin count: Hundreds
- Pipeline depth: No practical limit
Technology Stack
Build-Time
- Roslyn: Source generators and analyzers
- C# 13: Latest language features
- .NET 9.0: Target framework
- MSBuild: Build integration
Runtime
- Microsoft.Extensions.* Integration with .NET ecosystem
- OpenTelemetry: Observability infrastructure
- System.Numerics.Vectors: SIMD support
- System.Memory: Span
and Memory
Testing
- xUnit: Unit testing framework
- BenchmarkDotNet: Performance benchmarking
- Moq: Mocking framework
- FluentAssertions: Assertion library
Native AOT Compatibility
DotCompute is fully compatible with Native AOT:
What Works:
- All core runtime functionality
- All backend implementations
- Memory management
- Source generators
- Telemetry and monitoring
Requirements:
- No runtime code generation
- No reflection in hot paths
- Trimming-safe attribute usage
- AOT-analyzer verified
Benefits:
- Sub-10ms startup times
- Smaller deployment size
- Better performance
- Reduced memory usage
Security Considerations
Plugin Sandboxing
- Isolated AssemblyLoadContexts
- Permission management
- Resource limits (CPU, memory)
- Signature validation
Input Validation
- Compile-time parameter validation
- Runtime bounds checking
- Type safety enforcement
- Resource limit enforcement
Vulnerability Management
- NuGet package scanning
- CVE database integration
- GitHub advisory monitoring
- Automated security updates
Future Enhancements
Short-term (v0.3.0)
- Complete Metal MSL translation
- Enhanced LINQ provider (expression compilation)
- Additional algorithm implementations
- Improved OpenCL integration
Medium-term (v0.4.0)
- ROCm backend for AMD GPUs
- DirectX Compute backend
- Multi-GPU load balancing
- Advanced kernel fusion
Long-term (v1.0.0)
- Distributed computing support
- Cloud provider integration (Azure, AWS, GCP)
- Auto-tuning with persistent learning
- Visual debugging tools
Related Documentation
- Core Orchestration - Detailed orchestration design
- Backend Integration - Backend plugin architecture
- Memory Management - Memory system design
- Debugging System - Cross-backend validation
- Optimization Engine - Adaptive backend selection
- Source Generators - Code generation pipeline