Core Orchestration Architecture
Status: โ Production Ready | Test Coverage: 91.9% | Last Updated: November 2025
The Core Orchestration system is the heart of DotCompute, responsible for coordinating kernel execution across multiple backends with debugging, optimization, and telemetry capabilities.
๐งฉ System Components
graph TD
A[๐ฑ Application] --> B[๐ฏ IComputeOrchestrator<br/>High-level API]
B --> C[โ๏ธ KernelExecutionService<br/>Orchestration]
C --> D[๐ GeneratedKernelDiscoveryService<br/>Kernel registration]
C --> E[๐ง IAcceleratorManager<br/>Backend selection]
C --> F[๐ KernelDebugService<br/>Validation - optional]
C --> G[๐ค AdaptiveBackendSelector<br/>Optimization - optional]
C --> H[๐ TelemetryProvider<br/>Metrics - optional]
C --> I[๐ RecoveryService<br/>Fault tolerance - optional]
C --> J[๐ป IAccelerator<br/>Backend execution]
style A fill:#e1f5fe
style B fill:#c8e6c9
style C fill:#fff9c4
style D fill:#f8bbd0
style E fill:#d1c4e9
style F fill:#ffccbc
style G fill:#c5e1a5
style H fill:#b3e5fc
style I fill:#ffab91
style J fill:#ce93d8
๐ฏ IComputeOrchestrator Interface
The primary interface for kernel execution:
public interface IComputeOrchestrator
{
/// <summary>
/// Executes a kernel with automatic backend selection and type inference
/// </summary>
Task<TResult> ExecuteKernelAsync<TResult>(
string kernelName,
object parameters,
CancellationToken cancellationToken = default);
/// <summary>
/// Executes a kernel with explicit input/output types
/// </summary>
Task<TOutput> ExecuteKernelAsync<TInput, TOutput>(
string kernelName,
TInput parameters,
CancellationToken cancellationToken = default);
}
Design Rationale:
- Simple API: Hides complexity of backend selection and execution
- Type-safe: Generic type parameters ensure compile-time safety
- Async-first: Non-blocking operations for scalability
- Cancellation: Proper cancellation token support
โ๏ธ Kernel Execution Service
The main orchestration implementation:
Responsibilities
Kernel Discovery
- Discovers kernels generated by source generators
- Maintains kernel registry with metadata
- Supports runtime kernel registration
Backend Selection
- Automatic selection based on workload characteristics
- Manual override via configuration
- Fallback to CPU for GPU failures
Execution Coordination
- Parameter binding and validation
- Memory allocation and transfer
- Kernel compilation (if needed)
- Asynchronous execution
- Result materialization
Optional Services
- Debug validation (cross-backend comparison)
- Performance profiling (telemetry collection)
- Error recovery (retry and fallback)
๐ Execution Flow
flowchart TD
Start([ExecuteKernelAsync called]) --> A[1๏ธโฃ Discover kernel metadata]
A --> B[2๏ธโฃ Select optimal backend<br/>CPU/CUDA/Metal/OpenCL]
B --> C[3๏ธโฃ Allocate/get buffers<br/>from memory pool]
C --> D[4๏ธโฃ Bind parameters]
D --> E{GPU backend?}
E -->|Yes| F[5๏ธโฃ Transfer data<br/>to device]
E -->|No| G[5๏ธโฃ Skip transfer<br/>zero-copy CPU]
F --> H{Kernel<br/>cached?}
G --> H
H -->|No| I[6๏ธโฃ Compile kernel]
H -->|Yes| J[6๏ธโฃ Use cached kernel]
I --> K[7๏ธโฃ Execute kernel<br/>on backend]
J --> K
K --> L{Debug<br/>enabled?}
L -->|Yes| M[8๏ธโฃ Cross-backend<br/>validation]
L -->|No| N[8๏ธโฃ Skip validation]
M --> O{Telemetry<br/>enabled?}
N --> O
O -->|Yes| P[9๏ธโฃ Collect metrics]
O -->|No| Q[9๏ธโฃ Skip metrics]
P --> R{GPU backend?}
Q --> R
R -->|Yes| S[๐ Transfer results<br/>back to host]
R -->|No| T[๐ Skip transfer<br/>already on host]
S --> U[1๏ธโฃ1๏ธโฃ Materialize results]
T --> U
U --> End([Return results])
style Start fill:#c8e6c9
style End fill:#c8e6c9
style E fill:#fff9c4
style H fill:#fff9c4
style L fill:#fff9c4
style O fill:#fff9c4
style R fill:#fff9c4
Performance Optimization
The orchestration layer is designed for minimal overhead:
Fast Path (typical execution):
- Kernel registry lookup: O(1) hash table lookup
- Backend selection: < 10ฮผs (cached decision)
- Memory allocation: < 1ฮผs (pool hit)
- Orchestration overhead: < 50ฮผs total
Slow Path (first execution):
- Kernel discovery: One-time cost
- Compilation: Cached for subsequent calls
- ML model loading: One-time cost
๐ Kernel Discovery
Generated Kernel Discovery
Source generators create a GeneratedKernels class:
// Generated by KernelSourceGenerator
public static class GeneratedKernels
{
public static void Register(IKernelRegistry registry)
{
registry.RegisterKernel(new KernelMetadata
{
Name = "VectorAdd",
Namespace = "MyNamespace",
DeclaringType = "MyClass",
Parameters = new[]
{
new ParameterMetadata { Name = "a", Type = typeof(ReadOnlySpan<float>) },
new ParameterMetadata { Name = "b", Type = typeof(ReadOnlySpan<float>) },
new ParameterMetadata { Name = "result", Type = typeof(Span<float>) }
},
Backends = KernelBackends.CPU | KernelBackends.CUDA,
IsParallel = true
});
}
}
Discovery Process:
GeneratedKernelDiscoveryServicescans forGeneratedKernelsclasses- Calls
Register()method on each - Builds kernel registry with O(1) lookup
- Validates metadata consistency
Runtime Registration
Kernels can also be registered at runtime:
public class RuntimeKernelRegistration
{
public void RegisterCustomKernel(IKernelRegistry registry)
{
registry.RegisterKernel(new KernelDefinition
{
Name = "CustomKernel",
Source = "/* kernel source */",
EntryPoint = "custom_kernel",
Backend = AcceleratorType.CUDA
});
}
}
๐ง Backend Selection Strategy
Automatic Selection
The orchestrator uses workload characteristics to select the optimal backend:
var characteristics = new WorkloadCharacteristics
{
DataSize = inputSize,
ComputeIntensity = ComputeIntensity.High,
MemoryIntensive = true,
ParallelismPotential = ParallelismLevel.High
};
var backend = await selector.SelectBackendAsync(characteristics);
Selection Criteria:
- Data Size: Small data may be faster on CPU (no transfer overhead)
- Compute Intensity: Complex math benefits from GPU
- Memory Bandwidth: Memory-bound operations may favor CPU with larger cache
- Parallelism: Highly parallel workloads benefit from GPU
Manual Override
Users can force specific backends:
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.CUDA;
options.EnableAutoOptimization = false; // Disable automatic selection
});
Fallback Strategy
If the selected backend fails:
- Retry: Retry with exponential backoff (transient failures)
- Fallback to CPU: Use CPU backend if GPU fails
- Exception: Throw if CPU also fails
๐ Parameter Binding
Type-Safe Binding
The orchestrator binds parameters to kernel arguments:
// Application code
var result = await orchestrator.ExecuteKernelAsync<float[], float[]>(
"VectorAdd",
new { a = dataA, b = dataB, length = 1_000_000 }
);
// Orchestrator binds:
// - float[] a โ ReadOnlySpan<float> (kernel parameter)
// - float[] b โ ReadOnlySpan<float> (kernel parameter)
// - int length โ int (scalar parameter)
// - Allocates output buffer for result
Binding Rules:
- Arrays: Convert to Span
or ReadOnlySpan - Scalars: Pass by value
- Buffers: Use existing UnifiedBuffer
if provided - Output: Allocate buffer based on return type
Validation
Parameter validation occurs at multiple levels:
- Compile-time: Source generators validate parameter types
- Orchestration: Runtime validation of sizes and types
- Backend: Device-specific validation (e.g., memory limits)
๐พ Memory Coordination
The orchestrator coordinates with the memory manager:
Buffer Lifecycle
// 1. Get or allocate buffers
var bufferA = await memory.AllocateAsync<float>(size); // May come from pool
var bufferB = await memory.AllocateAsync<float>(size);
var bufferResult = await memory.AllocateAsync<float>(size);
// 2. Transfer data to device
await bufferA.CopyFromAsync(dataA);
await bufferB.CopyFromAsync(dataB);
// 3. Execute kernel
await kernel.ExecuteAsync(bufferA, bufferB, bufferResult);
// 4. Transfer results back
await bufferResult.CopyToAsync(results);
// 5. Return buffers to pool
await bufferA.DisposeAsync();
await bufferB.DisposeAsync();
await bufferResult.DisposeAsync();
Memory Optimization
The orchestrator optimizes memory usage:
- Pooling: Reuses buffers from pool (90% allocation reduction)
- Pipelining: Overlaps compute and transfer
- Pinned Memory: Uses pinned memory for faster transfers
- Zero-Copy: Uses Span
for CPU execution
๐ Integration with Optional Services
Debug Service Integration
When debug validation is enabled:
services.AddProductionDebugging(options =>
{
options.EnableCrossBackendValidation = true;
options.ValidateAllExecutions = false; // Only validate suspicious results
options.ToleranceThreshold = 1e-5;
});
The orchestrator automatically:
- Executes kernel on selected backend (e.g., GPU)
- Executes same kernel on CPU for comparison
- Compares results within tolerance
- Logs discrepancies
- Throws if validation fails in Development profile
Overhead: 2-5x in Development, < 5% in Production (selective validation)
Optimization Service Integration
When ML-powered optimization is enabled:
services.AddProductionOptimization(options =>
{
options.OptimizationStrategy = OptimizationStrategy.Aggressive;
options.EnableMachineLearning = true;
});
The orchestrator:
- Collects execution metrics (time, data size, backend used)
- Feeds metrics to ML model
- ML model learns optimal backend selection
- Future executions use learned policy
Benefit: 10-30% performance improvement after learning period
Telemetry Integration
OpenTelemetry integration for observability:
services.AddOpenTelemetry()
.WithMetrics(metrics => metrics.AddDotComputeInstrumentation())
.WithTracing(tracing => tracing.AddDotComputeInstrumentation());
Collected metrics:
dotcompute.kernel.executions- Execution countdotcompute.kernel.duration- Execution duration histogramdotcompute.memory.allocated- Memory allocation countdotcompute.memory.transferred- Data transfer bandwidth
Overhead: < 1% with sampling
โ ๏ธ Error Handling and Recovery
Exception Hierarchy
ComputeException (base)
โโโ CompilationException (kernel compilation failed)
โโโ DeviceException (device/backend error)
โ โโโ OutOfMemoryException
โ โโโ DeviceNotAvailableException
โโโ MemoryException (memory operation failed)
โโโ ExecutionException (kernel execution failed)
Recovery Strategies
try
{
return await orchestrator.ExecuteKernelAsync(kernel, params);
}
catch (DeviceException ex) when (ex.IsTransient)
{
// Automatic retry with exponential backoff
await Task.Delay(100);
return await orchestrator.ExecuteKernelAsync(kernel, params);
}
catch (DeviceException)
{
// Fall back to CPU
return await orchestrator.ExecuteKernelAsync(
kernel,
params,
forceBackend: AcceleratorType.CPU
);
}
Automatic Recovery:
- Transient failures: Retry 3 times with exponential backoff
- Device failures: Fallback to CPU
- Out-of-memory: Reduce batch size and retry
- Compilation errors: No automatic recovery (user fix required)
โ๏ธ Configuration Options
Runtime Configuration
public class DotComputeRuntimeOptions
{
/// <summary>Default backend for execution (Auto, CPU, CUDA, Metal, OpenCL)</summary>
public AcceleratorType DefaultAccelerator { get; set; } = AcceleratorType.Auto;
/// <summary>Enable telemetry collection</summary>
public bool EnableTelemetry { get; set; } = true;
/// <summary>Enable cross-backend debug validation</summary>
public bool EnableDebugValidation { get; set; } = false;
/// <summary>Enable ML-powered backend optimization</summary>
public bool EnableAutoOptimization { get; set; } = true;
/// <summary>Enable automatic error recovery</summary>
public bool EnableRecovery { get; set; } = true;
/// <summary>Minimum log level</summary>
public LogLevel MinimumLogLevel { get; set; } = LogLevel.Information;
}
Service Registration
// Minimal setup
services.AddDotComputeRuntime();
// With configuration
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.CUDA;
options.EnableDebugValidation = true;
});
// Complete setup with all features
services.AddDotComputeComplete(configuration);
โก Performance Characteristics
Orchestration Overhead
| Operation | Time | Notes |
|---|---|---|
| Kernel lookup | < 1ฮผs | Hash table O(1) |
| Backend selection | < 10ฮผs | Cached decisions |
| Parameter binding | < 20ฮผs | Type conversion |
| Memory allocation | < 1ฮผs | Pool hit |
| Total overhead | < 50ฮผs | Typical execution |
Scalability
- Concurrent executions: Unlimited (backend-limited)
- Registered kernels: Millions (O(1) lookup)
- Active buffers: Millions (with pooling)
- Throughput: 20K+ kernel executions/second
๐งช Testing Strategy
Unit Testing
[Fact]
public async Task ExecuteKernelAsync_ValidKernel_ReturnsCorrectResult()
{
// Arrange
var orchestrator = CreateOrchestrator();
var input = Enumerable.Range(0, 1000).Select(i => (float)i).ToArray();
// Act
var result = await orchestrator.ExecuteKernelAsync<float[], float[]>(
"VectorDouble",
new { input }
);
// Assert
result.Should().BeEquivalentTo(input.Select(x => x * 2));
}
Integration Testing
[Fact]
public async Task ExecuteKernelAsync_WithDebugValidation_ValidatesResults()
{
// Arrange
var services = CreateServicesWithDebugging();
var orchestrator = services.GetRequiredService<IComputeOrchestrator>();
// Act & Assert - should not throw
await orchestrator.ExecuteKernelAsync("SimpleKernel", params);
}