Backend Integration Architecture
The Backend Integration system provides a plugin-based architecture for compute acceleration, allowing DotCompute to support multiple hardware platforms (CPU, CUDA, Metal, OpenCL) through a unified interface.
Architecture Overview
Application Code
↓
IComputeOrchestrator
↓
IAcceleratorFactory
↓
┌─────────────────────────────────────────────────┐
│ Backend Plugin System │
├─────────────────────────────────────────────────┤
│ PluginManager → PluginLoader → BackendPlugin │
│ ↓ ↓ ↓ │
│ Discovery Isolation IAccelerator │
└─────────────────────────────────────────────────┘
↓ ↓ ↓
CPU Backend CUDA Backend Metal Backend
(AVX512/AVX2) (NVIDIA GPU) (Apple Silicon)
IAccelerator Interface
The core abstraction for all compute backends:
public interface IAccelerator : IAsyncDisposable
{
/// <summary>
/// Unique identifier for this accelerator instance
/// </summary>
string Id { get; }
/// <summary>
/// Type of accelerator (CPU, CUDA, Metal, OpenCL)
/// </summary>
AcceleratorType Type { get; }
/// <summary>
/// Device name (e.g., "NVIDIA RTX 2000", "Apple M1 Pro")
/// </summary>
string DeviceName { get; }
/// <summary>
/// Backend capabilities (e.g., compute capability, feature levels)
/// </summary>
AcceleratorCapabilities Capabilities { get; }
/// <summary>
/// Compiles kernel source code for this backend
/// </summary>
Task<ICompiledKernel> CompileKernelAsync(
KernelDefinition definition,
CompilationOptions options,
CancellationToken cancellationToken = default);
/// <summary>
/// Allocates device memory
/// </summary>
Task<IUnifiedBuffer<T>> AllocateAsync<T>(
long elementCount,
CancellationToken cancellationToken = default) where T : unmanaged;
/// <summary>
/// Synchronizes with device (waits for all operations to complete)
/// </summary>
Task SynchronizeAsync(CancellationToken cancellationToken = default);
/// <summary>
/// Gets current memory usage statistics
/// </summary>
MemoryStatistics GetMemoryStatistics();
}
Design Rationale:
- Async-first: Non-blocking operations for all I/O and device interactions
- Unified memory:
IUnifiedBuffer<T>abstracts device-specific memory - Capability detection: Runtime feature detection for optimal code paths
- Explicit synchronization: Clear control over device/host synchronization
- Resource tracking: Memory statistics for monitoring and optimization
Backend Plugin Architecture
Plugin Lifecycle
1. Discovery
- Scan plugin directories
- Load plugin manifests
- Validate signatures (optional)
↓
2. Loading
- Create isolated AssemblyLoadContext
- Load plugin assembly
- Resolve dependencies
↓
3. Initialization
- Create plugin instance
- Call Initialize()
- Register with AcceleratorManager
↓
4. Runtime
- Handle CreateAccelerator requests
- Execute kernel compilation
- Manage device resources
↓
5. Unloading (Hot-Reload)
- Call Dispose()
- Unload AssemblyLoadContext
- Clean up resources
IBackendPlugin Interface
public interface IBackendPlugin : IAsyncDisposable
{
/// <summary>
/// Plugin metadata (name, version, author)
/// </summary>
PluginMetadata Metadata { get; }
/// <summary>
/// Supported accelerator type
/// </summary>
AcceleratorType SupportedType { get; }
/// <summary>
/// Initializes the plugin
/// </summary>
Task InitializeAsync(IServiceProvider services, CancellationToken cancellationToken);
/// <summary>
/// Creates an accelerator instance
/// </summary>
Task<IAccelerator> CreateAcceleratorAsync(
AcceleratorConfiguration configuration,
CancellationToken cancellationToken);
/// <summary>
/// Enumerates available devices for this backend
/// </summary>
Task<IReadOnlyList<DeviceInfo>> EnumerateDevicesAsync(
CancellationToken cancellationToken);
/// <summary>
/// Checks if this backend is available on the current system
/// </summary>
Task<bool> IsAvailableAsync(CancellationToken cancellationToken);
/// <summary>
/// Gets plugin health status
/// </summary>
PluginHealthStatus GetHealthStatus();
}
Backend Implementations
CPU Backend
Implementation: CpuAccelerator in DotCompute.Backends.CPU
Key Features:
- SIMD vectorization (AVX512, AVX2, SSE4.2, NEON)
- Multi-threaded execution via
Parallel.For - Zero-copy operations with
Span<T> - Hardware capability detection at runtime
Compilation Strategy:
// CPU kernels use .NET JIT with SIMD intrinsics
public async Task<ICompiledKernel> CompileKernelAsync(KernelDefinition definition)
{
// 1. Analyze kernel for SIMD opportunities
var vectorizable = AnalyzeVectorizability(definition);
// 2. Generate SIMD-optimized code
if (vectorizable && Vector.IsHardwareAccelerated)
{
return GenerateSimdKernel(definition);
}
// 3. Fall back to scalar implementation
return GenerateScalarKernel(definition);
}
Performance: 8-23x speedup on vectorizable operations
CUDA Backend
Implementation: CudaAccelerator in DotCompute.Backends.CUDA
Key Features:
- NVIDIA GPU support (Compute Capability 5.0+)
- NVRTC runtime compilation
- PTX and CUBIN generation
- Unified memory and pinned memory optimization
Compilation Strategy:
// CUDA kernels compiled at runtime via NVRTC
public async Task<ICompiledKernel> CompileKernelAsync(KernelDefinition definition)
{
// 1. Detect compute capability
var computeCapability = CudaCapabilityManager.GetTargetComputeCapability();
// 2. Choose compilation target
if (computeCapability >= new Version(7, 0))
{
// Modern GPUs: Compile to CUBIN for best performance
return await CompileToCubinAsync(definition, computeCapability);
}
else
{
// Legacy GPUs: Compile to PTX for compatibility
return await CompileToPtxAsync(definition, computeCapability);
}
}
Compute Capability Detection:
public static class CudaCapabilityManager
{
public static Version GetTargetComputeCapability(int deviceId = 0)
{
// Query device properties
cuDeviceGetAttribute(out int major, DeviceAttribute.ComputeCapabilityMajor, deviceId);
cuDeviceGetAttribute(out int minor, DeviceAttribute.ComputeCapabilityMinor, deviceId);
return new Version(major, minor);
}
public static bool SupportsFeature(CudaFeature feature, int deviceId = 0)
{
var capability = GetTargetComputeCapability(deviceId);
return feature switch
{
CudaFeature.CooperativeGroups => capability >= new Version(6, 0),
CudaFeature.TensorCores => capability >= new Version(7, 0),
CudaFeature.UnifiedMemory => capability >= new Version(3, 0),
_ => false
};
}
}
Performance: 21-92x speedup vs CPU (measured on RTX 2000 Ada)
Metal Backend
Implementation: MetalAccelerator in DotCompute.Backends.Metal
Key Features:
- Apple Silicon optimization (M1/M2/M3)
- Unified memory architecture
- GPU family auto-detection
- Metal Shading Language (MSL) compilation
Compilation Strategy:
// Metal kernels compiled via Metal compiler
public async Task<ICompiledKernel> CompileKernelAsync(KernelDefinition definition)
{
// 1. Translate to Metal Shading Language (MSL)
string mslSource = TranslateToMsl(definition);
// 2. Compile via Metal framework
using var library = await device.NewLibraryAsync(mslSource);
var function = library.NewFunction(definition.EntryPoint);
// 3. Create compute pipeline
var pipeline = await device.NewComputePipelineStateAsync(function);
return new MetalCompiledKernel(pipeline, definition);
}
Unified Memory Optimization:
// Metal leverages unified memory for 2-3x speedup
public async Task<IUnifiedBuffer<T>> AllocateAsync<T>(long elementCount)
{
// Allocate in shared memory (visible to both CPU and GPU)
var buffer = device.NewBuffer(
elementCount * sizeof(T),
MTLResourceOptions.StorageModeShared
);
// Zero-copy: CPU can access directly without transfer
return new MetalUnifiedBuffer<T>(buffer, isSharedMemory: true);
}
Performance: 37-141x speedup vs CPU, 2-3x unified memory advantage
OpenCL Backend
Implementation: OpenCLAccelerator in DotCompute.Backends.OpenCL
Status: Foundation complete, full integration in progress
Key Features:
- Cross-platform GPU support (NVIDIA, AMD, Intel)
- Runtime compilation via OpenCL C
- Platform and device enumeration
- Workgroup size optimization
Backend Selection Strategy
Automatic Selection
The orchestrator uses workload characteristics to select the optimal backend:
public class WorkloadCharacteristics
{
/// <summary>
/// Total data size in bytes
/// </summary>
public long DataSize { get; set; }
/// <summary>
/// Compute intensity (operations per byte)
/// </summary>
public ComputeIntensity ComputeIntensity { get; set; }
/// <summary>
/// Memory access pattern (sequential vs random)
/// </summary>
public MemoryAccessPattern AccessPattern { get; set; }
/// <summary>
/// Potential for parallelism
/// </summary>
public ParallelismLevel ParallelismPotential { get; set; }
}
public interface IBackendSelector
{
Task<IAccelerator> SelectBackendAsync(
WorkloadCharacteristics characteristics,
CancellationToken cancellationToken);
}
Selection Algorithm:
public async Task<IAccelerator> SelectBackendAsync(WorkloadCharacteristics characteristics)
{
// 1. Rule-based selection for clear cases
if (characteristics.DataSize < 10_000) // Small data
{
return GetAccelerator(AcceleratorType.CPU); // No GPU transfer overhead
}
if (characteristics.ComputeIntensity == ComputeIntensity.Low)
{
return GetAccelerator(AcceleratorType.CPU); // Memory-bound, CPU is faster
}
// 2. ML-based selection for complex cases
if (mlModel != null)
{
var prediction = await mlModel.PredictOptimalBackendAsync(characteristics);
return GetAccelerator(prediction.BackendType);
}
// 3. Default: Prefer GPU for high parallelism
if (characteristics.ParallelismPotential == ParallelismLevel.High)
{
return GetAccelerator(AcceleratorType.CUDA) // Try CUDA first
?? GetAccelerator(AcceleratorType.Metal) // Fall back to Metal
?? GetAccelerator(AcceleratorType.CPU); // Fall back to CPU
}
// 4. CPU as final fallback
return GetAccelerator(AcceleratorType.CPU);
}
Manual Override
Users can force specific backends:
// Force CUDA backend
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.CUDA;
options.EnableAutoOptimization = false;
});
// Or specify per-execution
await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters,
forceBackend: AcceleratorType.Metal
);
Multi-Backend Coordination
Backend Registry
public class AcceleratorManager : IAcceleratorManager
{
private readonly Dictionary<AcceleratorType, IAccelerator> _accelerators = new();
public async Task<IAccelerator> GetOrCreateAcceleratorAsync(AcceleratorType type)
{
if (_accelerators.TryGetValue(type, out var accelerator))
{
return accelerator; // Return cached instance
}
// Create new accelerator via factory
accelerator = await _factory.CreateAcceleratorAsync(type);
_accelerators[type] = accelerator;
return accelerator;
}
public IReadOnlyList<AcceleratorType> GetAvailableBackends()
{
return _plugins
.Where(p => p.IsAvailableAsync().Result)
.Select(p => p.SupportedType)
.ToList();
}
}
Cross-Backend Validation
The debugging system can validate results across backends:
public async Task<ValidationResult> ValidateAcrossBackendsAsync(
string kernelName,
object parameters)
{
// 1. Execute on primary backend (e.g., GPU)
var primaryResult = await ExecuteOnBackendAsync(
kernelName,
parameters,
primaryBackend: AcceleratorType.CUDA
);
// 2. Execute on reference backend (always CPU)
var referenceResult = await ExecuteOnBackendAsync(
kernelName,
parameters,
referenceBackend: AcceleratorType.CPU
);
// 3. Compare results within tolerance
var differences = CompareResults(primaryResult, referenceResult, tolerance: 1e-5);
// 4. Log discrepancies
if (differences.Any())
{
logger.LogWarning(
"Cross-backend validation found {Count} differences for kernel {Kernel}",
differences.Count,
kernelName
);
}
return new ValidationResult
{
IsValid = !differences.Any(),
Differences = differences,
PrimaryBackend = AcceleratorType.CUDA,
ReferenceBackend = AcceleratorType.CPU
};
}
Plugin Loading and Isolation
Plugin Discovery
public class PluginManager : IPluginManager
{
private const string PluginDirectory = "plugins";
public async Task<IReadOnlyList<IBackendPlugin>> DiscoverPluginsAsync()
{
var plugins = new List<IBackendPlugin>();
// 1. Scan plugin directory for manifests
var manifestFiles = Directory.GetFiles(PluginDirectory, "plugin.json", SearchOption.AllDirectories);
foreach (var manifestFile in manifestFiles)
{
// 2. Load and validate manifest
var manifest = await LoadManifestAsync(manifestFile);
if (!ValidateManifest(manifest))
{
logger.LogWarning("Invalid plugin manifest: {Path}", manifestFile);
continue;
}
// 3. Load plugin assembly
var plugin = await LoadPluginAsync(manifest);
plugins.Add(plugin);
}
return plugins;
}
}
Assembly Isolation
Each plugin runs in isolated AssemblyLoadContext:
public class PluginLoader
{
public async Task<IBackendPlugin> LoadPluginAsync(PluginManifest manifest)
{
// 1. Create isolated load context
var loadContext = new PluginLoadContext(manifest.AssemblyPath);
// 2. Load plugin assembly
var assembly = loadContext.LoadFromAssemblyPath(manifest.AssemblyPath);
// 3. Find plugin type
var pluginType = assembly.GetTypes()
.FirstOrDefault(t => typeof(IBackendPlugin).IsAssignableFrom(t));
if (pluginType == null)
{
throw new InvalidOperationException(
$"Plugin assembly {manifest.AssemblyPath} does not contain IBackendPlugin implementation"
);
}
// 4. Create plugin instance
var plugin = (IBackendPlugin)Activator.CreateInstance(pluginType)!;
// 5. Initialize plugin
await plugin.InitializeAsync(_serviceProvider, CancellationToken.None);
return plugin;
}
}
internal class PluginLoadContext : AssemblyLoadContext
{
private readonly AssemblyDependencyResolver _resolver;
public PluginLoadContext(string pluginPath) : base(isCollectible: true)
{
_resolver = new AssemblyDependencyResolver(pluginPath);
}
protected override Assembly? Load(AssemblyName assemblyName)
{
// Resolve dependencies relative to plugin directory
string? assemblyPath = _resolver.ResolveAssemblyToPath(assemblyName);
return assemblyPath != null ? LoadFromAssemblyPath(assemblyPath) : null;
}
}
Benefits of Isolation:
- Hot-reload: Unload and reload plugins without restarting
- Versioning: Multiple versions of dependencies
- Security: Limit plugin access to host resources
- Stability: Plugin crashes don't affect host
Backend Capability Detection
Runtime Feature Detection
public class AcceleratorCapabilities
{
/// <summary>
/// Maximum work-group size for compute kernels
/// </summary>
public int MaxWorkGroupSize { get; init; }
/// <summary>
/// Maximum number of threads per block/work-group
/// </summary>
public int MaxThreadsPerBlock { get; init; }
/// <summary>
/// Maximum shared memory per work-group (bytes)
/// </summary>
public long MaxSharedMemoryPerWorkGroup { get; init; }
/// <summary>
/// Maximum global memory (bytes)
/// </summary>
public long MaxGlobalMemory { get; init; }
/// <summary>
/// Supports double-precision floating point
/// </summary>
public bool SupportsDouble { get; init; }
/// <summary>
/// Supports half-precision floating point
/// </summary>
public bool SupportsHalf { get; init; }
/// <summary>
/// Supports atomic operations
/// </summary>
public bool SupportsAtomics { get; init; }
/// <summary>
/// Supports unified memory (CPU/GPU shared)
/// </summary>
public bool SupportsUnifiedMemory { get; init; }
/// <summary>
/// Backend-specific extended capabilities
/// </summary>
public Dictionary<string, object> ExtendedCapabilities { get; init; }
}
CUDA-Specific Capabilities
public static class CudaCapabilities
{
public static AcceleratorCapabilities GetCapabilities(int deviceId = 0)
{
// Query device properties
cuDeviceGetAttribute(out int maxThreadsPerBlock,
DeviceAttribute.MaxThreadsPerBlock, deviceId);
cuDeviceGetAttribute(out int sharedMemPerBlock,
DeviceAttribute.MaxSharedMemoryPerBlock, deviceId);
cuDeviceGetAttribute(out int computeCapabilityMajor,
DeviceAttribute.ComputeCapabilityMajor, deviceId);
cuDeviceGetAttribute(out int computeCapabilityMinor,
DeviceAttribute.ComputeCapabilityMinor, deviceId);
var computeCapability = new Version(computeCapabilityMajor, computeCapabilityMinor);
return new AcceleratorCapabilities
{
MaxThreadsPerBlock = maxThreadsPerBlock,
MaxSharedMemoryPerWorkGroup = sharedMemPerBlock,
SupportsDouble = computeCapability >= new Version(1, 3),
SupportsHalf = computeCapability >= new Version(5, 3),
SupportsAtomics = computeCapability >= new Version(2, 0),
SupportsUnifiedMemory = computeCapability >= new Version(3, 0),
ExtendedCapabilities = new Dictionary<string, object>
{
["ComputeCapability"] = computeCapability,
["TensorCores"] = computeCapability >= new Version(7, 0),
["CooperativeGroups"] = computeCapability >= new Version(6, 0)
}
};
}
}
Metal-Specific Capabilities
public static class MetalCapabilities
{
public static AcceleratorCapabilities GetCapabilities(MTLDevice device)
{
return new AcceleratorCapabilities
{
MaxThreadsPerBlock = (int)device.MaxThreadsPerThreadgroup.Width,
MaxSharedMemoryPerWorkGroup = (long)device.MaxThreadgroupMemoryLength,
MaxGlobalMemory = (long)device.RecommendedMaxWorkingSetSize,
SupportsDouble = false, // Metal doesn't support double precision
SupportsHalf = true,
SupportsAtomics = true,
SupportsUnifiedMemory = true, // Always true on Apple Silicon
ExtendedCapabilities = new Dictionary<string, object>
{
["GPUFamily"] = GetGPUFamily(device),
["Architecture"] = device.Architecture,
["UnifiedMemoryArchitecture"] = true
}
};
}
private static string GetGPUFamily(MTLDevice device)
{
// Detect Apple Silicon generation
if (device.SupportsFamily(MTLGPUFamily.Apple9))
return "Apple9"; // M3
if (device.SupportsFamily(MTLGPUFamily.Apple8))
return "Apple8"; // M2
if (device.SupportsFamily(MTLGPUFamily.Apple7))
return "Apple7"; // M1
return "Unknown";
}
}
Source Generator Integration
Backend-Specific Code Generation
The source generator creates optimized code for each backend:
// User writes this:
[Kernel]
public static void VectorAdd(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
}
// Generator creates CPU SIMD version:
private static void VectorAdd_CPU_SIMD(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result)
{
int vectorSize = Vector<float>.Count;
int i = 0;
// Vectorized loop
for (; i <= result.Length - vectorSize; i += vectorSize)
{
var va = new Vector<float>(a.Slice(i));
var vb = new Vector<float>(b.Slice(i));
(va + vb).CopyTo(result.Slice(i));
}
// Scalar remainder
for (; i < result.Length; i++)
{
result[i] = a[i] + b[i];
}
}
// Generator creates CUDA version:
private const string VectorAdd_CUDA_Source = @"
extern ""C"" __global__ void VectorAdd(
const float* a,
const float* b,
float* result,
int length)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < length)
{
result[idx] = a[idx] + b[idx];
}
}
";
// Generator creates Metal version:
private const string VectorAdd_Metal_Source = @"
#include <metal_stdlib>
using namespace metal;
kernel void VectorAdd(
device const float* a [[buffer(0)]],
device const float* b [[buffer(1)]],
device float* result [[buffer(2)]],
constant int& length [[buffer(3)]],
uint idx [[thread_position_in_grid]])
{
if (idx < length)
{
result[idx] = a[idx] + b[idx];
}
}
";
Registration Generation
The source generator also creates registration code:
// Generated registration class
public static class GeneratedKernels
{
public static void Register(IKernelRegistry registry)
{
registry.RegisterKernel(new KernelMetadata
{
Name = "VectorAdd",
Namespace = "MyNamespace",
DeclaringType = "MyClass",
Parameters = new[]
{
new ParameterMetadata { Name = "a", Type = typeof(ReadOnlySpan<float>) },
new ParameterMetadata { Name = "b", Type = typeof(ReadOnlySpan<float>) },
new ParameterMetadata { Name = "result", Type = typeof(Span<float>) }
},
Backends = KernelBackends.CPU | KernelBackends.CUDA | KernelBackends.Metal,
IsParallel = true,
// CPU implementation
CpuImplementation = (args) => VectorAdd_CPU_SIMD(
args.Get<ReadOnlySpan<float>>("a"),
args.Get<ReadOnlySpan<float>>("b"),
args.Get<Span<float>>("result")
),
// GPU source code
CudaSource = VectorAdd_CUDA_Source,
MetalSource = VectorAdd_Metal_Source,
// Compiler options
CompilerOptions = new CompilationOptions
{
OptimizationLevel = OptimizationLevel.Aggressive,
GenerateDebugInfo = false,
EnableFastMath = true
}
});
}
}
Performance Characteristics
Backend Overhead
| Backend | Initialization | Compilation | Execution Overhead | Notes |
|---|---|---|---|---|
| CPU | < 1ms | N/A (JIT) | < 10μs | Zero-copy, no transfers |
| CUDA | ~100ms | 50-200ms | ~50μs | Device init, NVRTC compile |
| Metal | ~50ms | 20-100ms | ~30μs | Device init, MSL compile |
| OpenCL | ~150ms | 100-300ms | ~80μs | Platform detection, compile |
Notes:
- Initialization is one-time per application lifetime
- Compilation is cached for subsequent executions
- Execution overhead is per-kernel launch
Backend Selection Overhead
- Rule-based selection: < 10μs
- ML-based selection: ~100μs (inference time)
- Device enumeration: ~1ms (cached)
Plugin System Overhead
- Plugin loading: ~10-50ms per plugin (one-time)
- Hot-reload: ~20-100ms (unload + reload)
- Isolated execution: < 5% performance impact vs direct calls
Testing Strategy
Backend Tests
Each backend has comprehensive test suite:
[Fact]
public async Task CompileKernel_ValidDefinition_CompilesToExecutable()
{
// Arrange
var accelerator = await CreateAcceleratorAsync();
var definition = new KernelDefinition
{
Name = "SimpleAdd",
Source = "/* kernel source */",
EntryPoint = "SimpleAdd"
};
// Act
var compiled = await accelerator.CompileKernelAsync(definition, CompilationOptions.Default);
// Assert
Assert.NotNull(compiled);
Assert.Equal("SimpleAdd", compiled.EntryPoint);
}
[SkippableFact] // Skip if hardware not available
public async Task Execute_OnGpu_ProducesCorrectResults()
{
Skip.IfNot(await IsGpuAvailableAsync(), "GPU not available");
// Arrange
var accelerator = await CreateAcceleratorAsync();
var input = Enumerable.Range(0, 1000).Select(i => (float)i).ToArray();
// Act
var result = await ExecuteKernelAsync(accelerator, input);
// Assert
Assert.Equal(input.Select(x => x * 2), result);
}
Cross-Backend Validation Tests
[Theory]
[InlineData(AcceleratorType.CPU, AcceleratorType.CUDA)]
[InlineData(AcceleratorType.CPU, AcceleratorType.Metal)]
public async Task Execute_AcrossBackends_ProducesSameResults(
AcceleratorType reference,
AcceleratorType target)
{
// Arrange
var input = GenerateTestData(10_000);
// Act
var referenceResult = await ExecuteOnBackendAsync(reference, input);
var targetResult = await ExecuteOnBackendAsync(target, input);
// Assert
AssertResultsEqual(referenceResult, targetResult, tolerance: 1e-5f);
}