Backend Integration Architecture

The Backend Integration system provides a plugin-based architecture for compute acceleration, allowing DotCompute to support multiple hardware platforms (CPU, CUDA, Metal, OpenCL) through a unified interface.

Architecture Overview

Application Code
    ↓
IComputeOrchestrator
    ↓
IAcceleratorFactory
    ↓
┌─────────────────────────────────────────────────┐
│         Backend Plugin System                   │
├─────────────────────────────────────────────────┤
│  PluginManager → PluginLoader → BackendPlugin  │
│      ↓              ↓               ↓           │
│  Discovery     Isolation       IAccelerator     │
└─────────────────────────────────────────────────┘
    ↓                ↓                ↓
CPU Backend    CUDA Backend    Metal Backend
(AVX512/AVX2)  (NVIDIA GPU)   (Apple Silicon)

IAccelerator Interface

The core abstraction for all compute backends:

public interface IAccelerator : IAsyncDisposable
{
    /// <summary>
    /// Unique identifier for this accelerator instance
    /// </summary>
    string Id { get; }

    /// <summary>
    /// Type of accelerator (CPU, CUDA, Metal, OpenCL)
    /// </summary>
    AcceleratorType Type { get; }

    /// <summary>
    /// Device name (e.g., "NVIDIA RTX 2000", "Apple M1 Pro")
    /// </summary>
    string DeviceName { get; }

    /// <summary>
    /// Backend capabilities (e.g., compute capability, feature levels)
    /// </summary>
    AcceleratorCapabilities Capabilities { get; }

    /// <summary>
    /// Compiles kernel source code for this backend
    /// </summary>
    Task<ICompiledKernel> CompileKernelAsync(
        KernelDefinition definition,
        CompilationOptions options,
        CancellationToken cancellationToken = default);

    /// <summary>
    /// Allocates device memory
    /// </summary>
    Task<IUnifiedBuffer<T>> AllocateAsync<T>(
        long elementCount,
        CancellationToken cancellationToken = default) where T : unmanaged;

    /// <summary>
    /// Synchronizes with device (waits for all operations to complete)
    /// </summary>
    Task SynchronizeAsync(CancellationToken cancellationToken = default);

    /// <summary>
    /// Gets current memory usage statistics
    /// </summary>
    MemoryStatistics GetMemoryStatistics();
}

Design Rationale:

Async-first: Non-blocking operations for all I/O and device interactions
Unified memory: IUnifiedBuffer<T> abstracts device-specific memory
Capability detection: Runtime feature detection for optimal code paths
Explicit synchronization: Clear control over device/host synchronization
Resource tracking: Memory statistics for monitoring and optimization

Backend Plugin Architecture

Plugin Lifecycle

1. Discovery
   - Scan plugin directories
   - Load plugin manifests
   - Validate signatures (optional)
   ↓
2. Loading
   - Create isolated AssemblyLoadContext
   - Load plugin assembly
   - Resolve dependencies
   ↓
3. Initialization
   - Create plugin instance
   - Call Initialize()
   - Register with AcceleratorManager
   ↓
4. Runtime
   - Handle CreateAccelerator requests
   - Execute kernel compilation
   - Manage device resources
   ↓
5. Unloading (Hot-Reload)
   - Call Dispose()
   - Unload AssemblyLoadContext
   - Clean up resources

IBackendPlugin Interface

public interface IBackendPlugin : IAsyncDisposable
{
    /// <summary>
    /// Plugin metadata (name, version, author)
    /// </summary>
    PluginMetadata Metadata { get; }

    /// <summary>
    /// Supported accelerator type
    /// </summary>
    AcceleratorType SupportedType { get; }

    /// <summary>
    /// Initializes the plugin
    /// </summary>
    Task InitializeAsync(IServiceProvider services, CancellationToken cancellationToken);

    /// <summary>
    /// Creates an accelerator instance
    /// </summary>
    Task<IAccelerator> CreateAcceleratorAsync(
        AcceleratorConfiguration configuration,
        CancellationToken cancellationToken);

    /// <summary>
    /// Enumerates available devices for this backend
    /// </summary>
    Task<IReadOnlyList<DeviceInfo>> EnumerateDevicesAsync(
        CancellationToken cancellationToken);

    /// <summary>
    /// Checks if this backend is available on the current system
    /// </summary>
    Task<bool> IsAvailableAsync(CancellationToken cancellationToken);

    /// <summary>
    /// Gets plugin health status
    /// </summary>
    PluginHealthStatus GetHealthStatus();
}

Backend Implementations

CPU Backend

Implementation: CpuAccelerator in DotCompute.Backends.CPU

Key Features:

SIMD vectorization (AVX512, AVX2, SSE4.2, NEON)
Multi-threaded execution via Parallel.For
Zero-copy operations with Span<T>
Hardware capability detection at runtime

Compilation Strategy:

// CPU kernels use .NET JIT with SIMD intrinsics
public async Task<ICompiledKernel> CompileKernelAsync(KernelDefinition definition)
{
    // 1. Analyze kernel for SIMD opportunities
    var vectorizable = AnalyzeVectorizability(definition);

    // 2. Generate SIMD-optimized code
    if (vectorizable && Vector.IsHardwareAccelerated)
    {
        return GenerateSimdKernel(definition);
    }

    // 3. Fall back to scalar implementation
    return GenerateScalarKernel(definition);
}

Performance: 8-23x speedup on vectorizable operations

CUDA Backend

Implementation: CudaAccelerator in DotCompute.Backends.CUDA

Key Features:

NVIDIA GPU support (Compute Capability 5.0+)
NVRTC runtime compilation
PTX and CUBIN generation
Unified memory and pinned memory optimization

Compilation Strategy:

// CUDA kernels compiled at runtime via NVRTC
public async Task<ICompiledKernel> CompileKernelAsync(KernelDefinition definition)
{
    // 1. Detect compute capability
    var computeCapability = CudaCapabilityManager.GetTargetComputeCapability();

    // 2. Choose compilation target
    if (computeCapability >= new Version(7, 0))
    {
        // Modern GPUs: Compile to CUBIN for best performance
        return await CompileToCubinAsync(definition, computeCapability);
    }
    else
    {
        // Legacy GPUs: Compile to PTX for compatibility
        return await CompileToPtxAsync(definition, computeCapability);
    }
}

Compute Capability Detection:

public static class CudaCapabilityManager
{
    public static Version GetTargetComputeCapability(int deviceId = 0)
    {
        // Query device properties
        cuDeviceGetAttribute(out int major, DeviceAttribute.ComputeCapabilityMajor, deviceId);
        cuDeviceGetAttribute(out int minor, DeviceAttribute.ComputeCapabilityMinor, deviceId);

        return new Version(major, minor);
    }

    public static bool SupportsFeature(CudaFeature feature, int deviceId = 0)
    {
        var capability = GetTargetComputeCapability(deviceId);

        return feature switch
        {
            CudaFeature.CooperativeGroups => capability >= new Version(6, 0),
            CudaFeature.TensorCores => capability >= new Version(7, 0),
            CudaFeature.UnifiedMemory => capability >= new Version(3, 0),
            _ => false
        };
    }
}

Performance: 21-92x speedup vs CPU (measured on RTX 2000 Ada)

Metal Backend

Implementation: MetalAccelerator in DotCompute.Backends.Metal

Key Features:

Apple Silicon optimization (M1/M2/M3)
Unified memory architecture
GPU family auto-detection
Metal Shading Language (MSL) compilation

Compilation Strategy:

// Metal kernels compiled via Metal compiler
public async Task<ICompiledKernel> CompileKernelAsync(KernelDefinition definition)
{
    // 1. Translate to Metal Shading Language (MSL)
    string mslSource = TranslateToMsl(definition);

    // 2. Compile via Metal framework
    using var library = await device.NewLibraryAsync(mslSource);
    var function = library.NewFunction(definition.EntryPoint);

    // 3. Create compute pipeline
    var pipeline = await device.NewComputePipelineStateAsync(function);

    return new MetalCompiledKernel(pipeline, definition);
}

Unified Memory Optimization:

// Metal leverages unified memory for 2-3x speedup
public async Task<IUnifiedBuffer<T>> AllocateAsync<T>(long elementCount)
{
    // Allocate in shared memory (visible to both CPU and GPU)
    var buffer = device.NewBuffer(
        elementCount * sizeof(T),
        MTLResourceOptions.StorageModeShared
    );

    // Zero-copy: CPU can access directly without transfer
    return new MetalUnifiedBuffer<T>(buffer, isSharedMemory: true);
}

Performance: 37-141x speedup vs CPU, 2-3x unified memory advantage

OpenCL Backend

Implementation: OpenCLAccelerator in DotCompute.Backends.OpenCL

Status: Foundation complete, full integration in progress

Key Features:

Cross-platform GPU support (NVIDIA, AMD, Intel)
Runtime compilation via OpenCL C
Platform and device enumeration
Workgroup size optimization

Backend Selection Strategy

Automatic Selection

The orchestrator uses workload characteristics to select the optimal backend:

public class WorkloadCharacteristics
{
    /// <summary>
    /// Total data size in bytes
    /// </summary>
    public long DataSize { get; set; }

    /// <summary>
    /// Compute intensity (operations per byte)
    /// </summary>
    public ComputeIntensity ComputeIntensity { get; set; }

    /// <summary>
    /// Memory access pattern (sequential vs random)
    /// </summary>
    public MemoryAccessPattern AccessPattern { get; set; }

    /// <summary>
    /// Potential for parallelism
    /// </summary>
    public ParallelismLevel ParallelismPotential { get; set; }
}

public interface IBackendSelector
{
    Task<IAccelerator> SelectBackendAsync(
        WorkloadCharacteristics characteristics,
        CancellationToken cancellationToken);
}

Selection Algorithm:

public async Task<IAccelerator> SelectBackendAsync(WorkloadCharacteristics characteristics)
{
    // 1. Rule-based selection for clear cases
    if (characteristics.DataSize < 10_000) // Small data
    {
        return GetAccelerator(AcceleratorType.CPU); // No GPU transfer overhead
    }

    if (characteristics.ComputeIntensity == ComputeIntensity.Low)
    {
        return GetAccelerator(AcceleratorType.CPU); // Memory-bound, CPU is faster
    }

    // 2. ML-based selection for complex cases
    if (mlModel != null)
    {
        var prediction = await mlModel.PredictOptimalBackendAsync(characteristics);
        return GetAccelerator(prediction.BackendType);
    }

    // 3. Default: Prefer GPU for high parallelism
    if (characteristics.ParallelismPotential == ParallelismLevel.High)
    {
        return GetAccelerator(AcceleratorType.CUDA) // Try CUDA first
            ?? GetAccelerator(AcceleratorType.Metal) // Fall back to Metal
            ?? GetAccelerator(AcceleratorType.CPU);  // Fall back to CPU
    }

    // 4. CPU as final fallback
    return GetAccelerator(AcceleratorType.CPU);
}

Manual Override

Users can force specific backends:

// Force CUDA backend
services.AddDotComputeRuntime(options =>
{
    options.DefaultAccelerator = AcceleratorType.CUDA;
    options.EnableAutoOptimization = false;
});

// Or specify per-execution
await orchestrator.ExecuteKernelAsync(
    "MyKernel",
    parameters,
    forceBackend: AcceleratorType.Metal
);

Multi-Backend Coordination

Backend Registry

public class AcceleratorManager : IAcceleratorManager
{
    private readonly Dictionary<AcceleratorType, IAccelerator> _accelerators = new();

    public async Task<IAccelerator> GetOrCreateAcceleratorAsync(AcceleratorType type)
    {
        if (_accelerators.TryGetValue(type, out var accelerator))
        {
            return accelerator; // Return cached instance
        }

        // Create new accelerator via factory
        accelerator = await _factory.CreateAcceleratorAsync(type);
        _accelerators[type] = accelerator;

        return accelerator;
    }

    public IReadOnlyList<AcceleratorType> GetAvailableBackends()
    {
        return _plugins
            .Where(p => p.IsAvailableAsync().Result)
            .Select(p => p.SupportedType)
            .ToList();
    }
}

Cross-Backend Validation

The debugging system can validate results across backends:

public async Task<ValidationResult> ValidateAcrossBackendsAsync(
    string kernelName,
    object parameters)
{
    // 1. Execute on primary backend (e.g., GPU)
    var primaryResult = await ExecuteOnBackendAsync(
        kernelName,
        parameters,
        primaryBackend: AcceleratorType.CUDA
    );

    // 2. Execute on reference backend (always CPU)
    var referenceResult = await ExecuteOnBackendAsync(
        kernelName,
        parameters,
        referenceBackend: AcceleratorType.CPU
    );

    // 3. Compare results within tolerance
    var differences = CompareResults(primaryResult, referenceResult, tolerance: 1e-5);

    // 4. Log discrepancies
    if (differences.Any())
    {
        logger.LogWarning(
            "Cross-backend validation found {Count} differences for kernel {Kernel}",
            differences.Count,
            kernelName
        );
    }

    return new ValidationResult
    {
        IsValid = !differences.Any(),
        Differences = differences,
        PrimaryBackend = AcceleratorType.CUDA,
        ReferenceBackend = AcceleratorType.CPU
    };
}

Plugin Loading and Isolation

Plugin Discovery

public class PluginManager : IPluginManager
{
    private const string PluginDirectory = "plugins";

    public async Task<IReadOnlyList<IBackendPlugin>> DiscoverPluginsAsync()
    {
        var plugins = new List<IBackendPlugin>();

        // 1. Scan plugin directory for manifests
        var manifestFiles = Directory.GetFiles(PluginDirectory, "plugin.json", SearchOption.AllDirectories);

        foreach (var manifestFile in manifestFiles)
        {
            // 2. Load and validate manifest
            var manifest = await LoadManifestAsync(manifestFile);

            if (!ValidateManifest(manifest))
            {
                logger.LogWarning("Invalid plugin manifest: {Path}", manifestFile);
                continue;
            }

            // 3. Load plugin assembly
            var plugin = await LoadPluginAsync(manifest);
            plugins.Add(plugin);
        }

        return plugins;
    }
}

Assembly Isolation

Each plugin runs in isolated AssemblyLoadContext:

public class PluginLoader
{
    public async Task<IBackendPlugin> LoadPluginAsync(PluginManifest manifest)
    {
        // 1. Create isolated load context
        var loadContext = new PluginLoadContext(manifest.AssemblyPath);

        // 2. Load plugin assembly
        var assembly = loadContext.LoadFromAssemblyPath(manifest.AssemblyPath);

        // 3. Find plugin type
        var pluginType = assembly.GetTypes()
            .FirstOrDefault(t => typeof(IBackendPlugin).IsAssignableFrom(t));

        if (pluginType == null)
        {
            throw new InvalidOperationException(
                $"Plugin assembly {manifest.AssemblyPath} does not contain IBackendPlugin implementation"
            );
        }

        // 4. Create plugin instance
        var plugin = (IBackendPlugin)Activator.CreateInstance(pluginType)!;

        // 5. Initialize plugin
        await plugin.InitializeAsync(_serviceProvider, CancellationToken.None);

        return plugin;
    }
}

internal class PluginLoadContext : AssemblyLoadContext
{
    private readonly AssemblyDependencyResolver _resolver;

    public PluginLoadContext(string pluginPath) : base(isCollectible: true)
    {
        _resolver = new AssemblyDependencyResolver(pluginPath);
    }

    protected override Assembly? Load(AssemblyName assemblyName)
    {
        // Resolve dependencies relative to plugin directory
        string? assemblyPath = _resolver.ResolveAssemblyToPath(assemblyName);
        return assemblyPath != null ? LoadFromAssemblyPath(assemblyPath) : null;
    }
}

Benefits of Isolation:

Hot-reload: Unload and reload plugins without restarting
Versioning: Multiple versions of dependencies
Security: Limit plugin access to host resources
Stability: Plugin crashes don't affect host

Backend Capability Detection

Runtime Feature Detection

public class AcceleratorCapabilities
{
    /// <summary>
    /// Maximum work-group size for compute kernels
    /// </summary>
    public int MaxWorkGroupSize { get; init; }

    /// <summary>
    /// Maximum number of threads per block/work-group
    /// </summary>
    public int MaxThreadsPerBlock { get; init; }

    /// <summary>
    /// Maximum shared memory per work-group (bytes)
    /// </summary>
    public long MaxSharedMemoryPerWorkGroup { get; init; }

    /// <summary>
    /// Maximum global memory (bytes)
    /// </summary>
    public long MaxGlobalMemory { get; init; }

    /// <summary>
    /// Supports double-precision floating point
    /// </summary>
    public bool SupportsDouble { get; init; }

    /// <summary>
    /// Supports half-precision floating point
    /// </summary>
    public bool SupportsHalf { get; init; }

    /// <summary>
    /// Supports atomic operations
    /// </summary>
    public bool SupportsAtomics { get; init; }

    /// <summary>
    /// Supports unified memory (CPU/GPU shared)
    /// </summary>
    public bool SupportsUnifiedMemory { get; init; }

    /// <summary>
    /// Backend-specific extended capabilities
    /// </summary>
    public Dictionary<string, object> ExtendedCapabilities { get; init; }
}

CUDA-Specific Capabilities

public static class CudaCapabilities
{
    public static AcceleratorCapabilities GetCapabilities(int deviceId = 0)
    {
        // Query device properties
        cuDeviceGetAttribute(out int maxThreadsPerBlock,
            DeviceAttribute.MaxThreadsPerBlock, deviceId);
        cuDeviceGetAttribute(out int sharedMemPerBlock,
            DeviceAttribute.MaxSharedMemoryPerBlock, deviceId);
        cuDeviceGetAttribute(out int computeCapabilityMajor,
            DeviceAttribute.ComputeCapabilityMajor, deviceId);
        cuDeviceGetAttribute(out int computeCapabilityMinor,
            DeviceAttribute.ComputeCapabilityMinor, deviceId);

        var computeCapability = new Version(computeCapabilityMajor, computeCapabilityMinor);

        return new AcceleratorCapabilities
        {
            MaxThreadsPerBlock = maxThreadsPerBlock,
            MaxSharedMemoryPerWorkGroup = sharedMemPerBlock,
            SupportsDouble = computeCapability >= new Version(1, 3),
            SupportsHalf = computeCapability >= new Version(5, 3),
            SupportsAtomics = computeCapability >= new Version(2, 0),
            SupportsUnifiedMemory = computeCapability >= new Version(3, 0),
            ExtendedCapabilities = new Dictionary<string, object>
            {
                ["ComputeCapability"] = computeCapability,
                ["TensorCores"] = computeCapability >= new Version(7, 0),
                ["CooperativeGroups"] = computeCapability >= new Version(6, 0)
            }
        };
    }
}

Metal-Specific Capabilities

public static class MetalCapabilities
{
    public static AcceleratorCapabilities GetCapabilities(MTLDevice device)
    {
        return new AcceleratorCapabilities
        {
            MaxThreadsPerBlock = (int)device.MaxThreadsPerThreadgroup.Width,
            MaxSharedMemoryPerWorkGroup = (long)device.MaxThreadgroupMemoryLength,
            MaxGlobalMemory = (long)device.RecommendedMaxWorkingSetSize,
            SupportsDouble = false, // Metal doesn't support double precision
            SupportsHalf = true,
            SupportsAtomics = true,
            SupportsUnifiedMemory = true, // Always true on Apple Silicon
            ExtendedCapabilities = new Dictionary<string, object>
            {
                ["GPUFamily"] = GetGPUFamily(device),
                ["Architecture"] = device.Architecture,
                ["UnifiedMemoryArchitecture"] = true
            }
        };
    }

    private static string GetGPUFamily(MTLDevice device)
    {
        // Detect Apple Silicon generation
        if (device.SupportsFamily(MTLGPUFamily.Apple9))
            return "Apple9"; // M3
        if (device.SupportsFamily(MTLGPUFamily.Apple8))
            return "Apple8"; // M2
        if (device.SupportsFamily(MTLGPUFamily.Apple7))
            return "Apple7"; // M1

        return "Unknown";
    }
}

Source Generator Integration

Backend-Specific Code Generation

The source generator creates optimized code for each backend:

// User writes this:
[Kernel]
public static void VectorAdd(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result)
{
    int idx = Kernel.ThreadId.X;
    if (idx < result.Length)
    {
        result[idx] = a[idx] + b[idx];
    }
}

// Generator creates CPU SIMD version:
private static void VectorAdd_CPU_SIMD(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result)
{
    int vectorSize = Vector<float>.Count;
    int i = 0;

    // Vectorized loop
    for (; i <= result.Length - vectorSize; i += vectorSize)
    {
        var va = new Vector<float>(a.Slice(i));
        var vb = new Vector<float>(b.Slice(i));
        (va + vb).CopyTo(result.Slice(i));
    }

    // Scalar remainder
    for (; i < result.Length; i++)
    {
        result[i] = a[i] + b[i];
    }
}

// Generator creates CUDA version:
private const string VectorAdd_CUDA_Source = @"
extern ""C"" __global__ void VectorAdd(
    const float* a,
    const float* b,
    float* result,
    int length)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < length)
    {
        result[idx] = a[idx] + b[idx];
    }
}
";

// Generator creates Metal version:
private const string VectorAdd_Metal_Source = @"
#include <metal_stdlib>
using namespace metal;

kernel void VectorAdd(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device float* result [[buffer(2)]],
    constant int& length [[buffer(3)]],
    uint idx [[thread_position_in_grid]])
{
    if (idx < length)
    {
        result[idx] = a[idx] + b[idx];
    }
}
";

Registration Generation

The source generator also creates registration code:

// Generated registration class
public static class GeneratedKernels
{
    public static void Register(IKernelRegistry registry)
    {
        registry.RegisterKernel(new KernelMetadata
        {
            Name = "VectorAdd",
            Namespace = "MyNamespace",
            DeclaringType = "MyClass",
            Parameters = new[]
            {
                new ParameterMetadata { Name = "a", Type = typeof(ReadOnlySpan<float>) },
                new ParameterMetadata { Name = "b", Type = typeof(ReadOnlySpan<float>) },
                new ParameterMetadata { Name = "result", Type = typeof(Span<float>) }
            },
            Backends = KernelBackends.CPU | KernelBackends.CUDA | KernelBackends.Metal,
            IsParallel = true,

            // CPU implementation
            CpuImplementation = (args) => VectorAdd_CPU_SIMD(
                args.Get<ReadOnlySpan<float>>("a"),
                args.Get<ReadOnlySpan<float>>("b"),
                args.Get<Span<float>>("result")
            ),

            // GPU source code
            CudaSource = VectorAdd_CUDA_Source,
            MetalSource = VectorAdd_Metal_Source,

            // Compiler options
            CompilerOptions = new CompilationOptions
            {
                OptimizationLevel = OptimizationLevel.Aggressive,
                GenerateDebugInfo = false,
                EnableFastMath = true
            }
        });
    }
}

Performance Characteristics

Backend Overhead

Backend	Initialization	Compilation	Execution Overhead	Notes
CPU	< 1ms	N/A (JIT)	< 10μs	Zero-copy, no transfers
CUDA	~100ms	50-200ms	~50μs	Device init, NVRTC compile
Metal	~50ms	20-100ms	~30μs	Device init, MSL compile
OpenCL	~150ms	100-300ms	~80μs	Platform detection, compile

Notes:

Initialization is one-time per application lifetime
Compilation is cached for subsequent executions
Execution overhead is per-kernel launch

Backend Selection Overhead

Rule-based selection: < 10μs
ML-based selection: ~100μs (inference time)
Device enumeration: ~1ms (cached)

Plugin System Overhead

Plugin loading: ~10-50ms per plugin (one-time)
Hot-reload: ~20-100ms (unload + reload)
Isolated execution: < 5% performance impact vs direct calls

Testing Strategy

Backend Tests

Each backend has comprehensive test suite:

[Fact]
public async Task CompileKernel_ValidDefinition_CompilesToExecutable()
{
    // Arrange
    var accelerator = await CreateAcceleratorAsync();
    var definition = new KernelDefinition
    {
        Name = "SimpleAdd",
        Source = "/* kernel source */",
        EntryPoint = "SimpleAdd"
    };

    // Act
    var compiled = await accelerator.CompileKernelAsync(definition, CompilationOptions.Default);

    // Assert
    Assert.NotNull(compiled);
    Assert.Equal("SimpleAdd", compiled.EntryPoint);
}

[SkippableFact] // Skip if hardware not available
public async Task Execute_OnGpu_ProducesCorrectResults()
{
    Skip.IfNot(await IsGpuAvailableAsync(), "GPU not available");

    // Arrange
    var accelerator = await CreateAcceleratorAsync();
    var input = Enumerable.Range(0, 1000).Select(i => (float)i).ToArray();

    // Act
    var result = await ExecuteKernelAsync(accelerator, input);

    // Assert
    Assert.Equal(input.Select(x => x * 2), result);
}

Cross-Backend Validation Tests

[Theory]
[InlineData(AcceleratorType.CPU, AcceleratorType.CUDA)]
[InlineData(AcceleratorType.CPU, AcceleratorType.Metal)]
public async Task Execute_AcrossBackends_ProducesSameResults(
    AcceleratorType reference,
    AcceleratorType target)
{
    // Arrange
    var input = GenerateTestData(10_000);

    // Act
    var referenceResult = await ExecuteOnBackendAsync(reference, input);
    var targetResult = await ExecuteOnBackendAsync(target, input);

    // Assert
    AssertResultsEqual(referenceResult, targetResult, tolerance: 1e-5f);
}

Table of Contents

Backend Integration Architecture

Architecture Overview

IAccelerator Interface

Backend Plugin Architecture

Plugin Lifecycle

IBackendPlugin Interface

Backend Implementations

CPU Backend

CUDA Backend

Metal Backend

OpenCL Backend

Backend Selection Strategy

Automatic Selection

Manual Override

Multi-Backend Coordination

Backend Registry

Cross-Backend Validation

Plugin Loading and Isolation

Plugin Discovery

Assembly Isolation

Backend Capability Detection

Runtime Feature Detection

CUDA-Specific Capabilities

Metal-Specific Capabilities

Source Generator Integration

Backend-Specific Code Generation

Registration Generation

Performance Characteristics

Backend Overhead

Backend Selection Overhead

Plugin System Overhead

Testing Strategy

Backend Tests

Cross-Backend Validation Tests

Related Documentation