Backend Selection Guide
This guide helps you understand DotCompute's backend selection system and how to choose the optimal backend for your workload.
Available Backends
CPU Backend (Always Available)
Status: ✅ Production Ready
Capabilities:
- SIMD vectorization (AVX512, AVX2, SSE4.2, NEON)
- Multi-threaded execution via
Parallel.For - Zero-copy operations with
Span<T> - Hardware capability auto-detection
Performance: 8-23x speedup on vectorizable operations vs scalar code
When to Use:
- Small data (< 10,000 elements)
- Memory-bound operations (low compute intensity)
- Sequential memory access patterns
- No GPU available
- High data locality (fits in CPU cache)
Example:
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.CPU;
});
CUDA Backend (NVIDIA GPUs)
Status: ✅ Production Ready
Requirements:
- NVIDIA GPU with Compute Capability 5.0+
- CUDA Toolkit 12.0+
- Windows, Linux, or WSL2
Capabilities:
- NVRTC runtime compilation
- PTX and CUBIN generation
- Unified memory support
- P2P GPU-to-GPU transfers (NVLink)
- Compute Capability 5.0-8.9 support
Performance: 21-92x speedup vs CPU (measured on RTX 2000 Ada)
When to Use:
- Large data (> 1M elements)
- Highly parallel workloads
- Compute-intensive operations
- Matrix operations
- Deep learning inference
Example:
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.CUDA;
});
Metal Backend (Apple Silicon)
Status: ✅ Production Ready
Requirements:
- Apple Silicon Mac (M1/M2/M3)
- macOS 11.0+ (Big Sur)
Capabilities:
- Metal Performance Shaders (MPS): Batch normalization, max pooling 2D
- Advanced memory pooling: 90% allocation reduction (power-of-2 buckets)
- MTLBinaryArchive: Kernel binary caching (macOS 11.0+)
- Unified memory architecture
- GPU family auto-detection (Apple7/8/9)
- Command buffer management
- Zero-copy CPU-GPU access
Performance: 37-141x speedup vs CPU, 2-3x unified memory advantage
When to Use:
- Running on Apple Silicon Macs
- Applications requiring unified memory
- Video processing, image operations
- Metal-optimized workloads
Example:
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.Metal;
});
OpenCL Backend (Cross-Platform)
Status: 🚧 Foundation Complete
Requirements:
- OpenCL 1.2+ compatible GPU
- OpenCL runtime installed
Capabilities:
- Platform and device enumeration
- Kernel compilation
- Memory management
When to Use:
- AMD GPUs
- Intel integrated graphics
- Cross-platform GPU code
- Legacy GPU support
Example:
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.OpenCL;
});
Automatic Backend Selection
Default Behavior
By default, DotCompute automatically selects the best backend based on workload characteristics:
// Automatic selection (default)
services.AddDotComputeRuntime(); // No options needed
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
// Backend chosen automatically based on data size, parallelism, etc.
Selection Criteria
The automatic selector considers:
Data Size
- < 10,000 elements → CPU (no transfer overhead)
- 10,000-1M elements → GPU or CPU (depends on other factors)
-
1M elements → GPU (if available)
Compute Intensity
- Low (< 10 ops/byte) → CPU (memory-bound)
- Medium (10-100 ops/byte) → GPU or CPU
- High (> 100 ops/byte) → GPU (compute-bound)
Parallelism Potential
- Low (< 100 tasks) → CPU
- Medium (100-10K tasks) → GPU or CPU
- High (> 10K tasks) → GPU
Memory Access Pattern
- Sequential → CPU (cache advantage)
- Random → GPU (higher bandwidth)
- Strided → Depends on stride size
Device Availability
- No GPU → CPU
- NVIDIA GPU → CUDA
- Apple Silicon → Metal
- AMD GPU → OpenCL
Selection Flow
Kernel Execution Request
↓
Is data < 10,000 elements?
Yes → CPU
No → Continue
↓
Is compute intensity low?
Yes → CPU
No → Continue
↓
Is parallelism high?
Yes → GPU (if available)
No → Continue
↓
Is GPU available?
Yes → GPU
No → CPU
Manual Backend Selection
Force Specific Backend
// Force CUDA
await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters,
forceBackend: AcceleratorType.CUDA
);
// Force CPU
await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters,
forceBackend: AcceleratorType.CPU
);
// Force Metal
await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters,
forceBackend: AcceleratorType.Metal
);
Set Default Backend
services.AddDotComputeRuntime(options =>
{
options.DefaultAccelerator = AcceleratorType.CUDA;
options.EnableAutoOptimization = false; // Disable automatic selection
});
Check Backend Availability
var acceleratorManager = services.GetRequiredService<IAcceleratorManager>();
if (acceleratorManager.IsAvailable(AcceleratorType.CUDA))
{
Console.WriteLine("CUDA is available");
}
var availableBackends = acceleratorManager.GetAvailableBackends();
foreach (var backend in availableBackends)
{
Console.WriteLine($"Available: {backend}");
}
Optimization Profiles
Conservative Profile
Goal: Safety over performance
services.AddProductionOptimization(options =>
{
options.Profile = OptimizationProfile.Conservative;
});
Behavior:
- Prefers CPU for ambiguous cases
- Only uses GPU for clear performance wins
- Minimal risk of sub-optimal selection
Use When:
- Production systems with strict SLAs
- Unknown workload patterns
- Prioritizing reliability
Balanced Profile (Default)
Goal: Balance performance and reliability
services.AddProductionOptimization(options =>
{
options.Profile = OptimizationProfile.Balanced; // Default
});
Behavior:
- Uses heuristics with historical data
- Falls back to safe defaults when uncertain
- 70-80% optimal backend selection
Use When:
- General-purpose applications
- Mixed workload patterns
- Good default for most users
Aggressive Profile
Goal: Maximum performance
services.AddProductionOptimization(options =>
{
options.Profile = OptimizationProfile.Aggressive;
});
Behavior:
- Prefers GPU for most workloads
- Uses ML model when available
- Accepts occasional sub-optimal selections for higher peak performance
Use When:
- Performance-critical applications
- Known GPU-friendly workloads
- Can tolerate occasional slower execution
ML-Optimized Profile
Goal: Learn optimal selection from execution patterns
services.AddProductionOptimization(options =>
{
options.Profile = OptimizationProfile.MLOptimized;
options.EnableMachineLearning = true;
options.EnablePerformanceLearning = true;
});
Behavior:
- Learns from execution history
- Improves selection over time
- 85-95% optimal after learning period
Use When:
- Long-running applications
- Repetitive workload patterns
- Can tolerate initial learning phase
Performance Improvement: 10-30% average speedup after 1,000+ executions
Workload Analysis
Understanding Your Workload
Use the profiler to understand workload characteristics:
var debugService = services.GetRequiredService<IKernelDebugService>();
// Profile on CPU
var cpuProfile = await debugService.ProfileKernelAsync(
"MyKernel",
parameters,
AcceleratorType.CPU,
iterations: 100
);
// Profile on GPU
var gpuProfile = await debugService.ProfileKernelAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
iterations: 100
);
// Compare
Console.WriteLine($"CPU: {cpuProfile.AverageTime.TotalMilliseconds:F2}ms");
Console.WriteLine($"GPU: {gpuProfile.AverageTime.TotalMilliseconds:F2}ms");
Console.WriteLine($"Speedup: {cpuProfile.AverageTime.TotalMilliseconds / gpuProfile.AverageTime.TotalMilliseconds:F2}x");
Compute Intensity Estimation
// Low intensity (memory-bound): Use CPU
// Example: Vector addition (2 reads, 1 write, 1 add = 1 op / 12 bytes = 0.08 ops/byte)
[Kernel]
public static void VectorAdd(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx]; // Low compute intensity
}
}
// High intensity (compute-bound): Use GPU
// Example: Matrix multiplication (K operations per element)
[Kernel]
public static void MatrixMultiply(ReadOnlySpan<float> a, ReadOnlySpan<float> b, Span<float> c, int K)
{
int idx = Kernel.ThreadId.X;
if (idx < c.Length)
{
float sum = 0;
for (int k = 0; k < K; k++) // K operations per output element
{
sum += a[idx * K + k] * b[k];
}
c[idx] = sum; // High compute intensity (K ops / 8 bytes ≈ K/8 ops/byte)
}
}
Memory Access Pattern Analysis
// Sequential access: CPU-friendly
[Kernel]
public static void Sequential(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2; // Sequential: CPU cache-friendly
}
}
// Random access: GPU-friendly (higher bandwidth)
[Kernel]
public static void Gather(ReadOnlySpan<float> input, ReadOnlySpan<int> indices, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[indices[idx]]; // Random: GPU bandwidth helps
}
}
Backend-Specific Optimizations
CUDA-Specific
// Check compute capability
var cudaAccelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.CUDA);
var capabilities = cudaAccelerator.Capabilities;
Console.WriteLine($"Max threads per block: {capabilities.MaxThreadsPerBlock}");
Console.WriteLine($"Max shared memory: {capabilities.MaxSharedMemoryPerWorkGroup} bytes");
Console.WriteLine($"Supports double precision: {capabilities.SupportsDouble}");
// Use P2P transfers between GPUs
if (capabilities.ExtendedCapabilities.TryGetValue("P2PSupport", out var p2pSupport) && (bool)p2pSupport)
{
// Enable P2P transfers between GPUs
var p2pManager = services.GetRequiredService<P2PManager>();
await p2pManager.EnablePeerAccessAsync(deviceId1: 0, deviceId2: 1);
}
Metal-Specific
// Check GPU family
var metalAccelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.Metal);
var capabilities = metalAccelerator.Capabilities;
if (capabilities.ExtendedCapabilities.TryGetValue("GPUFamily", out var family))
{
Console.WriteLine($"GPU Family: {family}"); // Apple7, Apple8, or Apple9
}
// Leverage unified memory
if (capabilities.SupportsUnifiedMemory)
{
// Use AllocationMode.Unified for zero-copy access
var buffer = await memoryManager.AllocateAsync<float>(
1_000_000,
AllocationMode.Unified
);
}
CPU-Specific
// Check SIMD support
var cpuAccelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.CPU);
var capabilities = cpuAccelerator.Capabilities;
// Determine available instruction sets
if (Vector.IsHardwareAccelerated)
{
Console.WriteLine($"SIMD vector size: {Vector<float>.Count}");
}
if (Avx512F.IsSupported)
{
Console.WriteLine("AVX-512 available (512-bit vectors)");
}
else if (Avx2.IsSupported)
{
Console.WriteLine("AVX2 available (256-bit vectors)");
}
else if (Sse42.IsSupported)
{
Console.WriteLine("SSE4.2 available (128-bit vectors)");
}
Decision Tree
Should I Use CPU or GPU?
START
↓
Is data size < 10,000 elements?
Yes → USE CPU
No → Continue
↓
Is compute intensity < 10 ops/byte?
Yes → USE CPU
No → Continue
↓
Is memory access sequential?
Yes → BENCHMARK BOTH
No → Continue
↓
Is GPU available?
Yes → USE GPU
No → USE CPU
Real-World Examples
Use Case 1: Image Blur (3x3 kernel)
- Data Size: 1920×1080 = 2M pixels
- Compute Intensity: 9 operations per pixel = ~2 ops/byte
- Memory Access: Spatial locality (mostly sequential)
- Parallelism: Very high (2M independent operations)
- Recommendation: GPU (37-141x speedup measured)
Use Case 2: Vector Addition
- Data Size: 1M elements
- Compute Intensity: 1 operation per element = 0.08 ops/byte
- Memory Access: Sequential
- Parallelism: High
- Recommendation: CPU for small sizes, GPU for large sizes
Use Case 3: Matrix Multiplication (512×512)
- Data Size: 512×512 = 262K elements
- Compute Intensity: 512 operations per element = 64 ops/byte
- Memory Access: Complex (both sequential and random)
- Parallelism: Very high
- Recommendation: GPU (21-92x speedup measured)
Benchmarking Guide
Comparing Backends
public static async Task BenchmarkBackends()
{
var orchestrator = GetService<IComputeOrchestrator>();
var parameters = GenerateTestData();
// Benchmark CPU
var cpuTime = await BenchmarkBackend(orchestrator, AcceleratorType.CPU, parameters);
// Benchmark CUDA
var cudaTime = await BenchmarkBackend(orchestrator, AcceleratorType.CUDA, parameters);
// Benchmark Metal
var metalTime = await BenchmarkBackend(orchestrator, AcceleratorType.Metal, parameters);
// Print results
Console.WriteLine($"CPU: {cpuTime:F2}ms");
Console.WriteLine($"CUDA: {cudaTime:F2}ms (speedup: {cpuTime / cudaTime:F2}x)");
Console.WriteLine($"Metal: {metalTime:F2}ms (speedup: {cpuTime / metalTime:F2}x)");
}
private static async Task<double> BenchmarkBackend(
IComputeOrchestrator orchestrator,
AcceleratorType backend,
object parameters)
{
// Warm-up
await orchestrator.ExecuteKernelAsync("MyKernel", parameters, forceBackend: backend);
// Benchmark
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
await orchestrator.ExecuteKernelAsync("MyKernel", parameters, forceBackend: backend);
}
stopwatch.Stop();
return stopwatch.Elapsed.TotalMilliseconds / 100;
}
Troubleshooting
GPU Not Being Used
Symptom: Kernels always run on CPU even though GPU is available
Causes:
- Data size too small (< 10,000 elements)
EnableAutoOptimization = falsewithDefaultAccelerator = CPU- GPU not detected
Solution:
// Check GPU availability
var manager = services.GetRequiredService<IAcceleratorManager>();
if (!manager.IsAvailable(AcceleratorType.CUDA))
{
Console.WriteLine("CUDA not available. Reasons:");
// - No NVIDIA GPU
// - CUDA Toolkit not installed
// - Driver version mismatch
}
// Force GPU usage
await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters,
forceBackend: AcceleratorType.CUDA
);
Slower on GPU Than CPU
Symptom: GPU execution is slower than CPU
Causes:
- Data transfer overhead dominates (small data)
- Low parallelism (few threads)
- Memory-bound operation
- First execution (compilation overhead)
Solution:
// Profile both backends
var debugService = GetService<IKernelDebugService>();
var cpuProfile = await debugService.ProfileKernelAsync(
"MyKernel", parameters, AcceleratorType.CPU, iterations: 100
);
var gpuProfile = await debugService.ProfileKernelAsync(
"MyKernel", parameters, AcceleratorType.CUDA, iterations: 100
);
Console.WriteLine($"CPU avg: {cpuProfile.AverageTime.TotalMilliseconds:F2}ms");
Console.WriteLine($"GPU avg: {gpuProfile.AverageTime.TotalMilliseconds:F2}ms");
Console.WriteLine($"GPU transfer overhead: {EstimateTransferTime(parameters)}ms");
Best Practices
✅ Do
- Trust automatic selection for most workloads
- Profile before forcing backends - measure, don't guess
- Use ML-optimized profile for long-running apps
- Check GPU availability before assuming GPU execution
- Benchmark with realistic data sizes - small test data may favor CPU
❌ Don't
- Don't force GPU for all workloads - CPU is faster for small data
- Don't ignore transfer overhead - factor in CPU→GPU→CPU time
- Don't benchmark first execution - compilation skews results
- Don't assume GPU is always faster - profile your specific workload
Further Reading
- Performance Tuning Guide - Optimize execution
- Architecture: Backend Integration - Technical details
- Architecture: Optimization Engine - Selection algorithms
- Multi-GPU Programming - Using multiple GPUs
Choose Wisely • Profile First • Trust the Optimizer