Backend Selection

This module covers DotCompute's backend system and how to target specific hardware for optimal performance.

Available Backends

DotCompute supports multiple compute backends:

Backend	Vendor	Status	Use Case
CUDA	NVIDIA	✅ Production	Production GPU computing
CPU	All	✅ Production	Development, fallback, SIMD optimization
Metal	Apple	⚠️ Experimental	macOS/iOS applications
OpenCL	Cross-vendor	⚠️ Experimental	AMD, Intel, cross-platform

Automatic Backend Selection

By default, DotCompute selects the best available backend automatically:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;
using DotCompute.Abstractions.Factories;

// Build host with DotCompute services
var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime();  // Registers all services
var app = host.Build();

// Get orchestrator - handles automatic backend selection
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();

// Execute kernel (orchestrator selects best backend)
await orchestrator.ExecuteKernelAsync(
    kernelName: "VectorAdd",
    args: new object[] { a, b, result }
);

Selection priority:

CUDA (if NVIDIA GPU available)
Metal (if Apple Silicon available)
OpenCL (if supported GPU available)
CPU (always available)

Device Discovery

Use IUnifiedAcceleratorFactory to enumerate available devices:

var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();

// Get all available devices
var devices = await factory.GetAvailableDevicesAsync();

Console.WriteLine($"Found {devices.Count} device(s):");
foreach (var device in devices)
{
    Console.WriteLine($"  - {device.Name} ({device.DeviceType})");
    Console.WriteLine($"    Memory: {device.TotalMemory / (1024.0 * 1024 * 1024):F2} GB");
    Console.WriteLine($"    Compute Units: {device.MaxComputeUnits}");
    Console.WriteLine();
}

Manual Backend Selection

Specify Preferred Backend

You can request a specific backend when executing kernels:

var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();

// Request CUDA backend explicitly
await orchestrator.ExecuteAsync<object>(
    "VectorAdd",
    "CUDA",  // Preferred backend
    a, b, result
);

// Request CPU backend (useful for debugging)
await orchestrator.ExecuteAsync<object>(
    "VectorAdd",
    "CPU",
    a, b, result
);

Create Accelerator for Specific Device

For more control, create an accelerator for a specific device:

var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();
var devices = await factory.GetAvailableDevicesAsync();

// Find CUDA device
var cudaDevice = devices.FirstOrDefault(d => d.DeviceType == "CUDA");

if (cudaDevice != null)
{
    // Create accelerator for this specific device
    using var accelerator = await factory.CreateAsync(cudaDevice);
    Console.WriteLine($"Using: {cudaDevice.Name}");

    // Execute with specific accelerator
    await orchestrator.ExecuteAsync<object>(
        "VectorAdd",
        accelerator,
        a, b, result
    );
}

Backend Fallback Chain

Implement your own fallback logic:

var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();
var devices = await factory.GetAvailableDevicesAsync();

// Priority order: CUDA > OpenCL > CPU
var device = devices.FirstOrDefault(d => d.DeviceType == "CUDA")
          ?? devices.FirstOrDefault(d => d.DeviceType == "OpenCL")
          ?? devices.First();  // CPU is always available

Console.WriteLine($"Selected device: {device.Name} ({device.DeviceType})");

using var accelerator = await factory.CreateAsync(device);

Querying Device Capabilities

var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();
var devices = await factory.GetAvailableDevicesAsync();

foreach (var device in devices)
{
    Console.WriteLine($"Device: {device.Name}");
    Console.WriteLine($"  Type: {device.DeviceType}");
    Console.WriteLine($"  Memory: {device.TotalMemory / (1024*1024*1024.0):F1} GB");
    Console.WriteLine($"  Compute Units: {device.MaxComputeUnits}");
    Console.WriteLine();
}

Troubleshooting

CUDA Not Detected

# Check NVIDIA driver
nvidia-smi

# Check CUDA toolkit
nvcc --version

# WSL2: Set library path (CRITICAL for Windows developers)
export LD_LIBRARY_PATH="/usr/lib/wsl/lib:$LD_LIBRARY_PATH"

OpenCL Not Detected

# List OpenCL platforms (Linux)
clinfo

# Install OpenCL runtime (Intel)
sudo apt install intel-opencl-icd

Metal Not Available

Metal requires macOS 10.14+ or iOS 12+. Verify in code:

if (OperatingSystem.IsMacOS())
{
    // Metal should be available
}

Performance Comparison

Typical performance relative to CPU baseline:

Operation	CPU (SIMD)	CUDA	Metal	OpenCL
VectorAdd 1M	3.7x	92x	75x	60x
MatMul 1K×1K	4x	85x	70x	55x
Reduction 10M	3x	45x	40x	35x

Note: Actual performance depends on hardware and workload.

Exercises

Exercise 1: Backend Enumeration

List all available backends and their properties on your system.

using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using DotCompute.Runtime;
using DotCompute.Abstractions.Factories;

var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime();
var app = host.Build();

var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();
var devices = await factory.GetAvailableDevicesAsync();

foreach (var device in devices)
{
    Console.WriteLine($"{device.DeviceType}: {device.Name}");
    Console.WriteLine($"  Memory: {device.TotalMemory / 1e9:F2} GB");
}

Exercise 2: Performance Comparison

Run the same kernel on each available backend and compare execution times.

Exercise 3: Fallback Chain

Implement a solution that uses CUDA if available, falls back to CPU otherwise.

Key Takeaways

DotCompute auto-selects the best available backend
Use IComputeOrchestrator for kernel execution with automatic selection
Use IUnifiedAcceleratorFactory for device discovery
CUDA provides highest performance on NVIDIA hardware (Production)
CPU backend is always available for development and fallback (Production)
Metal and OpenCL are experimental - use for testing only

Path Complete

Congratulations! You've completed the Beginner Learning Path.

What you learned:

GPU computing fundamentals
Writing kernels with [Kernel] attribute
Memory management and transfers
Backend selection and configuration

Next steps:

Intermediate Path - Memory optimization and performance
Ring Kernels Guide - Persistent GPU computation
Examples - Real-world code samples

Table of Contents