Class CudaAccelerator

Namespace: DotCompute.Backends.CUDA

Assembly: DotCompute.Backends.CUDA.dll

CUDA accelerator implementation providing high-performance GPU compute operations.

public sealed class CudaAccelerator : BaseAccelerator, IAccelerator, IAsyncDisposable

Inheritance: object

BaseAccelerator

CudaAccelerator

Implements: IAccelerator

IAsyncDisposable

Inherited Members: BaseAccelerator.Info

BaseAccelerator.Type

BaseAccelerator.DeviceType

BaseAccelerator.Memory

BaseAccelerator.MemoryManager

BaseAccelerator.Context

BaseAccelerator.IsAvailable

BaseAccelerator.IsDisposed

BaseAccelerator.CompileKernelAsync(KernelDefinition, CompilationOptions, CancellationToken)

BaseAccelerator.SynchronizeAsync(CancellationToken)

BaseAccelerator.DisposeAsync()

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Examples

// Create accelerator for default GPU device
using var accelerator = new CudaAccelerator();

// Get detailed device information
var deviceInfo = accelerator.GetDeviceInfo();
Console.WriteLine($"GPU: {deviceInfo.Name}, CC: {deviceInfo.ComputeCapability}");

// Compile and execute kernel
var kernel = await accelerator.CompileKernelAsync(kernelDef, options);
await accelerator.ExecuteKernelAsync(kernel, args);
await accelerator.SynchronizeAsync();

Remarks

The CUDA accelerator is the primary interface for executing compute kernels on NVIDIA GPUs. It provides automatic kernel compilation, memory management, and execution optimization.

Supported GPU Architectures:

Compute Capability 5.0+ (Maxwell through Ada Lovelace/Hopper)
RTX 2000 Ada Generation (Compute Capability 8.9) with optimizations
Automatic detection and optimization for specific GPU models

Key Features:

Zero-copy unified memory support
CUDA graph execution for optimized kernel launches (CC 10.0+)
Automatic kernel compilation with NVRTC
Device memory pooling for reduced allocation overhead

Constructors

CudaAccelerator(int, ILogger<CudaAccelerator>?)

Initializes a new CUDA accelerator instance for GPU compute operations.

public CudaAccelerator(int deviceId = 0, ILogger<CudaAccelerator>? logger = null)

Parameters

deviceId int: Zero-based index of the CUDA device to use (default: 0 for first GPU). Use cudaGetDeviceCount(out int) to enumerate available devices.
logger ILogger<CudaAccelerator>: Optional logger for diagnostics and performance monitoring. If null, a null logger is used (no logging output).

Examples

// Use default GPU (device 0)
using var gpu0 = new CudaAccelerator();

// Use second GPU with logging
var logger = loggerFactory.CreateLogger<CudaAccelerator>();
using var gpu1 = new CudaAccelerator(deviceId: 1, logger: logger);

Remarks

The accelerator automatically:

Detects GPU compute capability and optimizes accordingly
Initializes memory pools for efficient allocation
Creates CUDA graph manager if CC 10.0+ is detected
Sets up NVRTC compiler for runtime kernel compilation

Multi-GPU Systems:
Create separate accelerator instances for each GPU device to enable parallel execution.

Exceptions

InvalidOperationException: Thrown when CUDA initialization fails, device doesn't exist, or driver is incompatible.

Properties

Device

Gets the underlying CUDA device providing hardware information and capabilities.

public CudaDevice Device { get; }

Property Value

CudaDevice: The CUDA device instance for this accelerator.

DeviceId

Gets the CUDA device identifier used for multi-GPU scenarios.

public int DeviceId { get; }

Property Value

int: Zero-based device index (0 = first GPU, 1 = second GPU, etc.).

GraphManager

Gets the CUDA graph manager for optimized repeated kernel execution patterns.

public CudaGraphManager? GraphManager { get; }

Property Value

CudaGraphManager: Graph manager instance if supported (Compute Capability 10.0+), otherwise null. CUDA graphs capture kernel launch sequences for faster replay with reduced CPU overhead.

Remarks

Available only on GPUs with compute capability 10.0 or higher (Hopper architecture and newer). Use graphs for repetitive execution patterns to reduce launch overhead by up to 50%.

Methods

CompileKernelCoreAsync(KernelDefinition, CompilationOptions, CancellationToken)

Core kernel compilation logic to be implemented by derived classes.

protected override ValueTask<ICompiledKernel> CompileKernelCoreAsync(KernelDefinition definition, CompilationOptions options, CancellationToken cancellationToken)

Parameters

definition KernelDefinition
options CompilationOptions
cancellationToken CancellationToken

Returns

ValueTask<ICompiledKernel>

DisposeCoreAsync()

Core disposal logic to be implemented by derived classes.

protected override ValueTask DisposeCoreAsync()

Returns

ValueTask

GetDeviceInfo()

Retrieves comprehensive device information including hardware specifications and capabilities.

public DeviceInfo GetDeviceInfo()

Returns

DeviceInfo: Detailed device information including compute capability, memory configuration, multiprocessor count, and architecture-specific features like RTX Ada detection.

Examples

var info = accelerator.GetDeviceInfo();
Console.WriteLine($"GPU: {info.Name}");
Console.WriteLine($"Compute Capability: {info.ComputeCapability}");
Console.WriteLine($"Memory: {info.GlobalMemoryBytes / (1024 * 1024 * 1024)} GB");
Console.WriteLine($"CUDA Cores: ~{info.EstimatedCudaCores}");
Console.WriteLine($"Memory Bandwidth: {info.MemoryBandwidthGBps:F1} GB/s");

if (info.IsRTX2000Ada)
{
    Console.WriteLine("Detected RTX 2000 Ada - enabling Ada optimizations");
}

Remarks

The returned DeviceInfo includes:

Hardware: GPU name, compute capability, SM count, CUDA core estimate
Memory: Total/available memory, bandwidth, L2 cache size
Capabilities: Unified memory, concurrent kernels, ECC support
Limits: Max threads per block, shared memory, warp size
Architecture: Generation detection (Ampere, Ada, Hopper, etc.)

RTX 2000 Ada Detection:
The IsRTX2000Ada property specifically identifies the RTX 2000 Ada Generation (Compute Capability 8.9) for architecture-specific optimizations.

Exceptions

ObjectDisposedException: Thrown if the accelerator has been disposed.

GetHealthSnapshotAsync(CancellationToken)

Gets a comprehensive health snapshot of the CUDA device.

public override ValueTask<DeviceHealthSnapshot> GetHealthSnapshotAsync(CancellationToken cancellationToken = default)

Parameters

cancellationToken CancellationToken: Cancellation token.

Returns

ValueTask<DeviceHealthSnapshot>: A task containing the device health snapshot.

Remarks

This method queries NVIDIA Management Library (NVML) for real-time GPU metrics including: - Temperature, power consumption, and fan speed - GPU and memory utilization percentages - Clock frequencies (graphics and memory) - Memory usage statistics - PCIe throughput measurements - Throttling status and reasons

Performance: Typically takes 2-5ms to collect all metrics via NVML. Results are collected synchronously but wrapped in ValueTask for consistency.

Requirements: NVIDIA driver with NVML support. Falls back to unavailable snapshot if NVML is not available or initialization fails.

GetProfilingMetricsAsync(CancellationToken)

Gets current profiling metrics from the CUDA device.

public override ValueTask<IReadOnlyList<ProfilingMetric>> GetProfilingMetricsAsync(CancellationToken cancellationToken = default)

Parameters

cancellationToken CancellationToken: Cancellation token.

Returns

ValueTask<IReadOnlyList<ProfilingMetric>>: A task containing the collection of profiling metrics.

GetProfilingSnapshotAsync(CancellationToken)

Gets a comprehensive profiling snapshot of the CUDA device.

public override ValueTask<ProfilingSnapshot> GetProfilingSnapshotAsync(CancellationToken cancellationToken = default)

Parameters

cancellationToken CancellationToken: Cancellation token.

Returns

ValueTask<ProfilingSnapshot>: A task containing the profiling snapshot.

Remarks

CUDA profiling provides detailed performance metrics using CUDA Events for precise timing. This implementation tracks kernel execution statistics, memory operations, and device utilization.

Available Metrics: - Kernel execution statistics (average, min, max, median, P95, P99) - Memory transfer statistics (bandwidth, transfer counts) - Device utilization (estimated from execution patterns) - Performance trends and bottleneck identification

Profiling Overhead: Minimal (<0.5%) as timing data is collected passively. CUDA Events provide hardware-accurate timing without CPU synchronization overhead.

Performance: Typically less than 1ms to collect and aggregate metrics.

GetSensorReadingsAsync(CancellationToken)

Gets current sensor readings from the CUDA device.

public override ValueTask<IReadOnlyList<SensorReading>> GetSensorReadingsAsync(CancellationToken cancellationToken = default)

Parameters

cancellationToken CancellationToken: Cancellation token.

Returns

ValueTask<IReadOnlyList<SensorReading>>: A task containing the collection of sensor readings.

InitializeCore()

Core initialization logic to be implemented by derived classes.

protected override object? InitializeCore()

Returns

object: Initialization result (typically null or status object)

Reset()

Resets the CUDA device to a clean state, clearing all memory allocations and reinitializing the context.

public void Reset()

Examples

// Reset device after encountering errors
try
{
    await accelerator.ExecuteKernelAsync(kernel, args);
}
catch (CudaException ex) when (ex.Error == CudaError.IllegalAddress)
{
    accelerator.Reset(); // Clean slate
    // Retry operation...
}

Remarks

This operation:

Frees all device memory allocations managed by this accelerator
Destroys all CUDA contexts on this device
Resets the device to its initial state
Reinitializes the accelerator context

Warning: This is a heavyweight operation that affects all CUDA contexts on the device, not just this accelerator instance. Use sparingly, primarily for:

Recovering from device errors
Cleaning up after memory leaks during development
Benchmarking scenarios requiring pristine device state

Exceptions

ObjectDisposedException: Thrown if the accelerator has been disposed.
InvalidOperationException: Thrown if device reset fails due to driver error or pending operations.

ResetAsync(ResetOptions?, CancellationToken)

Resets the CUDA device to a clean state.

public override ValueTask<ResetResult> ResetAsync(ResetOptions? options = null, CancellationToken cancellationToken = default)

Parameters

options ResetOptions
cancellationToken CancellationToken

Returns

ValueTask<ResetResult>

SynchronizeCoreAsync(CancellationToken)

Core synchronization logic to be implemented by derived classes.

protected override ValueTask SynchronizeCoreAsync(CancellationToken cancellationToken)

Parameters

cancellationToken CancellationToken

Returns

ValueTask

Table of Contents

Class CudaAccelerator

Examples

Remarks

Constructors

CudaAccelerator(int, ILogger<CudaAccelerator>?)

Parameters

Examples

Remarks

Exceptions

Properties

Device

Property Value

DeviceId

Property Value

GraphManager

Property Value

Remarks

Methods

CompileKernelCoreAsync(KernelDefinition, CompilationOptions, CancellationToken)

Parameters

Returns

DisposeCoreAsync()

Returns

GetDeviceInfo()

Returns

Examples

Remarks

Exceptions

GetHealthSnapshotAsync(CancellationToken)

Parameters

Returns

Remarks

GetProfilingMetricsAsync(CancellationToken)

Parameters

Returns

GetProfilingSnapshotAsync(CancellationToken)

Parameters

Returns

Remarks

GetSensorReadingsAsync(CancellationToken)

Parameters

Returns

InitializeCore()

Returns

Reset()

Examples

Remarks

Exceptions

ResetAsync(ResetOptions?, CancellationToken)

Parameters

Returns

SynchronizeCoreAsync(CancellationToken)

Parameters

Returns