Device Reset and Error Recovery
Version: 0.4.0 Status: Production Ready Last Updated: November 2025
Overview
DotCompute provides a comprehensive device reset and error recovery system that allows applications to gracefully handle GPU errors, memory exhaustion, and other hardware-related failures. The reset API supports four different reset levels, from lightweight synchronization to complete device reinitialization.
Table of Contents
- Reset Types
- Quick Start
- Reset Options
- Reset Results
- Backend-Specific Behavior
- Error Recovery Patterns
- Best Practices
- Advanced Scenarios
- Troubleshooting
Reset Types
DotCompute defines four reset levels, each providing different cleanup guarantees:
Soft Reset (ResetType.Soft)
Purpose: Lightweight synchronization without resource cleanup.
What it does:
- Waits for pending GPU operations to complete
- Synchronizes command queues
- No memory or cache cleanup
When to use:
- Between batch processing operations
- Before reading GPU results
- Periodic synchronization points
- Minimal disruption scenarios
Performance: < 10ms typical
// Soft reset - just synchronize
await accelerator.ResetAsync(ResetOptions.Soft, cancellationToken);
Context Reset (ResetType.Context)
Purpose: Clear compilation caches and reset context state.
What it does:
- Everything from Soft Reset
- Clears kernel compilation cache
- Resets command buffer pools (Metal)
- Clears OpenCL program cache
When to use:
- After dynamic kernel compilation
- When switching workload types
- Cache invalidation scenarios
- GPU context refresh
Performance: 100-200ms typical
// Context reset - clear caches
await accelerator.ResetAsync(ResetOptions.Context, cancellationToken);
Hard Reset (ResetType.Hard)
Purpose: Full resource cleanup without device reinitialization.
What it does:
- Everything from Context Reset
- Clears memory pools
- Frees all cached allocations
- Resets performance counters
- Clears pending operations
When to use:
- After memory exhaustion
- Memory leak recovery
- Before major workload changes
- Resource cleanup scenarios
Performance: 200-500ms typical
// Hard reset - full cleanup
await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
Full Reset (ResetType.Full)
Purpose: Complete device reinitialization.
What it does:
- Everything from Hard Reset
- Destroys and recreates device context
- Reinitializes GPU driver state
- Resets all hardware counters
- Full device recovery
When to use:
- GPU driver errors
- Hardware malfunction recovery
- Critical error scenarios
- Complete state reset required
Performance: 500-1000ms typical
// Full reset - reinitialize device
await accelerator.ResetAsync(ResetOptions.Full, cancellationToken);
Quick Start
Basic Reset
using DotCompute.Abstractions;
using DotCompute.Abstractions.Recovery;
// Get accelerator
var accelerator = await AcceleratorFactory.CreateAsync(
AcceleratorType.CUDA,
cancellationToken);
try
{
// Your GPU work here
await accelerator.ExecuteKernelAsync(kernel, cancellationToken);
}
catch (OutOfMemoryException)
{
// Recover from memory exhaustion
var result = await accelerator.ResetAsync(
ResetOptions.Hard,
cancellationToken);
if (result.Success)
{
Console.WriteLine($"Freed {result.MemoryFreedBytes / 1024 / 1024} MB");
// Retry operation
}
}
Orleans Grain Reset
public class ComputeGrain : Grain, IComputeGrain
{
private IAccelerator? _accelerator;
public override async Task OnDeactivateAsync(DeactivationReason reason, CancellationToken cancellationToken)
{
if (_accelerator != null)
{
// Use Orleans-optimized reset options
await _accelerator.ResetAsync(
ResetOptions.GrainDeactivation,
cancellationToken);
await _accelerator.DisposeAsync();
}
await base.OnDeactivateAsync(reason, cancellationToken);
}
}
Reset Options
The ResetOptions class provides fine-grained control over reset behavior:
Properties
public sealed class ResetOptions
{
/// <summary>
/// Type of reset to perform (required)
/// </summary>
public ResetType ResetType { get; set; }
/// <summary>
/// Wait for pending operations before reset (default: true)
/// </summary>
public bool WaitForCompletion { get; set; } = true;
/// <summary>
/// Maximum time to wait for reset (default: 30s)
/// </summary>
public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(30);
/// <summary>
/// Clear memory pool during reset (default: false)
/// </summary>
public bool ClearMemoryPool { get; set; }
/// <summary>
/// Clear kernel compilation cache (default: false)
/// </summary>
public bool ClearKernelCache { get; set; }
/// <summary>
/// Reinitialize device context (default: false)
/// </summary>
public bool Reinitialize { get; set; }
/// <summary>
/// Force reset even if operations are pending (default: false)
/// WARNING: Can cause data loss
/// </summary>
public bool Force { get; set; }
}
Predefined Options
DotCompute provides several predefined option sets:
// Minimal disruption
var options = ResetOptions.Soft;
// Cache invalidation
var options = ResetOptions.Context;
// Resource cleanup
var options = ResetOptions.Hard;
// Complete reinitialization
var options = ResetOptions.Full;
// Error recovery (aggressive)
var options = ResetOptions.ErrorRecovery;
// Orleans grain deactivation (optimized)
var options = ResetOptions.GrainDeactivation;
// Complete cleanup (for shutdown)
var options = ResetOptions.CompleteCleanup;
Custom Configuration
var options = new ResetOptions
{
ResetType = ResetType.Hard,
WaitForCompletion = true,
Timeout = TimeSpan.FromSeconds(10),
ClearMemoryPool = true,
ClearKernelCache = true,
Reinitialize = false,
Force = false
};
var result = await accelerator.ResetAsync(options, cancellationToken);
Reset Results
The ResetResult class provides detailed information about the reset operation:
Success Result
var result = await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
if (result.Success)
{
Console.WriteLine($"Device: {result.DeviceName} ({result.DeviceId})");
Console.WriteLine($"Backend: {result.BackendType}");
Console.WriteLine($"Reset Type: {result.ResetType}");
Console.WriteLine($"Duration: {result.Duration.TotalMilliseconds}ms");
Console.WriteLine($"Memory Freed: {result.MemoryFreedBytes / 1024 / 1024} MB");
Console.WriteLine($"Kernels Cleared: {result.KernelsCacheCleared}");
Console.WriteLine($"Pending Ops: {result.PendingOperationsCleared}");
Console.WriteLine($"Reinitialized: {result.WasReinitialized}");
// Backend-specific diagnostics
foreach (var (key, value) in result.DiagnosticInfo)
{
Console.WriteLine($"{key}: {value}");
}
}
Failure Result
var result = await accelerator.ResetAsync(ResetOptions.Full, cancellationToken);
if (!result.Success)
{
Console.WriteLine($"Reset failed: {result.ErrorMessage}");
Console.WriteLine($"Error code: {result.ErrorCode}");
// Detailed diagnostic information
foreach (var (key, value) in result.DiagnosticInfo)
{
Console.WriteLine($"{key}: {value}");
}
}
Properties Reference
public sealed class ResetResult
{
public bool Success { get; }
public string DeviceId { get; }
public string DeviceName { get; }
public string BackendType { get; }
public ResetType ResetType { get; }
public DateTimeOffset Timestamp { get; }
public TimeSpan Duration { get; }
public bool WasReinitialized { get; }
// Resource cleanup metrics
public long PendingOperationsCleared { get; }
public long MemoryFreedBytes { get; }
public int KernelsCacheCleared { get; }
public bool MemoryPoolCleared { get; }
public bool KernelCacheCleared { get; }
// Error information (when Success = false)
public string? ErrorMessage { get; }
public string? ErrorCode { get; }
// Backend-specific diagnostics
public IReadOnlyDictionary<string, string> DiagnosticInfo { get; }
}
Backend-Specific Behavior
CUDA Backend
Soft Reset:
cudaDeviceSynchronize()- Minimal overhead
Context Reset:
- Clears NVRTC compilation cache
- Resets CUDA module cache
- ~150ms typical
Hard Reset:
- Clears memory pool
- Frees cached allocations
- Resets cuBLAS/cuDNN handles
- ~300ms typical
Full Reset:
cudaDeviceReset()- Complete reinitialization
- ~800ms typical
Diagnostics:
result.DiagnosticInfo["ComputeCapability"] // "8.9"
result.DiagnosticInfo["DriverVersion"] // "545.29.06"
result.DiagnosticInfo["CudaVersion"] // "12.3"
result.DiagnosticInfo["MemoryFreed"] // "512 MB"
OpenCL Backend
Soft Reset:
clFinish()on all queues- Fast synchronization
Context Reset:
- Clears program cache
- Releases command pools
- ~120ms typical
Hard Reset:
- Clears memory pool
- Releases all buffers
- ~250ms typical
Full Reset:
- Recreates OpenCL context
- Reinitializes device
- ~600ms typical
Diagnostics:
result.DiagnosticInfo["OpenCLVersion"] // "3.0"
result.DiagnosticInfo["Device"] // "Intel UHD Graphics"
result.DiagnosticInfo["Vendor"] // "Intel"
result.DiagnosticInfo["Platform"] // "Intel OpenCL"
Metal Backend
Soft Reset:
- Waits for command buffer completion
- Fast on Apple Silicon
Context Reset:
- Clears command buffer pool
- Resets pipeline state cache
- ~100ms typical
Hard Reset:
- Clears unified memory allocations
- Resets resource heaps
- ~200ms typical
Full Reset:
- Recreates MTLDevice
- Full resource reinitialization
- ~500ms typical
Diagnostics:
result.DiagnosticInfo["DeviceName"] // "Apple M1 Pro"
result.DiagnosticInfo["Architecture"] // "Apple Silicon"
result.DiagnosticInfo["UnifiedMemory"] // "True"
result.DiagnosticInfo["MaxThreadgroupSize"] // "1024"
CPU Backend
All Reset Types:
- Synchronous operations (no pending work)
- Memory pool cleanup only
- < 50ms typical
Diagnostics:
result.DiagnosticInfo["ProcessorCount"] // "16"
result.DiagnosticInfo["OSVersion"] // "Linux 6.6.87.2"
result.DiagnosticInfo["PerformanceMode"] // "HighPerformance"
result.DiagnosticInfo["Vectorization"] // "True"
Error Recovery Patterns
Automatic Retry with Exponential Backoff
public async Task<T> ExecuteWithRetryAsync<T>(
Func<IAccelerator, CancellationToken, Task<T>> operation,
IAccelerator accelerator,
CancellationToken cancellationToken)
{
var maxAttempts = 3;
var resetType = ResetType.Soft;
for (int attempt = 1; attempt <= maxAttempts; attempt++)
{
try
{
return await operation(accelerator, cancellationToken);
}
catch (Exception ex) when (attempt < maxAttempts)
{
// Escalate reset severity with each retry
resetType = attempt switch
{
1 => ResetType.Context,
2 => ResetType.Hard,
_ => ResetType.Full
};
var result = await accelerator.ResetAsync(
new ResetOptions { ResetType = resetType },
cancellationToken);
if (!result.Success)
{
throw new InvalidOperationException(
$"Reset failed: {result.ErrorMessage}", ex);
}
// Exponential backoff
await Task.Delay(TimeSpan.FromMilliseconds(100 * Math.Pow(2, attempt)));
}
}
throw new InvalidOperationException($"Operation failed after {maxAttempts} attempts");
}
Memory Exhaustion Recovery
public async Task HandleOutOfMemoryAsync(
IAccelerator accelerator,
CancellationToken cancellationToken)
{
// Try progressive cleanup
var resetTypes = new[]
{
ResetType.Context, // First try cache cleanup
ResetType.Hard, // Then full memory cleanup
ResetType.Full // Last resort: reinitialize
};
foreach (var resetType in resetTypes)
{
var result = await accelerator.ResetAsync(
new ResetOptions { ResetType = resetType },
cancellationToken);
if (result.Success && result.MemoryFreedBytes > 0)
{
Console.WriteLine($"Freed {result.MemoryFreedBytes / 1024 / 1024} MB " +
$"using {resetType} reset");
return;
}
}
throw new OutOfMemoryException("Unable to free GPU memory");
}
Graceful Degradation
public async Task<ComputeResult> ExecuteWithFallbackAsync(
IKernel kernel,
IAccelerator gpuAccelerator,
IAccelerator cpuAccelerator,
CancellationToken cancellationToken)
{
try
{
// Try GPU first
return await gpuAccelerator.ExecuteKernelAsync(kernel, cancellationToken);
}
catch (Exception ex)
{
// Log GPU failure
Console.WriteLine($"GPU execution failed: {ex.Message}");
// Attempt GPU recovery
var result = await gpuAccelerator.ResetAsync(
ResetOptions.Hard,
cancellationToken);
if (result.Success)
{
// Retry GPU
try
{
return await gpuAccelerator.ExecuteKernelAsync(kernel, cancellationToken);
}
catch
{
// Fall through to CPU
}
}
// Fallback to CPU
Console.WriteLine("Falling back to CPU execution");
return await cpuAccelerator.ExecuteKernelAsync(kernel, cancellationToken);
}
}
Best Practices
1. Choose Appropriate Reset Level
// ✅ Good: Match reset level to scenario
await accelerator.ResetAsync(ResetOptions.Soft, cancellationToken); // Between batches
await accelerator.ResetAsync(ResetOptions.Context, cancellationToken); // After compilation
await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken); // Memory recovery
await accelerator.ResetAsync(ResetOptions.Full, cancellationToken); // Driver errors
// ❌ Bad: Over-resetting
await accelerator.ResetAsync(ResetOptions.Full, cancellationToken); // For every batch!
2. Always Check Reset Results
// ✅ Good: Check success and handle failures
var result = await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
if (!result.Success)
{
throw new InvalidOperationException($"Reset failed: {result.ErrorMessage}");
}
// ❌ Bad: Ignore failures
await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
// Continue without checking...
3. Use Cancellation Tokens
// ✅ Good: Support cancellation
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
var result = await accelerator.ResetAsync(ResetOptions.Hard, cts.Token);
// ❌ Bad: No timeout
var result = await accelerator.ResetAsync(ResetOptions.Hard, CancellationToken.None);
4. Log Reset Operations
// ✅ Good: Comprehensive logging
var result = await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
_logger.LogInformation(
"Device reset completed: Type={ResetType}, Duration={Duration}ms, MemoryFreed={MemoryMB}MB",
result.ResetType,
result.Duration.TotalMilliseconds,
result.MemoryFreedBytes / 1024 / 1024);
// ❌ Bad: No logging
await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
5. Reset Before Disposal
// ✅ Good: Clean shutdown
await accelerator.ResetAsync(ResetOptions.CompleteCleanup, cancellationToken);
await accelerator.DisposeAsync();
// ❌ Bad: Direct disposal
await accelerator.DisposeAsync(); // May leak resources
Advanced Scenarios
Custom Timeout Handling
var options = new ResetOptions
{
ResetType = ResetType.Hard,
Timeout = TimeSpan.FromSeconds(5) // Custom timeout
};
try
{
var result = await accelerator.ResetAsync(options, cancellationToken);
}
catch (TimeoutException)
{
// Force reset if timeout
options.Force = true;
options.Timeout = TimeSpan.FromSeconds(30);
var result = await accelerator.ResetAsync(options, cancellationToken);
}
Multi-GPU Reset Coordination
public async Task ResetAllAcceleratorsAsync(
IEnumerable<IAccelerator> accelerators,
CancellationToken cancellationToken)
{
var tasks = accelerators.Select(async accelerator =>
{
var result = await accelerator.ResetAsync(
ResetOptions.Hard,
cancellationToken);
return (Accelerator: accelerator, Result: result);
});
var results = await Task.WhenAll(tasks);
foreach (var (accelerator, result) in results)
{
if (!result.Success)
{
Console.WriteLine($"Reset failed for {accelerator.DeviceName}: " +
$"{result.ErrorMessage}");
}
}
}
Reset with State Preservation
public async Task ResetWithStatePreservationAsync(
IAccelerator accelerator,
CancellationToken cancellationToken)
{
// Save critical state
var criticalBuffers = await SaveCriticalStateAsync(accelerator);
try
{
// Perform reset
var result = await accelerator.ResetAsync(
ResetOptions.Hard,
cancellationToken);
if (!result.Success)
{
throw new InvalidOperationException($"Reset failed: {result.ErrorMessage}");
}
// Restore state
await RestoreCriticalStateAsync(accelerator, criticalBuffers);
}
finally
{
// Cleanup saved state
foreach (var buffer in criticalBuffers)
{
await buffer.DisposeAsync();
}
}
}
Troubleshooting
Reset Hangs or Times Out
Symptoms: Reset operation doesn't complete within timeout period.
Causes:
- Pending GPU operations stuck
- Driver-level deadlock
- Hardware malfunction
Solutions:
// 1. Try force reset
var options = new ResetOptions
{
ResetType = ResetType.Hard,
Force = true, // Skip waiting for pending operations
Timeout = TimeSpan.FromSeconds(60)
};
var result = await accelerator.ResetAsync(options, cancellationToken);
// 2. If that fails, escalate to Full reset
if (!result.Success)
{
options.ResetType = ResetType.Full;
result = await accelerator.ResetAsync(options, cancellationToken);
}
// 3. Last resort: recreate accelerator
if (!result.Success)
{
await accelerator.DisposeAsync();
accelerator = await AcceleratorFactory.CreateAsync(
AcceleratorType.CUDA,
cancellationToken);
}
Reset Doesn't Free Memory
Symptoms: MemoryFreedBytes is 0 or less than expected.
Causes:
- Memory still referenced
- Memory pool not cleared
- Driver caching
Solutions:
// Ensure all buffers are disposed
await DisposeAllBuffersAsync();
// Use Hard or Full reset
var result = await accelerator.ResetAsync(ResetOptions.Hard, cancellationToken);
// Check for memory leaks
if (result.MemoryFreedBytes < expectedBytes)
{
// Force GC collection
GC.Collect();
GC.WaitForPendingFinalizers();
// Retry reset
result = await accelerator.ResetAsync(ResetOptions.Full, cancellationToken);
}
Reset Fails with Driver Error
Symptoms: Reset returns Success = false with driver error code.
Causes:
- GPU driver crash
- Hardware malfunction
- Unsupported operation
Solutions:
var result = await accelerator.ResetAsync(ResetOptions.Full, cancellationToken);
if (!result.Success && result.ErrorCode != null)
{
// Check diagnostic information
Console.WriteLine($"Driver error: {result.ErrorCode}");
foreach (var (key, value) in result.DiagnosticInfo)
{
Console.WriteLine($"{key}: {value}");
}
// May require application restart or driver reinstall
throw new InvalidOperationException(
$"Unrecoverable GPU error: {result.ErrorMessage}");
}