Class RingKernelWatchdog
- Namespace
- DotCompute.Backends.CUDA.RingKernels.Resilience
- Assembly
- DotCompute.Backends.CUDA.dll
Monitors ring kernel health and triggers recovery actions when issues are detected.
public sealed class RingKernelWatchdog : IDisposable, IAsyncDisposable
- Inheritance
-
RingKernelWatchdog
- Implements
- Inherited Members
- Extension Methods
Remarks
The watchdog periodically checks kernel state and can detect:
- Kernel crashes (unexpected termination)
- Kernel stalls (no message processing activity)
- Heartbeat failures (kernel not responding)
When issues are detected, the watchdog triggers the configured recovery action.
Constructors
RingKernelWatchdog(ILogger, RingKernelFaultRecoveryOptions?)
Initializes a new instance of the RingKernelWatchdog class.
public RingKernelWatchdog(ILogger logger, RingKernelFaultRecoveryOptions? options = null)
Parameters
loggerILoggerLogger for diagnostics.
optionsRingKernelFaultRecoveryOptionsWatchdog configuration options.
Methods
Dispose()
Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
public void Dispose()
DisposeAsync()
Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources asynchronously.
public ValueTask DisposeAsync()
Returns
- ValueTask
A task that represents the asynchronous dispose operation.
GetCircuitBreakerState(string)
Gets the circuit breaker state for a specific kernel.
public CircuitBreakerState? GetCircuitBreakerState(string kernelId)
Parameters
kernelIdstringThe kernel identifier.
Returns
- CircuitBreakerState?
The circuit breaker state, or null if kernel is not registered.
GetStatistics()
Gets the current watchdog statistics.
public WatchdogStatistics GetStatistics()
Returns
RegisterKernel(string, Func<CancellationToken, Task<bool>>, Func<KernelHealthStatus>)
Registers a kernel for watchdog monitoring.
public void RegisterKernel(string kernelId, Func<CancellationToken, Task<bool>> restartCallback, Func<KernelHealthStatus> getStatusCallback)
Parameters
kernelIdstringUnique identifier for the kernel.
restartCallbackFunc<CancellationToken, Task<bool>>Callback to invoke when kernel restart is needed.
getStatusCallbackFunc<KernelHealthStatus>Callback to get current kernel status.
ReportActivity(string, long)
Reports activity for a kernel, resetting the stall timeout.
public void ReportActivity(string kernelId, long messagesProcessed = 0)
Parameters
kernelIdstringThe kernel identifier.
messagesProcessedlongNumber of messages processed since last report.
ResetCircuitBreaker(string)
Resets the circuit breaker for a kernel, allowing it to accept new operations.
public void ResetCircuitBreaker(string kernelId)
Parameters
kernelIdstringThe kernel identifier.
TripCircuitBreaker(string, string)
Forces the circuit breaker to trip for a kernel.
public void TripCircuitBreaker(string kernelId, string reason)
Parameters
UnregisterKernel(string)
Unregisters a kernel from watchdog monitoring.
public bool UnregisterKernel(string kernelId)
Parameters
kernelIdstringThe kernel identifier to unregister.
Returns
- bool
True if the kernel was unregistered; false if not found.
Events
KernelFaultDetected
Event raised when a kernel fault is detected.
public event EventHandler<KernelFaultEventArgs>? KernelFaultDetected
Event Type
KernelPermanentlyFailed
Event raised when a kernel cannot be recovered and is marked as failed.
public event EventHandler<KernelPermanentFailureEventArgs>? KernelPermanentlyFailed
Event Type
KernelRecovered
Event raised when a kernel is successfully recovered.
public event EventHandler<KernelRecoveryEventArgs>? KernelRecovered