Table of Contents

Class RingKernelWatchdog

Namespace
DotCompute.Backends.CUDA.RingKernels.Resilience
Assembly
DotCompute.Backends.CUDA.dll

Monitors ring kernel health and triggers recovery actions when issues are detected.

public sealed class RingKernelWatchdog : IDisposable, IAsyncDisposable
Inheritance
RingKernelWatchdog
Implements
Inherited Members
Extension Methods

Remarks

The watchdog periodically checks kernel state and can detect:

  • Kernel crashes (unexpected termination)
  • Kernel stalls (no message processing activity)
  • Heartbeat failures (kernel not responding)

When issues are detected, the watchdog triggers the configured recovery action.

Constructors

RingKernelWatchdog(ILogger, RingKernelFaultRecoveryOptions?)

Initializes a new instance of the RingKernelWatchdog class.

public RingKernelWatchdog(ILogger logger, RingKernelFaultRecoveryOptions? options = null)

Parameters

logger ILogger

Logger for diagnostics.

options RingKernelFaultRecoveryOptions

Watchdog configuration options.

Methods

Dispose()

Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.

public void Dispose()

DisposeAsync()

Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources asynchronously.

public ValueTask DisposeAsync()

Returns

ValueTask

A task that represents the asynchronous dispose operation.

GetCircuitBreakerState(string)

Gets the circuit breaker state for a specific kernel.

public CircuitBreakerState? GetCircuitBreakerState(string kernelId)

Parameters

kernelId string

The kernel identifier.

Returns

CircuitBreakerState?

The circuit breaker state, or null if kernel is not registered.

GetStatistics()

Gets the current watchdog statistics.

public WatchdogStatistics GetStatistics()

Returns

WatchdogStatistics

RegisterKernel(string, Func<CancellationToken, Task<bool>>, Func<KernelHealthStatus>)

Registers a kernel for watchdog monitoring.

public void RegisterKernel(string kernelId, Func<CancellationToken, Task<bool>> restartCallback, Func<KernelHealthStatus> getStatusCallback)

Parameters

kernelId string

Unique identifier for the kernel.

restartCallback Func<CancellationToken, Task<bool>>

Callback to invoke when kernel restart is needed.

getStatusCallback Func<KernelHealthStatus>

Callback to get current kernel status.

ReportActivity(string, long)

Reports activity for a kernel, resetting the stall timeout.

public void ReportActivity(string kernelId, long messagesProcessed = 0)

Parameters

kernelId string

The kernel identifier.

messagesProcessed long

Number of messages processed since last report.

ResetCircuitBreaker(string)

Resets the circuit breaker for a kernel, allowing it to accept new operations.

public void ResetCircuitBreaker(string kernelId)

Parameters

kernelId string

The kernel identifier.

TripCircuitBreaker(string, string)

Forces the circuit breaker to trip for a kernel.

public void TripCircuitBreaker(string kernelId, string reason)

Parameters

kernelId string

The kernel identifier.

reason string

Reason for tripping the circuit breaker.

UnregisterKernel(string)

Unregisters a kernel from watchdog monitoring.

public bool UnregisterKernel(string kernelId)

Parameters

kernelId string

The kernel identifier to unregister.

Returns

bool

True if the kernel was unregistered; false if not found.

Events

KernelFaultDetected

Event raised when a kernel fault is detected.

public event EventHandler<KernelFaultEventArgs>? KernelFaultDetected

Event Type

EventHandler<KernelFaultEventArgs>

KernelPermanentlyFailed

Event raised when a kernel cannot be recovered and is marked as failed.

public event EventHandler<KernelPermanentFailureEventArgs>? KernelPermanentlyFailed

Event Type

EventHandler<KernelPermanentFailureEventArgs>

KernelRecovered

Event raised when a kernel is successfully recovered.

public event EventHandler<KernelRecoveryEventArgs>? KernelRecovered

Event Type

EventHandler<KernelRecoveryEventArgs>