Table of Contents

Class RingKernelFaultRecoveryOptions

Namespace
DotCompute.Backends.CUDA.RingKernels.Resilience
Assembly
DotCompute.Backends.CUDA.dll

Configuration options for ring kernel fault recovery and watchdog behavior.

public sealed class RingKernelFaultRecoveryOptions
Inheritance
RingKernelFaultRecoveryOptions
Inherited Members

Properties

CircuitBreakerFailureThreshold

Gets or sets the failure threshold that triggers the circuit breaker. When this many failures occur within the tracking window, the circuit opens.

public int CircuitBreakerFailureThreshold { get; set; }

Property Value

int

CircuitBreakerOpenDuration

Gets or sets the duration the circuit breaker remains open before attempting recovery.

public TimeSpan CircuitBreakerOpenDuration { get; set; }

Property Value

TimeSpan

Default

Gets the default options instance.

public static RingKernelFaultRecoveryOptions Default { get; }

Property Value

RingKernelFaultRecoveryOptions

EnableAutoRestart

Gets or sets whether automatic restart is enabled for crashed kernels.

public bool EnableAutoRestart { get; set; }

Property Value

bool

EnableWatchdog

Gets or sets whether the kernel watchdog is enabled. When enabled, monitors kernel health and triggers recovery on detected issues.

public bool EnableWatchdog { get; set; }

Property Value

bool

FailureTrackingWindow

Gets or sets the window for tracking kernel failures. Failures outside this window are not counted toward the circuit breaker.

public TimeSpan FailureTrackingWindow { get; set; }

Property Value

TimeSpan

HeartbeatTimeout

Gets or sets the maximum time to wait for a kernel to respond to heartbeat.

public TimeSpan HeartbeatTimeout { get; set; }

Property Value

TimeSpan

KernelStallTimeout

Gets or sets the timeout after which a kernel is considered stalled. If a kernel doesn't process messages within this time, recovery is attempted.

public TimeSpan KernelStallTimeout { get; set; }

Property Value

TimeSpan

MaxRestartAttempts

Gets or sets the maximum number of automatic restart attempts before giving up.

public int MaxRestartAttempts { get; set; }

Property Value

int

MaxRestartDelay

Gets or sets the maximum delay when using exponential backoff.

public TimeSpan MaxRestartDelay { get; set; }

Property Value

TimeSpan

NotifyHealthMonitor

Gets or sets whether to notify health monitor of kernel failures.

public bool NotifyHealthMonitor { get; set; }

Property Value

bool

ResetFailuresOnSuccess

Gets or sets whether to reset the failure count after successful kernel execution.

public bool ResetFailuresOnSuccess { get; set; }

Property Value

bool

RestartDelay

Gets or sets the delay between restart attempts.

public TimeSpan RestartDelay { get; set; }

Property Value

TimeSpan

SuccessfulRunThreshold

Gets or sets the minimum time a kernel must run successfully before the restart count is reset (prevents rapid failure-restart cycles).

public TimeSpan SuccessfulRunThreshold { get; set; }

Property Value

TimeSpan

UseExponentialBackoff

Gets or sets whether to use exponential backoff for restart delays.

public bool UseExponentialBackoff { get; set; }

Property Value

bool

WatchdogInterval

Gets or sets the interval between watchdog health checks.

public TimeSpan WatchdogInterval { get; set; }

Property Value

TimeSpan

Methods

Validate()

Validates the options and throws if any values are invalid.

public void Validate()

Exceptions

ArgumentOutOfRangeException

Thrown when any option is out of valid range.