Class RingKernelFaultRecoveryOptions
- Namespace
- DotCompute.Backends.CUDA.RingKernels.Resilience
- Assembly
- DotCompute.Backends.CUDA.dll
Configuration options for ring kernel fault recovery and watchdog behavior.
public sealed class RingKernelFaultRecoveryOptions
- Inheritance
-
RingKernelFaultRecoveryOptions
- Inherited Members
Properties
CircuitBreakerFailureThreshold
Gets or sets the failure threshold that triggers the circuit breaker. When this many failures occur within the tracking window, the circuit opens.
public int CircuitBreakerFailureThreshold { get; set; }
Property Value
CircuitBreakerOpenDuration
Gets or sets the duration the circuit breaker remains open before attempting recovery.
public TimeSpan CircuitBreakerOpenDuration { get; set; }
Property Value
Default
Gets the default options instance.
public static RingKernelFaultRecoveryOptions Default { get; }
Property Value
EnableAutoRestart
Gets or sets whether automatic restart is enabled for crashed kernels.
public bool EnableAutoRestart { get; set; }
Property Value
EnableWatchdog
Gets or sets whether the kernel watchdog is enabled. When enabled, monitors kernel health and triggers recovery on detected issues.
public bool EnableWatchdog { get; set; }
Property Value
FailureTrackingWindow
Gets or sets the window for tracking kernel failures. Failures outside this window are not counted toward the circuit breaker.
public TimeSpan FailureTrackingWindow { get; set; }
Property Value
HeartbeatTimeout
Gets or sets the maximum time to wait for a kernel to respond to heartbeat.
public TimeSpan HeartbeatTimeout { get; set; }
Property Value
KernelStallTimeout
Gets or sets the timeout after which a kernel is considered stalled. If a kernel doesn't process messages within this time, recovery is attempted.
public TimeSpan KernelStallTimeout { get; set; }
Property Value
MaxRestartAttempts
Gets or sets the maximum number of automatic restart attempts before giving up.
public int MaxRestartAttempts { get; set; }
Property Value
MaxRestartDelay
Gets or sets the maximum delay when using exponential backoff.
public TimeSpan MaxRestartDelay { get; set; }
Property Value
NotifyHealthMonitor
Gets or sets whether to notify health monitor of kernel failures.
public bool NotifyHealthMonitor { get; set; }
Property Value
ResetFailuresOnSuccess
Gets or sets whether to reset the failure count after successful kernel execution.
public bool ResetFailuresOnSuccess { get; set; }
Property Value
RestartDelay
Gets or sets the delay between restart attempts.
public TimeSpan RestartDelay { get; set; }
Property Value
SuccessfulRunThreshold
Gets or sets the minimum time a kernel must run successfully before the restart count is reset (prevents rapid failure-restart cycles).
public TimeSpan SuccessfulRunThreshold { get; set; }
Property Value
UseExponentialBackoff
Gets or sets whether to use exponential backoff for restart delays.
public bool UseExponentialBackoff { get; set; }
Property Value
WatchdogInterval
Gets or sets the interval between watchdog health checks.
public TimeSpan WatchdogInterval { get; set; }
Property Value
Methods
Validate()
Validates the options and throws if any values are invalid.
public void Validate()
Exceptions
- ArgumentOutOfRangeException
Thrown when any option is out of valid range.