Struct KernelHealthStatus
- Namespace
- DotCompute.Backends.CUDA.RingKernels
- Assembly
- DotCompute.Backends.CUDA.dll
Health monitoring data for a Ring Kernel (GPU-resident).
public struct KernelHealthStatus : IEquatable<KernelHealthStatus>
- Implements
- Inherited Members
Remarks
Enables automatic failure detection and recovery for persistent Ring Kernels. Kernels periodically update heartbeat timestamps, and host monitors for stale data.
Memory Layout (32 bytes, 8-byte aligned): - LastHeartbeatTicks: 8 bytes (atomic timestamp) - FailedHeartbeats: 4 bytes (consecutive failures) - ErrorCount: 4 bytes (total errors) - State: 4 bytes (kernel health state) - LastCheckpointId: 8 bytes (checkpoint identifier) - Reserved: 4 bytes (padding/future use)
Failure Detection: 1. Heartbeat Monitoring: Each kernel updates timestamp periodically (~100ms) 2. Timeout Detection: Host checks for stale timestamps (>5 seconds) 3. Error Threshold: Host monitors error count (>10 errors triggers failure)
Recovery Strategies: - Checkpoint/Restore: Periodic state snapshots for recovery - Message Replay: Re-send messages from last checkpoint - Kernel Restart: Relaunch failed kernel with restored state
Fields
ErrorCount
Total number of errors encountered by kernel.
public int ErrorCount
Field Value
Remarks
Incremented atomically by kernel on errors. Threshold: >10 errors triggers kernel failure. Includes message processing errors, memory errors, etc.
FailedHeartbeats
Number of consecutive failed heartbeats.
public int FailedHeartbeats
Field Value
Remarks
Incremented by host when heartbeat is stale. Reset to 0 when valid heartbeat received. Threshold: 3 failed heartbeats triggers degraded state.
LastCheckpointId
Last checkpoint identifier (for recovery).
public long LastCheckpointId
Field Value
Remarks
Updated when kernel creates checkpoint snapshot. Used to identify which state to restore during recovery. 0 = no checkpoint exists.
LastHeartbeatTicks
Last heartbeat timestamp in ticks (100-nanosecond intervals since epoch).
public long LastHeartbeatTicks
Field Value
Remarks
Updated atomically by kernel every ~100ms. Host compares with current time to detect stale kernels. Uses DateTime.UtcNow.Ticks for consistency with host.
State
Current kernel health state.
public int State
Field Value
Remarks
Modified by both kernel (self-diagnosis) and host (monitoring). State transitions: Healthy → Degraded → Failed → Recovering → Healthy
Methods
CreateEmpty()
Creates an uninitialized health status (all fields zero).
public static KernelHealthStatus CreateEmpty()
Returns
- KernelHealthStatus
Empty health status suitable for GPU allocation.
CreateInitialized()
Creates a health status initialized with current timestamp.
public static KernelHealthStatus CreateInitialized()
Returns
- KernelHealthStatus
Health status with current heartbeat timestamp.
Equals(KernelHealthStatus)
Indicates whether the current object is equal to another object of the same type.
public readonly bool Equals(KernelHealthStatus other)
Parameters
otherKernelHealthStatusAn object to compare with this object.
Returns
Equals(object?)
Indicates whether this instance and a specified object are equal.
public override readonly bool Equals(object? obj)
Parameters
objobjectThe object to compare with the current instance.
Returns
- bool
true if
objand this instance are the same type and represent the same value; otherwise, false.
GetHashCode()
Returns the hash code for this instance.
public override readonly int GetHashCode()
Returns
- int
A 32-bit signed integer that is the hash code for this instance.
IsDegraded()
Checks if kernel is degraded.
public readonly bool IsDegraded()
Returns
- bool
True if state is Degraded, false otherwise.
IsFailed()
Checks if kernel has failed.
public readonly bool IsFailed()
Returns
- bool
True if state is Failed, false otherwise.
IsHealthy()
Checks if kernel is in healthy state.
public readonly bool IsHealthy()
Returns
- bool
True if state is Healthy, false otherwise.
IsHeartbeatStale(TimeSpan)
Checks if heartbeat is stale (older than specified timeout).
public readonly bool IsHeartbeatStale(TimeSpan timeout)
Parameters
timeoutTimeSpanTimeout duration.
Returns
- bool
True if heartbeat is stale, false otherwise.
IsRecovering()
Checks if kernel is in recovery mode.
public readonly bool IsRecovering()
Returns
- bool
True if state is Recovering, false otherwise.
TimeSinceLastHeartbeat()
Gets the time since last heartbeat.
public readonly TimeSpan TimeSinceLastHeartbeat()
Returns
- TimeSpan
TimeSpan since last heartbeat.
Validate()
Validates the health status structure for correctness.
public readonly bool Validate()
Returns
- bool
True if valid, false if any invariant is violated.
Remarks
Checks:
- LastHeartbeatTicks is non-negative
- FailedHeartbeats is non-negative
- ErrorCount is non-negative
- State is valid KernelState value
- LastCheckpointId is non-negative
Operators
operator ==(KernelHealthStatus, KernelHealthStatus)
Equality operator.
public static bool operator ==(KernelHealthStatus left, KernelHealthStatus right)
Parameters
leftKernelHealthStatusrightKernelHealthStatus
Returns
operator !=(KernelHealthStatus, KernelHealthStatus)
Inequality operator.
public static bool operator !=(KernelHealthStatus left, KernelHealthStatus right)
Parameters
leftKernelHealthStatusrightKernelHealthStatus