Table of Contents

Struct KernelHealthStatus

Namespace
DotCompute.Backends.CUDA.RingKernels
Assembly
DotCompute.Backends.CUDA.dll

Health monitoring data for a Ring Kernel (GPU-resident).

public struct KernelHealthStatus : IEquatable<KernelHealthStatus>
Implements
Inherited Members

Remarks

Enables automatic failure detection and recovery for persistent Ring Kernels. Kernels periodically update heartbeat timestamps, and host monitors for stale data.

Memory Layout (32 bytes, 8-byte aligned): - LastHeartbeatTicks: 8 bytes (atomic timestamp) - FailedHeartbeats: 4 bytes (consecutive failures) - ErrorCount: 4 bytes (total errors) - State: 4 bytes (kernel health state) - LastCheckpointId: 8 bytes (checkpoint identifier) - Reserved: 4 bytes (padding/future use)

Failure Detection: 1. Heartbeat Monitoring: Each kernel updates timestamp periodically (~100ms) 2. Timeout Detection: Host checks for stale timestamps (>5 seconds) 3. Error Threshold: Host monitors error count (>10 errors triggers failure)

Recovery Strategies: - Checkpoint/Restore: Periodic state snapshots for recovery - Message Replay: Re-send messages from last checkpoint - Kernel Restart: Relaunch failed kernel with restored state

Fields

ErrorCount

Total number of errors encountered by kernel.

public int ErrorCount

Field Value

int

Remarks

Incremented atomically by kernel on errors. Threshold: >10 errors triggers kernel failure. Includes message processing errors, memory errors, etc.

FailedHeartbeats

Number of consecutive failed heartbeats.

public int FailedHeartbeats

Field Value

int

Remarks

Incremented by host when heartbeat is stale. Reset to 0 when valid heartbeat received. Threshold: 3 failed heartbeats triggers degraded state.

LastCheckpointId

Last checkpoint identifier (for recovery).

public long LastCheckpointId

Field Value

long

Remarks

Updated when kernel creates checkpoint snapshot. Used to identify which state to restore during recovery. 0 = no checkpoint exists.

LastHeartbeatTicks

Last heartbeat timestamp in ticks (100-nanosecond intervals since epoch).

public long LastHeartbeatTicks

Field Value

long

Remarks

Updated atomically by kernel every ~100ms. Host compares with current time to detect stale kernels. Uses DateTime.UtcNow.Ticks for consistency with host.

State

Current kernel health state.

public int State

Field Value

int

Remarks

Modified by both kernel (self-diagnosis) and host (monitoring). State transitions: Healthy → Degraded → Failed → Recovering → Healthy

Methods

CreateEmpty()

Creates an uninitialized health status (all fields zero).

public static KernelHealthStatus CreateEmpty()

Returns

KernelHealthStatus

Empty health status suitable for GPU allocation.

CreateInitialized()

Creates a health status initialized with current timestamp.

public static KernelHealthStatus CreateInitialized()

Returns

KernelHealthStatus

Health status with current heartbeat timestamp.

Equals(KernelHealthStatus)

Indicates whether the current object is equal to another object of the same type.

public readonly bool Equals(KernelHealthStatus other)

Parameters

other KernelHealthStatus

An object to compare with this object.

Returns

bool

true if the current object is equal to the other parameter; otherwise, false.

Equals(object?)

Indicates whether this instance and a specified object are equal.

public override readonly bool Equals(object? obj)

Parameters

obj object

The object to compare with the current instance.

Returns

bool

true if obj and this instance are the same type and represent the same value; otherwise, false.

GetHashCode()

Returns the hash code for this instance.

public override readonly int GetHashCode()

Returns

int

A 32-bit signed integer that is the hash code for this instance.

IsDegraded()

Checks if kernel is degraded.

public readonly bool IsDegraded()

Returns

bool

True if state is Degraded, false otherwise.

IsFailed()

Checks if kernel has failed.

public readonly bool IsFailed()

Returns

bool

True if state is Failed, false otherwise.

IsHealthy()

Checks if kernel is in healthy state.

public readonly bool IsHealthy()

Returns

bool

True if state is Healthy, false otherwise.

IsHeartbeatStale(TimeSpan)

Checks if heartbeat is stale (older than specified timeout).

public readonly bool IsHeartbeatStale(TimeSpan timeout)

Parameters

timeout TimeSpan

Timeout duration.

Returns

bool

True if heartbeat is stale, false otherwise.

IsRecovering()

Checks if kernel is in recovery mode.

public readonly bool IsRecovering()

Returns

bool

True if state is Recovering, false otherwise.

TimeSinceLastHeartbeat()

Gets the time since last heartbeat.

public readonly TimeSpan TimeSinceLastHeartbeat()

Returns

TimeSpan

TimeSpan since last heartbeat.

Validate()

Validates the health status structure for correctness.

public readonly bool Validate()

Returns

bool

True if valid, false if any invariant is violated.

Remarks

Checks:

  • LastHeartbeatTicks is non-negative
  • FailedHeartbeats is non-negative
  • ErrorCount is non-negative
  • State is valid KernelState value
  • LastCheckpointId is non-negative

Operators

operator ==(KernelHealthStatus, KernelHealthStatus)

Equality operator.

public static bool operator ==(KernelHealthStatus left, KernelHealthStatus right)

Parameters

left KernelHealthStatus
right KernelHealthStatus

Returns

bool

operator !=(KernelHealthStatus, KernelHealthStatus)

Inequality operator.

public static bool operator !=(KernelHealthStatus left, KernelHealthStatus right)

Parameters

left KernelHealthStatus
right KernelHealthStatus

Returns

bool