Table of Contents

Class DeviceHealthSnapshot

Namespace
DotCompute.Abstractions.Health
Assembly
DotCompute.Abstractions.dll

Represents a point-in-time snapshot of device health and performance metrics.

public sealed class DeviceHealthSnapshot
Inheritance
DeviceHealthSnapshot
Inherited Members

Remarks

This class provides a comprehensive view of compute device health, including: - Sensor readings (temperature, power, utilization, etc.) - Health status and scoring - Error tracking and diagnostics - Availability information

Performance Characteristics: - Collection time: 1-10ms depending on backend and sensor count - Memory footprint: ~1-2KB per snapshot - Recommended collection interval: 5-10 seconds for monitoring - Use cached snapshots for high-frequency queries

Orleans Integration: This class is designed for Orleans.GpuBridge.Core integration, providing health metrics for grain placement decisions, fault tolerance, and observability.

Usage Example:

var snapshot = await accelerator.GetHealthSnapshotAsync();

if (snapshot.HealthScore < 0.7)
{
    logger.LogWarning("Device health degraded: {score:P0}", snapshot.HealthScore);

    // Check specific issues
    var tempReading = snapshot.GetSensorReading(SensorType.Temperature);
    if (tempReading?.IsAvailable == true && tempReading.Value > 85.0)
    {
        logger.LogWarning("High temperature detected: {temp}°C", tempReading.Value);
    }
}

Properties

BackendType

Gets the backend type for this device.

public required string BackendType { get; init; }

Property Value

string

Examples

"CUDA", "OpenCL", "Metal", "CPU"

ConsecutiveFailures

Gets the number of consecutive failed operations.

public int ConsecutiveFailures { get; init; }

Property Value

int

Remarks

Used for circuit breaker patterns and automatic device failover. Resets to 0 after a successful operation.

CustomMetrics

Gets additional custom metrics specific to the backend or device.

public IReadOnlyDictionary<string, double>? CustomMetrics { get; init; }

Property Value

IReadOnlyDictionary<string, double>

Remarks

Used for vendor-specific metrics not covered by standard sensors. Examples:

  • "nvlink_throughput_gbps": NVLink bandwidth (NVIDIA)
  • "infinity_fabric_utilization": IF utilization (AMD)
  • "unified_memory_pressure": Memory pressure level (Apple Silicon)

DeviceId

Gets the unique identifier of the device.

public required string DeviceId { get; init; }

Property Value

string

DeviceName

Gets the human-readable name of the device.

public required string DeviceName { get; init; }

Property Value

string

Examples

"NVIDIA GeForce RTX 4090", "AMD Radeon RX 7900 XTX", "Apple M3 Max"

ErrorCount

Gets the number of errors detected since device initialization or last reset.

public long ErrorCount { get; init; }

Property Value

long

Remarks

Includes:

  • ECC (Error Correction Code) errors
  • Kernel execution failures
  • Device reset events
  • Communication timeouts

High error counts may indicate hardware issues or instability.

HealthScore

Gets the overall health score (0.0 to 1.0).

public required double HealthScore { get; init; }

Property Value

double

Remarks

The health score is a composite metric that considers: - Temperature (weight: 0.3) - Error rate (weight: 0.3) - Utilization (weight: 0.2) - Throttling status (weight: 0.1) - Memory pressure (weight: 0.1)

Score Interpretation: - 1.0: Perfect health - 0.9-1.0: Excellent (green) - 0.7-0.9: Good (yellow) - 0.5-0.7: Degraded (orange) - 0.0-0.5: Critical (red)

Orleans Usage: Use health scores for grain placement decisions. Prefer devices with scores > 0.8 for new grain activations.

IsAvailable

Gets whether the device is currently available for computation.

public required bool IsAvailable { get; init; }

Property Value

bool

Remarks

false indicates the device is offline, in error state, or otherwise unavailable. Check StatusMessage for details.

IsThrottling

Gets whether the device is currently throttling due to thermal or power limits.

public bool IsThrottling { get; init; }

Property Value

bool

LastError

Gets the most recent error message, if any.

public string? LastError { get; init; }

Property Value

string

LastErrorTimestamp

Gets the timestamp of the last error occurrence (UTC).

public DateTimeOffset? LastErrorTimestamp { get; init; }

Property Value

DateTimeOffset?

SensorReadings

Gets the collection of sensor readings captured in this snapshot.

public required IReadOnlyList<SensorReading> SensorReadings { get; init; }

Property Value

IReadOnlyList<SensorReading>

Remarks

May be empty if sensor data collection failed or is not supported. Use GetSensorReading(SensorType) for safe access.

Status

Gets the current health status of the device.

public required DeviceHealthStatus Status { get; init; }

Property Value

DeviceHealthStatus

StatusMessage

Gets an optional status message providing additional health context.

public string? StatusMessage { get; init; }

Property Value

string

Examples

"Operating normally", "High temperature warning", "Power limit exceeded", "Device offline - driver reset required"

Timestamp

Gets the timestamp when this snapshot was captured (UTC).

public required DateTimeOffset Timestamp { get; init; }

Property Value

DateTimeOffset

Methods

CreateUnavailable(string, string, string, string)

Creates a health snapshot indicating the device is unavailable.

public static DeviceHealthSnapshot CreateUnavailable(string deviceId, string deviceName, string backendType, string reason)

Parameters

deviceId string

The device identifier.

deviceName string

The device name.

backendType string

The backend type.

reason string

The reason for unavailability.

Returns

DeviceHealthSnapshot

An unavailable device health snapshot.

GetSensorReading(SensorType)

Retrieves a specific sensor reading from this snapshot.

public SensorReading? GetSensorReading(SensorType sensorType)

Parameters

sensorType SensorType

The type of sensor to retrieve.

Returns

SensorReading

The sensor reading, or null if not available.

GetSensorValue(SensorType)

Gets the value of a specific sensor, or null if unavailable.

public double? GetSensorValue(SensorType sensorType)

Parameters

sensorType SensorType

The type of sensor.

Returns

double?

The sensor value, or null if unavailable.

IsSensorAvailable(SensorType)

Checks if a specific sensor type is available and has a valid reading.

public bool IsSensorAvailable(SensorType sensorType)

Parameters

sensorType SensorType

The type of sensor to check.

Returns

bool

True if the sensor is available with a valid reading.

ToString()

Returns a summary string of this health snapshot.

public override string ToString()

Returns

string