Class DeviceHealthSnapshot
- Namespace
- DotCompute.Abstractions.Health
- Assembly
- DotCompute.Abstractions.dll
Represents a point-in-time snapshot of device health and performance metrics.
public sealed class DeviceHealthSnapshot
- Inheritance
-
DeviceHealthSnapshot
- Inherited Members
Remarks
This class provides a comprehensive view of compute device health, including: - Sensor readings (temperature, power, utilization, etc.) - Health status and scoring - Error tracking and diagnostics - Availability information
Performance Characteristics: - Collection time: 1-10ms depending on backend and sensor count - Memory footprint: ~1-2KB per snapshot - Recommended collection interval: 5-10 seconds for monitoring - Use cached snapshots for high-frequency queries
Orleans Integration: This class is designed for Orleans.GpuBridge.Core integration, providing health metrics for grain placement decisions, fault tolerance, and observability.
Usage Example:
var snapshot = await accelerator.GetHealthSnapshotAsync();
if (snapshot.HealthScore < 0.7)
{
logger.LogWarning("Device health degraded: {score:P0}", snapshot.HealthScore);
// Check specific issues
var tempReading = snapshot.GetSensorReading(SensorType.Temperature);
if (tempReading?.IsAvailable == true && tempReading.Value > 85.0)
{
logger.LogWarning("High temperature detected: {temp}°C", tempReading.Value);
}
}
Properties
BackendType
Gets the backend type for this device.
public required string BackendType { get; init; }
Property Value
Examples
"CUDA", "OpenCL", "Metal", "CPU"
ConsecutiveFailures
Gets the number of consecutive failed operations.
public int ConsecutiveFailures { get; init; }
Property Value
Remarks
Used for circuit breaker patterns and automatic device failover. Resets to 0 after a successful operation.
CustomMetrics
Gets additional custom metrics specific to the backend or device.
public IReadOnlyDictionary<string, double>? CustomMetrics { get; init; }
Property Value
Remarks
Used for vendor-specific metrics not covered by standard sensors. Examples:
- "nvlink_throughput_gbps": NVLink bandwidth (NVIDIA)
- "infinity_fabric_utilization": IF utilization (AMD)
- "unified_memory_pressure": Memory pressure level (Apple Silicon)
DeviceId
Gets the unique identifier of the device.
public required string DeviceId { get; init; }
Property Value
DeviceName
Gets the human-readable name of the device.
public required string DeviceName { get; init; }
Property Value
Examples
"NVIDIA GeForce RTX 4090", "AMD Radeon RX 7900 XTX", "Apple M3 Max"
ErrorCount
Gets the number of errors detected since device initialization or last reset.
public long ErrorCount { get; init; }
Property Value
Remarks
Includes:
- ECC (Error Correction Code) errors
- Kernel execution failures
- Device reset events
- Communication timeouts
High error counts may indicate hardware issues or instability.
HealthScore
Gets the overall health score (0.0 to 1.0).
public required double HealthScore { get; init; }
Property Value
Remarks
The health score is a composite metric that considers: - Temperature (weight: 0.3) - Error rate (weight: 0.3) - Utilization (weight: 0.2) - Throttling status (weight: 0.1) - Memory pressure (weight: 0.1)
Score Interpretation: - 1.0: Perfect health - 0.9-1.0: Excellent (green) - 0.7-0.9: Good (yellow) - 0.5-0.7: Degraded (orange) - 0.0-0.5: Critical (red)
Orleans Usage: Use health scores for grain placement decisions. Prefer devices with scores > 0.8 for new grain activations.
IsAvailable
Gets whether the device is currently available for computation.
public required bool IsAvailable { get; init; }
Property Value
Remarks
false indicates the device is offline, in error state, or otherwise
unavailable. Check StatusMessage for details.
IsThrottling
Gets whether the device is currently throttling due to thermal or power limits.
public bool IsThrottling { get; init; }
Property Value
LastError
Gets the most recent error message, if any.
public string? LastError { get; init; }
Property Value
LastErrorTimestamp
Gets the timestamp of the last error occurrence (UTC).
public DateTimeOffset? LastErrorTimestamp { get; init; }
Property Value
SensorReadings
Gets the collection of sensor readings captured in this snapshot.
public required IReadOnlyList<SensorReading> SensorReadings { get; init; }
Property Value
Remarks
May be empty if sensor data collection failed or is not supported. Use GetSensorReading(SensorType) for safe access.
Status
Gets the current health status of the device.
public required DeviceHealthStatus Status { get; init; }
Property Value
StatusMessage
Gets an optional status message providing additional health context.
public string? StatusMessage { get; init; }
Property Value
Examples
"Operating normally", "High temperature warning", "Power limit exceeded", "Device offline - driver reset required"
Timestamp
Gets the timestamp when this snapshot was captured (UTC).
public required DateTimeOffset Timestamp { get; init; }
Property Value
Methods
CreateUnavailable(string, string, string, string)
Creates a health snapshot indicating the device is unavailable.
public static DeviceHealthSnapshot CreateUnavailable(string deviceId, string deviceName, string backendType, string reason)
Parameters
deviceIdstringThe device identifier.
deviceNamestringThe device name.
backendTypestringThe backend type.
reasonstringThe reason for unavailability.
Returns
- DeviceHealthSnapshot
An unavailable device health snapshot.
GetSensorReading(SensorType)
Retrieves a specific sensor reading from this snapshot.
public SensorReading? GetSensorReading(SensorType sensorType)
Parameters
sensorTypeSensorTypeThe type of sensor to retrieve.
Returns
- SensorReading
The sensor reading, or
nullif not available.
GetSensorValue(SensorType)
Gets the value of a specific sensor, or null if unavailable.
public double? GetSensorValue(SensorType sensorType)
Parameters
sensorTypeSensorTypeThe type of sensor.
Returns
- double?
The sensor value, or null if unavailable.
IsSensorAvailable(SensorType)
Checks if a specific sensor type is available and has a valid reading.
public bool IsSensorAvailable(SensorType sensorType)
Parameters
sensorTypeSensorTypeThe type of sensor to check.
Returns
- bool
True if the sensor is available with a valid reading.
ToString()
Returns a summary string of this health snapshot.
public override string ToString()