Class CudaTimingProvider
- Namespace
- DotCompute.Backends.CUDA.Timing
- Assembly
- DotCompute.Backends.CUDA.dll
CUDA-specific timing provider implementing high-precision GPU-native timestamps.
public sealed class CudaTimingProvider : ITimingProvider, IDisposable
- Inheritance
-
CudaTimingProvider
- Implements
- Inherited Members
- Extension Methods
Remarks
This implementation provides two timing strategies based on compute capability:
- CC 6.0+ (Pascal+): Uses %%globaltimer special register for 1ns resolution. This is a 64-bit nanosecond counter incremented at the GPU shader clock frequency.
- CC < 6.0 (Maxwell): Uses CUDA events with ~1μs resolution. Events are created with cudaEventDefault flags and measure elapsed time via cudaEventElapsedTime.
Thread Safety: All methods are thread-safe. Timestamp queries are serialized through the CUDA stream to ensure correct ordering.
Performance Characteristics:
- Single timestamp: ~10ns on CC 6.0+, ~100ns on older GPUs
- Batch timestamps (N=1000): ~1ns per timestamp amortized
- Clock calibration (100 samples): ~10ms total
Constructors
CudaTimingProvider(CudaDevice, CudaStream, ILogger?)
Initializes a new instance of the CudaTimingProvider class.
public CudaTimingProvider(CudaDevice device, CudaStream stream, ILogger? logger = null)
Parameters
deviceCudaDeviceThe CUDA device to provide timing for.
streamCudaStreamThe CUDA stream for kernel execution.
loggerILoggerOptional logger for diagnostic output.
Methods
CalibrateAsync(int, CalibrationStrategy, CancellationToken)
Performs clock calibration using the specified strategy.
public Task<ClockCalibration> CalibrateAsync(int sampleCount, CalibrationStrategy strategy, CancellationToken ct = default)
Parameters
sampleCountintNumber of timestamp pairs to collect (minimum 10, recommended 100+).
strategyCalibrationStrategyCalibration strategy (Basic, Robust, Weighted, or RANSAC).
ctCancellationTokenCancellation token.
Returns
- Task<ClockCalibration>
Clock calibration result with offset, drift, and error bounds.
CalibrateAsync(int, CancellationToken)
Calibrates the GPU clock against the CPU clock to enable accurate time conversions.
public Task<ClockCalibration> CalibrateAsync(int sampleCount = 100, CancellationToken ct = default)
Parameters
sampleCountintNumber of CPU-GPU timestamp pairs to collect for calibration (default: 100). Higher values improve accuracy but increase calibration time.
ctCancellationTokenCancellation token to cancel the async operation.
Returns
- Task<ClockCalibration>
A task representing the async operation, containing calibration data including offset, drift rate, and error bounds for converting between CPU and GPU time domains.
Remarks
Clock calibration performs linear regression on sampleCount paired
CPU-GPU timestamps to compute:
- Offset: GPU_time = CPU_time + offset
- Drift: Clock frequency difference (parts per million)
- Error Bounds: ±uncertainty range from regression residuals
Performance:
- 100 samples: ~10ms calibration time
- Typical drift: 50-200 PPM (180-720μs/hour)
- Recommended recalibration interval: 5-10 minutes
Usage:
var calibration = await timingProvider.CalibrateAsync(sampleCount: 100);
long cpuTime = GetCpuTime();
long gpuTime = calibration.GpuToCpuTime(cpuTime);
var (min, max) = calibration.GetUncertaintyRange(gpuTime);
Exceptions
- ArgumentOutOfRangeException
Thrown when
sampleCountis less than 10 (insufficient for calibration).- OperationCanceledException
Thrown when the cancellation token is triggered.
Dispose()
Releases resources used by the timing provider.
public void Dispose()
EnableTimestampInjection(bool)
Enables automatic timestamp injection at kernel entry points.
public void EnableTimestampInjection(bool enable = true)
Parameters
enableboolTrue to enable injection, false to disable.
Remarks
When enabled, kernels automatically record a timestamp in parameter slot 0 before executing user code. This eliminates manual timestamp management in kernel code.
Kernel Signature Change:
// Before injection:
__global__ void MyKernel(float* input, float* output);
// After injection (parameter 0 auto-injected):
__global__ void MyKernel(long* timestamps, float* input, float* output);
Overhead: <20ns per kernel launch (timestamp write by thread 0).
Note: Timestamp injection requires kernel recompilation. Existing compiled kernels will not be affected until next compilation.
GetGpuClockFrequency()
Gets the GPU clock frequency in Hertz (cycles per second).
public long GetGpuClockFrequency()
Returns
- long
The GPU clock frequency in Hz. Typical values:
- CUDA: 1,000,000,000 Hz (1 GHz) for nanosecond timers
- CUDA Events: 1,000,000 Hz (1 MHz) for microsecond precision
- OpenCL: Platform-dependent
Remarks
The clock frequency determines timer resolution. A 1 GHz clock provides 1ns resolution.
GetGpuTimestampAsync(CancellationToken)
Gets the current GPU timestamp in nanoseconds since device initialization.
public Task<long> GetGpuTimestampAsync(CancellationToken ct = default)
Parameters
ctCancellationTokenCancellation token to cancel the async operation.
Returns
- Task<long>
A task representing the async operation, containing the GPU timestamp in nanoseconds. The timestamp is monotonically increasing and has device-specific resolution.
Remarks
This method launches a minimal kernel to read the GPU hardware timer. The overhead is typically <10ns on CUDA (CC 6.0+) and <100ns on other platforms.
For batch queries, use GetGpuTimestampsBatchAsync(int, CancellationToken) which amortizes launch overhead across multiple timestamps.
Performance Targets:
- CUDA (CC 6.0+): <10ns per query
- CUDA Events: <100ns per query
- OpenCL/CPU: <1μs per query
Exceptions
- OperationCanceledException
Thrown when the cancellation token is triggered.
- InvalidOperationException
Thrown when the device is not in a valid state for timestamp queries.
GetGpuTimestampsBatchAsync(int, CancellationToken)
Gets multiple GPU timestamps in a single batch operation for improved efficiency.
public Task<long[]> GetGpuTimestampsBatchAsync(int count, CancellationToken ct = default)
Parameters
countintNumber of timestamps to retrieve (must be positive).
ctCancellationTokenCancellation token to cancel the async operation.
Returns
- Task<long[]>
A task representing the async operation, containing an array of GPU timestamps in nanoseconds. All timestamps are captured within a single kernel launch for minimal skew.
Remarks
Batch queries amortize kernel launch overhead across multiple timestamps, achieving
<1μs per timestamp when count ≥ 1000.
All timestamps in the batch are captured during the same kernel execution, ensuring minimal temporal skew between samples (typically <100ns).
Performance: For count = 1000:
- Total time: ~1μs (1ns per timestamp amortized)
- Skew between timestamps: <100ns
Exceptions
- ArgumentOutOfRangeException
Thrown when
countis less than or equal to zero.- OperationCanceledException
Thrown when the cancellation token is triggered.
GetTimerResolutionNanos()
Gets the timer resolution in nanoseconds (minimum measurable time interval).
public long GetTimerResolutionNanos()
Returns
- long
The timer resolution in nanoseconds. Typical values:
- CUDA (CC 6.0+): 1 ns (%%globaltimer)
- CUDA (CC < 6.0): 1,000 ns (CUDA events)
- OpenCL: 1,000 ns (clock() built-in)
- CPU: ~100 ns (Stopwatch)
Remarks
Lower resolution values indicate higher precision. A 1ns resolution means the timer can distinguish events separated by as little as 1 nanosecond.