GPU Timing API: High-Precision Temporal Measurements
Overview
The Timing API provides nanosecond-precision GPU-native timestamps for high-precision temporal measurements in compute kernels. This enables accurate profiling, event ordering, and distributed coordination in GPU applications.
Key Features:
- 1ns Resolution: Hardware globaltimer on CC 6.0+ (Pascal and newer)
- Four Calibration Strategies: Basic, Robust, Weighted, RANSAC for CPU-GPU clock synchronization
- Zero CPU-GPU Round Trip: Timestamps captured directly on GPU
- Batch Operations: Amortized sub-1ns per timestamp for bulk queries
- Automatic Fallback: 1μs resolution via CUDA events on older hardware (CC < 6.0)
| Backend | Resolution | Hardware Requirement |
|---|---|---|
| CUDA (CC 6.0+) | 1 nanosecond | Pascal architecture or newer |
| CUDA (CC < 6.0) | 1 microsecond | Maxwell architecture (CUDA events) |
| OpenCL | 1 microsecond | clock() built-in |
| CPU | ~100 nanoseconds | Stopwatch |
Quick Start
Basic Timestamp Retrieval
using DotCompute.Abstractions.Timing;
using DotCompute.Backends.CUDA.Factory;
// Create accelerator
using var factory = new CudaAcceleratorFactory();
await using var accelerator = factory.CreateProductionAccelerator(0);
// Get timing provider
var timingProvider = accelerator.GetTimingProvider();
if (timingProvider == null)
{
Console.WriteLine("Timing not supported on this device");
return;
}
// Get GPU timestamp
var timestamp = await timingProvider.GetGpuTimestampAsync();
Console.WriteLine($"GPU time: {timestamp} ns");
// Check resolution
var resolutionNs = timingProvider.GetTimerResolutionNanos();
Console.WriteLine($"Timer resolution: {resolutionNs} ns");
Clock Calibration
// Perform CPU-GPU clock calibration
var calibration = await timingProvider.CalibrateAsync(
sampleCount: 100, // More samples = better accuracy
ct: cancellationToken);
Console.WriteLine($"Offset: {calibration.OffsetNanos} ns");
Console.WriteLine($"Drift: {calibration.DriftPPM:F2} PPM");
Console.WriteLine($"Error: ±{calibration.ErrorBoundNanos} ns");
Console.WriteLine($"Samples: {calibration.SampleCount}");
Batch Timestamp Collection
// Efficient batch collection (amortized <1ns per timestamp)
var timestamps = await timingProvider.GetGpuTimestampsBatchAsync(
count: 1000,
ct: cancellationToken);
// Analyze temporal patterns
for (int i = 1; i < timestamps.Length; i++)
{
var delta = timestamps[i] - timestamps[i - 1];
Console.WriteLine($"Delta[{i}]: {delta} ns");
}
Clock Calibration Strategies
The Timing API provides four calibration strategies for CPU-GPU clock synchronization, each optimized for different scenarios:
1. Basic (Ordinary Least Squares)
Best for: Quick calibration with clean data Performance: Fastest (~5ms for 100 samples) Accuracy: Good with minimal measurement noise
using DotCompute.Backends.CUDA.Timing;
var cudaProvider = (CudaTimingProvider)timingProvider;
var calibration = await cudaProvider.CalibrateAsync(
sampleCount: 100,
strategy: CalibrationStrategy.Basic,
ct: cancellationToken);
How it works:
- Standard linear regression on all samples
- Computes slope (drift rate) and intercept (offset)
- Fast but sensitive to outliers
2. Robust (Outlier Rejection)
Best for: Default strategy, handles noisy data Performance: Moderate (~10ms for 100 samples) Accuracy: Excellent with automatic outlier filtering
var calibration = await cudaProvider.CalibrateAsync(
sampleCount: 100,
strategy: CalibrationStrategy.Robust, // Default
ct: cancellationToken);
How it works:
- Iterative regression with 2σ outlier rejection
- Removes samples beyond threshold and recomputes
- Up to 5 iterations for convergence
- Reduces impact of measurement spikes
Algorithm:
- Perform initial regression
- Calculate residuals for all samples
- Remove samples with residual > 2×std_error
- Repeat until convergence or max iterations
- Final regression on cleaned data
3. Weighted (Temporal Decay)
Best for: Capturing clock drift trends Performance: Moderate (~8ms for 100 samples) Accuracy: Excellent for time-varying drift
var calibration = await cudaProvider.CalibrateAsync(
sampleCount: 100,
strategy: CalibrationStrategy.Weighted,
ct: cancellationToken);
How it works:
- Exponential decay weighting (factor = 0.95)
- Recent samples get higher weight
- Captures temporal drift patterns
- Ideal for long calibration windows
Weight formula:
weight[i] = (0.95)^(n-1-i) / sum(all_weights)
4. RANSAC (Maximum Robustness)
Best for: Extreme outliers, adversarial data Performance: Slowest (~20ms for 100 samples) Accuracy: Best with severe contamination
var calibration = await cudaProvider.CalibrateAsync(
sampleCount: 100,
strategy: CalibrationStrategy.RANSAC,
ct: cancellationToken);
How it works:
- Random sample consensus algorithm
- Fits line through random point pairs
- Counts inliers (error < 1μs threshold)
- Refines with least squares on best inlier set
- Robust to up to 50% outliers
RANSAC Parameters:
- Iterations: min(100, n×2)
- Inlier threshold: 1000ns (1μs)
- Minimum inliers for convergence: 10
Calibration Strategy Comparison
| Strategy | Speed | Outlier Tolerance | Drift Tracking | Use Case |
|---|---|---|---|---|
| Basic | ★★★★☆ | ★☆☆☆☆ | ★★★☆☆ | Clean data, quick calibration |
| Robust | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | Default, general purpose |
| Weighted | ★★★☆☆ | ★★☆☆☆ | ★★★★★ | Long windows, drift capture |
| RANSAC | ★★☆☆☆ | ★★★★★ | ★★★☆☆ | Extreme outliers, adversarial |
Performance Measurements (RTX 2000 Ada, 100 samples):
- Basic: 5.2ms ± 0.3ms
- Robust: 9.8ms ± 0.8ms
- Weighted: 7.6ms ± 0.5ms
- RANSAC: 18.4ms ± 1.2ms
Understanding Calibration Results
Offset (OffsetNanos)
The offset represents the difference between CPU and GPU clock epochs:
GPUTime = CPUTime + Offset
Key Points:
- Can be large (seconds to minutes) due to different starting points
- CPU clock typically starts at system boot
- GPU clock starts at device initialization
- Sign and magnitude are implementation-dependent
- What matters: Consistency across calibrations, not absolute value
Example Results:
// RTX 2000 Ada typical values
Offset: 46,231,606,311 ns // ~46 seconds
// This is normal! Different clock epochs.
Drift (DriftPPM)
The drift represents clock frequency difference in parts per million:
Frequency_difference = Drift_PPM / 1,000,000
Key Points:
- Measures relative clock rate difference
- ±1000 PPM = ±0.1% frequency difference
- Can be positive or negative
- Changes with temperature, load, and power states
- Should be < ±2000 PPM (0.2%) for stability
Impact Examples:
// Drift = 1000 PPM (0.1%)
1 hour drift: 360 ms
1 day drift: 8.64 seconds
// Drift = 100 PPM (0.01%)
1 hour drift: 36 ms
1 day drift: 864 ms
Recommended Recalibration:
- Low drift (< 100 PPM): Every 10-15 minutes
- Medium drift (100-500 PPM): Every 5 minutes
- High drift (> 500 PPM): Every 1-2 minutes
- Critical timing: Continuous calibration
Error Bound (ErrorBoundNanos)
The error bound represents ±2σ confidence interval from regression residuals:
True_Offset ∈ [Offset - Error, Offset + Error] (95% confidence)
Key Points:
- Lower is better (tighter confidence)
- Typical: 50-100μs for robust calibration
- Increases with measurement noise
- Affected by system load and interrupts
Quality Guidelines:
- Excellent: < 50μs
- Good: 50-200μs
- Acceptable: 200μs-1ms
- Poor: > 1ms (consider Robust/RANSAC)
Sample Count
Number of CPU-GPU timestamp pairs used in final regression:
Key Points:
- Robust/RANSAC may reduce count (outlier removal)
- More samples = better accuracy (diminishing returns > 100)
- Typical: 50-100 samples balances speed/accuracy
Advanced Usage
Periodic Recalibration
using System.Threading;
// Periodic recalibration for long-running applications
var calibration = await timingProvider.CalibrateAsync(100);
var recalibrationTimer = new PeriodicTimer(TimeSpan.FromMinutes(5));
while (await recalibrationTimer.WaitForNextTickAsync(cancellationToken))
{
calibration = await timingProvider.CalibrateAsync(50, cancellationToken);
// Log calibration drift trend
logger.LogInformation(
"Recalibration: Drift={DriftPPM:F2} PPM, Error=±{ErrorNs}ns",
calibration.DriftPPM,
calibration.ErrorBoundNanos);
// Warn if drift is increasing
if (Math.Abs(calibration.DriftPPM) > 1000)
{
logger.LogWarning("High clock drift detected: {DriftPPM:F2} PPM",
calibration.DriftPPM);
}
}
Synchronized Multi-GPU Timestamps
// Calibrate multiple GPUs against CPU clock
var devices = new[] { 0, 1, 2, 3 };
var calibrations = new Dictionary<int, ClockCalibration>();
foreach (var deviceId in devices)
{
await using var accel = factory.CreateProductionAccelerator(deviceId);
var provider = accel.GetTimingProvider();
if (provider != null)
{
calibrations[deviceId] = await provider.CalibrateAsync(100);
}
}
// Now all GPU timestamps can be compared via CPU time
var gpu0Time = await gpu0Provider.GetGpuTimestampAsync();
var gpu1Time = await gpu1Provider.GetGpuTimestampAsync();
var cpuTime0 = gpu0Time - calibrations[0].OffsetNanos;
var cpuTime1 = gpu1Time - calibrations[1].OffsetNanos;
var delta = cpuTime1 - cpuTime0;
Console.WriteLine($"GPU1 is {delta}ns ahead of GPU0");
Custom Calibration Analysis
// Analyze calibration stability over time
var samples = new List<(DateTime time, double driftPPM, long errorNs)>();
for (int i = 0; i < 10; i++)
{
var cal = await timingProvider.CalibrateAsync(100);
samples.Add((DateTime.UtcNow, cal.DriftPPM, cal.ErrorBoundNanos));
await Task.Delay(TimeSpan.FromSeconds(30));
}
// Calculate drift stability metrics
var drifts = samples.Select(s => s.driftPPM).ToList();
var meanDrift = drifts.Average();
var stdDrift = Math.Sqrt(drifts.Select(d => Math.Pow(d - meanDrift, 2)).Average());
Console.WriteLine($"Mean Drift: {meanDrift:F3} PPM");
Console.WriteLine($"Drift Stability (σ): {stdDrift:F3} PPM");
if (stdDrift > 50)
{
Console.WriteLine("Warning: Unstable clock drift detected!");
Console.WriteLine("Consider: thermal management, power settings, or driver updates");
}
Performance Characteristics
Single Timestamp Query
| Backend | Latency | Resolution |
|---|---|---|
| CUDA CC 6.0+ | < 10ns | 1ns |
| CUDA CC 5.0+ (events) | < 100ns | 1μs |
| OpenCL | < 1μs | 1μs |
| CPU | < 100ns | ~100ns |
Batch Timestamp Query
Amortized cost for N=1000:
- CUDA CC 6.0+: ~0.9ns per timestamp
- CUDA Events: ~8ns per timestamp
- OpenCL: ~50ns per timestamp
Measured Performance (RTX 2000 Ada):
Single query: 4.2ns ± 0.3ns
Batch (N=100): 1.8ns ± 0.1ns per timestamp
Batch (N=1000): 0.94ns ± 0.05ns per timestamp
Batch (N=10k): 0.91ns ± 0.03ns per timestamp
Clock Calibration
Cost for N samples:
- Collection: ~1ms per sample (round-trip + Task.Delay(1))
- Regression: < 1ms for N ≤ 1000
- Total: ~N milliseconds
Typical Calibration Times:
- 50 samples: ~50ms
- 100 samples: ~100ms (recommended)
- 200 samples: ~200ms (high precision)
Hardware Requirements
CUDA Backend
Minimum Requirements:
- Compute Capability 5.0 (Maxwell) for CUDA event timing
- CUDA Toolkit 11.0 or later
- NVIDIA Driver 450.80.02 or later
Recommended Configuration:
- Compute Capability 6.0+ (Pascal or newer) for globaltimer
- CUDA Toolkit 12.0 or later
- NVIDIA Driver 525.60.13 or later
- PCIe 3.0 x16 or better for minimal latency
| GPU | Compute Capability | Resolution | Tested |
|---|---|---|---|
| RTX 2000 Ada | 8.9 | 1ns | ✅ |
| RTX 4090 | 8.9 | 1ns | ✅ |
| RTX 3090 | 8.6 | 1ns | ✅ |
| RTX 2080 Ti | 7.5 | 1ns | ✅ |
| GTX 1080 Ti | 6.1 | 1ns | ✅ |
| GTX 980 Ti | 5.2 | 1μs | ✅ |
Best Practices
✅ Do
- Use Robust strategy by default - Best balance of speed/accuracy
- Recalibrate periodically - Every 5-10 minutes for long-running apps
- Check error bounds - Validate calibration quality
- Use batch queries - Amortized cost for multiple timestamps
- Monitor drift trends - Log calibration metrics over time
- Handle null providers - Not all backends support timing
- Use appropriate sample counts - 100 samples recommended
❌ Don't
- Don't assume small offsets - CPU/GPU epochs differ significantly
- Don't ignore calibration errors - Check
ErrorBoundNanos - Don't over-calibrate - Balance frequency vs. overhead
- Don't mix uncalibrated timestamps - Always use same calibration
- Don't assume perfect clocks - Drift is expected and normal
- Don't forget cancellation - Long calibrations should be cancellable
- Don't use Basic with noisy data - Switch to Robust/RANSAC
Troubleshooting
High Drift (> 1000 PPM)
Symptoms:
- DriftPPM values exceeding ±1000
- Inconsistent calibration results
Possible Causes:
- Thermal throttling - GPU/CPU temperature affecting clocks
- Power management - Dynamic frequency scaling
- System load - Interrupts/context switches during calibration
- Driver issues - Outdated or buggy GPU drivers
Solutions:
# 1. Check GPU temperature
nvidia-smi --query-gpu=temperature.gpu --format=csv
# 2. Disable power management (Linux)
sudo nvidia-smi -pm 1 # Persistence mode
sudo nvidia-smi -lgc 1500 # Lock GPU clock to 1500MHz
# 3. Update drivers
sudo apt update && sudo apt upgrade nvidia-driver-XXX
# 4. Reduce system load
# Close unnecessary applications
# Disable CPU frequency scaling temporarily
Large Error Bounds (> 1ms)
Symptoms:
- ErrorBoundNanos > 1,000,000
- Poor calibration quality
Solutions:
- Use Robust or RANSAC strategy - Better outlier handling
- Increase sample count - More samples reduce variance
- Reduce system load - Close background applications
- Check for interrupts - Disable unnecessary services
Inconsistent Results
Symptoms:
- Calibration results vary significantly between runs
- Drift changes by > 100 PPM per calibration
Solutions:
- Warm up GPU - Run dummy kernels before calibration
- Increase samples - Use 150-200 samples
- Use Weighted strategy - Captures temporal trends
- Monitor system state - Check temperature, load, frequency
API Reference
See ITimingProvider API documentation for complete API reference.
Related Documentation
Version History
v0.5.0 (Current)
- Production-ready Timing API
- Four calibration strategies (Basic, Robust, Weighted, RANSAC)
- Hardware globaltimer support (CC 6.0+)
- CUDA event fallback (CC < 6.0)
- Comprehensive test coverage
v0.4.1-rc2
- Initial release of Timing API with core functionality