Debugging Guide
This guide provides practical techniques for debugging compute kernels, validating correctness, and troubleshooting common issues.
Enabling Debug Mode
Development Setup
Enable comprehensive debugging during development using logging and performance monitoring:
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using DotCompute.Runtime;
var host = Host.CreateApplicationBuilder(args);
// Configure detailed logging for debugging
host.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug);
logging.AddFilter("DotCompute", LogLevel.Trace); // Verbose DotCompute logging
});
// Add DotCompute services with performance monitoring
host.Services.AddDotComputeRuntime();
host.Services.AddPerformanceMonitoring(); // Enable metrics collection
var app = host.Build();
Behavior:
- Detailed logging of kernel execution
- Performance metrics collection
- Memory usage tracking
- Helps identify issues during development
Testing Configuration
For CI/CD environments, use appropriate logging levels:
host.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Information); // Less verbose for CI
});
host.Services.AddDotComputeRuntime();
Behavior:
- Standard logging level for test runs
- Captures important events and errors
- Suitable for automated testing
Production Configuration
Use minimal logging in production:
host.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Warning); // Only warnings and errors
});
host.Services.AddDotComputeRuntime();
Behavior:
- Only logs warnings and errors
- Minimal overhead
- Safe for production use
Cross-Backend Validation
Validate GPU Against CPU
The most powerful debugging technique:
var debugService = services.GetRequiredService<IKernelDebugService>();
var validation = await debugService.ValidateCrossBackendAsync(
kernelName: "MyKernel",
parameters: new { input, output },
primaryBackend: AcceleratorType.CUDA, // GPU implementation
referenceBackend: AcceleratorType.CPU // Trusted reference
);
if (!validation.IsValid)
{
Console.WriteLine($"❌ Validation FAILED");
Console.WriteLine($"Found {validation.Differences.Count} differences");
Console.WriteLine($"Severity: {validation.Severity}");
Console.WriteLine($"Recommendation: {validation.Recommendation}");
// Print first 10 differences
foreach (var diff in validation.Differences.Take(10))
{
Console.WriteLine(
$" Index {diff.Index}: " +
$"GPU={diff.PrimaryValue:F6}, " +
$"CPU={diff.ReferenceValue:F6}, " +
$"Error={diff.RelativeError:E2}"
);
}
}
else
{
Console.WriteLine($"✅ Validation PASSED");
Console.WriteLine($"GPU speedup: {validation.Speedup:F2}x");
}
Understanding Validation Results
Valid Result (all differences within tolerance):
✅ Validation PASSED
GPU speedup: 47.32x
No differences found (tolerance: 1e-5)
Invalid Result (differences exceed tolerance):
❌ Validation FAILED
Found 127 differences
Severity: Medium
Recommendation: Check for race conditions in parallel sections
First 10 differences:
Index 42: GPU=3.141593, CPU=3.141592, Error=3.18e-07
Index 108: GPU=2.718282, CPU=2.718281, Error=3.68e-07
...
Tolerance Thresholds
// Strict (default for testing)
options.ToleranceThreshold = 1e-5; // 0.001% relative error
// Lenient (for accumulating operations)
options.ToleranceThreshold = 1e-3; // 0.1% relative error
// Very lenient (for known precision issues)
options.ToleranceThreshold = 1e-2; // 1% relative error
Rule of Thumb:
- Simple operations (add, multiply): 1e-5
- Accumulating operations (sum, dot product): 1e-3
- Transcendental functions (sin, exp, log): 1e-4
Determinism Testing
Check for Non-Deterministic Results
var determinism = await debugService.TestDeterminismAsync(
kernelName: "MyKernel",
parameters: new { input, output },
backend: AcceleratorType.CUDA,
runs: 100 // Run 100 times with same input
);
if (!determinism.IsDeterministic)
{
Console.WriteLine($"⚠️ Kernel is NON-DETERMINISTIC!");
Console.WriteLine($"Found {determinism.Violations.Count} violations");
Console.WriteLine($"Likely cause: {determinism.Cause}");
// Show some violations
foreach (var violation in determinism.Violations.Take(5))
{
Console.WriteLine(
$" Run {violation.RunIndex}, " +
$"Index {violation.ElementIndex}: " +
$"Expected {violation.ExpectedValue}, " +
$"Got {violation.ActualValue}"
);
}
}
else
{
Console.WriteLine("✅ Kernel is deterministic");
}
Common Non-Determinism Causes
1. Race Conditions:
// ❌ Race condition: Multiple threads writing same location
[Kernel]
public static void HasRaceCondition(Span<float> output)
{
int idx = Kernel.ThreadId.X;
output[0] += idx; // Race! All threads write to output[0]
}
// ✅ Fixed: Each thread writes unique location
[Kernel]
public static void NoRaceCondition(Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] += idx; // Each thread has unique index
}
}
2. Unordered Reduction:
// ❌ Non-deterministic: Floating-point addition is not associative
[Kernel]
public static void UnorderedSum(ReadOnlySpan<float> input, Span<float> partialSums)
{
int idx = Kernel.ThreadId.X;
float sum = 0;
// Different thread scheduling = different accumulation order = different result
for (int i = idx; i < input.Length; i += Kernel.GridDim.X)
{
sum += input[i];
}
partialSums[Kernel.BlockId.X] = sum;
}
Solution: Use Kahan summation or accept small non-determinism
Common Issues and Solutions
Issue 1: Wrong Results on GPU
Symptoms:
- GPU produces different results than expected
- Cross-backend validation fails
- Results are NaN or Inf
Debug Steps:
Step 1: Validate against CPU
var validation = await debugService.ValidateCrossBackendAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
AcceleratorType.CPU
);
Step 2: Check for common issues
// Check for NaN/Inf
if (result.Any(float.IsNaN))
{
Console.WriteLine("❌ Result contains NaN");
// Causes: Division by zero, sqrt of negative, log of negative
}
if (result.Any(float.IsInfinity))
{
Console.WriteLine("❌ Result contains Infinity");
// Causes: Overflow, division by zero
}
Step 3: Validate numerical stability
var stability = await debugService.ValidateNumericalStabilityAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA
);
if (!stability.IsStable)
{
Console.WriteLine($"⚠️ Numerical instability detected");
Console.WriteLine($"NaN count: {stability.NaNCount}");
Console.WriteLine($"Inf count: {stability.InfCount}");
Console.WriteLine($"Overflow count: {stability.OverflowCount}");
}
Common Causes:
- Missing bounds check
- Race condition
- Uninitialized memory
- Integer overflow
- Division by zero
Issue 2: Slow Performance
Symptoms:
- Kernel is slower than expected
- GPU slower than CPU
- Performance varies widely
Debug Steps:
Step 1: Profile the kernel
var profile = await debugService.ProfileKernelAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
iterations: 1000
);
Console.WriteLine($"Average: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");
// High std dev indicates variable performance
if (profile.StandardDeviation.TotalMilliseconds > profile.AverageTime.TotalMilliseconds * 0.1)
{
Console.WriteLine("⚠️ High variability in execution time");
}
Step 2: Analyze memory patterns
var memoryReport = await debugService.AnalyzeMemoryPatternsAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA
);
Console.WriteLine($"Sequential access: {memoryReport.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {memoryReport.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {memoryReport.BandwidthUtilization:P1}");
foreach (var suggestion in memoryReport.Suggestions)
{
Console.WriteLine($"💡 {suggestion}");
}
Step 3: Compare backends
var cpuTime = await BenchmarkBackend(AcceleratorType.CPU);
var gpuTime = await BenchmarkBackend(AcceleratorType.CUDA);
Console.WriteLine($"CPU: {cpuTime:F2}ms");
Console.WriteLine($"GPU: {gpuTime:F2}ms");
if (gpuTime > cpuTime)
{
Console.WriteLine("⚠️ GPU is slower than CPU!");
Console.WriteLine("Possible causes:");
Console.WriteLine(" - Data too small (< 10,000 elements)");
Console.WriteLine(" - Memory-bound operation");
Console.WriteLine(" - Transfer overhead dominates");
}
Common Causes:
- Poor memory access pattern
- Too many branches
- Low parallelism
- Small data size
- Transfer overhead
Issue 3: Intermittent Failures
Symptoms:
- Kernel passes sometimes, fails other times
- Non-deterministic results
- Hard to reproduce
Debug Steps:
Step 1: Test determinism
var determinism = await debugService.TestDeterminismAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
runs: 100
);
if (!determinism.IsDeterministic)
{
Console.WriteLine($"❌ Non-deterministic (cause: {determinism.Cause})");
}
Step 2: Stress test
var stressTest = await debugService.StressTestKernelAsync(
"MyKernel",
inputGenerator: new RandomInputGenerator(),
backend: AcceleratorType.CUDA,
iterations: 10_000
);
Console.WriteLine($"Success rate: {stressTest.SuccessRate:P1}");
Console.WriteLine($"Failures: {stressTest.FailureCount}");
if (stressTest.FailureCount > 0)
{
Console.WriteLine("Sample failures:");
foreach (var failure in stressTest.Failures.Take(5))
{
Console.WriteLine($" Input: {failure.Input}");
Console.WriteLine($" Error: {failure.Error}");
}
}
Step 3: Detect race conditions
var raceReport = await debugService.DetectRaceConditionsAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
concurrentExecutions: 100
);
if (raceReport.HasRaceConditions)
{
Console.WriteLine($"❌ Race conditions detected");
Console.WriteLine($"Conflicts: {raceReport.ConflictCount}");
foreach (var conflict in raceReport.Conflicts.Take(5))
{
Console.WriteLine($" Location: {conflict.MemoryLocation}");
Console.WriteLine($" Threads: {string.Join(", ", conflict.ConflictingThreads)}");
}
}
Common Causes:
- Race conditions
- Unordered reduction
- Thread-unsafe operations
- Shared memory conflicts
Issue 4: Out of Memory
Symptoms:
OutOfMemoryExceptionthrown- Kernel fails to allocate buffers
- System becomes unresponsive
Debug Steps:
Step 1: Check memory usage
var memoryStats = memoryManager.GetStatistics();
Console.WriteLine($"Total allocated: {memoryStats.TotalAllocated / 1024 / 1024:F2} MB");
Console.WriteLine($"Total pooled: {memoryStats.TotalPooled / 1024 / 1024:F2} MB");
Console.WriteLine($"Active buffers: {memoryStats.ActiveBuffers}");
Console.WriteLine($"Peak usage: {memoryStats.PeakUsage / 1024 / 1024:F2} MB");
Console.WriteLine($"Pool hit rate: {memoryStats.HitRate:P1}");
Step 2: Check GPU memory
var accelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.CUDA);
var deviceStats = accelerator.GetMemoryStatistics();
Console.WriteLine($"Total GPU memory: {deviceStats.TotalMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Used GPU memory: {deviceStats.UsedMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Free GPU memory: {deviceStats.FreeMemory / 1024 / 1024:F2} MB");
if (deviceStats.FreeMemory < 100 * 1024 * 1024) // < 100 MB
{
Console.WriteLine("⚠️ Low GPU memory!");
}
Step 3: Find memory leaks
// Track allocations
var initialActiveBuffers = memoryStats.ActiveBuffers;
// Run kernel
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
// Force GC
GC.Collect();
GC.WaitForPendingFinalizers();
var finalActiveBuffers = memoryManager.GetStatistics().ActiveBuffers;
if (finalActiveBuffers > initialActiveBuffers)
{
Console.WriteLine($"⚠️ Memory leak detected!");
Console.WriteLine($"Leaked buffers: {finalActiveBuffers - initialActiveBuffers}");
}
Solutions:
- Use
usingstatements for buffers - Return buffers to pool
- Reduce batch size
- Use streaming for large data
Debugging Tools
Print Debugging (CPU Only)
[Kernel]
public static void DebugPrint(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
// Only works on CPU backend
if (idx < 10) // Print first 10 threads
{
Console.WriteLine($"Thread {idx}: input={input[idx]}");
}
if (idx < output.Length)
{
output[idx] = input[idx] * 2;
}
}
// Force CPU execution for debugging
await orchestrator.ExecuteKernelAsync(
"DebugPrint",
parameters,
forceBackend: AcceleratorType.CPU
);
Note: Console.WriteLine only works on CPU backend
Golden Reference Testing
// Create known-good output
var goldenOutput = ComputeExpectedOutput(input);
// Test kernel against golden reference
var validation = await debugService.ValidateAgainstGoldenAsync(
"MyKernel",
parameters: new { input },
expectedOutput: goldenOutput,
backend: AcceleratorType.CUDA
);
if (!validation.IsValid)
{
Console.WriteLine($"❌ Failed to match golden reference");
Console.WriteLine($"Differences: {validation.Differences.Count}");
}
Regression Testing
[Fact]
public async Task MyKernel_ProducesSameResultsAsPreviousVersion()
{
// Load results from previous version
var previousResults = LoadPreviousResults("v0.1.0");
// Execute current version
var currentResults = await orchestrator.ExecuteKernelAsync(
"MyKernel",
parameters
);
// Compare
Assert.Equal(previousResults, currentResults);
}
IDE Integration
Visual Studio
Diagnostic Warnings:
- DC001-DC012 diagnostics show as error squiggles
- Hover for quick explanation
- Click lightbulb for automated fixes
Debugging:
- Set breakpoints in kernel code (CPU only)
- Step through execution
- Watch variables
- Call stack shows kernel invocation
VS Code
C# Dev Kit Extension:
code --install-extension ms-dotnettools.csdevkit
Features:
- Same diagnostics as Visual Studio
- Quick fixes via lightbulb
- IntelliSense for generated code
Logging and Diagnostics
Enable Detailed Logging
services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug);
// Filter to DotCompute only
logging.AddFilter("DotCompute", LogLevel.Trace);
});
Log Output Example
[Trace] DotCompute.Core.KernelExecutionService: Discovering kernel 'VectorAdd'
[Debug] DotCompute.Core.KernelExecutionService: Backend selection: DataSize=4000000, Intensity=Low
[Debug] DotCompute.Core.KernelExecutionService: Selected backend: CPU (rule: small data)
[Trace] DotCompute.Backends.CPU.CpuAccelerator: Compiling kernel 'VectorAdd' (SIMD=AVX2)
[Debug] DotCompute.Memory.UnifiedMemoryManager: Allocated 4.00 MB from pool (hit rate: 92.3%)
[Info] DotCompute.Core.KernelExecutionService: Executed 'VectorAdd' in 2.34ms
Custom Diagnostics
public class CustomDiagnostics
{
private readonly ILogger<CustomDiagnostics> _logger;
public async Task DiagnoseKernel(string kernelName, object parameters)
{
_logger.LogInformation("=== Diagnostics for {Kernel} ===", kernelName);
// 1. Check kernel exists
var registry = GetService<IKernelRegistry>();
var metadata = registry.GetKernel(kernelName);
if (metadata == null)
{
_logger.LogError("❌ Kernel not found: {Kernel}", kernelName);
return;
}
_logger.LogInformation("✅ Kernel found: {Namespace}.{Type}.{Method}",
metadata.Namespace, metadata.DeclaringType, metadata.Name);
// 2. Check backend availability
var manager = GetService<IAcceleratorManager>();
var availableBackends = manager.GetAvailableBackends();
_logger.LogInformation("Available backends: {Backends}",
string.Join(", ", availableBackends));
// 3. Profile execution
var profile = await ProfileKernel(kernelName, parameters);
_logger.LogInformation("Average time: {Time:F2}μs", profile.AverageTime.TotalMicroseconds);
// 4. Validate correctness
var validation = await ValidateKernel(kernelName, parameters);
if (validation.IsValid)
{
_logger.LogInformation("✅ Validation passed");
}
else
{
_logger.LogWarning("⚠️ Validation failed: {Count} differences",
validation.Differences.Count);
}
_logger.LogInformation("=== Diagnostics complete ===");
}
}
Best Practices
✅ Do
- Enable debug validation in development - Catches issues early
- Use cross-backend validation - Most reliable correctness check
- Test determinism for critical kernels - Avoid subtle bugs
- Profile before and after optimization - Verify improvements
- Use golden reference tests - Prevent regressions
- Log diagnostic information - Helps troubleshoot production issues
❌ Don't
- Don't disable validation in tests - May miss correctness issues
- Don't ignore analyzer warnings - DC001-DC012 catch real problems
- Don't assume GPU is correct - Validate against CPU
- Don't skip stress testing - Catches intermittent issues
- Don't forget to dispose buffers - Causes memory leaks
Troubleshooting Checklist
When a kernel misbehaves:
- [ ] Enable debug validation
- [ ] Run cross-backend validation
- [ ] Check for NaN/Inf in results
- [ ] Test determinism (run 100 times)
- [ ] Profile performance (check for anomalies)
- [ ] Analyze memory access patterns
- [ ] Check for race conditions
- [ ] Verify bounds checking
- [ ] Test with small, known inputs
- [ ] Review analyzer warnings (DC001-DC012)
- [ ] Check memory usage (no leaks)
- [ ] Compare CPU vs GPU results
Further Reading
- Kernel Development Guide - Writing correct kernels
- Performance Tuning Guide - Optimization techniques
- Architecture: Debugging System - Technical details
- Diagnostic Rules Reference - DC001-DC012 reference
- Troubleshooting Guide - Common issues and solutions
Debug Early • Validate Often • Trust But Verify