Table of Contents

Debugging Guide

This guide provides practical techniques for debugging compute kernels, validating correctness, and troubleshooting common issues.

Enabling Debug Mode

Development Setup

Enable comprehensive debugging during development using logging and performance monitoring:

using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using DotCompute.Runtime;

var host = Host.CreateApplicationBuilder(args);

// Configure detailed logging for debugging
host.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Debug);
    logging.AddFilter("DotCompute", LogLevel.Trace);  // Verbose DotCompute logging
});

// Add DotCompute services with performance monitoring
host.Services.AddDotComputeRuntime();
host.Services.AddPerformanceMonitoring();  // Enable metrics collection

var app = host.Build();

Behavior:

  • Detailed logging of kernel execution
  • Performance metrics collection
  • Memory usage tracking
  • Helps identify issues during development

Testing Configuration

For CI/CD environments, use appropriate logging levels:

host.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Information);  // Less verbose for CI
});

host.Services.AddDotComputeRuntime();

Behavior:

  • Standard logging level for test runs
  • Captures important events and errors
  • Suitable for automated testing

Production Configuration

Use minimal logging in production:

host.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Warning);  // Only warnings and errors
});

host.Services.AddDotComputeRuntime();

Behavior:

  • Only logs warnings and errors
  • Minimal overhead
  • Safe for production use

Cross-Backend Validation

Validate GPU Against CPU

The most powerful debugging technique:

var debugService = services.GetRequiredService<IKernelDebugService>();

var validation = await debugService.ValidateCrossBackendAsync(
    kernelName: "MyKernel",
    parameters: new { input, output },
    primaryBackend: AcceleratorType.CUDA,    // GPU implementation
    referenceBackend: AcceleratorType.CPU     // Trusted reference
);

if (!validation.IsValid)
{
    Console.WriteLine($"❌ Validation FAILED");
    Console.WriteLine($"Found {validation.Differences.Count} differences");
    Console.WriteLine($"Severity: {validation.Severity}");
    Console.WriteLine($"Recommendation: {validation.Recommendation}");

    // Print first 10 differences
    foreach (var diff in validation.Differences.Take(10))
    {
        Console.WriteLine(
            $"  Index {diff.Index}: " +
            $"GPU={diff.PrimaryValue:F6}, " +
            $"CPU={diff.ReferenceValue:F6}, " +
            $"Error={diff.RelativeError:E2}"
        );
    }
}
else
{
    Console.WriteLine($"✅ Validation PASSED");
    Console.WriteLine($"GPU speedup: {validation.Speedup:F2}x");
}

Understanding Validation Results

Valid Result (all differences within tolerance):

✅ Validation PASSED
GPU speedup: 47.32x
No differences found (tolerance: 1e-5)

Invalid Result (differences exceed tolerance):

❌ Validation FAILED
Found 127 differences
Severity: Medium
Recommendation: Check for race conditions in parallel sections

First 10 differences:
  Index 42: GPU=3.141593, CPU=3.141592, Error=3.18e-07
  Index 108: GPU=2.718282, CPU=2.718281, Error=3.68e-07
  ...

Tolerance Thresholds

// Strict (default for testing)
options.ToleranceThreshold = 1e-5;  // 0.001% relative error

// Lenient (for accumulating operations)
options.ToleranceThreshold = 1e-3;  // 0.1% relative error

// Very lenient (for known precision issues)
options.ToleranceThreshold = 1e-2;  // 1% relative error

Rule of Thumb:

  • Simple operations (add, multiply): 1e-5
  • Accumulating operations (sum, dot product): 1e-3
  • Transcendental functions (sin, exp, log): 1e-4

Determinism Testing

Check for Non-Deterministic Results

var determinism = await debugService.TestDeterminismAsync(
    kernelName: "MyKernel",
    parameters: new { input, output },
    backend: AcceleratorType.CUDA,
    runs: 100  // Run 100 times with same input
);

if (!determinism.IsDeterministic)
{
    Console.WriteLine($"⚠️ Kernel is NON-DETERMINISTIC!");
    Console.WriteLine($"Found {determinism.Violations.Count} violations");
    Console.WriteLine($"Likely cause: {determinism.Cause}");

    // Show some violations
    foreach (var violation in determinism.Violations.Take(5))
    {
        Console.WriteLine(
            $"  Run {violation.RunIndex}, " +
            $"Index {violation.ElementIndex}: " +
            $"Expected {violation.ExpectedValue}, " +
            $"Got {violation.ActualValue}"
        );
    }
}
else
{
    Console.WriteLine("✅ Kernel is deterministic");
}

Common Non-Determinism Causes

1. Race Conditions:

// ❌ Race condition: Multiple threads writing same location
[Kernel]
public static void HasRaceCondition(Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    output[0] += idx;  // Race! All threads write to output[0]
}

// ✅ Fixed: Each thread writes unique location
[Kernel]
public static void NoRaceCondition(Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] += idx;  // Each thread has unique index
    }
}

2. Unordered Reduction:

// ❌ Non-deterministic: Floating-point addition is not associative
[Kernel]
public static void UnorderedSum(ReadOnlySpan<float> input, Span<float> partialSums)
{
    int idx = Kernel.ThreadId.X;
    float sum = 0;

    // Different thread scheduling = different accumulation order = different result
    for (int i = idx; i < input.Length; i += Kernel.GridDim.X)
    {
        sum += input[i];
    }

    partialSums[Kernel.BlockId.X] = sum;
}

Solution: Use Kahan summation or accept small non-determinism

Common Issues and Solutions

Issue 1: Wrong Results on GPU

Symptoms:

  • GPU produces different results than expected
  • Cross-backend validation fails
  • Results are NaN or Inf

Debug Steps:

Step 1: Validate against CPU

var validation = await debugService.ValidateCrossBackendAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    AcceleratorType.CPU
);

Step 2: Check for common issues

// Check for NaN/Inf
if (result.Any(float.IsNaN))
{
    Console.WriteLine("❌ Result contains NaN");
    // Causes: Division by zero, sqrt of negative, log of negative
}

if (result.Any(float.IsInfinity))
{
    Console.WriteLine("❌ Result contains Infinity");
    // Causes: Overflow, division by zero
}

Step 3: Validate numerical stability

var stability = await debugService.ValidateNumericalStabilityAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA
);

if (!stability.IsStable)
{
    Console.WriteLine($"⚠️ Numerical instability detected");
    Console.WriteLine($"NaN count: {stability.NaNCount}");
    Console.WriteLine($"Inf count: {stability.InfCount}");
    Console.WriteLine($"Overflow count: {stability.OverflowCount}");
}

Common Causes:

  1. Missing bounds check
  2. Race condition
  3. Uninitialized memory
  4. Integer overflow
  5. Division by zero

Issue 2: Slow Performance

Symptoms:

  • Kernel is slower than expected
  • GPU slower than CPU
  • Performance varies widely

Debug Steps:

Step 1: Profile the kernel

var profile = await debugService.ProfileKernelAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    iterations: 1000
);

Console.WriteLine($"Average: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");

// High std dev indicates variable performance
if (profile.StandardDeviation.TotalMilliseconds > profile.AverageTime.TotalMilliseconds * 0.1)
{
    Console.WriteLine("⚠️ High variability in execution time");
}

Step 2: Analyze memory patterns

var memoryReport = await debugService.AnalyzeMemoryPatternsAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA
);

Console.WriteLine($"Sequential access: {memoryReport.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {memoryReport.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {memoryReport.BandwidthUtilization:P1}");

foreach (var suggestion in memoryReport.Suggestions)
{
    Console.WriteLine($"💡 {suggestion}");
}

Step 3: Compare backends

var cpuTime = await BenchmarkBackend(AcceleratorType.CPU);
var gpuTime = await BenchmarkBackend(AcceleratorType.CUDA);

Console.WriteLine($"CPU: {cpuTime:F2}ms");
Console.WriteLine($"GPU: {gpuTime:F2}ms");

if (gpuTime > cpuTime)
{
    Console.WriteLine("⚠️ GPU is slower than CPU!");
    Console.WriteLine("Possible causes:");
    Console.WriteLine("  - Data too small (< 10,000 elements)");
    Console.WriteLine("  - Memory-bound operation");
    Console.WriteLine("  - Transfer overhead dominates");
}

Common Causes:

  1. Poor memory access pattern
  2. Too many branches
  3. Low parallelism
  4. Small data size
  5. Transfer overhead

Issue 3: Intermittent Failures

Symptoms:

  • Kernel passes sometimes, fails other times
  • Non-deterministic results
  • Hard to reproduce

Debug Steps:

Step 1: Test determinism

var determinism = await debugService.TestDeterminismAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    runs: 100
);

if (!determinism.IsDeterministic)
{
    Console.WriteLine($"❌ Non-deterministic (cause: {determinism.Cause})");
}

Step 2: Stress test

var stressTest = await debugService.StressTestKernelAsync(
    "MyKernel",
    inputGenerator: new RandomInputGenerator(),
    backend: AcceleratorType.CUDA,
    iterations: 10_000
);

Console.WriteLine($"Success rate: {stressTest.SuccessRate:P1}");
Console.WriteLine($"Failures: {stressTest.FailureCount}");

if (stressTest.FailureCount > 0)
{
    Console.WriteLine("Sample failures:");
    foreach (var failure in stressTest.Failures.Take(5))
    {
        Console.WriteLine($"  Input: {failure.Input}");
        Console.WriteLine($"  Error: {failure.Error}");
    }
}

Step 3: Detect race conditions

var raceReport = await debugService.DetectRaceConditionsAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    concurrentExecutions: 100
);

if (raceReport.HasRaceConditions)
{
    Console.WriteLine($"❌ Race conditions detected");
    Console.WriteLine($"Conflicts: {raceReport.ConflictCount}");

    foreach (var conflict in raceReport.Conflicts.Take(5))
    {
        Console.WriteLine($"  Location: {conflict.MemoryLocation}");
        Console.WriteLine($"  Threads: {string.Join(", ", conflict.ConflictingThreads)}");
    }
}

Common Causes:

  1. Race conditions
  2. Unordered reduction
  3. Thread-unsafe operations
  4. Shared memory conflicts

Issue 4: Out of Memory

Symptoms:

  • OutOfMemoryException thrown
  • Kernel fails to allocate buffers
  • System becomes unresponsive

Debug Steps:

Step 1: Check memory usage

var memoryStats = memoryManager.GetStatistics();

Console.WriteLine($"Total allocated: {memoryStats.TotalAllocated / 1024 / 1024:F2} MB");
Console.WriteLine($"Total pooled: {memoryStats.TotalPooled / 1024 / 1024:F2} MB");
Console.WriteLine($"Active buffers: {memoryStats.ActiveBuffers}");
Console.WriteLine($"Peak usage: {memoryStats.PeakUsage / 1024 / 1024:F2} MB");
Console.WriteLine($"Pool hit rate: {memoryStats.HitRate:P1}");

Step 2: Check GPU memory

var accelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.CUDA);
var deviceStats = accelerator.GetMemoryStatistics();

Console.WriteLine($"Total GPU memory: {deviceStats.TotalMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Used GPU memory: {deviceStats.UsedMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Free GPU memory: {deviceStats.FreeMemory / 1024 / 1024:F2} MB");

if (deviceStats.FreeMemory < 100 * 1024 * 1024)  // < 100 MB
{
    Console.WriteLine("⚠️ Low GPU memory!");
}

Step 3: Find memory leaks

// Track allocations
var initialActiveBuffers = memoryStats.ActiveBuffers;

// Run kernel
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);

// Force GC
GC.Collect();
GC.WaitForPendingFinalizers();

var finalActiveBuffers = memoryManager.GetStatistics().ActiveBuffers;

if (finalActiveBuffers > initialActiveBuffers)
{
    Console.WriteLine($"⚠️ Memory leak detected!");
    Console.WriteLine($"Leaked buffers: {finalActiveBuffers - initialActiveBuffers}");
}

Solutions:

  1. Use using statements for buffers
  2. Return buffers to pool
  3. Reduce batch size
  4. Use streaming for large data

Debugging Tools

[Kernel]
public static void DebugPrint(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;

    // Only works on CPU backend
    if (idx < 10)  // Print first 10 threads
    {
        Console.WriteLine($"Thread {idx}: input={input[idx]}");
    }

    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}

// Force CPU execution for debugging
await orchestrator.ExecuteKernelAsync(
    "DebugPrint",
    parameters,
    forceBackend: AcceleratorType.CPU
);

Note: Console.WriteLine only works on CPU backend

Golden Reference Testing

// Create known-good output
var goldenOutput = ComputeExpectedOutput(input);

// Test kernel against golden reference
var validation = await debugService.ValidateAgainstGoldenAsync(
    "MyKernel",
    parameters: new { input },
    expectedOutput: goldenOutput,
    backend: AcceleratorType.CUDA
);

if (!validation.IsValid)
{
    Console.WriteLine($"❌ Failed to match golden reference");
    Console.WriteLine($"Differences: {validation.Differences.Count}");
}

Regression Testing

[Fact]
public async Task MyKernel_ProducesSameResultsAsPreviousVersion()
{
    // Load results from previous version
    var previousResults = LoadPreviousResults("v0.1.0");

    // Execute current version
    var currentResults = await orchestrator.ExecuteKernelAsync(
        "MyKernel",
        parameters
    );

    // Compare
    Assert.Equal(previousResults, currentResults);
}

IDE Integration

Visual Studio

Diagnostic Warnings:

  • DC001-DC012 diagnostics show as error squiggles
  • Hover for quick explanation
  • Click lightbulb for automated fixes

Debugging:

  • Set breakpoints in kernel code (CPU only)
  • Step through execution
  • Watch variables
  • Call stack shows kernel invocation

VS Code

C# Dev Kit Extension:

code --install-extension ms-dotnettools.csdevkit

Features:

  • Same diagnostics as Visual Studio
  • Quick fixes via lightbulb
  • IntelliSense for generated code

Logging and Diagnostics

Enable Detailed Logging

services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Debug);

    // Filter to DotCompute only
    logging.AddFilter("DotCompute", LogLevel.Trace);
});

Log Output Example

[Trace] DotCompute.Core.KernelExecutionService: Discovering kernel 'VectorAdd'
[Debug] DotCompute.Core.KernelExecutionService: Backend selection: DataSize=4000000, Intensity=Low
[Debug] DotCompute.Core.KernelExecutionService: Selected backend: CPU (rule: small data)
[Trace] DotCompute.Backends.CPU.CpuAccelerator: Compiling kernel 'VectorAdd' (SIMD=AVX2)
[Debug] DotCompute.Memory.UnifiedMemoryManager: Allocated 4.00 MB from pool (hit rate: 92.3%)
[Info] DotCompute.Core.KernelExecutionService: Executed 'VectorAdd' in 2.34ms

Custom Diagnostics

public class CustomDiagnostics
{
    private readonly ILogger<CustomDiagnostics> _logger;

    public async Task DiagnoseKernel(string kernelName, object parameters)
    {
        _logger.LogInformation("=== Diagnostics for {Kernel} ===", kernelName);

        // 1. Check kernel exists
        var registry = GetService<IKernelRegistry>();
        var metadata = registry.GetKernel(kernelName);
        if (metadata == null)
        {
            _logger.LogError("❌ Kernel not found: {Kernel}", kernelName);
            return;
        }

        _logger.LogInformation("✅ Kernel found: {Namespace}.{Type}.{Method}",
            metadata.Namespace, metadata.DeclaringType, metadata.Name);

        // 2. Check backend availability
        var manager = GetService<IAcceleratorManager>();
        var availableBackends = manager.GetAvailableBackends();
        _logger.LogInformation("Available backends: {Backends}",
            string.Join(", ", availableBackends));

        // 3. Profile execution
        var profile = await ProfileKernel(kernelName, parameters);
        _logger.LogInformation("Average time: {Time:F2}μs", profile.AverageTime.TotalMicroseconds);

        // 4. Validate correctness
        var validation = await ValidateKernel(kernelName, parameters);
        if (validation.IsValid)
        {
            _logger.LogInformation("✅ Validation passed");
        }
        else
        {
            _logger.LogWarning("⚠️ Validation failed: {Count} differences",
                validation.Differences.Count);
        }

        _logger.LogInformation("=== Diagnostics complete ===");
    }
}

Best Practices

✅ Do

  1. Enable debug validation in development - Catches issues early
  2. Use cross-backend validation - Most reliable correctness check
  3. Test determinism for critical kernels - Avoid subtle bugs
  4. Profile before and after optimization - Verify improvements
  5. Use golden reference tests - Prevent regressions
  6. Log diagnostic information - Helps troubleshoot production issues

❌ Don't

  1. Don't disable validation in tests - May miss correctness issues
  2. Don't ignore analyzer warnings - DC001-DC012 catch real problems
  3. Don't assume GPU is correct - Validate against CPU
  4. Don't skip stress testing - Catches intermittent issues
  5. Don't forget to dispose buffers - Causes memory leaks

Troubleshooting Checklist

When a kernel misbehaves:

  • [ ] Enable debug validation
  • [ ] Run cross-backend validation
  • [ ] Check for NaN/Inf in results
  • [ ] Test determinism (run 100 times)
  • [ ] Profile performance (check for anomalies)
  • [ ] Analyze memory access patterns
  • [ ] Check for race conditions
  • [ ] Verify bounds checking
  • [ ] Test with small, known inputs
  • [ ] Review analyzer warnings (DC001-DC012)
  • [ ] Check memory usage (no leaks)
  • [ ] Compare CPU vs GPU results

Further Reading


Debug Early • Validate Often • Trust But Verify