Debugging Guide

This guide provides practical techniques for debugging compute kernels, validating correctness, and troubleshooting common issues.

Enabling Debug Mode

Development Setup

Enable comprehensive debugging during development using logging and performance monitoring:

using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using DotCompute.Runtime;

var host = Host.CreateApplicationBuilder(args);

// Configure detailed logging for debugging
host.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Debug);
    logging.AddFilter("DotCompute", LogLevel.Trace);  // Verbose DotCompute logging
});

// Add DotCompute services with performance monitoring
host.Services.AddDotComputeRuntime();
host.Services.AddPerformanceMonitoring();  // Enable metrics collection

var app = host.Build();

Behavior:

Detailed logging of kernel execution
Performance metrics collection
Memory usage tracking
Helps identify issues during development

Testing Configuration

For CI/CD environments, use appropriate logging levels:

host.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Information);  // Less verbose for CI
});

host.Services.AddDotComputeRuntime();

Behavior:

Standard logging level for test runs
Captures important events and errors
Suitable for automated testing

Production Configuration

Use minimal logging in production:

host.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Warning);  // Only warnings and errors
});

host.Services.AddDotComputeRuntime();

Behavior:

Only logs warnings and errors
Minimal overhead
Safe for production use

Cross-Backend Validation

Validate GPU Against CPU

The most powerful debugging technique:

var debugService = services.GetRequiredService<IKernelDebugService>();

var validation = await debugService.ValidateCrossBackendAsync(
    kernelName: "MyKernel",
    parameters: new { input, output },
    primaryBackend: AcceleratorType.CUDA,    // GPU implementation
    referenceBackend: AcceleratorType.CPU     // Trusted reference
);

if (!validation.IsValid)
{
    Console.WriteLine($"❌ Validation FAILED");
    Console.WriteLine($"Found {validation.Differences.Count} differences");
    Console.WriteLine($"Severity: {validation.Severity}");
    Console.WriteLine($"Recommendation: {validation.Recommendation}");

    // Print first 10 differences
    foreach (var diff in validation.Differences.Take(10))
    {
        Console.WriteLine(
            $"  Index {diff.Index}: " +
            $"GPU={diff.PrimaryValue:F6}, " +
            $"CPU={diff.ReferenceValue:F6}, " +
            $"Error={diff.RelativeError:E2}"
        );
    }
}
else
{
    Console.WriteLine($"✅ Validation PASSED");
    Console.WriteLine($"GPU speedup: {validation.Speedup:F2}x");
}

Understanding Validation Results

Valid Result (all differences within tolerance):

✅ Validation PASSED
GPU speedup: 47.32x
No differences found (tolerance: 1e-5)

Invalid Result (differences exceed tolerance):

❌ Validation FAILED
Found 127 differences
Severity: Medium
Recommendation: Check for race conditions in parallel sections

First 10 differences:
  Index 42: GPU=3.141593, CPU=3.141592, Error=3.18e-07
  Index 108: GPU=2.718282, CPU=2.718281, Error=3.68e-07
  ...

Tolerance Thresholds

// Strict (default for testing)
options.ToleranceThreshold = 1e-5;  // 0.001% relative error

// Lenient (for accumulating operations)
options.ToleranceThreshold = 1e-3;  // 0.1% relative error

// Very lenient (for known precision issues)
options.ToleranceThreshold = 1e-2;  // 1% relative error

Rule of Thumb:

Simple operations (add, multiply): 1e-5
Accumulating operations (sum, dot product): 1e-3
Transcendental functions (sin, exp, log): 1e-4

Determinism Testing

Check for Non-Deterministic Results

var determinism = await debugService.TestDeterminismAsync(
    kernelName: "MyKernel",
    parameters: new { input, output },
    backend: AcceleratorType.CUDA,
    runs: 100  // Run 100 times with same input
);

if (!determinism.IsDeterministic)
{
    Console.WriteLine($"⚠️ Kernel is NON-DETERMINISTIC!");
    Console.WriteLine($"Found {determinism.Violations.Count} violations");
    Console.WriteLine($"Likely cause: {determinism.Cause}");

    // Show some violations
    foreach (var violation in determinism.Violations.Take(5))
    {
        Console.WriteLine(
            $"  Run {violation.RunIndex}, " +
            $"Index {violation.ElementIndex}: " +
            $"Expected {violation.ExpectedValue}, " +
            $"Got {violation.ActualValue}"
        );
    }
}
else
{
    Console.WriteLine("✅ Kernel is deterministic");
}

Common Non-Determinism Causes

1. Race Conditions:

// ❌ Race condition: Multiple threads writing same location
[Kernel]
public static void HasRaceCondition(Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    output[0] += idx;  // Race! All threads write to output[0]
}

// ✅ Fixed: Each thread writes unique location
[Kernel]
public static void NoRaceCondition(Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] += idx;  // Each thread has unique index
    }
}

2. Unordered Reduction:

// ❌ Non-deterministic: Floating-point addition is not associative
[Kernel]
public static void UnorderedSum(ReadOnlySpan<float> input, Span<float> partialSums)
{
    int idx = Kernel.ThreadId.X;
    float sum = 0;

    // Different thread scheduling = different accumulation order = different result
    for (int i = idx; i < input.Length; i += Kernel.GridDim.X)
    {
        sum += input[i];
    }

    partialSums[Kernel.BlockId.X] = sum;
}

Solution: Use Kahan summation or accept small non-determinism

Common Issues and Solutions

Issue 1: Wrong Results on GPU

Symptoms:

GPU produces different results than expected
Cross-backend validation fails
Results are NaN or Inf

Debug Steps:

Step 1: Validate against CPU

var validation = await debugService.ValidateCrossBackendAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    AcceleratorType.CPU
);

Step 2: Check for common issues

// Check for NaN/Inf
if (result.Any(float.IsNaN))
{
    Console.WriteLine("❌ Result contains NaN");
    // Causes: Division by zero, sqrt of negative, log of negative
}

if (result.Any(float.IsInfinity))
{
    Console.WriteLine("❌ Result contains Infinity");
    // Causes: Overflow, division by zero
}

Step 3: Validate numerical stability

var stability = await debugService.ValidateNumericalStabilityAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA
);

if (!stability.IsStable)
{
    Console.WriteLine($"⚠️ Numerical instability detected");
    Console.WriteLine($"NaN count: {stability.NaNCount}");
    Console.WriteLine($"Inf count: {stability.InfCount}");
    Console.WriteLine($"Overflow count: {stability.OverflowCount}");
}

Common Causes:

Missing bounds check
Race condition
Uninitialized memory
Integer overflow
Division by zero

Issue 2: Slow Performance

Symptoms:

Kernel is slower than expected
GPU slower than CPU
Performance varies widely

Debug Steps:

Step 1: Profile the kernel

var profile = await debugService.ProfileKernelAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    iterations: 1000
);

Console.WriteLine($"Average: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");

// High std dev indicates variable performance
if (profile.StandardDeviation.TotalMilliseconds > profile.AverageTime.TotalMilliseconds * 0.1)
{
    Console.WriteLine("⚠️ High variability in execution time");
}

Step 2: Analyze memory patterns

var memoryReport = await debugService.AnalyzeMemoryPatternsAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA
);

Console.WriteLine($"Sequential access: {memoryReport.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {memoryReport.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {memoryReport.BandwidthUtilization:P1}");

foreach (var suggestion in memoryReport.Suggestions)
{
    Console.WriteLine($"💡 {suggestion}");
}

Step 3: Compare backends

var cpuTime = await BenchmarkBackend(AcceleratorType.CPU);
var gpuTime = await BenchmarkBackend(AcceleratorType.CUDA);

Console.WriteLine($"CPU: {cpuTime:F2}ms");
Console.WriteLine($"GPU: {gpuTime:F2}ms");

if (gpuTime > cpuTime)
{
    Console.WriteLine("⚠️ GPU is slower than CPU!");
    Console.WriteLine("Possible causes:");
    Console.WriteLine("  - Data too small (< 10,000 elements)");
    Console.WriteLine("  - Memory-bound operation");
    Console.WriteLine("  - Transfer overhead dominates");
}

Common Causes:

Poor memory access pattern
Too many branches
Low parallelism
Small data size
Transfer overhead

Issue 3: Intermittent Failures

Symptoms:

Kernel passes sometimes, fails other times
Non-deterministic results
Hard to reproduce

Debug Steps:

Step 1: Test determinism

var determinism = await debugService.TestDeterminismAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    runs: 100
);

if (!determinism.IsDeterministic)
{
    Console.WriteLine($"❌ Non-deterministic (cause: {determinism.Cause})");
}

Step 2: Stress test

var stressTest = await debugService.StressTestKernelAsync(
    "MyKernel",
    inputGenerator: new RandomInputGenerator(),
    backend: AcceleratorType.CUDA,
    iterations: 10_000
);

Console.WriteLine($"Success rate: {stressTest.SuccessRate:P1}");
Console.WriteLine($"Failures: {stressTest.FailureCount}");

if (stressTest.FailureCount > 0)
{
    Console.WriteLine("Sample failures:");
    foreach (var failure in stressTest.Failures.Take(5))
    {
        Console.WriteLine($"  Input: {failure.Input}");
        Console.WriteLine($"  Error: {failure.Error}");
    }
}

Step 3: Detect race conditions

var raceReport = await debugService.DetectRaceConditionsAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    concurrentExecutions: 100
);

if (raceReport.HasRaceConditions)
{
    Console.WriteLine($"❌ Race conditions detected");
    Console.WriteLine($"Conflicts: {raceReport.ConflictCount}");

    foreach (var conflict in raceReport.Conflicts.Take(5))
    {
        Console.WriteLine($"  Location: {conflict.MemoryLocation}");
        Console.WriteLine($"  Threads: {string.Join(", ", conflict.ConflictingThreads)}");
    }
}

Common Causes:

Race conditions
Unordered reduction
Thread-unsafe operations
Shared memory conflicts

Issue 4: Out of Memory

Symptoms:

OutOfMemoryException thrown
Kernel fails to allocate buffers
System becomes unresponsive

Debug Steps:

Step 1: Check memory usage

var memoryStats = memoryManager.GetStatistics();

Console.WriteLine($"Total allocated: {memoryStats.TotalAllocated / 1024 / 1024:F2} MB");
Console.WriteLine($"Total pooled: {memoryStats.TotalPooled / 1024 / 1024:F2} MB");
Console.WriteLine($"Active buffers: {memoryStats.ActiveBuffers}");
Console.WriteLine($"Peak usage: {memoryStats.PeakUsage / 1024 / 1024:F2} MB");
Console.WriteLine($"Pool hit rate: {memoryStats.HitRate:P1}");

Step 2: Check GPU memory

var accelerator = await acceleratorManager.GetOrCreateAcceleratorAsync(AcceleratorType.CUDA);
var deviceStats = accelerator.GetMemoryStatistics();

Console.WriteLine($"Total GPU memory: {deviceStats.TotalMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Used GPU memory: {deviceStats.UsedMemory / 1024 / 1024:F2} MB");
Console.WriteLine($"Free GPU memory: {deviceStats.FreeMemory / 1024 / 1024:F2} MB");

if (deviceStats.FreeMemory < 100 * 1024 * 1024)  // < 100 MB
{
    Console.WriteLine("⚠️ Low GPU memory!");
}

Step 3: Find memory leaks

// Track allocations
var initialActiveBuffers = memoryStats.ActiveBuffers;

// Run kernel
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);

// Force GC
GC.Collect();
GC.WaitForPendingFinalizers();

var finalActiveBuffers = memoryManager.GetStatistics().ActiveBuffers;

if (finalActiveBuffers > initialActiveBuffers)
{
    Console.WriteLine($"⚠️ Memory leak detected!");
    Console.WriteLine($"Leaked buffers: {finalActiveBuffers - initialActiveBuffers}");
}

Solutions:

Use using statements for buffers
Return buffers to pool
Reduce batch size
Use streaming for large data

Debugging Tools

Print Debugging (CPU Only)

[Kernel]
public static void DebugPrint(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;

    // Only works on CPU backend
    if (idx < 10)  // Print first 10 threads
    {
        Console.WriteLine($"Thread {idx}: input={input[idx]}");
    }

    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}

// Force CPU execution for debugging
await orchestrator.ExecuteKernelAsync(
    "DebugPrint",
    parameters,
    forceBackend: AcceleratorType.CPU
);

Note: Console.WriteLine only works on CPU backend

Golden Reference Testing

// Create known-good output
var goldenOutput = ComputeExpectedOutput(input);

// Test kernel against golden reference
var validation = await debugService.ValidateAgainstGoldenAsync(
    "MyKernel",
    parameters: new { input },
    expectedOutput: goldenOutput,
    backend: AcceleratorType.CUDA
);

if (!validation.IsValid)
{
    Console.WriteLine($"❌ Failed to match golden reference");
    Console.WriteLine($"Differences: {validation.Differences.Count}");
}

Regression Testing

[Fact]
public async Task MyKernel_ProducesSameResultsAsPreviousVersion()
{
    // Load results from previous version
    var previousResults = LoadPreviousResults("v0.1.0");

    // Execute current version
    var currentResults = await orchestrator.ExecuteKernelAsync(
        "MyKernel",
        parameters
    );

    // Compare
    Assert.Equal(previousResults, currentResults);
}

IDE Integration

Visual Studio

Diagnostic Warnings:

DC001-DC012 diagnostics show as error squiggles
Hover for quick explanation
Click lightbulb for automated fixes

Debugging:

Set breakpoints in kernel code (CPU only)
Step through execution
Watch variables
Call stack shows kernel invocation

VS Code

C# Dev Kit Extension:

code --install-extension ms-dotnettools.csdevkit

Features:

Same diagnostics as Visual Studio
Quick fixes via lightbulb
IntelliSense for generated code

Logging and Diagnostics

Enable Detailed Logging

services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.SetMinimumLevel(LogLevel.Debug);

    // Filter to DotCompute only
    logging.AddFilter("DotCompute", LogLevel.Trace);
});

Log Output Example

[Trace] DotCompute.Core.KernelExecutionService: Discovering kernel 'VectorAdd'
[Debug] DotCompute.Core.KernelExecutionService: Backend selection: DataSize=4000000, Intensity=Low
[Debug] DotCompute.Core.KernelExecutionService: Selected backend: CPU (rule: small data)
[Trace] DotCompute.Backends.CPU.CpuAccelerator: Compiling kernel 'VectorAdd' (SIMD=AVX2)
[Debug] DotCompute.Memory.UnifiedMemoryManager: Allocated 4.00 MB from pool (hit rate: 92.3%)
[Info] DotCompute.Core.KernelExecutionService: Executed 'VectorAdd' in 2.34ms

Custom Diagnostics

public class CustomDiagnostics
{
    private readonly ILogger<CustomDiagnostics> _logger;

    public async Task DiagnoseKernel(string kernelName, object parameters)
    {
        _logger.LogInformation("=== Diagnostics for {Kernel} ===", kernelName);

        // 1. Check kernel exists
        var registry = GetService<IKernelRegistry>();
        var metadata = registry.GetKernel(kernelName);
        if (metadata == null)
        {
            _logger.LogError("❌ Kernel not found: {Kernel}", kernelName);
            return;
        }

        _logger.LogInformation("✅ Kernel found: {Namespace}.{Type}.{Method}",
            metadata.Namespace, metadata.DeclaringType, metadata.Name);

        // 2. Check backend availability
        var manager = GetService<IAcceleratorManager>();
        var availableBackends = manager.GetAvailableBackends();
        _logger.LogInformation("Available backends: {Backends}",
            string.Join(", ", availableBackends));

        // 3. Profile execution
        var profile = await ProfileKernel(kernelName, parameters);
        _logger.LogInformation("Average time: {Time:F2}μs", profile.AverageTime.TotalMicroseconds);

        // 4. Validate correctness
        var validation = await ValidateKernel(kernelName, parameters);
        if (validation.IsValid)
        {
            _logger.LogInformation("✅ Validation passed");
        }
        else
        {
            _logger.LogWarning("⚠️ Validation failed: {Count} differences",
                validation.Differences.Count);
        }

        _logger.LogInformation("=== Diagnostics complete ===");
    }
}

Best Practices

✅ Do

Enable debug validation in development - Catches issues early
Use cross-backend validation - Most reliable correctness check
Test determinism for critical kernels - Avoid subtle bugs
Profile before and after optimization - Verify improvements
Use golden reference tests - Prevent regressions
Log diagnostic information - Helps troubleshoot production issues

❌ Don't

Don't disable validation in tests - May miss correctness issues
Don't ignore analyzer warnings - DC001-DC012 catch real problems
Don't assume GPU is correct - Validate against CPU
Don't skip stress testing - Catches intermittent issues
Don't forget to dispose buffers - Causes memory leaks

Troubleshooting Checklist

When a kernel misbehaves:

[ ] Enable debug validation
[ ] Run cross-backend validation
[ ] Check for NaN/Inf in results
[ ] Test determinism (run 100 times)
[ ] Profile performance (check for anomalies)
[ ] Analyze memory access patterns
[ ] Check for race conditions
[ ] Verify bounds checking
[ ] Test with small, known inputs
[ ] Review analyzer warnings (DC001-DC012)
[ ] Check memory usage (no leaks)
[ ] Compare CPU vs GPU results

Table of Contents

Debugging Guide

Enabling Debug Mode

Development Setup

Testing Configuration

Production Configuration

Cross-Backend Validation

Validate GPU Against CPU

Understanding Validation Results

Tolerance Thresholds

Determinism Testing

Check for Non-Deterministic Results

Common Non-Determinism Causes

Common Issues and Solutions

Issue 1: Wrong Results on GPU

Issue 2: Slow Performance

Issue 3: Intermittent Failures

Issue 4: Out of Memory

Debugging Tools

Print Debugging (CPU Only)

Golden Reference Testing

Regression Testing

IDE Integration

Visual Studio

VS Code

Logging and Diagnostics

Enable Detailed Logging

Log Output Example

Custom Diagnostics

Best Practices

✅ Do

❌ Don't

Troubleshooting Checklist

Further Reading