Table of Contents

Kernel Development Guide

Status: โœ… Production Ready | Test Coverage: 100% | Last Updated: November 2025

This guide covers best practices for writing efficient, correct, and maintainable compute kernels with DotCompute.

๐Ÿ“š Kernel Development Workflow

flowchart LR
    A[1๏ธโƒฃ Define Kernel<br/>with [Kernel] attribute] --> B[2๏ธโƒฃ Write Logic<br/>with bounds checking]
    B --> C[3๏ธโƒฃ Build Project<br/>source generator runs]
    C --> D{Analyzer<br/>Warnings?}
    D -->|Yes| E[๐Ÿ”ง Fix Issues<br/>DC001-DC012]
    D -->|No| F[4๏ธโƒฃ Write Tests<br/>unit & integration]
    E --> B
    F --> G[5๏ธโƒฃ Run Tests<br/>all backends]
    G --> H{Tests<br/>Pass?}
    H -->|No| I[๐Ÿ› Debug<br/>cross-backend validation]
    H -->|Yes| J[6๏ธโƒฃ Profile<br/>performance]
    I --> B
    J --> K{Performance<br/>Acceptable?}
    K -->|No| L[โšก Optimize<br/>memory access & branching]
    K -->|Yes| M[โœ… Deploy<br/>production ready!]
    L --> B

    style A fill:#c8e6c9
    style M fill:#c8e6c9
    style D fill:#fff9c4
    style H fill:#fff9c4
    style K fill:#fff9c4
    style E fill:#ffccbc
    style I fill:#ffccbc
    style L fill:#ffccbc

๐Ÿ“– Kernel Basics

๐Ÿ” Anatomy of a Kernel

[Kernel]                                    // 1. Kernel attribute
public static void MyKernel(                // 2. Must be static, return void
    ReadOnlySpan<float> input,              // 3. Input parameters (ReadOnlySpan)
    Span<float> output,                     // 4. Output parameters (Span)
    int length)                             // 5. Scalar parameters
{
    int idx = Kernel.ThreadId.X;            // 6. Thread indexing

    if (idx < length)                       // 7. Bounds checking
    {
        output[idx] = input[idx] * 2.0f;    // 8. Kernel logic
    }
}

Kernel Requirements

Must Have:

  • [Kernel] attribute
  • static modifier
  • void return type
  • At least one parameter
  • Bounds checking
  • Kernel.ThreadId for indexing

Cannot Have:

  • async modifier
  • Instance methods
  • ref/out parameters (except Span<T>)
  • LINQ expressions
  • Reflection
  • Non-inlineable method calls

๐Ÿงต Thread Indexing

1D Indexing (Most Common)

[Kernel]
public static void Process1D(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;

    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}

Use Cases: Vector operations, element-wise operations, 1D arrays

2D Indexing

[Kernel]
public static void Process2D(
    ReadOnlySpan<float> input,
    Span<float> output,
    int width,
    int height)
{
    int col = Kernel.ThreadId.X;
    int row = Kernel.ThreadId.Y;

    if (col < width && row < height)
    {
        int idx = row * width + col;
        output[idx] = input[idx] * 2;
    }
}

Use Cases: Image processing, matrix operations, 2D grids

3D Indexing

[Kernel]
public static void Process3D(
    ReadOnlySpan<float> input,
    Span<float> output,
    int width,
    int height,
    int depth)
{
    int x = Kernel.ThreadId.X;
    int y = Kernel.ThreadId.Y;
    int z = Kernel.ThreadId.Z;

    if (x < width && y < height && z < depth)
    {
        int idx = z * (width * height) + y * width + x;
        output[idx] = input[idx] * 2;
    }
}

Use Cases: Volumetric data, 3D simulations, video processing

๐Ÿ“ Parameter Types

Input Parameters: ReadOnlySpan

[Kernel]
public static void Example(
    ReadOnlySpan<float> input,    // โœ… Recommended
    ReadOnlySpan<int> indices,    // โœ… Multiple inputs OK
    Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[indices[idx]];
    }
}

Benefits:

  • Compiler enforces read-only semantics
  • Zero-copy on CPU
  • Clear intent in API

Output Parameters: Span

[Kernel]
public static void Example(
    ReadOnlySpan<float> input,
    Span<float> output1,          // โœ… Multiple outputs OK
    Span<float> output2)
{
    int idx = Kernel.ThreadId.X;
    if (idx < input.Length)
    {
        output1[idx] = input[idx] * 2;
        output2[idx] = input[idx] + 1;
    }
}

Benefits:

  • Clear output semantics
  • Allows multiple outputs
  • Efficient memory access

Scalar Parameters

[Kernel]
public static void Scale(
    ReadOnlySpan<float> input,
    Span<float> output,
    float scaleFactor,           // โœ… Scalar parameter
    int offset)                  // โœ… Multiple scalars OK
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * scaleFactor + offset;
    }
}

Supported Scalar Types:

  • int, long, short, byte
  • float, double
  • bool
  • uint, ulong, ushort, sbyte

๐Ÿ›ก๏ธ Bounds Checking

Why Bounds Checking Matters

// โŒ BAD: No bounds check (crashes on out-of-bounds)
[Kernel]
public static void Unsafe(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    output[idx] = input[idx] * 2; // May access beyond array bounds!
}

// โœ… GOOD: Always check bounds
[Kernel]
public static void Safe(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)  // Prevents out-of-bounds access
    {
        output[idx] = input[idx] * 2;
    }
}

Analyzer Warning: DC009 will warn about missing bounds checks

Bounds Checking Patterns

Pattern 1: Single Array

if (idx < array.Length)
{
    array[idx] = value;
}

Pattern 2: Multiple Arrays (Same Size)

if (idx < Math.Min(input.Length, output.Length))
{
    output[idx] = input[idx] * 2;
}

Pattern 3: 2D Bounds

if (col < width && row < height)
{
    int idx = row * width + col;
    array[idx] = value;
}

Pattern 4: Early Return

int idx = Kernel.ThreadId.X;
if (idx >= array.Length)
    return;  // Thread does nothing if out of bounds

array[idx] = value;

๐ŸŽฏ Common Kernel Patterns

1. Vector Addition

[Kernel]
public static void VectorAdd(
    ReadOnlySpan<float> a,
    ReadOnlySpan<float> b,
    Span<float> result)
{
    int idx = Kernel.ThreadId.X;
    if (idx < result.Length)
    {
        result[idx] = a[idx] + b[idx];
    }
}

Performance: Memory-bound, ~12 GB/s on PCIe 4.0

2. Scalar Multiply

[Kernel]
public static void ScalarMultiply(
    ReadOnlySpan<float> input,
    Span<float> output,
    float scalar)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * scalar;
    }
}

Performance: Memory-bound, same as vector add

3. Dot Product (Reduction)

[Kernel]
public static void DotProduct(
    ReadOnlySpan<float> a,
    ReadOnlySpan<float> b,
    Span<float> partialSums,  // One per thread block
    int n)
{
    int idx = Kernel.ThreadId.X;

    // Each thread computes partial sum
    float sum = 0;
    for (int i = idx; i < n; i += Kernel.GridDim.X * Kernel.BlockDim.X)
    {
        sum += a[i] * b[i];
    }

    // Store partial sum (final reduction happens on CPU)
    int blockIdx = Kernel.BlockId.X;
    partialSums[blockIdx] = sum;
}

Note: Full parallel reduction requires additional synchronization not shown here

4. Matrix Multiplication

[Kernel]
public static void MatrixMultiply(
    ReadOnlySpan<float> a,
    ReadOnlySpan<float> b,
    Span<float> c,
    int M, int N, int K)  // A: Mร—K, B: Kร—N, C: Mร—N
{
    int row = Kernel.ThreadId.Y;
    int col = Kernel.ThreadId.X;

    if (row < M && col < N)
    {
        float sum = 0;
        for (int k = 0; k < K; k++)
        {
            sum += a[row * K + k] * b[k * N + col];
        }
        c[row * N + col] = sum;
    }
}

Performance: Compute-bound, benefits greatly from GPU

5. Image Blur (Convolution)

[Kernel]
public static void GaussianBlur(
    ReadOnlySpan<float> input,
    Span<float> output,
    int width,
    int height)
{
    int x = Kernel.ThreadId.X;
    int y = Kernel.ThreadId.Y;

    if (x >= 1 && x < width - 1 && y >= 1 && y < height - 1)
    {
        // 3x3 Gaussian kernel
        float sum = 0;
        sum += input[(y-1) * width + (x-1)] * 0.0625f;
        sum += input[(y-1) * width +  x   ] * 0.125f;
        sum += input[(y-1) * width + (x+1)] * 0.0625f;
        sum += input[ y    * width + (x-1)] * 0.125f;
        sum += input[ y    * width +  x   ] * 0.25f;
        sum += input[ y    * width + (x+1)] * 0.125f;
        sum += input[(y+1) * width + (x-1)] * 0.0625f;
        sum += input[(y+1) * width +  x   ] * 0.125f;
        sum += input[(y+1) * width + (x+1)] * 0.0625f;

        output[y * width + x] = sum;
    }
}

Performance: Memory-bound with spatial locality

โšก Performance Optimization

graph TD
    A[Slow Kernel Performance?] --> B{Profile<br/>Performance}
    B --> C[Identify Bottleneck]
    C --> D{Memory<br/>Bound?}
    D -->|Yes| E[Optimize Memory Access<br/>- Sequential patterns<br/>- Data reuse<br/>- Reduce transfers]
    D -->|No| F{Compute<br/>Bound?}
    F -->|Yes| G[Optimize Computation<br/>- Use float vs double<br/>- Loop unrolling<br/>- Vectorization]
    F -->|No| H{Branch<br/>Heavy?}
    H -->|Yes| I[Reduce Branching<br/>- Branch-free math<br/>- Predication<br/>- Minimize divergence]
    H -->|No| J{Low<br/>Occupancy?}
    J -->|Yes| K[Increase Parallelism<br/>- Larger work size<br/>- Multiple outputs<br/>- Kernel fusion]
    J -->|No| L[Backend-Specific<br/>Optimization Needed]

    E --> M[Measure Again]
    G --> M
    I --> M
    K --> M
    L --> M
    M --> N{Better?}
    N -->|No| B
    N -->|Yes| O[โœ… Done!]

    style A fill:#ffccbc
    style O fill:#c8e6c9
    style D fill:#fff9c4
    style F fill:#fff9c4
    style H fill:#fff9c4
    style J fill:#fff9c4
    style N fill:#fff9c4

1. Memory Access Patterns

Sequential Access (Fast):

// โœ… Good: Sequential, cache-friendly
for (int i = idx; i < n; i += stride)
{
    output[i] = input[i] * 2;
}

Random Access (Slow):

// โŒ Bad: Random access, cache-unfriendly
for (int i = idx; i < n; i += stride)
{
    output[i] = input[indices[i]] * 2;  // Random lookup
}

Strided Access (Medium):

// โš ๏ธ OK: Strided but predictable
for (int i = idx; i < n; i += stride)
{
    output[i] = input[i * 4] * 2;  // Every 4th element
}

2. Data Reuse

Without Reuse (reads input[i] three times):

output[i] = input[i] + input[i] * input[i];

With Reuse (reads input[i] once):

float value = input[i];
output[i] = value + value * value;

3. Avoid Branching

Branchy Code (bad for GPU):

if (input[idx] > threshold)
{
    output[idx] = input[idx] * 2;
}
else
{
    output[idx] = input[idx] / 2;
}

Branch-Free (better for GPU):

float value = input[idx];
float isGreater = value > threshold ? 1.0f : 0.0f;
output[idx] = value * (2.0f * isGreater + 0.5f * (1.0f - isGreater));

Or Use Math:

float value = input[idx];
float multiplier = (value > threshold) ? 2.0f : 0.5f;
output[idx] = value * multiplier;

4. Loop Unrolling

Regular Loop:

for (int i = 0; i < 4; i++)
{
    sum += input[idx * 4 + i];
}

Unrolled Loop (compiler may do this automatically):

sum += input[idx * 4 + 0];
sum += input[idx * 4 + 1];
sum += input[idx * 4 + 2];
sum += input[idx * 4 + 3];

5. Use Appropriate Precision

Single Precision (faster on most GPUs):

float result = input[idx] * 2.0f;  // โœ… Fast

Double Precision (slower, use only when needed):

double result = input[idx] * 2.0;  // โš ๏ธ 2-8x slower on GPU

๐Ÿงช Testing Kernels

Unit Testing

[Fact]
public async Task VectorAdd_ProducesCorrectResults()
{
    // Arrange
    var orchestrator = CreateOrchestrator();
    var a = new float[] { 1, 2, 3, 4, 5 };
    var b = new float[] { 10, 20, 30, 40, 50 };
    var result = new float[5];

    // Act
    await orchestrator.ExecuteKernelAsync(
        "VectorAdd",
        new { a, b, result }
    );

    // Assert
    Assert.Equal(new float[] { 11, 22, 33, 44, 55 }, result);
}

Cross-Backend Validation

[Fact]
public async Task VectorAdd_CPUandGPU_ProduceSameResults()
{
    var debugService = GetService<IKernelDebugService>();
    var input = GenerateRandomData(10_000);

    var validation = await debugService.ValidateCrossBackendAsync(
        "VectorAdd",
        new { input },
        primaryBackend: AcceleratorType.CUDA,
        referenceBackend: AcceleratorType.CPU
    );

    Assert.True(validation.IsValid);
}

Performance Testing

[Fact]
public async Task VectorAdd_PerformanceIsAcceptable()
{
    var orchestrator = CreateOrchestrator();
    var input = new float[1_000_000];

    var stopwatch = Stopwatch.StartNew();

    for (int i = 0; i < 100; i++)
    {
        await orchestrator.ExecuteKernelAsync("VectorAdd", new { input });
    }

    stopwatch.Stop();

    var avgTime = stopwatch.Elapsed.TotalMilliseconds / 100;
    Assert.True(avgTime < 5.0, $"Average execution time {avgTime}ms exceeds 5ms");
}

๐Ÿ› Debugging Kernels

Enable Debug Validation

#if DEBUG
services.AddProductionDebugging(options =>
{
    options.Profile = DebugProfile.Development;
    options.ValidateAllExecutions = true;
    options.ThrowOnValidationFailure = true;
});
#endif

Common Issues

Issue 1: Wrong Results on GPU

Symptom: GPU produces different results than CPU

Debug:

var validation = await debugService.ValidateCrossBackendAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    AcceleratorType.CPU
);

if (!validation.IsValid)
{
    foreach (var diff in validation.Differences)
    {
        Console.WriteLine($"Index {diff.Index}: GPU={diff.PrimaryValue}, CPU={diff.ReferenceValue}");
    }
}

Common Causes:

  • Missing bounds check
  • Race condition (multiple threads writing same location)
  • Uninitialized memory
  • Floating-point precision differences

Issue 2: Non-Deterministic Results

Symptom: Same input produces different outputs on different runs

Debug:

var determinism = await debugService.TestDeterminismAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    runs: 100
);

if (!determinism.IsDeterministic)
{
    Console.WriteLine($"Cause: {determinism.Cause}");
}

Common Causes:

  • Race conditions
  • Unordered reduction operations
  • Floating-point rounding (accumulation order matters)

Issue 3: Slow Performance

Debug:

var profile = await debugService.ProfileKernelAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA,
    iterations: 1000
);

Console.WriteLine($"Average: {profile.AverageTime.TotalMicroseconds}ฮผs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds}ฮผs");
Console.WriteLine($"GFLOPS: {profile.GFLOPS:F2}");

Common Causes:

  • Memory-bound (low compute intensity)
  • Poor memory access pattern
  • Too many branches
  • Insufficient parallelism

๐Ÿ“‹ Best Practices Summary

โœ… Do

  1. Always check bounds: Prevents crashes and undefined behavior
  2. Use ReadOnlySpan<T> for inputs: Clear semantics, zero-copy on CPU
  3. Reuse variables: Reduces memory traffic
  4. Keep kernels simple: Complex logic is hard to optimize
  5. Test cross-backend: Ensures correctness on all platforms
  6. Profile before optimizing: Know where the bottleneck is

โŒ Don't

  1. Don't use LINQ: Not supported in kernels
  2. Don't use reflection: Not supported, not AOT-compatible
  3. Don't call complex methods: May not inline properly
  4. Don't ignore analyzer warnings: DC001-DC012 catch real issues
  5. Don't optimize prematurely: Profile first
  6. Don't forget async/await: Kernel execution is asynchronous

๐Ÿ“š Further Reading


Write Efficient Kernels โ€ข Test Thoroughly โ€ข Profile Before Optimizing