Table of Contents

[Kernel] Attribute Reference

Complete reference for the [Kernel] attribute and kernel programming model in DotCompute.

Overview

The [Kernel] attribute marks C# methods for GPU compilation. DotCompute's source generator automatically translates these methods to backend-specific code (CUDA C, Metal Shading Language, OpenCL C, or SIMD-optimized C#).

Basic Syntax

[Kernel]
public static void KernelName(
    ReadOnlySpan<T> input,
    Span<T> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = Process(input[idx]);
    }
}

Kernel Requirements

Method Signature

Valid:

[Kernel]
public static void VectorAdd(ReadOnlySpan<float> a, Span<float> b)
{
    // GPU-compatible code
}

Invalid:

[Kernel]
public void InstanceMethod() { } // Must be static

[Kernel]
public static int ReturnValue() { } // Must return void

[Kernel]
public static async Task AsyncMethod() { } // Cannot be async

Parameter Types

Supported:

  • Span<T> - Writable buffer
  • ReadOnlySpan<T> - Read-only buffer
  • Primitive types: int, float, double, long, bool
  • Value types (structs with primitive fields)

Not Supported:

  • Reference types (classes, strings)
  • Delegates, lambda expressions
  • out or ref parameters
  • Pointers (except in CUDA-specific code)

Thread Model

Thread IDs

DotCompute provides a unified threading model across all backends:

[Kernel]
public static void Process2DData(Span<float> data, int width, int height)
{
    int x = Kernel.ThreadId.X; // Column index
    int y = Kernel.ThreadId.Y; // Row index
    int z = Kernel.ThreadId.Z; // Depth index (for 3D)

    if (x < width && y < height)
    {
        int index = y * width + x;
        data[index] = x + y;
    }
}

Grid and Block Dimensions

// Thread counts
int gridSizeX = Kernel.GridDim.X;    // Number of thread blocks in X
int gridSizeY = Kernel.GridDim.Y;    // Number of thread blocks in Y
int blockSizeX = Kernel.BlockDim.X;  // Threads per block in X
int blockSizeY = Kernel.BlockDim.Y;  // Threads per block in Y

// Block indices
int blockX = Kernel.BlockId.X;
int blockY = Kernel.BlockId.Y;

// Global thread index
int globalX = Kernel.BlockId.X * Kernel.BlockDim.X + Kernel.ThreadIdx.X;
int globalY = Kernel.BlockId.Y * Kernel.BlockDim.Y + Kernel.ThreadIdx.Y;

Memory Access Patterns

Coalesced Memory Access (Optimal)

[Kernel]
public static void CoalescedAccess(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    // Sequential access - optimal for GPU
    output[idx] = input[idx] * 2.0f;
}

Strided Access (Suboptimal)

[Kernel]
public static void StridedAccess(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X;
    // Strided access - may cause bank conflicts
    output[idx * stride] = input[idx * stride];
}

Shared Memory (CUDA/Metal)

[Kernel]
public static void SharedMemoryExample(
    ReadOnlySpan<float> input,
    Span<float> output)
{
    // Declare shared memory
    var shared = Kernel.AllocateShared<float>(256);

    int tid = Kernel.ThreadIdx.X;
    int bid = Kernel.BlockId.X;
    int idx = bid * Kernel.BlockDim.X + tid;

    // Load to shared memory
    shared[tid] = input[idx];

    // Synchronize threads in block
    Kernel.Barrier();

    // Use shared data
    output[idx] = shared[tid] + shared[(tid + 1) % 256];
}

Synchronization

Thread Barrier

Kernel.Barrier(); // Synchronize all threads in a thread block

Use Cases:

  • After writing to shared memory
  • Before reading data written by other threads
  • Coordinating multi-phase algorithms

Important: Only synchronizes threads within the same thread block, not across blocks.

Atomic Operations

For thread-safe updates to shared data:

[Kernel]
public static void HistogramAtomic(ReadOnlySpan<int> data, Span<int> histogram)
{
    int idx = Kernel.ThreadId.X;
    if (idx < data.Length)
    {
        int bin = data[idx];
        Kernel.AtomicAdd(ref histogram[bin], 1);
    }
}

Supported Atomic Operations:

  • Kernel.AtomicAdd(ref target, value)
  • Kernel.AtomicSub(ref target, value)
  • Kernel.AtomicMin(ref target, value)
  • Kernel.AtomicMax(ref target, value)
  • Kernel.AtomicExchange(ref target, value)
  • Kernel.AtomicCompareExchange(ref target, compare, value)

Math Functions

Built-in Math

DotCompute translates C# math to GPU-optimized intrinsics:

[Kernel]
public static void MathOperations(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    float x = input[idx];

    // Trigonometric
    output[idx] = MathF.Sin(x);       // → sinf() on CUDA
    output[idx] = MathF.Cos(x);       // → cosf() on CUDA
    output[idx] = MathF.Tan(x);       // → tanf() on CUDA

    // Power and exponential
    output[idx] = MathF.Sqrt(x);      // → sqrtf() on CUDA
    output[idx] = MathF.Pow(x, 2);    // → powf() on CUDA
    output[idx] = MathF.Exp(x);       // → expf() on CUDA
    output[idx] = MathF.Log(x);       // → logf() on CUDA

    // Min/Max
    output[idx] = MathF.Min(x, 1.0f); // → fminf() on CUDA
    output[idx] = MathF.Max(x, 0.0f); // → fmaxf() on CUDA

    // Absolute value
    output[idx] = MathF.Abs(x);       // → fabsf() on CUDA
}

Control Flow

Supported Constructs

[Kernel]
public static void ControlFlow(Span<int> data)
{
    int idx = Kernel.ThreadId.X;

    // If statements
    if (idx % 2 == 0)
    {
        data[idx] = 0;
    }
    else
    {
        data[idx] = 1;
    }

    // For loops
    for (int i = 0; i < 10; i++)
    {
        data[idx] += i;
    }

    // While loops
    int count = 0;
    while (count < 5)
    {
        data[idx]++;
        count++;
    }

    // Switch statements
    switch (data[idx] % 3)
    {
        case 0: data[idx] = 10; break;
        case 1: data[idx] = 20; break;
        case 2: data[idx] = 30; break;
    }
}

Unsupported Constructs

Not Allowed:

  • Recursion
  • Dynamic memory allocation (new, stackalloc)
  • Exception handling (try/catch/finally)
  • LINQ queries
  • Async/await
  • Virtual method calls
  • Interface calls

Optimization Attributes

Memory Coalescing Hint

[Kernel(CoalescedAccess = true)]
public static void OptimizedKernel(ReadOnlySpan<float> data)
{
    // Compiler applies memory coalescing optimizations
}

Register Pressure Control

[Kernel(MaxRegisters = 32)]
public static void RegisterOptimized(Span<float> data)
{
    // Limits register usage to increase occupancy
}

Shared Memory Size

[Kernel(SharedMemoryBytes = 4096)]
public static void SharedMemoryKernel(Span<float> data)
{
    var shared = Kernel.AllocateShared<float>(1024); // 4KB
    // ...
}

Type Conversions

[Kernel]
public static void TypeConversions(Span<float> floats, Span<int> ints)
{
    int idx = Kernel.ThreadId.X;

    // Explicit conversions
    floats[idx] = (float)ints[idx];
    ints[idx] = (int)floats[idx];

    // Reinterpret casting (bitwise)
    int bitPattern = Kernel.ReinterpretCast<float, int>(floats[idx]);
}

Debugging Tips

Bounds Checking

Always include bounds checks:

[Kernel]
public static void SafeKernel(Span<float> data)
{
    int idx = Kernel.ThreadId.X;
    if (idx < data.Length) // Essential bounds check
    {
        data[idx] = ProcessValue(data[idx]);
    }
}

Cross-Backend Validation

// Enable debugging to validate GPU results
var debugService = new KernelDebugService();
var results = await debugService.CompareBackendsAsync(
    kernel,
    arguments,
    cpuAccelerator,
    cudaAccelerator
);

if (!results.OutputsMatch)
{
    Console.WriteLine($"GPU bug detected! Max error: {results.MaxAbsoluteDifference}");
}

Performance Best Practices

1. Minimize Divergence

Good:

if (Kernel.ThreadId.X < threshold)  // All threads in warp take same path

Bad:

if (data[Kernel.ThreadId.X] > threshold)  // Threads diverge

2. Coalesce Memory Access

Good:

int idx = Kernel.ThreadId.X;
output[idx] = input[idx]; // Sequential, coalesced

Bad:

int idx = Kernel.ThreadId.X;
output[idx] = input[idx * 7]; // Strided, uncoalesced

3. Use Shared Memory for Repeated Access

Good:

var shared = Kernel.AllocateShared<float>(256);
shared[Kernel.ThreadIdx.X] = input[globalIdx];
Kernel.Barrier();
// Access shared multiple times

Bad:

// Access global memory repeatedly
for (int i = 0; i < 10; i++)
{
    sum += input[globalIdx];
}

4. Minimize Atomic Operations

Better:

// Use local accumulation, then single atomic update
float local = 0;
for (int i = 0; i < 100; i++)
{
    local += data[i];
}
Kernel.AtomicAdd(ref total, local); // One atomic op

Worse:

for (int i = 0; i < 100; i++)
{
    Kernel.AtomicAdd(ref total, data[i]); // 100 atomic ops
}

Backend-Specific Attributes

CUDA-Specific

[Kernel]
[CudaMaxThreadsPerBlock(512)]
[CudaMinBlocksPerMultiprocessor(2)]
public static void CudaOptimized(Span<float> data)
{
    // CUDA-specific optimizations
}

Metal-Specific

[Kernel]
[MetalThreadExecutionWidth(32)]
[MetalMaxTotalThreadsPerThreadgroup(1024)]
public static void MetalOptimized(Span<float> data)
{
    // Metal-specific optimizations
}

Advanced: Inline MSL/CUDA

For maximum control, embed native code:

[Kernel]
[NativeCode(Backend.CUDA, @"
    __global__ void custom_kernel(float* data) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        data[idx] = __sinf(data[idx]);
    }
")]
public static void CustomKernel(Span<float> data)
{
    // Fallback C# implementation for CPU
    int idx = Kernel.ThreadId.X;
    data[idx] = MathF.Sin(data[idx]);
}

See Also


Next: Learn about performance optimization strategies to maximize GPU utilization.