[Kernel] Attribute Reference

Complete reference for the [Kernel] attribute and kernel programming model in DotCompute.

Overview

The [Kernel] attribute marks C# methods for GPU compilation. DotCompute's source generator automatically translates these methods to backend-specific code (CUDA C, Metal Shading Language, OpenCL C, or SIMD-optimized C#).

Basic Syntax

[Kernel]
public static void KernelName(
    ReadOnlySpan<T> input,
    Span<T> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = Process(input[idx]);
    }
}

Kernel Requirements

Method Signature

✅ Valid:

[Kernel]
public static void VectorAdd(ReadOnlySpan<float> a, Span<float> b)
{
    // GPU-compatible code
}

❌ Invalid:

[Kernel]
public void InstanceMethod() { } // Must be static

[Kernel]
public static int ReturnValue() { } // Must return void

[Kernel]
public static async Task AsyncMethod() { } // Cannot be async

Parameter Types

Supported:

Span<T> - Writable buffer
ReadOnlySpan<T> - Read-only buffer
Primitive types: int, float, double, long, bool
Value types (structs with primitive fields)

Not Supported:

Reference types (classes, strings)
Delegates, lambda expressions
out or ref parameters
Pointers (except in CUDA-specific code)

Thread Model

Thread IDs

DotCompute provides a unified threading model across all backends:

[Kernel]
public static void Process2DData(Span<float> data, int width, int height)
{
    int x = Kernel.ThreadId.X; // Column index
    int y = Kernel.ThreadId.Y; // Row index
    int z = Kernel.ThreadId.Z; // Depth index (for 3D)

    if (x < width && y < height)
    {
        int index = y * width + x;
        data[index] = x + y;
    }
}

Grid and Block Dimensions

// Thread counts
int gridSizeX = Kernel.GridDim.X;    // Number of thread blocks in X
int gridSizeY = Kernel.GridDim.Y;    // Number of thread blocks in Y
int blockSizeX = Kernel.BlockDim.X;  // Threads per block in X
int blockSizeY = Kernel.BlockDim.Y;  // Threads per block in Y

// Block indices
int blockX = Kernel.BlockId.X;
int blockY = Kernel.BlockId.Y;

// Global thread index
int globalX = Kernel.BlockId.X * Kernel.BlockDim.X + Kernel.ThreadIdx.X;
int globalY = Kernel.BlockId.Y * Kernel.BlockDim.Y + Kernel.ThreadIdx.Y;

Memory Access Patterns

Coalesced Memory Access (Optimal)

[Kernel]
public static void CoalescedAccess(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    // Sequential access - optimal for GPU
    output[idx] = input[idx] * 2.0f;
}

Strided Access (Suboptimal)

[Kernel]
public static void StridedAccess(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X;
    // Strided access - may cause bank conflicts
    output[idx * stride] = input[idx * stride];
}

Shared Memory (CUDA/Metal)

[Kernel]
public static void SharedMemoryExample(
    ReadOnlySpan<float> input,
    Span<float> output)
{
    // Declare shared memory
    var shared = Kernel.AllocateShared<float>(256);

    int tid = Kernel.ThreadIdx.X;
    int bid = Kernel.BlockId.X;
    int idx = bid * Kernel.BlockDim.X + tid;

    // Load to shared memory
    shared[tid] = input[idx];

    // Synchronize threads in block
    Kernel.Barrier();

    // Use shared data
    output[idx] = shared[tid] + shared[(tid + 1) % 256];
}

Synchronization

Thread Barrier

Kernel.Barrier(); // Synchronize all threads in a thread block

Use Cases:

After writing to shared memory
Before reading data written by other threads
Coordinating multi-phase algorithms

Important: Only synchronizes threads within the same thread block, not across blocks.

Atomic Operations

For thread-safe updates to shared data:

[Kernel]
public static void HistogramAtomic(ReadOnlySpan<int> data, Span<int> histogram)
{
    int idx = Kernel.ThreadId.X;
    if (idx < data.Length)
    {
        int bin = data[idx];
        Kernel.AtomicAdd(ref histogram[bin], 1);
    }
}

Supported Atomic Operations:

Kernel.AtomicAdd(ref target, value)
Kernel.AtomicSub(ref target, value)
Kernel.AtomicMin(ref target, value)
Kernel.AtomicMax(ref target, value)
Kernel.AtomicExchange(ref target, value)
Kernel.AtomicCompareExchange(ref target, compare, value)

Math Functions

Built-in Math

DotCompute translates C# math to GPU-optimized intrinsics:

[Kernel]
public static void MathOperations(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    float x = input[idx];

    // Trigonometric
    output[idx] = MathF.Sin(x);       // → sinf() on CUDA
    output[idx] = MathF.Cos(x);       // → cosf() on CUDA
    output[idx] = MathF.Tan(x);       // → tanf() on CUDA

    // Power and exponential
    output[idx] = MathF.Sqrt(x);      // → sqrtf() on CUDA
    output[idx] = MathF.Pow(x, 2);    // → powf() on CUDA
    output[idx] = MathF.Exp(x);       // → expf() on CUDA
    output[idx] = MathF.Log(x);       // → logf() on CUDA

    // Min/Max
    output[idx] = MathF.Min(x, 1.0f); // → fminf() on CUDA
    output[idx] = MathF.Max(x, 0.0f); // → fmaxf() on CUDA

    // Absolute value
    output[idx] = MathF.Abs(x);       // → fabsf() on CUDA
}

Control Flow

Supported Constructs

[Kernel]
public static void ControlFlow(Span<int> data)
{
    int idx = Kernel.ThreadId.X;

    // If statements
    if (idx % 2 == 0)
    {
        data[idx] = 0;
    }
    else
    {
        data[idx] = 1;
    }

    // For loops
    for (int i = 0; i < 10; i++)
    {
        data[idx] += i;
    }

    // While loops
    int count = 0;
    while (count < 5)
    {
        data[idx]++;
        count++;
    }

    // Switch statements
    switch (data[idx] % 3)
    {
        case 0: data[idx] = 10; break;
        case 1: data[idx] = 20; break;
        case 2: data[idx] = 30; break;
    }
}

Unsupported Constructs

❌ Not Allowed:

Recursion
Dynamic memory allocation (new, stackalloc)
Exception handling (try/catch/finally)
LINQ queries
Async/await
Virtual method calls
Interface calls

Optimization Attributes

Memory Coalescing Hint

[Kernel(CoalescedAccess = true)]
public static void OptimizedKernel(ReadOnlySpan<float> data)
{
    // Compiler applies memory coalescing optimizations
}

Register Pressure Control

[Kernel(MaxRegisters = 32)]
public static void RegisterOptimized(Span<float> data)
{
    // Limits register usage to increase occupancy
}

Shared Memory Size

[Kernel(SharedMemoryBytes = 4096)]
public static void SharedMemoryKernel(Span<float> data)
{
    var shared = Kernel.AllocateShared<float>(1024); // 4KB
    // ...
}

Type Conversions

[Kernel]
public static void TypeConversions(Span<float> floats, Span<int> ints)
{
    int idx = Kernel.ThreadId.X;

    // Explicit conversions
    floats[idx] = (float)ints[idx];
    ints[idx] = (int)floats[idx];

    // Reinterpret casting (bitwise)
    int bitPattern = Kernel.ReinterpretCast<float, int>(floats[idx]);
}

Debugging Tips

Bounds Checking

Always include bounds checks:

[Kernel]
public static void SafeKernel(Span<float> data)
{
    int idx = Kernel.ThreadId.X;
    if (idx < data.Length) // Essential bounds check
    {
        data[idx] = ProcessValue(data[idx]);
    }
}

Cross-Backend Validation

// Enable debugging to validate GPU results
var debugService = new KernelDebugService();
var results = await debugService.CompareBackendsAsync(
    kernel,
    arguments,
    cpuAccelerator,
    cudaAccelerator
);

if (!results.OutputsMatch)
{
    Console.WriteLine($"GPU bug detected! Max error: {results.MaxAbsoluteDifference}");
}

Performance Best Practices

1. Minimize Divergence

✅ Good:

if (Kernel.ThreadId.X < threshold)  // All threads in warp take same path

❌ Bad:

if (data[Kernel.ThreadId.X] > threshold)  // Threads diverge

2. Coalesce Memory Access

✅ Good:

int idx = Kernel.ThreadId.X;
output[idx] = input[idx]; // Sequential, coalesced

❌ Bad:

int idx = Kernel.ThreadId.X;
output[idx] = input[idx * 7]; // Strided, uncoalesced

3. Use Shared Memory for Repeated Access

✅ Good:

var shared = Kernel.AllocateShared<float>(256);
shared[Kernel.ThreadIdx.X] = input[globalIdx];
Kernel.Barrier();
// Access shared multiple times

❌ Bad:

// Access global memory repeatedly
for (int i = 0; i < 10; i++)
{
    sum += input[globalIdx];
}

4. Minimize Atomic Operations

✅ Better:

// Use local accumulation, then single atomic update
float local = 0;
for (int i = 0; i < 100; i++)
{
    local += data[i];
}
Kernel.AtomicAdd(ref total, local); // One atomic op

❌ Worse:

for (int i = 0; i < 100; i++)
{
    Kernel.AtomicAdd(ref total, data[i]); // 100 atomic ops
}

Backend-Specific Attributes

CUDA-Specific

[Kernel]
[CudaMaxThreadsPerBlock(512)]
[CudaMinBlocksPerMultiprocessor(2)]
public static void CudaOptimized(Span<float> data)
{
    // CUDA-specific optimizations
}

Metal-Specific

[Kernel]
[MetalThreadExecutionWidth(32)]
[MetalMaxTotalThreadsPerThreadgroup(1024)]
public static void MetalOptimized(Span<float> data)
{
    // Metal-specific optimizations
}

Advanced: Inline MSL/CUDA

For maximum control, embed native code:

[Kernel]
[NativeCode(Backend.CUDA, @"
    __global__ void custom_kernel(float* data) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        data[idx] = __sinf(data[idx]);
    }
")]
public static void CustomKernel(Span<float> data)
{
    // Fallback C# implementation for CPU
    int idx = Kernel.ThreadId.X;
    data[idx] = MathF.Sin(data[idx]);
}

Table of Contents

[Kernel] Attribute Reference

Overview

Basic Syntax

Kernel Requirements

Method Signature

Parameter Types

Thread Model

Thread IDs

Grid and Block Dimensions

Memory Access Patterns

Coalesced Memory Access (Optimal)

Strided Access (Suboptimal)

Shared Memory (CUDA/Metal)

Synchronization

Thread Barrier

Atomic Operations

Math Functions

Built-in Math

Control Flow

Supported Constructs

Unsupported Constructs

Optimization Attributes

Memory Coalescing Hint

Register Pressure Control

Shared Memory Size

Type Conversions

Debugging Tips

Bounds Checking

Cross-Backend Validation

Performance Best Practices

1. Minimize Divergence

2. Coalesce Memory Access

3. Use Shared Memory for Repeated Access

4. Minimize Atomic Operations

Backend-Specific Attributes

CUDA-Specific

Metal-Specific

Advanced: Inline MSL/CUDA

See Also