Your First Kernel

This module guides you through writing and executing your first GPU kernel using DotCompute's [Kernel] attribute.

What is a Kernel?

A kernel is a function that executes in parallel across many GPU threads. Each thread:

Runs the same code
Has a unique thread ID
Processes different data elements

Writing a Vector Addition Kernel

Step 1: Define the Kernel

Create a static method with the [Kernel] attribute:

using DotCompute.Generators.Kernel.Attributes;

public static partial class MyKernels
{
    [Kernel]
    public static void VectorAdd(
        ReadOnlySpan<float> a,
        ReadOnlySpan<float> b,
        Span<float> result)
    {
        // Get this thread's index
        int idx = Kernel.ThreadId.X;

        // Bounds check (critical for correctness)
        if (idx < result.Length)
        {
            result[idx] = a[idx] + b[idx];
        }
    }
}

Key elements:

Element	Purpose
`[Kernel]`	Marks method for GPU compilation
`partial class`	Enables source generation
`ReadOnlySpan<T>`	Input buffer (read-only on GPU)
`Span<T>`	Output buffer (read-write on GPU)
`Kernel.ThreadId.X`	Current thread's X-dimension index
Bounds check	Prevents out-of-bounds access

Step 2: Set Up DotCompute Runtime

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;

// Build the host with DotCompute services
var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime();  // Registers all necessary services
var app = host.Build();

// Get the orchestrator from DI
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();

Step 3: Prepare Data

const int size = 10_000;

// Create input arrays
float[] a = Enumerable.Range(0, size).Select(i => (float)i).ToArray();
float[] b = Enumerable.Range(0, size).Select(i => (float)i * 2).ToArray();
float[] result = new float[size];

Step 4: Execute the Kernel

// Execute kernel with automatic backend selection
await orchestrator.ExecuteKernelAsync(
    kernelName: "VectorAdd",
    args: new object[] { a, b, result }
);

The orchestrator automatically:

Selects the best available backend (GPU or CPU)
Handles data transfers to/from the device
Manages thread configuration

Step 5: Verify Results

// Verify correctness
bool correct = true;
for (int i = 0; i < 5; i++)  // Check first 5 elements
{
    float expected = a[i] + b[i];
    Console.WriteLine($"result[{i}] = {result[i]} (expected: {expected})");
    if (Math.Abs(result[i] - expected) > 0.001f)
    {
        correct = false;
    }
}

Console.WriteLine(correct ? "Results verified!" : "Verification failed!");

Complete Example

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;
using DotCompute.Generators.Kernel.Attributes;

// Define kernel
public static partial class MyKernels
{
    [Kernel]
    public static void VectorAdd(
        ReadOnlySpan<float> a,
        ReadOnlySpan<float> b,
        Span<float> result)
    {
        int idx = Kernel.ThreadId.X;
        if (idx < result.Length)
        {
            result[idx] = a[idx] + b[idx];
        }
    }
}

// Main program
class Program
{
    static async Task Main(string[] args)
    {
        // Setup DotCompute
        var host = Host.CreateApplicationBuilder(args);
        host.Services.AddDotComputeRuntime();
        var app = host.Build();

        var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();

        // Prepare data
        const int size = 10_000;
        float[] a = Enumerable.Range(0, size).Select(i => (float)i).ToArray();
        float[] b = Enumerable.Range(0, size).Select(i => (float)i * 2).ToArray();
        float[] result = new float[size];

        // Execute kernel
        await orchestrator.ExecuteKernelAsync(
            kernelName: "VectorAdd",
            args: new object[] { a, b, result }
        );

        // Display results
        Console.WriteLine($"Result[0] = {result[0]}");      // 0 + 0 = 0
        Console.WriteLine($"Result[100] = {result[100]}");  // 100 + 200 = 300
        Console.WriteLine($"Result[999] = {result[999]}");  // 999 + 1998 = 2997
    }
}

Understanding Thread Configuration

Automatic Thread Management

DotCompute automatically configures thread dimensions based on your data size. For advanced control, you can use the [Kernel] attribute options:

[Kernel(
    GridDimensions = new[] { 16 },      // Number of thread blocks
    BlockDimensions = new[] { 256 }     // Threads per block
)]
public static void CustomConfigKernel(...)
{
    // ...
}

Grid and Block Dimensions

Total threads = GridSize × BlockSize

For 10,000 elements with BlockSize=256:

GridSize = ceil(10000/256) = 40 blocks
Total threads = 40 × 256 = 10,240 threads

The extra 240 threads are handled by the bounds check.

Choosing Block Size

Block Size	Use Case
32	Minimum (one warp)
128	Memory-bound kernels
256	General purpose (recommended)
512	Compute-bound kernels
1024	Maximum (device dependent)

Specifying a Backend

You can explicitly request a specific backend:

// Force CUDA execution
await orchestrator.ExecuteAsync<object>(
    "VectorAdd",
    "CUDA",  // Preferred backend
    a, b, result
);

// Force CPU execution (useful for debugging)
await orchestrator.ExecuteAsync<object>(
    "VectorAdd",
    "CPU",
    a, b, result
);

Common Mistakes

1. Missing Bounds Check

// WRONG: Will crash or corrupt memory
result[idx] = a[idx] + b[idx];

// CORRECT: Always check bounds
if (idx < result.Length)
{
    result[idx] = a[idx] + b[idx];
}

2. Forgetting `AddDotComputeRuntime()`

// WRONG: Missing service registration
var services = new ServiceCollection();
var provider = services.BuildServiceProvider();
var orchestrator = provider.GetRequiredService<IComputeOrchestrator>(); // Will throw!

// CORRECT: Register DotCompute services
services.AddDotComputeRuntime();

3. Using Wrong Interface

// WRONG: IComputeService doesn't exist
var computeService = provider.GetRequiredService<IComputeService>();

// CORRECT: Use IComputeOrchestrator
var orchestrator = provider.GetRequiredService<IComputeOrchestrator>();

Exercises

Exercise 1: Element-wise Multiplication

Modify the kernel to multiply instead of add:

[Kernel]
public static void VectorMultiply(
    ReadOnlySpan<float> a,
    ReadOnlySpan<float> b,
    Span<float> result)
{
    int idx = Kernel.ThreadId.X;
    if (idx < result.Length)
    {
        result[idx] = a[idx] * b[idx];
    }
}

Exercise 2: Scalar Addition

Add a scalar value to each element:

[Kernel]
public static void ScalarAdd(
    ReadOnlySpan<float> input,
    float scalar,
    Span<float> result)
{
    int idx = Kernel.ThreadId.X;
    if (idx < result.Length)
    {
        result[idx] = input[idx] + scalar;
    }
}

Exercise 3: Benchmark

Compare GPU vs CPU performance for different data sizes:

1,000 elements
100,000 elements
10,000,000 elements

At what size does GPU become faster?

Key Takeaways

Kernels execute in parallel across thousands of threads
Always include bounds checks to prevent memory errors
Use AddDotComputeRuntime() to register all necessary services
Use IComputeOrchestrator for kernel execution
Thread configuration is automatic but can be customized

Next Module

Memory Fundamentals →

Learn about GPU memory hierarchy and efficient data transfers.

Table of Contents