Your First Kernel
This module guides you through writing and executing your first GPU kernel using DotCompute's [Kernel] attribute.
What is a Kernel?
A kernel is a function that executes in parallel across many GPU threads. Each thread:
- Runs the same code
- Has a unique thread ID
- Processes different data elements
Writing a Vector Addition Kernel
Step 1: Define the Kernel
Create a static method with the [Kernel] attribute:
using DotCompute.Generators.Kernel.Attributes;
public static partial class MyKernels
{
[Kernel]
public static void VectorAdd(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> result)
{
// Get this thread's index
int idx = Kernel.ThreadId.X;
// Bounds check (critical for correctness)
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
}
}
Key elements:
| Element | Purpose |
|---|---|
[Kernel] |
Marks method for GPU compilation |
partial class |
Enables source generation |
ReadOnlySpan<T> |
Input buffer (read-only on GPU) |
Span<T> |
Output buffer (read-write on GPU) |
Kernel.ThreadId.X |
Current thread's X-dimension index |
| Bounds check | Prevents out-of-bounds access |
Step 2: Set Up DotCompute Runtime
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;
// Build the host with DotCompute services
var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime(); // Registers all necessary services
var app = host.Build();
// Get the orchestrator from DI
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
Step 3: Prepare Data
const int size = 10_000;
// Create input arrays
float[] a = Enumerable.Range(0, size).Select(i => (float)i).ToArray();
float[] b = Enumerable.Range(0, size).Select(i => (float)i * 2).ToArray();
float[] result = new float[size];
Step 4: Execute the Kernel
// Execute kernel with automatic backend selection
await orchestrator.ExecuteKernelAsync(
kernelName: "VectorAdd",
args: new object[] { a, b, result }
);
The orchestrator automatically:
- Selects the best available backend (GPU or CPU)
- Handles data transfers to/from the device
- Manages thread configuration
Step 5: Verify Results
// Verify correctness
bool correct = true;
for (int i = 0; i < 5; i++) // Check first 5 elements
{
float expected = a[i] + b[i];
Console.WriteLine($"result[{i}] = {result[i]} (expected: {expected})");
if (Math.Abs(result[i] - expected) > 0.001f)
{
correct = false;
}
}
Console.WriteLine(correct ? "Results verified!" : "Verification failed!");
Complete Example
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;
using DotCompute.Generators.Kernel.Attributes;
// Define kernel
public static partial class MyKernels
{
[Kernel]
public static void VectorAdd(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
}
}
// Main program
class Program
{
static async Task Main(string[] args)
{
// Setup DotCompute
var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime();
var app = host.Build();
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
// Prepare data
const int size = 10_000;
float[] a = Enumerable.Range(0, size).Select(i => (float)i).ToArray();
float[] b = Enumerable.Range(0, size).Select(i => (float)i * 2).ToArray();
float[] result = new float[size];
// Execute kernel
await orchestrator.ExecuteKernelAsync(
kernelName: "VectorAdd",
args: new object[] { a, b, result }
);
// Display results
Console.WriteLine($"Result[0] = {result[0]}"); // 0 + 0 = 0
Console.WriteLine($"Result[100] = {result[100]}"); // 100 + 200 = 300
Console.WriteLine($"Result[999] = {result[999]}"); // 999 + 1998 = 2997
}
}
Understanding Thread Configuration
Automatic Thread Management
DotCompute automatically configures thread dimensions based on your data size. For advanced control, you can use the [Kernel] attribute options:
[Kernel(
GridDimensions = new[] { 16 }, // Number of thread blocks
BlockDimensions = new[] { 256 } // Threads per block
)]
public static void CustomConfigKernel(...)
{
// ...
}
Grid and Block Dimensions
Total threads = GridSize × BlockSize
For 10,000 elements with BlockSize=256:
- GridSize = ceil(10000/256) = 40 blocks
- Total threads = 40 × 256 = 10,240 threads
The extra 240 threads are handled by the bounds check.
Choosing Block Size
| Block Size | Use Case |
|---|---|
| 32 | Minimum (one warp) |
| 128 | Memory-bound kernels |
| 256 | General purpose (recommended) |
| 512 | Compute-bound kernels |
| 1024 | Maximum (device dependent) |
Specifying a Backend
You can explicitly request a specific backend:
// Force CUDA execution
await orchestrator.ExecuteAsync<object>(
"VectorAdd",
"CUDA", // Preferred backend
a, b, result
);
// Force CPU execution (useful for debugging)
await orchestrator.ExecuteAsync<object>(
"VectorAdd",
"CPU",
a, b, result
);
Common Mistakes
1. Missing Bounds Check
// WRONG: Will crash or corrupt memory
result[idx] = a[idx] + b[idx];
// CORRECT: Always check bounds
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
2. Forgetting AddDotComputeRuntime()
// WRONG: Missing service registration
var services = new ServiceCollection();
var provider = services.BuildServiceProvider();
var orchestrator = provider.GetRequiredService<IComputeOrchestrator>(); // Will throw!
// CORRECT: Register DotCompute services
services.AddDotComputeRuntime();
3. Using Wrong Interface
// WRONG: IComputeService doesn't exist
var computeService = provider.GetRequiredService<IComputeService>();
// CORRECT: Use IComputeOrchestrator
var orchestrator = provider.GetRequiredService<IComputeOrchestrator>();
Exercises
Exercise 1: Element-wise Multiplication
Modify the kernel to multiply instead of add:
[Kernel]
public static void VectorMultiply(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] * b[idx];
}
}
Exercise 2: Scalar Addition
Add a scalar value to each element:
[Kernel]
public static void ScalarAdd(
ReadOnlySpan<float> input,
float scalar,
Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = input[idx] + scalar;
}
}
Exercise 3: Benchmark
Compare GPU vs CPU performance for different data sizes:
- 1,000 elements
- 100,000 elements
- 10,000,000 elements
At what size does GPU become faster?
Key Takeaways
- Kernels execute in parallel across thousands of threads
- Always include bounds checks to prevent memory errors
- Use
AddDotComputeRuntime()to register all necessary services - Use
IComputeOrchestratorfor kernel execution - Thread configuration is automatic but can be customized
Next Module
Learn about GPU memory hierarchy and efficient data transfers.