Kernel Development Guide
Status: โ Production Ready | Test Coverage: 100% | Last Updated: November 2025
This guide covers best practices for writing efficient, correct, and maintainable compute kernels with DotCompute.
๐ Kernel Development Workflow
flowchart LR
A[1๏ธโฃ Define Kernel<br/>with [Kernel] attribute] --> B[2๏ธโฃ Write Logic<br/>with bounds checking]
B --> C[3๏ธโฃ Build Project<br/>source generator runs]
C --> D{Analyzer<br/>Warnings?}
D -->|Yes| E[๐ง Fix Issues<br/>DC001-DC012]
D -->|No| F[4๏ธโฃ Write Tests<br/>unit & integration]
E --> B
F --> G[5๏ธโฃ Run Tests<br/>all backends]
G --> H{Tests<br/>Pass?}
H -->|No| I[๐ Debug<br/>cross-backend validation]
H -->|Yes| J[6๏ธโฃ Profile<br/>performance]
I --> B
J --> K{Performance<br/>Acceptable?}
K -->|No| L[โก Optimize<br/>memory access & branching]
K -->|Yes| M[โ
Deploy<br/>production ready!]
L --> B
style A fill:#c8e6c9
style M fill:#c8e6c9
style D fill:#fff9c4
style H fill:#fff9c4
style K fill:#fff9c4
style E fill:#ffccbc
style I fill:#ffccbc
style L fill:#ffccbc
๐ Kernel Basics
๐ Anatomy of a Kernel
[Kernel] // 1. Kernel attribute
public static void MyKernel( // 2. Must be static, return void
ReadOnlySpan<float> input, // 3. Input parameters (ReadOnlySpan)
Span<float> output, // 4. Output parameters (Span)
int length) // 5. Scalar parameters
{
int idx = Kernel.ThreadId.X; // 6. Thread indexing
if (idx < length) // 7. Bounds checking
{
output[idx] = input[idx] * 2.0f; // 8. Kernel logic
}
}
Kernel Requirements
Must Have:
[Kernel]attributestaticmodifiervoidreturn type- At least one parameter
- Bounds checking
Kernel.ThreadIdfor indexing
Cannot Have:
asyncmodifier- Instance methods
ref/outparameters (exceptSpan<T>)- LINQ expressions
- Reflection
- Non-inlineable method calls
๐งต Thread Indexing
1D Indexing (Most Common)
[Kernel]
public static void Process1D(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2;
}
}
Use Cases: Vector operations, element-wise operations, 1D arrays
2D Indexing
[Kernel]
public static void Process2D(
ReadOnlySpan<float> input,
Span<float> output,
int width,
int height)
{
int col = Kernel.ThreadId.X;
int row = Kernel.ThreadId.Y;
if (col < width && row < height)
{
int idx = row * width + col;
output[idx] = input[idx] * 2;
}
}
Use Cases: Image processing, matrix operations, 2D grids
3D Indexing
[Kernel]
public static void Process3D(
ReadOnlySpan<float> input,
Span<float> output,
int width,
int height,
int depth)
{
int x = Kernel.ThreadId.X;
int y = Kernel.ThreadId.Y;
int z = Kernel.ThreadId.Z;
if (x < width && y < height && z < depth)
{
int idx = z * (width * height) + y * width + x;
output[idx] = input[idx] * 2;
}
}
Use Cases: Volumetric data, 3D simulations, video processing
๐ Parameter Types
Input Parameters: ReadOnlySpan
[Kernel]
public static void Example(
ReadOnlySpan<float> input, // โ
Recommended
ReadOnlySpan<int> indices, // โ
Multiple inputs OK
Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[indices[idx]];
}
}
Benefits:
- Compiler enforces read-only semantics
- Zero-copy on CPU
- Clear intent in API
Output Parameters: Span
[Kernel]
public static void Example(
ReadOnlySpan<float> input,
Span<float> output1, // โ
Multiple outputs OK
Span<float> output2)
{
int idx = Kernel.ThreadId.X;
if (idx < input.Length)
{
output1[idx] = input[idx] * 2;
output2[idx] = input[idx] + 1;
}
}
Benefits:
- Clear output semantics
- Allows multiple outputs
- Efficient memory access
Scalar Parameters
[Kernel]
public static void Scale(
ReadOnlySpan<float> input,
Span<float> output,
float scaleFactor, // โ
Scalar parameter
int offset) // โ
Multiple scalars OK
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * scaleFactor + offset;
}
}
Supported Scalar Types:
int,long,short,bytefloat,doublebooluint,ulong,ushort,sbyte
๐ก๏ธ Bounds Checking
Why Bounds Checking Matters
// โ BAD: No bounds check (crashes on out-of-bounds)
[Kernel]
public static void Unsafe(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
output[idx] = input[idx] * 2; // May access beyond array bounds!
}
// โ
GOOD: Always check bounds
[Kernel]
public static void Safe(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length) // Prevents out-of-bounds access
{
output[idx] = input[idx] * 2;
}
}
Analyzer Warning: DC009 will warn about missing bounds checks
Bounds Checking Patterns
Pattern 1: Single Array
if (idx < array.Length)
{
array[idx] = value;
}
Pattern 2: Multiple Arrays (Same Size)
if (idx < Math.Min(input.Length, output.Length))
{
output[idx] = input[idx] * 2;
}
Pattern 3: 2D Bounds
if (col < width && row < height)
{
int idx = row * width + col;
array[idx] = value;
}
Pattern 4: Early Return
int idx = Kernel.ThreadId.X;
if (idx >= array.Length)
return; // Thread does nothing if out of bounds
array[idx] = value;
๐ฏ Common Kernel Patterns
1. Vector Addition
[Kernel]
public static void VectorAdd(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
}
Performance: Memory-bound, ~12 GB/s on PCIe 4.0
2. Scalar Multiply
[Kernel]
public static void ScalarMultiply(
ReadOnlySpan<float> input,
Span<float> output,
float scalar)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * scalar;
}
}
Performance: Memory-bound, same as vector add
3. Dot Product (Reduction)
[Kernel]
public static void DotProduct(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> partialSums, // One per thread block
int n)
{
int idx = Kernel.ThreadId.X;
// Each thread computes partial sum
float sum = 0;
for (int i = idx; i < n; i += Kernel.GridDim.X * Kernel.BlockDim.X)
{
sum += a[i] * b[i];
}
// Store partial sum (final reduction happens on CPU)
int blockIdx = Kernel.BlockId.X;
partialSums[blockIdx] = sum;
}
Note: Full parallel reduction requires additional synchronization not shown here
4. Matrix Multiplication
[Kernel]
public static void MatrixMultiply(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> c,
int M, int N, int K) // A: MรK, B: KรN, C: MรN
{
int row = Kernel.ThreadId.Y;
int col = Kernel.ThreadId.X;
if (row < M && col < N)
{
float sum = 0;
for (int k = 0; k < K; k++)
{
sum += a[row * K + k] * b[k * N + col];
}
c[row * N + col] = sum;
}
}
Performance: Compute-bound, benefits greatly from GPU
5. Image Blur (Convolution)
[Kernel]
public static void GaussianBlur(
ReadOnlySpan<float> input,
Span<float> output,
int width,
int height)
{
int x = Kernel.ThreadId.X;
int y = Kernel.ThreadId.Y;
if (x >= 1 && x < width - 1 && y >= 1 && y < height - 1)
{
// 3x3 Gaussian kernel
float sum = 0;
sum += input[(y-1) * width + (x-1)] * 0.0625f;
sum += input[(y-1) * width + x ] * 0.125f;
sum += input[(y-1) * width + (x+1)] * 0.0625f;
sum += input[ y * width + (x-1)] * 0.125f;
sum += input[ y * width + x ] * 0.25f;
sum += input[ y * width + (x+1)] * 0.125f;
sum += input[(y+1) * width + (x-1)] * 0.0625f;
sum += input[(y+1) * width + x ] * 0.125f;
sum += input[(y+1) * width + (x+1)] * 0.0625f;
output[y * width + x] = sum;
}
}
Performance: Memory-bound with spatial locality
โก Performance Optimization
graph TD
A[Slow Kernel Performance?] --> B{Profile<br/>Performance}
B --> C[Identify Bottleneck]
C --> D{Memory<br/>Bound?}
D -->|Yes| E[Optimize Memory Access<br/>- Sequential patterns<br/>- Data reuse<br/>- Reduce transfers]
D -->|No| F{Compute<br/>Bound?}
F -->|Yes| G[Optimize Computation<br/>- Use float vs double<br/>- Loop unrolling<br/>- Vectorization]
F -->|No| H{Branch<br/>Heavy?}
H -->|Yes| I[Reduce Branching<br/>- Branch-free math<br/>- Predication<br/>- Minimize divergence]
H -->|No| J{Low<br/>Occupancy?}
J -->|Yes| K[Increase Parallelism<br/>- Larger work size<br/>- Multiple outputs<br/>- Kernel fusion]
J -->|No| L[Backend-Specific<br/>Optimization Needed]
E --> M[Measure Again]
G --> M
I --> M
K --> M
L --> M
M --> N{Better?}
N -->|No| B
N -->|Yes| O[โ
Done!]
style A fill:#ffccbc
style O fill:#c8e6c9
style D fill:#fff9c4
style F fill:#fff9c4
style H fill:#fff9c4
style J fill:#fff9c4
style N fill:#fff9c4
1. Memory Access Patterns
Sequential Access (Fast):
// โ
Good: Sequential, cache-friendly
for (int i = idx; i < n; i += stride)
{
output[i] = input[i] * 2;
}
Random Access (Slow):
// โ Bad: Random access, cache-unfriendly
for (int i = idx; i < n; i += stride)
{
output[i] = input[indices[i]] * 2; // Random lookup
}
Strided Access (Medium):
// โ ๏ธ OK: Strided but predictable
for (int i = idx; i < n; i += stride)
{
output[i] = input[i * 4] * 2; // Every 4th element
}
2. Data Reuse
Without Reuse (reads input[i] three times):
output[i] = input[i] + input[i] * input[i];
With Reuse (reads input[i] once):
float value = input[i];
output[i] = value + value * value;
3. Avoid Branching
Branchy Code (bad for GPU):
if (input[idx] > threshold)
{
output[idx] = input[idx] * 2;
}
else
{
output[idx] = input[idx] / 2;
}
Branch-Free (better for GPU):
float value = input[idx];
float isGreater = value > threshold ? 1.0f : 0.0f;
output[idx] = value * (2.0f * isGreater + 0.5f * (1.0f - isGreater));
Or Use Math:
float value = input[idx];
float multiplier = (value > threshold) ? 2.0f : 0.5f;
output[idx] = value * multiplier;
4. Loop Unrolling
Regular Loop:
for (int i = 0; i < 4; i++)
{
sum += input[idx * 4 + i];
}
Unrolled Loop (compiler may do this automatically):
sum += input[idx * 4 + 0];
sum += input[idx * 4 + 1];
sum += input[idx * 4 + 2];
sum += input[idx * 4 + 3];
5. Use Appropriate Precision
Single Precision (faster on most GPUs):
float result = input[idx] * 2.0f; // โ
Fast
Double Precision (slower, use only when needed):
double result = input[idx] * 2.0; // โ ๏ธ 2-8x slower on GPU
๐งช Testing Kernels
Unit Testing
[Fact]
public async Task VectorAdd_ProducesCorrectResults()
{
// Arrange
var orchestrator = CreateOrchestrator();
var a = new float[] { 1, 2, 3, 4, 5 };
var b = new float[] { 10, 20, 30, 40, 50 };
var result = new float[5];
// Act
await orchestrator.ExecuteKernelAsync(
"VectorAdd",
new { a, b, result }
);
// Assert
Assert.Equal(new float[] { 11, 22, 33, 44, 55 }, result);
}
Cross-Backend Validation
[Fact]
public async Task VectorAdd_CPUandGPU_ProduceSameResults()
{
var debugService = GetService<IKernelDebugService>();
var input = GenerateRandomData(10_000);
var validation = await debugService.ValidateCrossBackendAsync(
"VectorAdd",
new { input },
primaryBackend: AcceleratorType.CUDA,
referenceBackend: AcceleratorType.CPU
);
Assert.True(validation.IsValid);
}
Performance Testing
[Fact]
public async Task VectorAdd_PerformanceIsAcceptable()
{
var orchestrator = CreateOrchestrator();
var input = new float[1_000_000];
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
await orchestrator.ExecuteKernelAsync("VectorAdd", new { input });
}
stopwatch.Stop();
var avgTime = stopwatch.Elapsed.TotalMilliseconds / 100;
Assert.True(avgTime < 5.0, $"Average execution time {avgTime}ms exceeds 5ms");
}
๐ Debugging Kernels
Enable Debug Validation
#if DEBUG
services.AddProductionDebugging(options =>
{
options.Profile = DebugProfile.Development;
options.ValidateAllExecutions = true;
options.ThrowOnValidationFailure = true;
});
#endif
Common Issues
Issue 1: Wrong Results on GPU
Symptom: GPU produces different results than CPU
Debug:
var validation = await debugService.ValidateCrossBackendAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
AcceleratorType.CPU
);
if (!validation.IsValid)
{
foreach (var diff in validation.Differences)
{
Console.WriteLine($"Index {diff.Index}: GPU={diff.PrimaryValue}, CPU={diff.ReferenceValue}");
}
}
Common Causes:
- Missing bounds check
- Race condition (multiple threads writing same location)
- Uninitialized memory
- Floating-point precision differences
Issue 2: Non-Deterministic Results
Symptom: Same input produces different outputs on different runs
Debug:
var determinism = await debugService.TestDeterminismAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
runs: 100
);
if (!determinism.IsDeterministic)
{
Console.WriteLine($"Cause: {determinism.Cause}");
}
Common Causes:
- Race conditions
- Unordered reduction operations
- Floating-point rounding (accumulation order matters)
Issue 3: Slow Performance
Debug:
var profile = await debugService.ProfileKernelAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA,
iterations: 1000
);
Console.WriteLine($"Average: {profile.AverageTime.TotalMicroseconds}ฮผs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds}ฮผs");
Console.WriteLine($"GFLOPS: {profile.GFLOPS:F2}");
Common Causes:
- Memory-bound (low compute intensity)
- Poor memory access pattern
- Too many branches
- Insufficient parallelism
๐ Best Practices Summary
โ Do
- Always check bounds: Prevents crashes and undefined behavior
- Use
ReadOnlySpan<T>for inputs: Clear semantics, zero-copy on CPU - Reuse variables: Reduces memory traffic
- Keep kernels simple: Complex logic is hard to optimize
- Test cross-backend: Ensures correctness on all platforms
- Profile before optimizing: Know where the bottleneck is
โ Don't
- Don't use LINQ: Not supported in kernels
- Don't use reflection: Not supported, not AOT-compatible
- Don't call complex methods: May not inline properly
- Don't ignore analyzer warnings: DC001-DC012 catch real issues
- Don't optimize prematurely: Profile first
- Don't forget async/await: Kernel execution is asynchronous
๐ Further Reading
- Performance Tuning Guide - Advanced optimization
- Debugging Guide - Troubleshooting techniques
- Backend Selection - Choosing optimal backend
- Architecture: Source Generators - How code generation works
- Diagnostic Rules Reference - Complete DC001-DC012 reference
Write Efficient Kernels โข Test Thoroughly โข Profile Before Optimizing