Kernel Performance

This module covers kernel optimization techniques including thread configuration, occupancy analysis, and profiling.

Understanding GPU Execution

Execution Hierarchy

Grid
├── Block 0
│   ├── Warp 0 (threads 0-31)
│   ├── Warp 1 (threads 32-63)
│   └── ...
├── Block 1
│   └── ...
└── ...

Key concepts:

Warp: 32 threads executing in lockstep (NVIDIA) / 64 threads (AMD)
Block: Group of warps sharing resources
Grid: All blocks in a kernel launch

Resource Limits

Resource	Typical Limit	Impact
Threads per block	1024	Max parallelism per SM
Registers per thread	255	Spilling reduces performance
Shared memory	48-164 KB	Limits block size
Blocks per SM	16-32	Concurrent execution

Thread Configuration

Calculating Optimal Configuration

public static KernelConfig CalculateConfig(int dataSize, IComputeBackend backend)
{
    // Get device limits
    int maxThreads = backend.MaxThreadsPerBlock;     // e.g., 1024
    int warpSize = backend.WarpSize;                 // e.g., 32

    // Choose block size (multiple of warp size)
    int blockSize = Math.Min(256, maxThreads);       // 256 is often optimal

    // Calculate grid size to cover all data
    int gridSize = (dataSize + blockSize - 1) / blockSize;

    return new KernelConfig
    {
        BlockSize = blockSize,
        GridSize = gridSize
    };
}

Multi-dimensional Configuration

// 2D kernel for image processing
public static KernelConfig2D CalculateConfig2D(int width, int height)
{
    // 16x16 = 256 threads per block (good default for 2D)
    var blockDim = new Dim3(16, 16, 1);

    var gridDim = new Dim3(
        (width + blockDim.X - 1) / blockDim.X,
        (height + blockDim.Y - 1) / blockDim.Y,
        1);

    return new KernelConfig2D
    {
        BlockDim = blockDim,
        GridDim = gridDim
    };
}

[Kernel]
public static void ProcessImage(
    ReadOnlySpan2D<byte> input,
    Span2D<byte> output,
    int width, int height)
{
    int x = Kernel.BlockId.X * Kernel.BlockDim.X + Kernel.ThreadId.X;
    int y = Kernel.BlockId.Y * Kernel.BlockDim.Y + Kernel.ThreadId.Y;

    if (x < width && y < height)
    {
        output[y, x] = ProcessPixel(input[y, x]);
    }
}

Occupancy Optimization

What is Occupancy?

Occupancy = Active Warps / Maximum Warps per SM

Higher occupancy helps hide memory latency through warp switching.

Factors Affecting Occupancy

Factor	Effect
Block size	Fewer threads = more blocks can fit
Register usage	More registers = fewer blocks
Shared memory	More shared = fewer blocks

Querying Occupancy

var occupancyInfo = await orchestrator.GetOccupancyAsync(
    MyKernels.VectorAdd,
    blockSize: 256);

Console.WriteLine($"Theoretical occupancy: {occupancyInfo.TheoreticalOccupancy:P0}");
Console.WriteLine($"Active blocks per SM: {occupancyInfo.ActiveBlocksPerSM}");
Console.WriteLine($"Active warps per SM: {occupancyInfo.ActiveWarpsPerSM}");
Console.WriteLine($"Registers per thread: {occupancyInfo.RegistersPerThread}");
Console.WriteLine($"Shared memory per block: {occupancyInfo.SharedMemoryPerBlock}");

Auto-tuning Block Size

public static async Task<int> FindOptimalBlockSize(
    IComputeOrchestrator orchestrator,
    string kernelName)
{
    var blockSizes = new[] { 64, 128, 256, 512, 1024 };
    var best = (blockSize: 256, occupancy: 0.0);

    foreach (var blockSize in blockSizes)
    {
        var info = await service.GetOccupancyAsync(kernel, blockSize);
        if (info.TheoreticalOccupancy > best.occupancy)
        {
            best = (blockSize, info.TheoreticalOccupancy);
        }
    }

    return best.blockSize;
}

Reducing Register Pressure

Problem: Register Spilling

When a kernel uses too many registers, variables spill to slow local memory.

// BAD: Many intermediate variables
[Kernel]
public static void TooManyRegisters(Span<float> data)
{
    int idx = Kernel.ThreadId.X;
    float a = data[idx];
    float b = a * 2.0f;
    float c = b + 1.0f;
    float d = c * c;
    float e = MathF.Sin(d);
    float f = MathF.Cos(d);
    float g = e + f;
    float h = g * 0.5f;
    // ... more variables
    data[idx] = h;
}

// BETTER: Reuse variables
[Kernel]
public static void FewerRegisters(Span<float> data)
{
    int idx = Kernel.ThreadId.X;
    float val = data[idx];
    val = val * 2.0f + 1.0f;
    val = val * val;
    val = (MathF.Sin(val) + MathF.Cos(val)) * 0.5f;
    data[idx] = val;
}

Controlling Register Usage

[Kernel(MaxRegisters = 32)]  // Limit registers per thread
public static void LimitedRegisters(Span<float> data)
{
    // ...
}

Profiling Kernels

Basic Timing

var sw = Stopwatch.StartNew();
await orchestrator.ExecuteKernelAsync(kernel, config, buffers);
await orchestrator.SynchronizeAsync(); // Wait for GPU completion
sw.Stop();

Console.WriteLine($"Kernel execution: {sw.ElapsedMilliseconds} ms");

GPU-Native Timing

// Create timing context
using var timing = orchestrator.CreateTimingContext();

// Record start event
timing.RecordStart();

await orchestrator.ExecuteKernelAsync(kernel, config, buffers);

// Record end event
timing.RecordEnd();

// Get precise GPU time
float gpuTimeMs = await timing.GetElapsedMillisecondsAsync();
Console.WriteLine($"GPU kernel time: {gpuTimeMs:F3} ms");

Performance Metrics

var metrics = await orchestrator.ProfileKernelAsync(
    kernel, config, buffers,
    ProfileMetrics.All);

Console.WriteLine($"Execution time: {metrics.ExecutionTimeMs:F3} ms");
Console.WriteLine($"Memory throughput: {metrics.MemoryThroughputGBps:F1} GB/s");
Console.WriteLine($"Compute throughput: {metrics.ComputeThroughputGFlops:F1} GFLOPS");
Console.WriteLine($"Achieved occupancy: {metrics.AchievedOccupancy:P0}");
Console.WriteLine($"Memory efficiency: {metrics.MemoryEfficiency:P0}");

Common Performance Issues

Issue 1: Warp Divergence

When threads in a warp take different branches, both paths execute serially.

// BAD: High divergence
[Kernel]
public static void HighDivergence(Span<float> data)
{
    int idx = Kernel.ThreadId.X;
    if (data[idx] > 0)  // Half the warp goes here
        data[idx] = MathF.Sqrt(data[idx]);
    else                // Other half goes here
        data[idx] = 0;
}

// BETTER: Branchless
[Kernel]
public static void LowDivergence(Span<float> data)
{
    int idx = Kernel.ThreadId.X;
    float val = data[idx];
    float positive = MathF.Max(val, 0);
    data[idx] = MathF.Sqrt(positive); // All threads same path
}

Issue 2: Uncoalesced Memory Access

// BAD: Strided access
[Kernel]
public static void UncoalescedAccess(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X;
    output[idx] = input[idx * stride];  // Non-contiguous reads
}

// GOOD: Sequential access
[Kernel]
public static void CoalescedAccess(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    output[idx] = input[idx];  // Contiguous reads
}

Issue 3: Bank Conflicts (Shared Memory)

// BAD: Same bank access
[Kernel]
public static void BankConflict(Span<float> shared, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    // All threads in warp access bank 0
    output[idx] = shared[idx * 32];
}

// GOOD: Different banks
[Kernel]
public static void NoBankConflict(Span<float> shared, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    // Each thread accesses different bank
    output[idx] = shared[idx];
}

Optimization Checklist

[ ] Block size: Multiple of warp size (32), typically 128-256
[ ] Occupancy: Target >50% theoretical occupancy
[ ] Memory access: Coalesced, aligned to 128 bytes
[ ] Branching: Minimize warp divergence
[ ] Registers: Keep under limit to avoid spilling
[ ] Shared memory: Avoid bank conflicts

Exercises

Exercise 1: Occupancy Analysis

Profile a kernel and experiment with different block sizes to maximize occupancy.

Exercise 2: Divergence Elimination

Rewrite a kernel with branches to use branchless operations.

Exercise 3: Memory Pattern Analysis

Identify and fix uncoalesced memory access in a matrix transpose kernel.

Key Takeaways

Block size affects occupancy - tune for your kernel's resource usage
Warp divergence serializes execution - minimize branching
Memory coalescing is critical - adjacent threads should access adjacent memory
Profile before optimizing - measure to find actual bottlenecks
Auto-tune for different hardware - optimal settings vary by device

Next Module

Multi-Kernel Pipelines →

Learn to chain kernels and manage complex data flows.

Table of Contents