Performance Tuning Guide
This guide provides practical techniques for optimizing kernel performance and maximizing throughput in DotCompute applications.
Performance Profiling
Measuring Performance
Before optimizing, measure baseline performance:
var debugService = services.GetRequiredService<IKernelDebugService>();
var profile = await debugService.ProfileKernelAsync(
kernelName: "MyKernel",
parameters: parameters,
backend: AcceleratorType.CUDA,
iterations: 1000
);
Console.WriteLine($"Average time: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Throughput: {profile.Throughput:F0} ops/sec");
Console.WriteLine($"GFLOPS: {profile.GFLOPS:F2}");
Identify Bottlenecks
Use the bottleneck analyzer:
var bottlenecks = await debugService.AnalyzeMemoryPatternsAsync(
"MyKernel",
parameters,
AcceleratorType.CUDA
);
Console.WriteLine($"Sequential access rate: {bottlenecks.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {bottlenecks.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {bottlenecks.BandwidthUtilization:P1}");
foreach (var suggestion in bottlenecks.Suggestions)
{
Console.WriteLine($"Suggestion: {suggestion}");
}
Memory Optimization
1. Memory Pooling
Problem: Frequent allocations cause overhead
Before (slow):
for (int i = 0; i < 1000; i++)
{
using var buffer = await memoryManager.AllocateAsync<float>(100_000);
await ProcessData(buffer);
// Dispose allocates and deallocates every iteration
}
After (fast):
// Allocate once, reuse 1000 times
using var buffer = await memoryManager.AllocateAsync<float>(100_000);
for (int i = 0; i < 1000; i++)
{
await ProcessData(buffer);
}
Performance Improvement: 11.2x faster (measured)
2. Zero-Copy Operations
Problem: Unnecessary data copies
Before (copies):
var inputArray = new float[1_000_000];
var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
await buffer.CopyFromAsync(inputArray); // Copy!
await kernel.ExecuteAsync(buffer);
var outputArray = new float[1_000_000];
await buffer.CopyToAsync(outputArray); // Copy!
After (zero-copy on CPU):
var inputArray = new float[1_000_000];
var buffer = memoryManager.CreateBuffer(inputArray, BufferMode.ReadWrite);
await kernel.ExecuteAsync(buffer);
// No copies needed - direct Span<T> access
var result = buffer.AsSpan();
Performance Improvement: 10x faster for CPU execution
3. Pinned Memory
Problem: Slow CPU-GPU transfers
Before (non-pinned):
var buffer = await memoryManager.AllocateAsync<float>(
1_000_000,
AllocationMode.Default
);
// Transfer: ~6 GB/s
After (pinned):
var buffer = await memoryManager.AllocateAsync<float>(
1_000_000,
AllocationMode.Pinned
);
// Transfer: ~16 GB/s (2.7x faster)
Performance Improvement: 2-3x faster transfers
4. Unified Memory (Apple Silicon)
Problem: Explicit transfers on unified memory systems
Before (unnecessary transfers):
var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
await buffer.CopyFromAsync(sourceData); // Unnecessary copy!
await kernel.ExecuteAsync(buffer);
await buffer.CopyToAsync(resultData); // Unnecessary copy!
After (unified memory):
var buffer = await memoryManager.AllocateAsync<float>(
1_000_000,
AllocationMode.Unified
);
// CPU and GPU access same physical memory - no copies!
var span = buffer.AsSpan();
sourceData.CopyTo(span);
await kernel.ExecuteAsync(buffer);
buffer.AsSpan().CopyTo(resultData);
Performance Improvement: 2-3x faster on Apple Silicon
Kernel Optimization
1. Memory Access Patterns
Sequential Access (fast):
[Kernel]
public static void Sequential(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
// Sequential: CPU cache-friendly, GPU coalesced
output[idx] = input[idx] * 2;
}
}
Strided Access (slower):
[Kernel]
public static void Strided(ReadOnlySpan<float> input, Span<float> output, int stride)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
// Strided: Less cache-friendly, GPU may not coalesce
output[idx] = input[idx * stride] * 2;
}
}
Random Access (slowest):
[Kernel]
public static void Random(
ReadOnlySpan<float> input,
ReadOnlySpan<int> indices,
Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
// Random: Cache-unfriendly, GPU scatter/gather
output[idx] = input[indices[idx]] * 2;
}
}
Performance Impact:
- Sequential: 100% bandwidth utilization
- Strided (stride=2): ~80% bandwidth
- Strided (stride=16): ~50% bandwidth
- Random: ~20% bandwidth
2. Data Reuse
Before (reads input[idx] three times):
[Kernel]
public static void NoReuse(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] + input[idx] * input[idx]; // 3 reads!
}
}
After (reads once, reuses):
[Kernel]
public static void WithReuse(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
float value = input[idx]; // Read once
output[idx] = value + value * value; // Reuse
}
}
Performance Improvement: 1.5-2x faster (memory traffic reduced)
3. Reduce Branching
Branchy (divergent warps on GPU):
[Kernel]
public static void Branchy(ReadOnlySpan<float> input, Span<float> output, float threshold)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
if (input[idx] > threshold) // Branch divergence!
{
output[idx] = input[idx] * 2;
}
else
{
output[idx] = input[idx] / 2;
}
}
}
Branch-Free (better for GPU):
[Kernel]
public static void BranchFree(ReadOnlySpan<float> input, Span<float> output, float threshold)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
float value = input[idx];
float multiplier = (value > threshold) ? 2.0f : 0.5f;
output[idx] = value * multiplier; // Conditional move, not branch
}
}
Performance Improvement: 2-4x faster on GPU (depends on branch divergence)
4. Loop Unrolling
Regular Loop:
[Kernel]
public static void RegularLoop(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X * 4;
if (idx < output.Length)
{
for (int i = 0; i < 4; i++)
{
output[idx + i] = input[idx + i] * 2;
}
}
}
Unrolled Loop:
[Kernel]
public static void UnrolledLoop(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X * 4;
if (idx + 3 < output.Length)
{
output[idx + 0] = input[idx + 0] * 2;
output[idx + 1] = input[idx + 1] * 2;
output[idx + 2] = input[idx + 2] * 2;
output[idx + 3] = input[idx + 3] * 2;
}
}
Performance Improvement: 1.2-1.5x faster (reduced loop overhead)
5. Precision Selection
Double Precision (slower):
[Kernel]
public static void DoublePrecision(ReadOnlySpan<double> input, Span<double> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2.0; // 2-8x slower on most GPUs
}
}
Single Precision (faster):
[Kernel]
public static void SinglePrecision(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2.0f; // Fast on all GPUs
}
}
Performance Impact: 2-8x faster with float vs double on most GPUs
Batching Operations
Problem: Many Small Kernel Calls
Before (slow):
for (int i = 0; i < 1000; i++)
{
await orchestrator.ExecuteKernelAsync("SmallKernel", data[i]);
// 1000 kernel launches = high overhead
}
After (fast):
// Combine into single large kernel
await orchestrator.ExecuteKernelAsync("BatchKernel", allData);
// Single kernel launch
Performance Improvement: 10-100x faster (kernel launch overhead eliminated)
Batch Size Tuning
// Too small: High overhead
const int batchSize = 100;
// Too large: May exceed GPU memory
const int batchSize = 100_000_000;
// Just right: Maximize GPU utilization without exceeding memory
const int batchSize = CalculateOptimalBatchSize(availableMemory, dataSize);
private static int CalculateOptimalBatchSize(long availableMemory, int elementSize)
{
// Use 80% of available memory
long usableMemory = (long)(availableMemory * 0.8);
int maxElements = (int)(usableMemory / elementSize);
// Round down to power of 2 for efficient indexing
return (int)Math.Pow(2, Math.Floor(Math.Log2(maxElements)));
}
Parallelization Strategies
1. CPU Multi-Threading
DotCompute automatically uses Parallel.For on CPU:
[Kernel]
public static void ParallelCPU(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2;
}
}
// Automatically uses all CPU cores
Control Thread Count:
services.AddDotComputeRuntime(options =>
{
options.MaxCpuThreads = Environment.ProcessorCount; // Default
// or
options.MaxCpuThreads = Environment.ProcessorCount / 2; // Half cores
});
2. GPU Thread Configuration
1D Grid (typical):
// Automatically configured:
// - Threads per block: 256 (typical)
// - Blocks: Ceiling(dataSize / 256)
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
Manual Configuration (advanced):
var compilationOptions = new CompilationOptions
{
ThreadsPerBlock = 512, // Must be multiple of 32 (warp size)
BlocksPerGrid = (dataSize + 511) / 512
};
await orchestrator.ExecuteKernelAsync("MyKernel", parameters, compilationOptions);
Guidelines:
- Threads per block: 128-512 (256 is good default)
- Occupancy: Aim for 50%+ occupancy
- Warp size: Multiple of 32 for NVIDIA, 64 for AMD
3. Pipeline Parallelism
Before (sequential):
await kernel1.ExecuteAsync(data1);
await kernel2.ExecuteAsync(data2);
await kernel3.ExecuteAsync(data3);
// Total time: T1 + T2 + T3
After (pipelined):
var task1 = kernel1.ExecuteAsync(data1);
var task2 = kernel2.ExecuteAsync(data2);
var task3 = kernel3.ExecuteAsync(data3);
await Task.WhenAll(task1, task2, task3);
// Total time: Max(T1, T2, T3)
Performance Improvement: 2-3x faster (overlapped execution)
Backend-Specific Optimizations
CUDA Optimization
1. Coalesced Memory Access:
// ✅ Coalesced: Adjacent threads access adjacent memory
[Kernel]
public static void Coalesced(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X + Kernel.BlockId.X * Kernel.BlockDim.X;
if (idx < output.Length)
{
output[idx] = input[idx]; // Thread 0: addr 0, Thread 1: addr 4, etc.
}
}
// ❌ Uncoalesced: Adjacent threads access strided memory
[Kernel]
public static void Uncoalesced(ReadOnlySpan<float> input, Span<float> output, int stride)
{
int idx = Kernel.ThreadId.X + Kernel.BlockId.X * Kernel.BlockDim.X;
if (idx < output.Length)
{
output[idx] = input[idx * stride]; // Poor memory access pattern
}
}
2. Occupancy Optimization:
// Check occupancy
var capabilities = cudaAccelerator.Capabilities;
int maxThreadsPerBlock = capabilities.MaxThreadsPerBlock; // e.g., 1024
// Calculate optimal threads per block
int threadsPerBlock = Math.Min(256, maxThreadsPerBlock); // 256 is good default
// Ensure enough blocks for full occupancy
int blocksPerGrid = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
int minBlocksForFullOccupancy = capabilities.MultiProcessorCount * 2;
if (blocksPerGrid < minBlocksForFullOccupancy)
{
// Reduce threads per block to increase block count
threadsPerBlock = 128;
blocksPerGrid = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
}
Metal Optimization
1. Threadgroup Memory (shared memory):
[Kernel]
public static void WithThreadgroupMemory(
ReadOnlySpan<float> input,
Span<float> output,
int n)
{
// Metal automatically allocates threadgroup memory for local arrays
// Not yet exposed in C# kernel syntax - use direct MSL for now
}
2. Unified Memory:
// Leverage zero-copy unified memory on Apple Silicon
var buffer = await memoryManager.AllocateAsync<float>(
1_000_000,
AllocationMode.Unified
);
// CPU can write directly
buffer.AsSpan()[0] = 42;
// GPU reads directly (no explicit transfer)
await orchestrator.ExecuteKernelAsync("MyKernel", new { buffer });
// CPU reads directly (no explicit transfer)
var result = buffer.AsSpan()[0];
CPU Optimization
SIMD Vectorization {#simd-vectorization}
SIMD (Single Instruction Multiple Data) vectorization processes multiple data elements simultaneously using CPU vector instructions.
1. Automatic SIMD by Source Generator:
DotCompute's source generator automatically vectorizes kernels for CPU execution:
// Automatically vectorized by source generator
[Kernel]
public static void AutoVectorized(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2; // Becomes Vector<float> operation
}
}
// On AVX512: Processes 16 floats per instruction
// On AVX2: Processes 8 floats per instruction
// On SSE4.2: Processes 4 floats per instruction
Performance Impact:
Scalar: 100ms (1 element/cycle)
SSE4.2: 27ms (3.7x, 4 elements/cycle)
AVX2: 14ms (7.1x, 8 elements/cycle)
AVX512: 8ms (12.5x, 16 elements/cycle)
Vectorization Requirements:
- Sequential memory access (stride-1)
- No data dependencies between iterations
- Same operation applied to all elements
- Supported data types:
float,double,int,long,short
Example: Manual SIMD (advanced):
// For operations not auto-vectorized, use System.Numerics.Vector<T>
[Kernel]
public static void ManualVectorized(ReadOnlySpan<float> input, Span<float> output)
{
int vectorSize = Vector<float>.Count; // 4, 8, or 16
int i = 0;
// Vectorized loop
for (; i <= output.Length - vectorSize; i += vectorSize)
{
var vec = new Vector<float>(input.Slice(i, vectorSize));
vec *= 2.0f;
vec.CopyTo(output.Slice(i, vectorSize));
}
// Scalar remainder
for (; i < output.Length; i++)
{
output[i] = input[i] * 2.0f;
}
}
Memory Coalescing {#memory-coalescing}
Memory coalescing combines multiple memory accesses into a single transaction for maximum bandwidth utilization on GPUs.
✅ Coalesced Access (GPU optimal):
[Kernel]
public static void Coalesced(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X + Kernel.BlockIdx.X * Kernel.BlockDim.X;
if (idx < output.Length)
{
// Adjacent threads access adjacent memory addresses
output[idx] = input[idx] * 2;
// Thread 0: addr[0], Thread 1: addr[4], Thread 2: addr[8], ...
// GPU combines into single 128-byte transaction
}
}
❌ Uncoalesced Access (GPU inefficient):
[Kernel]
public static void Uncoalesced(ReadOnlySpan<float> input, Span<float> output, int stride)
{
int idx = Kernel.ThreadId.X + Kernel.BlockIdx.X * Kernel.BlockDim.X;
if (idx < output.Length)
{
// Adjacent threads access strided memory
output[idx] = input[idx * stride] * 2;
// Each thread requires separate transaction
}
}
Performance Impact:
Pattern | Bandwidth | Performance
---------------------------|-----------|------------
Coalesced (stride=1) | 900 GB/s | 100%
Coalesced (stride=2) | 450 GB/s | 50%
Uncoalesced (stride=32) | 112 GB/s | 12%
Random access | 50 GB/s | 6%
Optimization Strategy:
- Ensure adjacent threads access adjacent memory
- Align data structures to cache line boundaries (64 bytes)
- Use struct-of-arrays (SoA) instead of array-of-structs (AoS) for better coalescing
- Pad arrays to multiples of warp size (32 for NVIDIA)
Shared Memory Usage {#shared-memory}
Shared memory is fast on-chip memory shared by threads within a GPU block, enabling efficient inter-thread communication.
Use Cases:
- Data reuse across threads (halo regions in stencil operations)
- Reduction operations (sum, max, min)
- Matrix transpose and tiling
- Collaborative loading of data
Example: Shared Memory Reduction (CUDA-style, conceptual):
// TODO: Full shared memory API coming in future release
// Current workaround: Use multiple kernel launches with intermediate buffers
// Conceptual example (not yet supported in C# kernel syntax):
[Kernel]
public static void SharedMemoryReduce(
ReadOnlySpan<float> input,
Span<float> output,
int n)
{
// Would allocate shared memory for block (future feature)
// __shared__ float sharedData[256];
int tid = Kernel.ThreadId.X;
int idx = Kernel.BlockIdx.X * Kernel.BlockDim.X + tid;
// Load to shared memory (future)
// sharedData[tid] = (idx < n) ? input[idx] : 0;
// Synchronize threads (future)
// __syncthreads();
// Tree reduction in shared memory (future)
// for (int s = blockDim.x / 2; s > 0; s >>= 1)
// {
// if (tid < s) sharedData[tid] += sharedData[tid + s];
// __syncthreads();
// }
// Write result
// if (tid == 0) output[blockIdx.x] = sharedData[0];
}
Current Alternative: Use multiple kernel passes with global memory buffers.
Performance Benefits (when available):
- 100-200x faster than global memory access
- Enables efficient inter-thread collaboration
- Reduces global memory bandwidth pressure
2. Cache Optimization:
// ✅ Cache-friendly: Sequential access
[Kernel]
public static void CacheFriendly(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx] * 2;
}
}
// ❌ Cache-unfriendly: Large stride
[Kernel]
public static void CacheUnfriendly(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
output[idx] = input[idx * 1000] * 2; // Cache misses!
}
}
Common Performance Pitfalls
Pitfall 1: Not Warming Up
Problem: First execution includes compilation time
// ❌ Bad: Benchmark includes compilation
var stopwatch = Stopwatch.StartNew();
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.Elapsed.TotalMilliseconds}ms");
// May show 200ms instead of 2ms!
// ✅ Good: Warm up first
await orchestrator.ExecuteKernelAsync("MyKernel", parameters); // Warm-up
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
}
stopwatch.Stop();
Console.WriteLine($"Avg time: {stopwatch.Elapsed.TotalMilliseconds / 100:F2}ms");
Pitfall 2: Synchronous Waits
Problem: Not overlapping CPU and GPU work
// ❌ Bad: CPU waits for each GPU operation
for (int i = 0; i < 10; i++)
{
await orchestrator.ExecuteKernelAsync("Kernel1", data[i]);
ProcessOnCPU(data[i]); // CPU work while GPU idle
}
// ✅ Good: Overlap CPU and GPU work
var gpuTasks = new List<Task>();
for (int i = 0; i < 10; i++)
{
gpuTasks.Add(orchestrator.ExecuteKernelAsync("Kernel1", data[i]));
ProcessOnCPU(data[i - 1]); // Process previous result while GPU works
}
await Task.WhenAll(gpuTasks);
Pitfall 3: Unnecessary Synchronization
Problem: Forcing synchronization between independent operations
// ❌ Bad: Forced synchronization
await orchestrator.ExecuteKernelAsync("Kernel1", data1);
await accelerator.SynchronizeAsync(); // Unnecessary!
await orchestrator.ExecuteKernelAsync("Kernel2", data2);
// ✅ Good: Let runtime manage synchronization
await orchestrator.ExecuteKernelAsync("Kernel1", data1);
await orchestrator.ExecuteKernelAsync("Kernel2", data2);
// Runtime automatically handles dependencies
Pitfall 4: Small Data on GPU
Problem: Transfer overhead exceeds compute time
// ❌ Bad: GPU for tiny data
var tinyData = new float[100];
await orchestrator.ExecuteKernelAsync(
"MyKernel",
new { tinyData },
forceBackend: AcceleratorType.CUDA
);
// Transfer time (50μs) >> Compute time (1μs)
// ✅ Good: CPU for small data
await orchestrator.ExecuteKernelAsync(
"MyKernel",
new { tinyData },
forceBackend: AcceleratorType.CPU
);
// No transfer, compute time ~1μs
Performance Monitoring
OpenTelemetry Integration
services.AddOpenTelemetry()
.WithMetrics(metrics => metrics
.AddDotComputeInstrumentation())
.WithTracing(tracing => tracing
.AddDotComputeInstrumentation());
// Metrics collected:
// - dotcompute.kernel.executions (count)
// - dotcompute.kernel.duration (histogram)
// - dotcompute.memory.allocated (count)
// - dotcompute.memory.transferred (histogram)
// - dotcompute.backend.selection_time (histogram)
Real-Time Performance Dashboard
services.AddDotComputeRuntime(options =>
{
options.EnableTelemetry = true;
options.TelemetrySamplingRate = 0.1; // Sample 10% of executions
});
// Query performance data
var telemetryProvider = services.GetRequiredService<ITelemetryProvider>();
var metrics = await telemetryProvider.GetMetricsAsync(TimeSpan.FromHours(1));
Console.WriteLine($"Total executions: {metrics.TotalExecutions}");
Console.WriteLine($"Average duration: {metrics.AverageDuration.TotalMilliseconds:F2}ms");
Console.WriteLine($"P50: {metrics.P50.TotalMilliseconds:F2}ms");
Console.WriteLine($"P95: {metrics.P95.TotalMilliseconds:F2}ms");
Console.WriteLine($"P99: {metrics.P99.TotalMilliseconds:F2}ms");
Performance Checklist
Before Optimization
- [ ] Profile baseline performance
- [ ] Identify bottleneck (CPU, GPU, memory, transfer)
- [ ] Measure memory bandwidth utilization
- [ ] Check cache hit rates
- [ ] Analyze memory access patterns
Memory Optimization
- [ ] Use memory pooling for frequent allocations
- [ ] Reuse buffers across iterations
- [ ] Use pinned memory for CPU-GPU transfers
- [ ] Use unified memory on Apple Silicon
- [ ] Batch small operations
Kernel Optimization
- [ ] Ensure sequential/coalesced memory access
- [ ] Reduce branching in GPU kernels
- [ ] Reuse loaded values
- [ ] Use appropriate precision (float vs double)
- [ ] Check bounds only once per thread
Backend Selection
- [ ] Profile on both CPU and GPU
- [ ] Use automatic selection for production
- [ ] Enable ML-optimized profile for long-running apps
- [ ] Verify GPU is being used for large data
Monitoring
- [ ] Enable telemetry in production
- [ ] Monitor P95 and P99 latencies
- [ ] Track memory usage trends
- [ ] Set up alerts for performance regressions
Further Reading
- Kernel Development Guide - Writing efficient kernels
- Backend Selection Guide - Choosing optimal backend
- Memory Management Guide - Memory best practices
- Architecture: Optimization Engine - ML-powered selection
- Architecture: Memory Management - Memory system design
Measure • Optimize • Validate • Repeat