Performance Tuning Guide

This guide provides practical techniques for optimizing kernel performance and maximizing throughput in DotCompute applications.

Performance Profiling

Measuring Performance

Before optimizing, measure baseline performance:

var debugService = services.GetRequiredService<IKernelDebugService>();

var profile = await debugService.ProfileKernelAsync(
    kernelName: "MyKernel",
    parameters: parameters,
    backend: AcceleratorType.CUDA,
    iterations: 1000
);

Console.WriteLine($"Average time: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Throughput: {profile.Throughput:F0} ops/sec");
Console.WriteLine($"GFLOPS: {profile.GFLOPS:F2}");

Identify Bottlenecks

Use the bottleneck analyzer:

var bottlenecks = await debugService.AnalyzeMemoryPatternsAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA
);

Console.WriteLine($"Sequential access rate: {bottlenecks.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {bottlenecks.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {bottlenecks.BandwidthUtilization:P1}");

foreach (var suggestion in bottlenecks.Suggestions)
{
    Console.WriteLine($"Suggestion: {suggestion}");
}

Memory Optimization

1. Memory Pooling

Problem: Frequent allocations cause overhead

Before (slow):

for (int i = 0; i < 1000; i++)
{
    using var buffer = await memoryManager.AllocateAsync<float>(100_000);
    await ProcessData(buffer);
    // Dispose allocates and deallocates every iteration
}

After (fast):

// Allocate once, reuse 1000 times
using var buffer = await memoryManager.AllocateAsync<float>(100_000);

for (int i = 0; i < 1000; i++)
{
    await ProcessData(buffer);
}

Performance Improvement: 11.2x faster (measured)

2. Zero-Copy Operations

Problem: Unnecessary data copies

Before (copies):

var inputArray = new float[1_000_000];
var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
await buffer.CopyFromAsync(inputArray);  // Copy!
await kernel.ExecuteAsync(buffer);
var outputArray = new float[1_000_000];
await buffer.CopyToAsync(outputArray);   // Copy!

After (zero-copy on CPU):

var inputArray = new float[1_000_000];
var buffer = memoryManager.CreateBuffer(inputArray, BufferMode.ReadWrite);
await kernel.ExecuteAsync(buffer);
// No copies needed - direct Span<T> access
var result = buffer.AsSpan();

Performance Improvement: 10x faster for CPU execution

3. Pinned Memory

Problem: Slow CPU-GPU transfers

Before (non-pinned):

var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Default
);
// Transfer: ~6 GB/s

After (pinned):

var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Pinned
);
// Transfer: ~16 GB/s (2.7x faster)

Performance Improvement: 2-3x faster transfers

4. Unified Memory (Apple Silicon)

Problem: Explicit transfers on unified memory systems

Before (unnecessary transfers):

var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
await buffer.CopyFromAsync(sourceData);  // Unnecessary copy!
await kernel.ExecuteAsync(buffer);
await buffer.CopyToAsync(resultData);    // Unnecessary copy!

After (unified memory):

var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Unified
);
// CPU and GPU access same physical memory - no copies!
var span = buffer.AsSpan();
sourceData.CopyTo(span);
await kernel.ExecuteAsync(buffer);
buffer.AsSpan().CopyTo(resultData);

Performance Improvement: 2-3x faster on Apple Silicon

Kernel Optimization

1. Memory Access Patterns

Sequential Access (fast):

[Kernel]
public static void Sequential(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        // Sequential: CPU cache-friendly, GPU coalesced
        output[idx] = input[idx] * 2;
    }
}

Strided Access (slower):

[Kernel]
public static void Strided(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        // Strided: Less cache-friendly, GPU may not coalesce
        output[idx] = input[idx * stride] * 2;
    }
}

Random Access (slowest):

[Kernel]
public static void Random(
    ReadOnlySpan<float> input,
    ReadOnlySpan<int> indices,
    Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        // Random: Cache-unfriendly, GPU scatter/gather
        output[idx] = input[indices[idx]] * 2;
    }
}

Performance Impact:

Sequential: 100% bandwidth utilization
Strided (stride=2): ~80% bandwidth
Strided (stride=16): ~50% bandwidth
Random: ~20% bandwidth

2. Data Reuse

Before (reads input[idx] three times):

[Kernel]
public static void NoReuse(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] + input[idx] * input[idx];  // 3 reads!
    }
}

After (reads once, reuses):

[Kernel]
public static void WithReuse(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        float value = input[idx];  // Read once
        output[idx] = value + value * value;  // Reuse
    }
}

Performance Improvement: 1.5-2x faster (memory traffic reduced)

3. Reduce Branching

Branchy (divergent warps on GPU):

[Kernel]
public static void Branchy(ReadOnlySpan<float> input, Span<float> output, float threshold)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        if (input[idx] > threshold)  // Branch divergence!
        {
            output[idx] = input[idx] * 2;
        }
        else
        {
            output[idx] = input[idx] / 2;
        }
    }
}

Branch-Free (better for GPU):

[Kernel]
public static void BranchFree(ReadOnlySpan<float> input, Span<float> output, float threshold)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        float value = input[idx];
        float multiplier = (value > threshold) ? 2.0f : 0.5f;
        output[idx] = value * multiplier;  // Conditional move, not branch
    }
}

Performance Improvement: 2-4x faster on GPU (depends on branch divergence)

4. Loop Unrolling

Regular Loop:

[Kernel]
public static void RegularLoop(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X * 4;
    if (idx < output.Length)
    {
        for (int i = 0; i < 4; i++)
        {
            output[idx + i] = input[idx + i] * 2;
        }
    }
}

Unrolled Loop:

[Kernel]
public static void UnrolledLoop(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X * 4;
    if (idx + 3 < output.Length)
    {
        output[idx + 0] = input[idx + 0] * 2;
        output[idx + 1] = input[idx + 1] * 2;
        output[idx + 2] = input[idx + 2] * 2;
        output[idx + 3] = input[idx + 3] * 2;
    }
}

Performance Improvement: 1.2-1.5x faster (reduced loop overhead)

5. Precision Selection

Double Precision (slower):

[Kernel]
public static void DoublePrecision(ReadOnlySpan<double> input, Span<double> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2.0;  // 2-8x slower on most GPUs
    }
}

Single Precision (faster):

[Kernel]
public static void SinglePrecision(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2.0f;  // Fast on all GPUs
    }
}

Performance Impact: 2-8x faster with float vs double on most GPUs

Batching Operations

Problem: Many Small Kernel Calls

Before (slow):

for (int i = 0; i < 1000; i++)
{
    await orchestrator.ExecuteKernelAsync("SmallKernel", data[i]);
    // 1000 kernel launches = high overhead
}

After (fast):

// Combine into single large kernel
await orchestrator.ExecuteKernelAsync("BatchKernel", allData);
// Single kernel launch

Performance Improvement: 10-100x faster (kernel launch overhead eliminated)

Batch Size Tuning

// Too small: High overhead
const int batchSize = 100;

// Too large: May exceed GPU memory
const int batchSize = 100_000_000;

// Just right: Maximize GPU utilization without exceeding memory
const int batchSize = CalculateOptimalBatchSize(availableMemory, dataSize);

private static int CalculateOptimalBatchSize(long availableMemory, int elementSize)
{
    // Use 80% of available memory
    long usableMemory = (long)(availableMemory * 0.8);
    int maxElements = (int)(usableMemory / elementSize);

    // Round down to power of 2 for efficient indexing
    return (int)Math.Pow(2, Math.Floor(Math.Log2(maxElements)));
}

Parallelization Strategies

1. CPU Multi-Threading

DotCompute automatically uses Parallel.For on CPU:

[Kernel]
public static void ParallelCPU(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}
// Automatically uses all CPU cores

Control Thread Count:

services.AddDotComputeRuntime(options =>
{
    options.MaxCpuThreads = Environment.ProcessorCount; // Default
    // or
    options.MaxCpuThreads = Environment.ProcessorCount / 2; // Half cores
});

2. GPU Thread Configuration

1D Grid (typical):

// Automatically configured:
// - Threads per block: 256 (typical)
// - Blocks: Ceiling(dataSize / 256)
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);

Manual Configuration (advanced):

var compilationOptions = new CompilationOptions
{
    ThreadsPerBlock = 512,  // Must be multiple of 32 (warp size)
    BlocksPerGrid = (dataSize + 511) / 512
};

await orchestrator.ExecuteKernelAsync("MyKernel", parameters, compilationOptions);

Guidelines:

Threads per block: 128-512 (256 is good default)
Occupancy: Aim for 50%+ occupancy
Warp size: Multiple of 32 for NVIDIA, 64 for AMD

3. Pipeline Parallelism

Before (sequential):

await kernel1.ExecuteAsync(data1);
await kernel2.ExecuteAsync(data2);
await kernel3.ExecuteAsync(data3);
// Total time: T1 + T2 + T3

After (pipelined):

var task1 = kernel1.ExecuteAsync(data1);
var task2 = kernel2.ExecuteAsync(data2);
var task3 = kernel3.ExecuteAsync(data3);
await Task.WhenAll(task1, task2, task3);
// Total time: Max(T1, T2, T3)

Performance Improvement: 2-3x faster (overlapped execution)

Backend-Specific Optimizations

CUDA Optimization

1. Coalesced Memory Access:

// ✅ Coalesced: Adjacent threads access adjacent memory
[Kernel]
public static void Coalesced(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockId.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx];  // Thread 0: addr 0, Thread 1: addr 4, etc.
    }
}

// ❌ Uncoalesced: Adjacent threads access strided memory
[Kernel]
public static void Uncoalesced(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockId.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx * stride];  // Poor memory access pattern
    }
}

2. Occupancy Optimization:

// Check occupancy
var capabilities = cudaAccelerator.Capabilities;
int maxThreadsPerBlock = capabilities.MaxThreadsPerBlock;  // e.g., 1024

// Calculate optimal threads per block
int threadsPerBlock = Math.Min(256, maxThreadsPerBlock);  // 256 is good default

// Ensure enough blocks for full occupancy
int blocksPerGrid = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
int minBlocksForFullOccupancy = capabilities.MultiProcessorCount * 2;

if (blocksPerGrid < minBlocksForFullOccupancy)
{
    // Reduce threads per block to increase block count
    threadsPerBlock = 128;
    blocksPerGrid = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
}

Metal Optimization

1. Threadgroup Memory (shared memory):

[Kernel]
public static void WithThreadgroupMemory(
    ReadOnlySpan<float> input,
    Span<float> output,
    int n)
{
    // Metal automatically allocates threadgroup memory for local arrays
    // Not yet exposed in C# kernel syntax - use direct MSL for now
}

2. Unified Memory:

// Leverage zero-copy unified memory on Apple Silicon
var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Unified
);

// CPU can write directly
buffer.AsSpan()[0] = 42;

// GPU reads directly (no explicit transfer)
await orchestrator.ExecuteKernelAsync("MyKernel", new { buffer });

// CPU reads directly (no explicit transfer)
var result = buffer.AsSpan()[0];

CPU Optimization

SIMD Vectorization {#simd-vectorization}

SIMD (Single Instruction Multiple Data) vectorization processes multiple data elements simultaneously using CPU vector instructions.

1. Automatic SIMD by Source Generator:

DotCompute's source generator automatically vectorizes kernels for CPU execution:

// Automatically vectorized by source generator
[Kernel]
public static void AutoVectorized(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;  // Becomes Vector<float> operation
    }
}
// On AVX512: Processes 16 floats per instruction
// On AVX2: Processes 8 floats per instruction
// On SSE4.2: Processes 4 floats per instruction

Performance Impact:

Scalar:  100ms (1 element/cycle)
SSE4.2:  27ms (3.7x, 4 elements/cycle)
AVX2:    14ms (7.1x, 8 elements/cycle)
AVX512:  8ms (12.5x, 16 elements/cycle)

Vectorization Requirements:

Sequential memory access (stride-1)
No data dependencies between iterations
Same operation applied to all elements
Supported data types: float, double, int, long, short

Example: Manual SIMD (advanced):

// For operations not auto-vectorized, use System.Numerics.Vector<T>
[Kernel]
public static void ManualVectorized(ReadOnlySpan<float> input, Span<float> output)
{
    int vectorSize = Vector<float>.Count;  // 4, 8, or 16
    int i = 0;

    // Vectorized loop
    for (; i <= output.Length - vectorSize; i += vectorSize)
    {
        var vec = new Vector<float>(input.Slice(i, vectorSize));
        vec *= 2.0f;
        vec.CopyTo(output.Slice(i, vectorSize));
    }

    // Scalar remainder
    for (; i < output.Length; i++)
    {
        output[i] = input[i] * 2.0f;
    }
}

Memory Coalescing {#memory-coalescing}

Memory coalescing combines multiple memory accesses into a single transaction for maximum bandwidth utilization on GPUs.

✅ Coalesced Access (GPU optimal):

[Kernel]
public static void Coalesced(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockIdx.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        // Adjacent threads access adjacent memory addresses
        output[idx] = input[idx] * 2;
        // Thread 0: addr[0], Thread 1: addr[4], Thread 2: addr[8], ...
        // GPU combines into single 128-byte transaction
    }
}

❌ Uncoalesced Access (GPU inefficient):

[Kernel]
public static void Uncoalesced(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockIdx.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        // Adjacent threads access strided memory
        output[idx] = input[idx * stride] * 2;
        // Each thread requires separate transaction
    }
}

Performance Impact:

Pattern                    | Bandwidth | Performance
---------------------------|-----------|------------
Coalesced (stride=1)       | 900 GB/s  | 100%
Coalesced (stride=2)       | 450 GB/s  | 50%
Uncoalesced (stride=32)    | 112 GB/s  | 12%
Random access              | 50 GB/s   | 6%

Optimization Strategy:

Ensure adjacent threads access adjacent memory
Align data structures to cache line boundaries (64 bytes)
Use struct-of-arrays (SoA) instead of array-of-structs (AoS) for better coalescing
Pad arrays to multiples of warp size (32 for NVIDIA)

Shared Memory Usage {#shared-memory}

Shared memory is fast on-chip memory shared by threads within a GPU block, enabling efficient inter-thread communication.

Use Cases:

Data reuse across threads (halo regions in stencil operations)
Reduction operations (sum, max, min)
Matrix transpose and tiling
Collaborative loading of data

Example: Shared Memory Reduction (CUDA-style, conceptual):

// TODO: Full shared memory API coming in future release
// Current workaround: Use multiple kernel launches with intermediate buffers

// Conceptual example (not yet supported in C# kernel syntax):
[Kernel]
public static void SharedMemoryReduce(
    ReadOnlySpan<float> input,
    Span<float> output,
    int n)
{
    // Would allocate shared memory for block (future feature)
    // __shared__ float sharedData[256];

    int tid = Kernel.ThreadId.X;
    int idx = Kernel.BlockIdx.X * Kernel.BlockDim.X + tid;

    // Load to shared memory (future)
    // sharedData[tid] = (idx < n) ? input[idx] : 0;

    // Synchronize threads (future)
    // __syncthreads();

    // Tree reduction in shared memory (future)
    // for (int s = blockDim.x / 2; s > 0; s >>= 1)
    // {
    //     if (tid < s) sharedData[tid] += sharedData[tid + s];
    //     __syncthreads();
    // }

    // Write result
    // if (tid == 0) output[blockIdx.x] = sharedData[0];
}

Current Alternative: Use multiple kernel passes with global memory buffers.

Performance Benefits (when available):

100-200x faster than global memory access
Enables efficient inter-thread collaboration
Reduces global memory bandwidth pressure

2. Cache Optimization:

// ✅ Cache-friendly: Sequential access
[Kernel]
public static void CacheFriendly(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}

// ❌ Cache-unfriendly: Large stride
[Kernel]
public static void CacheUnfriendly(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx * 1000] * 2;  // Cache misses!
    }
}

Common Performance Pitfalls

Pitfall 1: Not Warming Up

Problem: First execution includes compilation time

// ❌ Bad: Benchmark includes compilation
var stopwatch = Stopwatch.StartNew();
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.Elapsed.TotalMilliseconds}ms");
// May show 200ms instead of 2ms!

// ✅ Good: Warm up first
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);  // Warm-up

var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
    await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
}
stopwatch.Stop();
Console.WriteLine($"Avg time: {stopwatch.Elapsed.TotalMilliseconds / 100:F2}ms");

Pitfall 2: Synchronous Waits

Problem: Not overlapping CPU and GPU work

// ❌ Bad: CPU waits for each GPU operation
for (int i = 0; i < 10; i++)
{
    await orchestrator.ExecuteKernelAsync("Kernel1", data[i]);
    ProcessOnCPU(data[i]);  // CPU work while GPU idle
}

// ✅ Good: Overlap CPU and GPU work
var gpuTasks = new List<Task>();
for (int i = 0; i < 10; i++)
{
    gpuTasks.Add(orchestrator.ExecuteKernelAsync("Kernel1", data[i]));
    ProcessOnCPU(data[i - 1]);  // Process previous result while GPU works
}
await Task.WhenAll(gpuTasks);

Pitfall 3: Unnecessary Synchronization

Problem: Forcing synchronization between independent operations

// ❌ Bad: Forced synchronization
await orchestrator.ExecuteKernelAsync("Kernel1", data1);
await accelerator.SynchronizeAsync();  // Unnecessary!
await orchestrator.ExecuteKernelAsync("Kernel2", data2);

// ✅ Good: Let runtime manage synchronization
await orchestrator.ExecuteKernelAsync("Kernel1", data1);
await orchestrator.ExecuteKernelAsync("Kernel2", data2);
// Runtime automatically handles dependencies

Pitfall 4: Small Data on GPU

Problem: Transfer overhead exceeds compute time

// ❌ Bad: GPU for tiny data
var tinyData = new float[100];
await orchestrator.ExecuteKernelAsync(
    "MyKernel",
    new { tinyData },
    forceBackend: AcceleratorType.CUDA
);
// Transfer time (50μs) >> Compute time (1μs)

// ✅ Good: CPU for small data
await orchestrator.ExecuteKernelAsync(
    "MyKernel",
    new { tinyData },
    forceBackend: AcceleratorType.CPU
);
// No transfer, compute time ~1μs

Performance Monitoring

OpenTelemetry Integration

services.AddOpenTelemetry()
    .WithMetrics(metrics => metrics
        .AddDotComputeInstrumentation())
    .WithTracing(tracing => tracing
        .AddDotComputeInstrumentation());

// Metrics collected:
// - dotcompute.kernel.executions (count)
// - dotcompute.kernel.duration (histogram)
// - dotcompute.memory.allocated (count)
// - dotcompute.memory.transferred (histogram)
// - dotcompute.backend.selection_time (histogram)

Real-Time Performance Dashboard

services.AddDotComputeRuntime(options =>
{
    options.EnableTelemetry = true;
    options.TelemetrySamplingRate = 0.1;  // Sample 10% of executions
});

// Query performance data
var telemetryProvider = services.GetRequiredService<ITelemetryProvider>();
var metrics = await telemetryProvider.GetMetricsAsync(TimeSpan.FromHours(1));

Console.WriteLine($"Total executions: {metrics.TotalExecutions}");
Console.WriteLine($"Average duration: {metrics.AverageDuration.TotalMilliseconds:F2}ms");
Console.WriteLine($"P50: {metrics.P50.TotalMilliseconds:F2}ms");
Console.WriteLine($"P95: {metrics.P95.TotalMilliseconds:F2}ms");
Console.WriteLine($"P99: {metrics.P99.TotalMilliseconds:F2}ms");

Performance Checklist

Before Optimization

[ ] Profile baseline performance
[ ] Identify bottleneck (CPU, GPU, memory, transfer)
[ ] Measure memory bandwidth utilization
[ ] Check cache hit rates
[ ] Analyze memory access patterns

Memory Optimization

[ ] Use memory pooling for frequent allocations
[ ] Reuse buffers across iterations
[ ] Use pinned memory for CPU-GPU transfers
[ ] Use unified memory on Apple Silicon
[ ] Batch small operations

Kernel Optimization

[ ] Ensure sequential/coalesced memory access
[ ] Reduce branching in GPU kernels
[ ] Reuse loaded values
[ ] Use appropriate precision (float vs double)
[ ] Check bounds only once per thread

Backend Selection

[ ] Profile on both CPU and GPU
[ ] Use automatic selection for production
[ ] Enable ML-optimized profile for long-running apps
[ ] Verify GPU is being used for large data

Monitoring

[ ] Enable telemetry in production
[ ] Monitor P95 and P99 latencies
[ ] Track memory usage trends
[ ] Set up alerts for performance regressions

Table of Contents

Performance Tuning Guide

Performance Profiling

Measuring Performance

Identify Bottlenecks

Memory Optimization

1. Memory Pooling

2. Zero-Copy Operations

3. Pinned Memory

4. Unified Memory (Apple Silicon)

Kernel Optimization

1. Memory Access Patterns

2. Data Reuse

3. Reduce Branching

4. Loop Unrolling

5. Precision Selection

Batching Operations

Problem: Many Small Kernel Calls

Batch Size Tuning

Parallelization Strategies

1. CPU Multi-Threading

2. GPU Thread Configuration

3. Pipeline Parallelism

Backend-Specific Optimizations

CUDA Optimization

Metal Optimization

CPU Optimization

SIMD Vectorization {#simd-vectorization}

Memory Coalescing {#memory-coalescing}

Shared Memory Usage {#shared-memory}

Common Performance Pitfalls

Pitfall 1: Not Warming Up

Pitfall 2: Synchronous Waits

Pitfall 3: Unnecessary Synchronization

Pitfall 4: Small Data on GPU

Performance Monitoring

OpenTelemetry Integration

Real-Time Performance Dashboard

Performance Checklist

Before Optimization

Memory Optimization

Kernel Optimization

Backend Selection

Monitoring

Further Reading