Table of Contents

Performance Tuning Guide

This guide provides practical techniques for optimizing kernel performance and maximizing throughput in DotCompute applications.

Performance Profiling

Measuring Performance

Before optimizing, measure baseline performance:

var debugService = services.GetRequiredService<IKernelDebugService>();

var profile = await debugService.ProfileKernelAsync(
    kernelName: "MyKernel",
    parameters: parameters,
    backend: AcceleratorType.CUDA,
    iterations: 1000
);

Console.WriteLine($"Average time: {profile.AverageTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Min/Max: {profile.MinTime.TotalMicroseconds:F2}μs / {profile.MaxTime.TotalMicroseconds:F2}μs");
Console.WriteLine($"Std dev: {profile.StandardDeviation.TotalMicroseconds:F2}μs");
Console.WriteLine($"Throughput: {profile.Throughput:F0} ops/sec");
Console.WriteLine($"GFLOPS: {profile.GFLOPS:F2}");

Identify Bottlenecks

Use the bottleneck analyzer:

var bottlenecks = await debugService.AnalyzeMemoryPatternsAsync(
    "MyKernel",
    parameters,
    AcceleratorType.CUDA
);

Console.WriteLine($"Sequential access rate: {bottlenecks.SequentialAccessRate:P1}");
Console.WriteLine($"Cache hit rate: {bottlenecks.CacheHitRate:P1}");
Console.WriteLine($"Bandwidth utilization: {bottlenecks.BandwidthUtilization:P1}");

foreach (var suggestion in bottlenecks.Suggestions)
{
    Console.WriteLine($"Suggestion: {suggestion}");
}

Memory Optimization

1. Memory Pooling

Problem: Frequent allocations cause overhead

Before (slow):

for (int i = 0; i < 1000; i++)
{
    using var buffer = await memoryManager.AllocateAsync<float>(100_000);
    await ProcessData(buffer);
    // Dispose allocates and deallocates every iteration
}

After (fast):

// Allocate once, reuse 1000 times
using var buffer = await memoryManager.AllocateAsync<float>(100_000);

for (int i = 0; i < 1000; i++)
{
    await ProcessData(buffer);
}

Performance Improvement: 11.2x faster (measured)

2. Zero-Copy Operations

Problem: Unnecessary data copies

Before (copies):

var inputArray = new float[1_000_000];
var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
await buffer.CopyFromAsync(inputArray);  // Copy!
await kernel.ExecuteAsync(buffer);
var outputArray = new float[1_000_000];
await buffer.CopyToAsync(outputArray);   // Copy!

After (zero-copy on CPU):

var inputArray = new float[1_000_000];
var buffer = memoryManager.CreateBuffer(inputArray, BufferMode.ReadWrite);
await kernel.ExecuteAsync(buffer);
// No copies needed - direct Span<T> access
var result = buffer.AsSpan();

Performance Improvement: 10x faster for CPU execution

3. Pinned Memory

Problem: Slow CPU-GPU transfers

Before (non-pinned):

var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Default
);
// Transfer: ~6 GB/s

After (pinned):

var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Pinned
);
// Transfer: ~16 GB/s (2.7x faster)

Performance Improvement: 2-3x faster transfers

4. Unified Memory (Apple Silicon)

Problem: Explicit transfers on unified memory systems

Before (unnecessary transfers):

var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
await buffer.CopyFromAsync(sourceData);  // Unnecessary copy!
await kernel.ExecuteAsync(buffer);
await buffer.CopyToAsync(resultData);    // Unnecessary copy!

After (unified memory):

var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Unified
);
// CPU and GPU access same physical memory - no copies!
var span = buffer.AsSpan();
sourceData.CopyTo(span);
await kernel.ExecuteAsync(buffer);
buffer.AsSpan().CopyTo(resultData);

Performance Improvement: 2-3x faster on Apple Silicon

Kernel Optimization

1. Memory Access Patterns

Sequential Access (fast):

[Kernel]
public static void Sequential(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        // Sequential: CPU cache-friendly, GPU coalesced
        output[idx] = input[idx] * 2;
    }
}

Strided Access (slower):

[Kernel]
public static void Strided(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        // Strided: Less cache-friendly, GPU may not coalesce
        output[idx] = input[idx * stride] * 2;
    }
}

Random Access (slowest):

[Kernel]
public static void Random(
    ReadOnlySpan<float> input,
    ReadOnlySpan<int> indices,
    Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        // Random: Cache-unfriendly, GPU scatter/gather
        output[idx] = input[indices[idx]] * 2;
    }
}

Performance Impact:

  • Sequential: 100% bandwidth utilization
  • Strided (stride=2): ~80% bandwidth
  • Strided (stride=16): ~50% bandwidth
  • Random: ~20% bandwidth

2. Data Reuse

Before (reads input[idx] three times):

[Kernel]
public static void NoReuse(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] + input[idx] * input[idx];  // 3 reads!
    }
}

After (reads once, reuses):

[Kernel]
public static void WithReuse(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        float value = input[idx];  // Read once
        output[idx] = value + value * value;  // Reuse
    }
}

Performance Improvement: 1.5-2x faster (memory traffic reduced)

3. Reduce Branching

Branchy (divergent warps on GPU):

[Kernel]
public static void Branchy(ReadOnlySpan<float> input, Span<float> output, float threshold)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        if (input[idx] > threshold)  // Branch divergence!
        {
            output[idx] = input[idx] * 2;
        }
        else
        {
            output[idx] = input[idx] / 2;
        }
    }
}

Branch-Free (better for GPU):

[Kernel]
public static void BranchFree(ReadOnlySpan<float> input, Span<float> output, float threshold)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        float value = input[idx];
        float multiplier = (value > threshold) ? 2.0f : 0.5f;
        output[idx] = value * multiplier;  // Conditional move, not branch
    }
}

Performance Improvement: 2-4x faster on GPU (depends on branch divergence)

4. Loop Unrolling

Regular Loop:

[Kernel]
public static void RegularLoop(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X * 4;
    if (idx < output.Length)
    {
        for (int i = 0; i < 4; i++)
        {
            output[idx + i] = input[idx + i] * 2;
        }
    }
}

Unrolled Loop:

[Kernel]
public static void UnrolledLoop(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X * 4;
    if (idx + 3 < output.Length)
    {
        output[idx + 0] = input[idx + 0] * 2;
        output[idx + 1] = input[idx + 1] * 2;
        output[idx + 2] = input[idx + 2] * 2;
        output[idx + 3] = input[idx + 3] * 2;
    }
}

Performance Improvement: 1.2-1.5x faster (reduced loop overhead)

5. Precision Selection

Double Precision (slower):

[Kernel]
public static void DoublePrecision(ReadOnlySpan<double> input, Span<double> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2.0;  // 2-8x slower on most GPUs
    }
}

Single Precision (faster):

[Kernel]
public static void SinglePrecision(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2.0f;  // Fast on all GPUs
    }
}

Performance Impact: 2-8x faster with float vs double on most GPUs

Batching Operations

Problem: Many Small Kernel Calls

Before (slow):

for (int i = 0; i < 1000; i++)
{
    await orchestrator.ExecuteKernelAsync("SmallKernel", data[i]);
    // 1000 kernel launches = high overhead
}

After (fast):

// Combine into single large kernel
await orchestrator.ExecuteKernelAsync("BatchKernel", allData);
// Single kernel launch

Performance Improvement: 10-100x faster (kernel launch overhead eliminated)

Batch Size Tuning

// Too small: High overhead
const int batchSize = 100;

// Too large: May exceed GPU memory
const int batchSize = 100_000_000;

// Just right: Maximize GPU utilization without exceeding memory
const int batchSize = CalculateOptimalBatchSize(availableMemory, dataSize);

private static int CalculateOptimalBatchSize(long availableMemory, int elementSize)
{
    // Use 80% of available memory
    long usableMemory = (long)(availableMemory * 0.8);
    int maxElements = (int)(usableMemory / elementSize);

    // Round down to power of 2 for efficient indexing
    return (int)Math.Pow(2, Math.Floor(Math.Log2(maxElements)));
}

Parallelization Strategies

1. CPU Multi-Threading

DotCompute automatically uses Parallel.For on CPU:

[Kernel]
public static void ParallelCPU(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}
// Automatically uses all CPU cores

Control Thread Count:

services.AddDotComputeRuntime(options =>
{
    options.MaxCpuThreads = Environment.ProcessorCount; // Default
    // or
    options.MaxCpuThreads = Environment.ProcessorCount / 2; // Half cores
});

2. GPU Thread Configuration

1D Grid (typical):

// Automatically configured:
// - Threads per block: 256 (typical)
// - Blocks: Ceiling(dataSize / 256)
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);

Manual Configuration (advanced):

var compilationOptions = new CompilationOptions
{
    ThreadsPerBlock = 512,  // Must be multiple of 32 (warp size)
    BlocksPerGrid = (dataSize + 511) / 512
};

await orchestrator.ExecuteKernelAsync("MyKernel", parameters, compilationOptions);

Guidelines:

  • Threads per block: 128-512 (256 is good default)
  • Occupancy: Aim for 50%+ occupancy
  • Warp size: Multiple of 32 for NVIDIA, 64 for AMD

3. Pipeline Parallelism

Before (sequential):

await kernel1.ExecuteAsync(data1);
await kernel2.ExecuteAsync(data2);
await kernel3.ExecuteAsync(data3);
// Total time: T1 + T2 + T3

After (pipelined):

var task1 = kernel1.ExecuteAsync(data1);
var task2 = kernel2.ExecuteAsync(data2);
var task3 = kernel3.ExecuteAsync(data3);
await Task.WhenAll(task1, task2, task3);
// Total time: Max(T1, T2, T3)

Performance Improvement: 2-3x faster (overlapped execution)

Backend-Specific Optimizations

CUDA Optimization

1. Coalesced Memory Access:

// ✅ Coalesced: Adjacent threads access adjacent memory
[Kernel]
public static void Coalesced(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockId.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx];  // Thread 0: addr 0, Thread 1: addr 4, etc.
    }
}

// ❌ Uncoalesced: Adjacent threads access strided memory
[Kernel]
public static void Uncoalesced(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockId.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx * stride];  // Poor memory access pattern
    }
}

2. Occupancy Optimization:

// Check occupancy
var capabilities = cudaAccelerator.Capabilities;
int maxThreadsPerBlock = capabilities.MaxThreadsPerBlock;  // e.g., 1024

// Calculate optimal threads per block
int threadsPerBlock = Math.Min(256, maxThreadsPerBlock);  // 256 is good default

// Ensure enough blocks for full occupancy
int blocksPerGrid = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
int minBlocksForFullOccupancy = capabilities.MultiProcessorCount * 2;

if (blocksPerGrid < minBlocksForFullOccupancy)
{
    // Reduce threads per block to increase block count
    threadsPerBlock = 128;
    blocksPerGrid = (dataSize + threadsPerBlock - 1) / threadsPerBlock;
}

Metal Optimization

1. Threadgroup Memory (shared memory):

[Kernel]
public static void WithThreadgroupMemory(
    ReadOnlySpan<float> input,
    Span<float> output,
    int n)
{
    // Metal automatically allocates threadgroup memory for local arrays
    // Not yet exposed in C# kernel syntax - use direct MSL for now
}

2. Unified Memory:

// Leverage zero-copy unified memory on Apple Silicon
var buffer = await memoryManager.AllocateAsync<float>(
    1_000_000,
    AllocationMode.Unified
);

// CPU can write directly
buffer.AsSpan()[0] = 42;

// GPU reads directly (no explicit transfer)
await orchestrator.ExecuteKernelAsync("MyKernel", new { buffer });

// CPU reads directly (no explicit transfer)
var result = buffer.AsSpan()[0];

CPU Optimization

SIMD Vectorization {#simd-vectorization}

SIMD (Single Instruction Multiple Data) vectorization processes multiple data elements simultaneously using CPU vector instructions.

1. Automatic SIMD by Source Generator:

DotCompute's source generator automatically vectorizes kernels for CPU execution:

// Automatically vectorized by source generator
[Kernel]
public static void AutoVectorized(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;  // Becomes Vector<float> operation
    }
}
// On AVX512: Processes 16 floats per instruction
// On AVX2: Processes 8 floats per instruction
// On SSE4.2: Processes 4 floats per instruction

Performance Impact:

Scalar:  100ms (1 element/cycle)
SSE4.2:  27ms (3.7x, 4 elements/cycle)
AVX2:    14ms (7.1x, 8 elements/cycle)
AVX512:  8ms (12.5x, 16 elements/cycle)

Vectorization Requirements:

  • Sequential memory access (stride-1)
  • No data dependencies between iterations
  • Same operation applied to all elements
  • Supported data types: float, double, int, long, short

Example: Manual SIMD (advanced):

// For operations not auto-vectorized, use System.Numerics.Vector<T>
[Kernel]
public static void ManualVectorized(ReadOnlySpan<float> input, Span<float> output)
{
    int vectorSize = Vector<float>.Count;  // 4, 8, or 16
    int i = 0;

    // Vectorized loop
    for (; i <= output.Length - vectorSize; i += vectorSize)
    {
        var vec = new Vector<float>(input.Slice(i, vectorSize));
        vec *= 2.0f;
        vec.CopyTo(output.Slice(i, vectorSize));
    }

    // Scalar remainder
    for (; i < output.Length; i++)
    {
        output[i] = input[i] * 2.0f;
    }
}

Memory Coalescing {#memory-coalescing}

Memory coalescing combines multiple memory accesses into a single transaction for maximum bandwidth utilization on GPUs.

✅ Coalesced Access (GPU optimal):

[Kernel]
public static void Coalesced(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockIdx.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        // Adjacent threads access adjacent memory addresses
        output[idx] = input[idx] * 2;
        // Thread 0: addr[0], Thread 1: addr[4], Thread 2: addr[8], ...
        // GPU combines into single 128-byte transaction
    }
}

❌ Uncoalesced Access (GPU inefficient):

[Kernel]
public static void Uncoalesced(ReadOnlySpan<float> input, Span<float> output, int stride)
{
    int idx = Kernel.ThreadId.X + Kernel.BlockIdx.X * Kernel.BlockDim.X;
    if (idx < output.Length)
    {
        // Adjacent threads access strided memory
        output[idx] = input[idx * stride] * 2;
        // Each thread requires separate transaction
    }
}

Performance Impact:

Pattern                    | Bandwidth | Performance
---------------------------|-----------|------------
Coalesced (stride=1)       | 900 GB/s  | 100%
Coalesced (stride=2)       | 450 GB/s  | 50%
Uncoalesced (stride=32)    | 112 GB/s  | 12%
Random access              | 50 GB/s   | 6%

Optimization Strategy:

  1. Ensure adjacent threads access adjacent memory
  2. Align data structures to cache line boundaries (64 bytes)
  3. Use struct-of-arrays (SoA) instead of array-of-structs (AoS) for better coalescing
  4. Pad arrays to multiples of warp size (32 for NVIDIA)

Shared Memory Usage {#shared-memory}

Shared memory is fast on-chip memory shared by threads within a GPU block, enabling efficient inter-thread communication.

Use Cases:

  • Data reuse across threads (halo regions in stencil operations)
  • Reduction operations (sum, max, min)
  • Matrix transpose and tiling
  • Collaborative loading of data

Example: Shared Memory Reduction (CUDA-style, conceptual):

// TODO: Full shared memory API coming in future release
// Current workaround: Use multiple kernel launches with intermediate buffers

// Conceptual example (not yet supported in C# kernel syntax):
[Kernel]
public static void SharedMemoryReduce(
    ReadOnlySpan<float> input,
    Span<float> output,
    int n)
{
    // Would allocate shared memory for block (future feature)
    // __shared__ float sharedData[256];

    int tid = Kernel.ThreadId.X;
    int idx = Kernel.BlockIdx.X * Kernel.BlockDim.X + tid;

    // Load to shared memory (future)
    // sharedData[tid] = (idx < n) ? input[idx] : 0;

    // Synchronize threads (future)
    // __syncthreads();

    // Tree reduction in shared memory (future)
    // for (int s = blockDim.x / 2; s > 0; s >>= 1)
    // {
    //     if (tid < s) sharedData[tid] += sharedData[tid + s];
    //     __syncthreads();
    // }

    // Write result
    // if (tid == 0) output[blockIdx.x] = sharedData[0];
}

Current Alternative: Use multiple kernel passes with global memory buffers.

Performance Benefits (when available):

  • 100-200x faster than global memory access
  • Enables efficient inter-thread collaboration
  • Reduces global memory bandwidth pressure

2. Cache Optimization:

// ✅ Cache-friendly: Sequential access
[Kernel]
public static void CacheFriendly(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx] * 2;
    }
}

// ❌ Cache-unfriendly: Large stride
[Kernel]
public static void CacheUnfriendly(ReadOnlySpan<float> input, Span<float> output)
{
    int idx = Kernel.ThreadId.X;
    if (idx < output.Length)
    {
        output[idx] = input[idx * 1000] * 2;  // Cache misses!
    }
}

Common Performance Pitfalls

Pitfall 1: Not Warming Up

Problem: First execution includes compilation time

// ❌ Bad: Benchmark includes compilation
var stopwatch = Stopwatch.StartNew();
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
stopwatch.Stop();
Console.WriteLine($"Time: {stopwatch.Elapsed.TotalMilliseconds}ms");
// May show 200ms instead of 2ms!

// ✅ Good: Warm up first
await orchestrator.ExecuteKernelAsync("MyKernel", parameters);  // Warm-up

var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
    await orchestrator.ExecuteKernelAsync("MyKernel", parameters);
}
stopwatch.Stop();
Console.WriteLine($"Avg time: {stopwatch.Elapsed.TotalMilliseconds / 100:F2}ms");

Pitfall 2: Synchronous Waits

Problem: Not overlapping CPU and GPU work

// ❌ Bad: CPU waits for each GPU operation
for (int i = 0; i < 10; i++)
{
    await orchestrator.ExecuteKernelAsync("Kernel1", data[i]);
    ProcessOnCPU(data[i]);  // CPU work while GPU idle
}

// ✅ Good: Overlap CPU and GPU work
var gpuTasks = new List<Task>();
for (int i = 0; i < 10; i++)
{
    gpuTasks.Add(orchestrator.ExecuteKernelAsync("Kernel1", data[i]));
    ProcessOnCPU(data[i - 1]);  // Process previous result while GPU works
}
await Task.WhenAll(gpuTasks);

Pitfall 3: Unnecessary Synchronization

Problem: Forcing synchronization between independent operations

// ❌ Bad: Forced synchronization
await orchestrator.ExecuteKernelAsync("Kernel1", data1);
await accelerator.SynchronizeAsync();  // Unnecessary!
await orchestrator.ExecuteKernelAsync("Kernel2", data2);

// ✅ Good: Let runtime manage synchronization
await orchestrator.ExecuteKernelAsync("Kernel1", data1);
await orchestrator.ExecuteKernelAsync("Kernel2", data2);
// Runtime automatically handles dependencies

Pitfall 4: Small Data on GPU

Problem: Transfer overhead exceeds compute time

// ❌ Bad: GPU for tiny data
var tinyData = new float[100];
await orchestrator.ExecuteKernelAsync(
    "MyKernel",
    new { tinyData },
    forceBackend: AcceleratorType.CUDA
);
// Transfer time (50μs) >> Compute time (1μs)

// ✅ Good: CPU for small data
await orchestrator.ExecuteKernelAsync(
    "MyKernel",
    new { tinyData },
    forceBackend: AcceleratorType.CPU
);
// No transfer, compute time ~1μs

Performance Monitoring

OpenTelemetry Integration

services.AddOpenTelemetry()
    .WithMetrics(metrics => metrics
        .AddDotComputeInstrumentation())
    .WithTracing(tracing => tracing
        .AddDotComputeInstrumentation());

// Metrics collected:
// - dotcompute.kernel.executions (count)
// - dotcompute.kernel.duration (histogram)
// - dotcompute.memory.allocated (count)
// - dotcompute.memory.transferred (histogram)
// - dotcompute.backend.selection_time (histogram)

Real-Time Performance Dashboard

services.AddDotComputeRuntime(options =>
{
    options.EnableTelemetry = true;
    options.TelemetrySamplingRate = 0.1;  // Sample 10% of executions
});

// Query performance data
var telemetryProvider = services.GetRequiredService<ITelemetryProvider>();
var metrics = await telemetryProvider.GetMetricsAsync(TimeSpan.FromHours(1));

Console.WriteLine($"Total executions: {metrics.TotalExecutions}");
Console.WriteLine($"Average duration: {metrics.AverageDuration.TotalMilliseconds:F2}ms");
Console.WriteLine($"P50: {metrics.P50.TotalMilliseconds:F2}ms");
Console.WriteLine($"P95: {metrics.P95.TotalMilliseconds:F2}ms");
Console.WriteLine($"P99: {metrics.P99.TotalMilliseconds:F2}ms");

Performance Checklist

Before Optimization

  • [ ] Profile baseline performance
  • [ ] Identify bottleneck (CPU, GPU, memory, transfer)
  • [ ] Measure memory bandwidth utilization
  • [ ] Check cache hit rates
  • [ ] Analyze memory access patterns

Memory Optimization

  • [ ] Use memory pooling for frequent allocations
  • [ ] Reuse buffers across iterations
  • [ ] Use pinned memory for CPU-GPU transfers
  • [ ] Use unified memory on Apple Silicon
  • [ ] Batch small operations

Kernel Optimization

  • [ ] Ensure sequential/coalesced memory access
  • [ ] Reduce branching in GPU kernels
  • [ ] Reuse loaded values
  • [ ] Use appropriate precision (float vs double)
  • [ ] Check bounds only once per thread

Backend Selection

  • [ ] Profile on both CPU and GPU
  • [ ] Use automatic selection for production
  • [ ] Enable ML-optimized profile for long-running apps
  • [ ] Verify GPU is being used for large data

Monitoring

  • [ ] Enable telemetry in production
  • [ ] Monitor P95 and P99 latencies
  • [ ] Track memory usage trends
  • [ ] Set up alerts for performance regressions

Further Reading


Measure • Optimize • Validate • Repeat