GPU Optimization¶

Performance tuning for GPU-accelerated ring kernels.

Overview¶

This guide covers techniques to maximize GPU performance with PyDotCompute across CUDA (NVIDIA) and Metal (Apple Silicon) backends.

Memory Optimization¶

Minimize Host-Device Transfers¶

Transfers are expensive. Keep data on the GPU:

# BAD: Transfer every iteration
for batch in batches:
    gpu_data = cp.asarray(batch)  # Transfer
    result = kernel(gpu_data)
    cpu_result = cp.asnumpy(result)  # Transfer back

# GOOD: Batch transfers
all_data = cp.asarray(np.concatenate(batches))  # One transfer
results = process_all(all_data)  # All on GPU
final = cp.asnumpy(results)  # One transfer back

Use UnifiedBuffer Efficiently¶

from pydotcompute import UnifiedBuffer

# Minimize state transitions
buf = UnifiedBuffer((10000,), dtype=np.float32)

# Do ALL host work first
buf.host[:5000] = data_a
buf.host[5000:] = data_b

# Then sync once
buf.sync_to_device()

# Do ALL GPU work
result = gpu_kernel(buf.device)

# Sync back once
buf.sync_to_host()

Use Pinned Memory (CUDA)¶

For streaming workloads on NVIDIA GPUs:

# Enable pinned memory for fast transfers
buf = UnifiedBuffer((10000,), dtype=np.float32, pinned=True)

# DMA transfers are faster with pinned memory

Metal Advantage

On Apple Silicon with Metal, pinned memory is not needed. The unified memory architecture means CPU and GPU share the same physical memory, eliminating transfer overhead entirely.

Memory Pooling¶

Reduce allocation overhead:

from pydotcompute.core.memory_pool import get_memory_pool

pool = get_memory_pool()

for batch in batches:
    buf = pool.acquire((batch_size,), dtype=np.float32)
    try:
        process(buf)
    finally:
        pool.release(buf)  # Returns to pool, not freed

Kernel Optimization¶

Block Size Selection¶

Choose appropriate thread block sizes:

from pydotcompute import kernel

# Good block sizes: 128, 256, 512
@kernel(block=(256,))
def my_kernel(data, out):
    i = cuda.grid(1)
    if i < len(data):
        out[i] = process(data[i])

Guidelines:

Data Size	Block Size
< 1K	64-128
1K - 100K	256
> 100K	256-512

Memory Coalescing¶

Access memory sequentially:

# BAD: Strided access
@cuda.jit
def bad_kernel(data, out):
    i = cuda.grid(1)
    # Threads access non-contiguous memory
    out[i] = data[i * stride]

# GOOD: Coalesced access
@cuda.jit
def good_kernel(data, out):
    i = cuda.grid(1)
    # Threads access contiguous memory
    out[i] = data[i]

Shared Memory¶

Use shared memory for data reuse:

@cuda.jit
def reduce_kernel(data, out):
    # Allocate shared memory
    shared = cuda.shared.array(256, dtype=float32)

    i = cuda.grid(1)
    tid = cuda.threadIdx.x

    # Load to shared memory (coalesced)
    if i < len(data):
        shared[tid] = data[i]
    else:
        shared[tid] = 0

    cuda.syncthreads()

    # Reduction in shared memory (fast)
    s = 128
    while s > 0:
        if tid < s:
            shared[tid] += shared[tid + s]
        cuda.syncthreads()
        s //= 2

    if tid == 0:
        out[cuda.blockIdx.x] = shared[0]

Avoid Warp Divergence¶

Keep threads in a warp on the same path:

# BAD: Warp divergence
@cuda.jit
def bad_kernel(data, out):
    i = cuda.grid(1)
    if i % 2 == 0:  # Half the warp goes one way
        out[i] = process_a(data[i])
    else:  # Other half goes another
        out[i] = process_b(data[i])

# GOOD: No divergence
@cuda.jit
def good_kernel(data, out):
    i = cuda.grid(1)
    # All threads do same work
    out[i] = process(data[i])

Actor Optimization¶

Batch Processing¶

Process multiple messages per iteration:

@ring_kernel(kernel_id="batch_processor", queue_size=10000)
async def batch_processor(ctx):
    batch_size = 100

    while not ctx.should_terminate:
        if not ctx.is_active:
            await ctx.wait_active()
            continue

        try:
            # Collect batch
            batch = []
            for _ in range(batch_size):
                try:
                    msg = await ctx.receive(timeout=0.01)
                    batch.append(msg)
                except asyncio.TimeoutError:
                    break

            if batch:
                # Process batch on GPU
                results = process_batch_gpu(batch)

                # Send responses
                for msg, result in zip(batch, results):
                    await ctx.send(Response(
                        result=result,
                        correlation_id=msg.message_id,
                    ))

        except:
            continue

Queue Size Tuning¶

Balance memory vs. latency:

# High throughput, more memory
@ring_kernel(kernel_id="high_throughput", queue_size=10000)
async def high_throughput(ctx):
    ...

# Low latency, less buffering
@ring_kernel(kernel_id="low_latency", queue_size=100)
async def low_latency(ctx):
    ...

Priority Processing¶

Handle urgent work first:

# High-priority messages processed first
urgent = Request(data=x, priority=255)
normal = Request(data=y, priority=128)

await runtime.send("worker", urgent)  # Processed first
await runtime.send("worker", normal)  # Processed second

Async Streams¶

Overlap compute and transfer:

import cupy as cp

@ring_kernel(kernel_id="stream_processor")
async def stream_processor(ctx):
    # Create CUDA streams
    stream1 = cp.cuda.Stream()
    stream2 = cp.cuda.Stream()

    current_buf = None
    next_buf = None

    while not ctx.should_terminate:
        if not ctx.is_active:
            await ctx.wait_active()
            continue

        try:
            # Receive next work
            msg = await ctx.receive(timeout=0.1)

            # Process current while loading next
            if current_buf is not None:
                with stream1:
                    result = kernel(current_buf)

            # Load next data
            with stream2:
                next_buf = cp.asarray(msg.data)

            # Synchronize
            stream1.synchronize()
            stream2.synchronize()

            if current_buf is not None:
                await ctx.send(Response(result=result))

            current_buf = next_buf

        except asyncio.TimeoutError:
            continue

Profiling¶

GPU Profiling¶

import cupy as cp

# Profile GPU operations
with cp.cuda.profile():
    result = my_kernel(data)

# Or use NVIDIA Nsight
# nsys profile python my_script.py

Actor Telemetry¶

async with RingKernelRuntime(enable_telemetry=True) as runtime:
    await runtime.launch("worker")
    await runtime.activate("worker")

    # Run workload...

    telemetry = runtime.get_telemetry("worker")
    print(f"Throughput: {telemetry.throughput:.1f} msg/s")
    print(f"Messages: {telemetry.messages_processed}")

Timing¶

import time
import cupy as cp

# Accurate GPU timing
cp.cuda.Device().synchronize()  # Wait for GPU
start = time.perf_counter()

result = my_kernel(data)

cp.cuda.Device().synchronize()  # Wait for GPU
elapsed = time.perf_counter() - start

print(f"Kernel time: {elapsed*1000:.2f} ms")

Common Bottlenecks¶

Problem: Low GPU Utilization¶

Symptoms: GPU usage < 50%

Solutions:

Increase batch size
Use larger queue sizes
Reduce host-device transfers
Use async streams

Problem: Memory Transfer Bound¶

Symptoms: High transfer time, low compute time

Solutions:

Keep data on GPU longer
Use pinned memory
Batch transfers
Overlap compute and transfer

Problem: Kernel Launch Overhead¶

Symptoms: Many small kernels, high overhead

Solutions:

Fuse kernels
Increase work per kernel
Use persistent kernels (ring kernels!)

Best Practices Summary¶

Area	Recommendation
Transfers	Minimize, batch, use pinned memory (CUDA)
Memory	Pool buffers, use UnifiedBuffer
Kernels	Coalesced access, avoid divergence (CUDA)
Actors	Batch processing, tune queue size
Profiling	Measure before optimizing
Metal	Leverage unified memory, use MLX operations

Next Steps¶

Testing: Performance testing
Building Actors: Actor design