GPU Computing Fundamentals¶
Understanding GPU architecture and why persistent kernels matter.
GPU Architecture¶
PyDotCompute supports multiple GPU backends:
- CUDA: NVIDIA GPUs (Windows, Linux)
- Metal: Apple Silicon GPUs via MLX (macOS)
- CPU: Fallback simulation for development/testing
CPU vs GPU¶
CPU (Few powerful cores) GPU (Many simple cores)
┌─────────────────────┐ ┌─────────────────────────────────┐
│ ┌─────┐ ┌─────┐ │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │
│ │Core │ │Core │ │ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │
│ │ 1 │ │ 2 │ │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │
│ └─────┘ └─────┘ │ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │
│ ┌─────┐ ┌─────┐ │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │
│ │Core │ │Core │ │ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │
│ │ 3 │ │ 4 │ │ │ ... 1000s of cores │
│ └─────┘ └─────┘ │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │
│ │ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │
└─────────────────────┘ └─────────────────────────────────┘
Complex tasks Parallel tasks
Low latency High throughput
NVIDIA GPU Hierarchy¶
GPU
└── Streaming Multiprocessors (SMs)
└── Blocks (Thread Blocks)
└── Warps (32 threads)
└── Threads
| Level | Typical Count | Characteristics |
|---|---|---|
| SMs | 80-100+ | Independent processors |
| Blocks | 1000s | Scheduled to SMs |
| Warps | 32 threads | Execute in lockstep |
| Threads | 100,000s | Lightweight |
Traditional GPU Programming¶
The Typical Flow¶
1. Allocate host memory
2. Initialize data on host
3. Allocate device memory
4. Copy data to device ← Transfer latency
5. Launch kernel ← Launch overhead
6. Wait for completion ← Synchronization
7. Copy results to host ← Transfer latency
8. Free device memory
Example (CUDA/CuPy)¶
import cupy as cp
import numpy as np
# Host data
host_data = np.random.randn(1000000).astype(np.float32)
# Copy to device
device_data = cp.asarray(host_data) # ~0.5ms for 4MB
# Kernel launch
result = cp.square(device_data) # ~0.01ms
# Copy back
host_result = cp.asnumpy(result) # ~0.5ms
Observation: Transfer time dominates computation time!
The Problem with Small Kernels¶
Traditional approach:
For each batch:
copy_to_device() ─── 500μs
launch_kernel() ─── 10μs (actual work)
copy_from_device() ─── 500μs
Total: 1010μs per batch
Efficiency: 10/1010 = 1%
Persistent Kernels¶
The Innovation¶
Keep the kernel running and feed it data:
Persistent kernel approach:
launch_kernel() once ─── 10μs (one-time)
For each batch:
send_to_queue() ─── 1μs
kernel_processes() ─── 10μs
receive_from_queue() ─── 1μs
Total: 12μs per batch (after launch)
Efficiency: 10/12 = 83%
How Ring Kernels Work¶
┌─────────────────────────────────────────────────────────────┐
│ GPU │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Ring Kernel │ │
│ │ │ │
│ │ ┌─────────┐ ┌───────────┐ ┌─────────┐ │ │
│ │ │ Input │───►│ Process │───►│ Output │ │ │
│ │ │ Queue │ │ (Loop) │ │ Queue │ │ │
│ │ └─────────┘ └───────────┘ └─────────┘ │ │
│ │ ▲ │ │ │
│ │ │ ┌─────────┐ │ │ │
│ │ │ │ State │ │ │ │
│ │ │ └─────────┘ │ │ │
│ └────────│───────────────────────────────│──────────────┘ │
│ │ │ │
│ │ ┌────────────────────┐ │ │
│ │ │ Unified Memory │ │ │
│ │ │ (Host-Device) │ │ │
│ │ └────────────────────┘ │ │
│ │ ▲ ▼ │
└───────────│───────────────│───────────────│─────────────────┘
│ │ │
┌─────┴───────────────┴───────────────┴─────┐
│ HOST │
│ send() receive() │
└────────────────────────────────────────────┘
Memory Hierarchy¶
GPU Memory Types¶
┌─────────────────────────────────────────────────────────┐
│ Global Memory │
│ (Large, ~24-80GB, Slow) │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Shared Memory │ │ Shared Memory │ │
│ │ (Fast, ~48KB) │ │ (Fast, ~48KB) │ │
│ │ Per Block │ │ Per Block │ │
│ └─────────────────┘ └─────────────────┘ │
│ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │
│ │R │ │R │ │R │ │R │ │R │ │R │ │R │ │R │ │
│ │e │ │e │ │e │ │e │ │e │ │e │ │e │ │e │ │
│ │g │ │g │ │g │ │g │ │g │ │g │ │g │ │g │ │
│ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ │
│ Thread Registers (Fastest, Per-Thread) │
└─────────────────────────────────────────────────────────┘
| Memory Type | Size | Bandwidth | Latency | Scope |
|---|---|---|---|---|
| Registers | ~256KB | ~20TB/s | 1 cycle | Thread |
| Shared | ~48KB | ~10TB/s | ~20 cycles | Block |
| L1 Cache | ~128KB | ~2TB/s | ~30 cycles | SM |
| L2 Cache | ~6MB | ~1TB/s | ~200 cycles | GPU |
| Global | 24-80GB | ~900GB/s | ~400 cycles | All |
| Host | GB-TB | ~25GB/s | ~10K cycles | CPU |
Unified Memory¶
PyDotCompute's UnifiedBuffer abstracts memory across backends:
from pydotcompute import UnifiedBuffer
# Single buffer, accessible from both host and device
buf = UnifiedBuffer((1000,), dtype=np.float32)
# Host access
buf.host[:] = data # Automatic page migration
# Device access
result = kernel(buf.device) # Data migrates to GPU
# Host access again
output = buf.host[:] # Data migrates back
from pydotcompute import UnifiedBuffer
# On Apple Silicon, memory is truly unified
buf = UnifiedBuffer((1000,), dtype=np.float32)
# Host access
buf.host[:] = data
# Metal access (no physical transfer needed!)
metal_array = buf.metal # Returns MLX array
# CPU and GPU share the same physical memory
output = buf.host[:] # Virtually free
Apple Silicon Advantage
Apple Silicon's unified memory architecture means CPU and GPU share the same physical memory. This eliminates the traditional host-device transfer bottleneck, making Metal particularly efficient for streaming workloads.
Kernel Launch Overhead¶
What Happens at Launch¶
- Driver Setup: ~5-10μs
- Command Buffer: ~2-5μs
- Kernel Dispatch: ~1-2μs
- First Thread Start: ~5-10μs
Total: ~15-30μs per launch
Why Persistent Kernels Help¶
100 small computations:
Traditional:
100 × (15μs launch + 10μs compute) = 2500μs
Persistent:
1 × 15μs launch + 100 × 10μs compute = 1015μs
Speedup: 2.5x
For streaming workloads, the difference is even larger.
Streaming Multiprocessors (SMs)¶
SM Structure¶
┌─────────────────────────────────────────────────┐
│ SM │
├─────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────┐ │
│ │ Warp Schedulers (4) │ │
│ └─────────────────────────────────────────┘ │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ CUDA │ │ CUDA │ │ CUDA │ │
│ │ Cores │ │ Cores │ │ Cores │ │
│ │ (32) │ │ (32) │ │ (32) │ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Tensor Cores (4) │ │ Register File │ │
│ └───────────────────┘ │ (256KB) │ │
│ └───────────────────┘ │
│ ┌───────────────────────────────────────────┐ │
│ │ Shared Memory (96KB) │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Occupancy¶
Occupancy = Active Warps / Maximum Warps per SM
Higher occupancy hides latency:
# Higher occupancy (more concurrent warps)
@kernel(block=(256,)) # 256 threads = 8 warps
def high_occupancy_kernel(...):
...
# Lower occupancy (fewer warps, more resources each)
@kernel(block=(64,)) # 64 threads = 2 warps
def low_occupancy_kernel(...):
# More registers/shared memory per thread
...
PyDotCompute's Approach¶
PyDotCompute addresses these GPU challenges:
| Challenge | Traditional | PyDotCompute |
|---|---|---|
| Launch overhead | Every call | Once |
| Memory transfer | Every call | Minimized |
| State management | Manual | Automatic |
| Synchronization | Explicit | Message-based |
| Memory tracking | Manual | UnifiedBuffer |
| Backend portability | Vendor-specific | Multi-backend (CUDA, Metal, CPU) |
Next Steps¶
- DotCompute Comparison: Origin story
- Ring Kernels: Implementation