Memory Management¶
Managing host and device memory with UnifiedBuffer.
Overview¶
PyDotCompute provides UnifiedBuffer for seamless host-device memory management across CPU, CUDA (NVIDIA), and Metal (Apple Silicon) backends. It tracks which copy is current and automatically synchronizes when needed.
The Memory Challenge¶
GPU programming traditionally requires explicit memory management:
# Traditional approach (manual)
host_data = np.array([1, 2, 3], dtype=np.float32)
device_data = cuda.to_device(host_data) # Copy to GPU
kernel(device_data) # GPU computation
host_result = device_data.copy_to_host() # Copy back
Problems:
- Easy to forget synchronization
- Redundant copies
- Manual state tracking
- Error-prone
UnifiedBuffer Solution¶
UnifiedBuffer automates memory management:
from pydotcompute import UnifiedBuffer
# Create unified buffer
buf = UnifiedBuffer((1000,), dtype=np.float32)
# Work on host
buf.host[:] = np.random.randn(1000)
# Use on device (auto-syncs)
device_array = buf.device # Automatically copies to GPU
# Read back (auto-syncs if device modified)
result = buf.to_numpy()
Buffer States¶
The buffer tracks its state across host and device memory (CUDA or Metal):
┌─────────────────┐
│ UNINITIALIZED │ No data allocated
└────────┬────────┘
│ first access
┌────────▼────────┐
│ HOST_ONLY │ Data on host only
└────────┬────────┘
│ .device/.metal access
┌────────▼────────┐
│ SYNCHRONIZED │ Both copies match
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌────────┐
│HOST │ │DEVICE │
│DIRTY │ │DIRTY │
└────────┘ └────────┘
State Transitions¶
| From | Action | To |
|---|---|---|
| UNINITIALIZED | .host access |
HOST_ONLY |
| HOST_ONLY | .device access |
SYNCHRONIZED |
| SYNCHRONIZED | host modified | HOST_DIRTY |
| SYNCHRONIZED | device modified | DEVICE_DIRTY |
| HOST_DIRTY | .device access |
SYNCHRONIZED |
| DEVICE_DIRTY | .host access |
SYNCHRONIZED |
Lazy Synchronization¶
Synchronization happens only when needed:
buf = UnifiedBuffer((1000,), dtype=np.float32)
# Write on host
buf.host[:] = data # State: HOST_DIRTY
# No sync yet - still HOST_DIRTY
print(buf.state) # HOST_DIRTY
# Access device - triggers sync
device_data = buf.device # Sync: host → device
print(buf.state) # SYNCHRONIZED
Explicit Synchronization¶
For performance-critical code, use explicit sync:
buf = UnifiedBuffer((1000,), dtype=np.float32)
# Prepare data
buf.host[:] = data
# Explicitly sync before kernel launch
buf.sync_to_device()
# Run GPU kernel (modifies device data)
my_kernel(buf.device)
# Mark device as modified
buf.mark_device_dirty()
# Explicitly sync before reading
buf.sync_to_host()
# Now safe to read
result = buf.host[:]
Pinned Memory (CUDA)¶
For faster CUDA transfers, use pinned (page-locked) memory:
# Pinned memory for frequent transfers
buf = UnifiedBuffer((1000,), dtype=np.float32, pinned=True)
# Transfers are faster due to DMA
buf.host[:] = data
device_view = buf.device # Faster copy
When to use pinned memory:
- Streaming workloads with frequent transfers
- Real-time processing
- Large batch operations
When to avoid:
- Limited system memory
- Many small buffers (overhead)
- Infrequent transfers
Metal and Unified Memory
On Apple Silicon with Metal, the unified memory architecture eliminates the need for explicit pinned memory. CPU and GPU share the same physical memory, making transfers virtually free.
Memory Pooling¶
Reduce allocation overhead with pooling:
from pydotcompute.core.memory_pool import get_memory_pool
pool = get_memory_pool()
# Acquire from pool (fast if cached)
buf = pool.acquire((1000,), dtype=np.float32)
# Use buffer...
buf.host[:] = data
process(buf.device)
# Release back to pool (not deallocated)
pool.release(buf)
# Next acquire may reuse the buffer
buf2 = pool.acquire((1000,), dtype=np.float32) # Same buffer!
Large Data Handling¶
For very large data, avoid message serialization:
@message
@dataclass
class ProcessRequest:
# Don't include large arrays in messages!
buffer_id: str # Reference to shared buffer
offset: int
size: int
# Shared buffer registry
buffers: dict[str, UnifiedBuffer] = {}
def create_work_buffer(data: np.ndarray) -> str:
buf_id = str(uuid4())
buf = UnifiedBuffer(data.shape, data.dtype)
buf.copy_from(data)
buffers[buf_id] = buf
return buf_id
# Send just the reference
buf_id = create_work_buffer(large_array)
await runtime.send("processor", ProcessRequest(
buffer_id=buf_id,
offset=0,
size=len(large_array),
))
Memory Patterns¶
Read-Modify-Write¶
buf = UnifiedBuffer((1000,), dtype=np.float32)
# Read on host
buf.host[:] = input_data # HOST_DIRTY
# Modify on device
gpu_kernel(buf.device) # Syncs, then SYNCHRONIZED
buf.mark_device_dirty() # DEVICE_DIRTY
# Read on host
result = buf.to_numpy() # Syncs back
Double Buffering¶
# Two buffers for overlap
buf_a = UnifiedBuffer((1000,), dtype=np.float32)
buf_b = UnifiedBuffer((1000,), dtype=np.float32)
while data_available():
# Process buf_a on GPU while filling buf_b on CPU
buf_a.sync_to_device()
gpu_task = launch_async(kernel, buf_a.device)
# Meanwhile, fill buf_b on host
buf_b.host[:] = get_next_batch()
# Wait for GPU
await gpu_task
# Swap buffers
buf_a, buf_b = buf_b, buf_a
Batch Processing¶
pool = get_memory_pool()
results = []
for batch in batches:
buf = pool.acquire(batch.shape, batch.dtype)
try:
buf.copy_from(batch)
process_on_gpu(buf.device)
results.append(buf.to_numpy().copy())
finally:
pool.release(buf)
Memory Best Practices¶
-
Minimize Transfers: Keep data on GPU as long as possible
-
Use Pooling: Reduce allocation overhead
-
Explicit Sync for Timing: Use explicit sync for benchmarks
-
Pinned Memory for Streaming: Enable for high-throughput
-
Batch Operations: Process multiple items per transfer
-
Check State: Debug with
buf.state -
Don't Serialize Large Data: Use buffer references
GPU Memory Monitoring¶
from pydotcompute import get_accelerator
acc = get_accelerator()
# Before allocation
free_before, total = acc.get_memory_info()
# Allocate
buf = UnifiedBuffer((10_000_000,), dtype=np.float32)
_ = buf.device # Force device allocation
# After allocation
free_after, _ = acc.get_memory_info()
print(f"Allocated: {(free_before - free_after) / 1e6:.1f} MB")
from pydotcompute import get_accelerator
acc = get_accelerator()
# Metal memory info includes cache and peak usage
free, total = acc.get_memory_info()
print(f"Memory: {free / 1e9:.1f} GB free / {total / 1e9:.1f} GB total")
# Allocate
buf = UnifiedBuffer((10_000_000,), dtype=np.float32)
_ = buf.metal # Force Metal allocation (virtually free on unified memory)
Next Steps¶
- Lifecycle: Kernel state management
- GPU Optimization Guide: Performance tips