Memory Management¶

Managing host and device memory with UnifiedBuffer.

Overview¶

PyDotCompute provides UnifiedBuffer for seamless host-device memory management across CPU, CUDA (NVIDIA), and Metal (Apple Silicon) backends. It tracks which copy is current and automatically synchronizes when needed.

The Memory Challenge¶

GPU programming traditionally requires explicit memory management:

# Traditional approach (manual)
host_data = np.array([1, 2, 3], dtype=np.float32)
device_data = cuda.to_device(host_data)  # Copy to GPU
kernel(device_data)                       # GPU computation
host_result = device_data.copy_to_host() # Copy back

Problems:

Easy to forget synchronization
Redundant copies
Manual state tracking
Error-prone

UnifiedBuffer Solution¶

UnifiedBuffer automates memory management:

from pydotcompute import UnifiedBuffer

# Create unified buffer
buf = UnifiedBuffer((1000,), dtype=np.float32)

# Work on host
buf.host[:] = np.random.randn(1000)

# Use on device (auto-syncs)
device_array = buf.device  # Automatically copies to GPU

# Read back (auto-syncs if device modified)
result = buf.to_numpy()

Buffer States¶

The buffer tracks its state across host and device memory (CUDA or Metal):

┌─────────────────┐
│  UNINITIALIZED  │  No data allocated
└────────┬────────┘
         │ first access
┌────────▼────────┐
│    HOST_ONLY    │  Data on host only
└────────┬────────┘
         │ .device/.metal access
┌────────▼────────┐
│  SYNCHRONIZED   │  Both copies match
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐  ┌────────┐
│HOST    │  │DEVICE  │
│DIRTY   │  │DIRTY   │
└────────┘  └────────┘

State Transitions¶

From	Action	To
UNINITIALIZED	`.host` access	HOST_ONLY
HOST_ONLY	`.device` access	SYNCHRONIZED
SYNCHRONIZED	host modified	HOST_DIRTY
SYNCHRONIZED	device modified	DEVICE_DIRTY
HOST_DIRTY	`.device` access	SYNCHRONIZED
DEVICE_DIRTY	`.host` access	SYNCHRONIZED

Lazy Synchronization¶

Synchronization happens only when needed:

buf = UnifiedBuffer((1000,), dtype=np.float32)

# Write on host
buf.host[:] = data  # State: HOST_DIRTY

# No sync yet - still HOST_DIRTY
print(buf.state)  # HOST_DIRTY

# Access device - triggers sync
device_data = buf.device  # Sync: host → device
print(buf.state)  # SYNCHRONIZED

Explicit Synchronization¶

For performance-critical code, use explicit sync:

buf = UnifiedBuffer((1000,), dtype=np.float32)

# Prepare data
buf.host[:] = data

# Explicitly sync before kernel launch
buf.sync_to_device()

# Run GPU kernel (modifies device data)
my_kernel(buf.device)

# Mark device as modified
buf.mark_device_dirty()

# Explicitly sync before reading
buf.sync_to_host()

# Now safe to read
result = buf.host[:]

Pinned Memory (CUDA)¶

For faster CUDA transfers, use pinned (page-locked) memory:

# Pinned memory for frequent transfers
buf = UnifiedBuffer((1000,), dtype=np.float32, pinned=True)

# Transfers are faster due to DMA
buf.host[:] = data
device_view = buf.device  # Faster copy

When to use pinned memory:

Streaming workloads with frequent transfers
Real-time processing
Large batch operations

When to avoid:

Limited system memory
Many small buffers (overhead)
Infrequent transfers

Metal and Unified Memory

On Apple Silicon with Metal, the unified memory architecture eliminates the need for explicit pinned memory. CPU and GPU share the same physical memory, making transfers virtually free.

Memory Pooling¶

Reduce allocation overhead with pooling:

from pydotcompute.core.memory_pool import get_memory_pool

pool = get_memory_pool()

# Acquire from pool (fast if cached)
buf = pool.acquire((1000,), dtype=np.float32)

# Use buffer...
buf.host[:] = data
process(buf.device)

# Release back to pool (not deallocated)
pool.release(buf)

# Next acquire may reuse the buffer
buf2 = pool.acquire((1000,), dtype=np.float32)  # Same buffer!

Large Data Handling¶

For very large data, avoid message serialization:

@message
@dataclass
class ProcessRequest:
    # Don't include large arrays in messages!
    buffer_id: str  # Reference to shared buffer
    offset: int
    size: int

# Shared buffer registry
buffers: dict[str, UnifiedBuffer] = {}

def create_work_buffer(data: np.ndarray) -> str:
    buf_id = str(uuid4())
    buf = UnifiedBuffer(data.shape, data.dtype)
    buf.copy_from(data)
    buffers[buf_id] = buf
    return buf_id

# Send just the reference
buf_id = create_work_buffer(large_array)
await runtime.send("processor", ProcessRequest(
    buffer_id=buf_id,
    offset=0,
    size=len(large_array),
))

Memory Patterns¶

Read-Modify-Write¶

buf = UnifiedBuffer((1000,), dtype=np.float32)

# Read on host
buf.host[:] = input_data  # HOST_DIRTY

# Modify on device
gpu_kernel(buf.device)    # Syncs, then SYNCHRONIZED
buf.mark_device_dirty()   # DEVICE_DIRTY

# Read on host
result = buf.to_numpy()   # Syncs back

Double Buffering¶

# Two buffers for overlap
buf_a = UnifiedBuffer((1000,), dtype=np.float32)
buf_b = UnifiedBuffer((1000,), dtype=np.float32)

while data_available():
    # Process buf_a on GPU while filling buf_b on CPU
    buf_a.sync_to_device()
    gpu_task = launch_async(kernel, buf_a.device)

    # Meanwhile, fill buf_b on host
    buf_b.host[:] = get_next_batch()

    # Wait for GPU
    await gpu_task

    # Swap buffers
    buf_a, buf_b = buf_b, buf_a

Batch Processing¶

pool = get_memory_pool()
results = []

for batch in batches:
    buf = pool.acquire(batch.shape, batch.dtype)
    try:
        buf.copy_from(batch)
        process_on_gpu(buf.device)
        results.append(buf.to_numpy().copy())
    finally:
        pool.release(buf)

Memory Best Practices¶

Minimize Transfers: Keep data on GPU as long as possible
Use Pooling: Reduce allocation overhead
Explicit Sync for Timing: Use explicit sync for benchmarks
Pinned Memory for Streaming: Enable for high-throughput
Batch Operations: Process multiple items per transfer
Check State: Debug with buf.state
Don't Serialize Large Data: Use buffer references

GPU Memory Monitoring¶

CUDAMetal (macOS)

from pydotcompute import get_accelerator

acc = get_accelerator()

# Before allocation
free_before, total = acc.get_memory_info()

# Allocate
buf = UnifiedBuffer((10_000_000,), dtype=np.float32)
_ = buf.device  # Force device allocation

# After allocation
free_after, _ = acc.get_memory_info()

print(f"Allocated: {(free_before - free_after) / 1e6:.1f} MB")

from pydotcompute import get_accelerator

acc = get_accelerator()

# Metal memory info includes cache and peak usage
free, total = acc.get_memory_info()
print(f"Memory: {free / 1e9:.1f} GB free / {total / 1e9:.1f} GB total")

# Allocate
buf = UnifiedBuffer((10_000_000,), dtype=np.float32)
_ = buf.metal  # Force Metal allocation (virtually free on unified memory)

Next Steps¶

Lifecycle: Kernel state management
GPU Optimization Guide: Performance tips