Table of Contents

Memory Management Architecture

The Memory Management system provides unified memory abstraction with high-performance pooling, zero-copy operations, and optimized data transfers across CPU and GPU devices.

Architecture Overview

Application
    ↓
IUnifiedMemoryManager (Unified API)
    ↓
┌───────────────────────────────────────────────┐
│        UnifiedMemoryManager                   │
├───────────────────────────────────────────────┤
│  - Buffer allocation and lifecycle            │
│  - Pool coordination                          │
│  - Transfer orchestration                     │
│  - P2P management                             │
└───────────────────────────────────────────────┘
    ↓                    ↓                    ↓
MemoryPool    TransferEngine    P2PManager
(Pooling)     (Transfers)       (GPU-to-GPU)
    ↓                    ↓                    ↓
┌─────────────────────────────────────────────────┐
│        OptimizedUnifiedBuffer<T>                │
├─────────────────────────────────────────────────┤
│  - Cross-device memory buffer                   │
│  - Zero-copy via Span<T>                        │
│  - Automatic synchronization                    │
│  - Lifecycle management                         │
└─────────────────────────────────────────────────┘
    ↓
Device Memory (CPU/CUDA/Metal)

Core Concepts

Unified Memory Abstraction

All memory operations go through a single interface regardless of backend:

public interface IUnifiedMemoryManager
{
    /// <summary>
    /// Allocates a unified buffer accessible from all devices
    /// </summary>
    Task<IUnifiedBuffer<T>> AllocateAsync<T>(
        long elementCount,
        AllocationMode mode = AllocationMode.Default,
        CancellationToken cancellationToken = default) where T : unmanaged;

    /// <summary>
    /// Creates buffer from existing CPU memory (zero-copy when possible)
    /// </summary>
    IUnifiedBuffer<T> CreateBuffer<T>(
        T[] hostData,
        BufferMode mode = BufferMode.ReadWrite) where T : unmanaged;

    /// <summary>
    /// Gets memory usage statistics
    /// </summary>
    MemoryStatistics GetStatistics();

    /// <summary>
    /// Returns buffer to pool for reuse
    /// </summary>
    ValueTask ReturnToPoolAsync<T>(
        IUnifiedBuffer<T> buffer) where T : unmanaged;
}

Design Benefits:

  • Backend Independence: Same code works on CPU, CUDA, Metal
  • Type Safety: Generic IUnifiedBuffer<T> prevents type errors
  • Zero-Copy: Span access avoids unnecessary copies
  • Automatic Pooling: Transparent buffer reuse

IUnifiedBuffer Interface

The core memory abstraction:

public interface IUnifiedBuffer<T> : IDisposable, IAsyncDisposable where T : unmanaged
{
    /// <summary>
    /// Number of elements in buffer
    /// </summary>
    long Length { get; }

    /// <summary>
    /// Size in bytes
    /// </summary>
    long SizeInBytes { get; }

    /// <summary>
    /// Device where buffer is allocated
    /// </summary>
    IAccelerator Device { get; }

    /// <summary>
    /// Current location (Host, Device, Both)
    /// </summary>
    BufferLocation Location { get; }

    /// <summary>
    /// Copies data from host to device
    /// </summary>
    Task CopyFromAsync(
        ReadOnlySpan<T> source,
        long offset = 0,
        CancellationToken cancellationToken = default);

    /// <summary>
    /// Copies data from device to host
    /// </summary>
    Task CopyToAsync(
        Span<T> destination,
        long offset = 0,
        CancellationToken cancellationToken = default);

    /// <summary>
    /// Gets direct CPU access (zero-copy when possible)
    /// </summary>
    Span<T> AsSpan();

    /// <summary>
    /// Maps buffer to CPU memory (may require transfer)
    /// </summary>
    Task<Memory<T>> MapAsync(
        MapMode mode,
        CancellationToken cancellationToken = default);

    /// <summary>
    /// Unmaps buffer from CPU memory
    /// </summary>
    Task UnmapAsync(CancellationToken cancellationToken = default);
}

Memory Pooling System

High-Performance Object Pool

The pooling system dramatically reduces allocations:

public class HighPerformanceObjectPool<T> where T : class
{
    private readonly ConcurrentBag<T>[] _pools;
    private readonly Func<T> _factory;
    private readonly Action<T>? _reset;
    private readonly int _maxItemsPerBucket;

    /// <summary>
    /// Gets object from pool or creates new one
    /// </summary>
    public T Rent()
    {
        // Try to get from thread-local bucket first (lock-free)
        int bucketIndex = Thread.CurrentThread.ManagedThreadId % _pools.Length;
        if (_pools[bucketIndex].TryTake(out var item))
        {
            return item;
        }

        // Try other buckets
        for (int i = 0; i < _pools.Length; i++)
        {
            if (i != bucketIndex && _pools[i].TryTake(out item))
            {
                return item;
            }
        }

        // Create new object if pool exhausted
        return _factory();
    }

    /// <summary>
    /// Returns object to pool for reuse
    /// </summary>
    public void Return(T item)
    {
        // Reset state if needed
        _reset?.Invoke(item);

        // Return to thread-local bucket
        int bucketIndex = Thread.CurrentThread.ManagedThreadId % _pools.Length;
        if (_pools[bucketIndex].Count < _maxItemsPerBucket)
        {
            _pools[bucketIndex].Add(item);
        }
        // else: discard (pool is full)
    }
}

Performance Characteristics:

  • Rent: < 100ns (pool hit), ~1μs (pool miss)
  • Return: < 50ns
  • Contention: Minimal (thread-local buckets)
  • Allocation Reduction: 90%+ (measured in benchmarks)

Memory Pool with Size Classes

The memory pool organizes buffers by size:

public class MemoryPool
{
    // 21 size classes: 1KB, 2KB, 4KB, ..., 512MB, 1GB
    private static readonly long[] SizeClasses = GenerateSizeClasses();

    private readonly ConcurrentDictionary<long, ConcurrentBag<IUnifiedBuffer<byte>>> _pools;

    /// <summary>
    /// Gets buffer from pool or allocates new
    /// </summary>
    public async Task<IUnifiedBuffer<T>> AllocateAsync<T>(
        long elementCount,
        IAccelerator device) where T : unmanaged
    {
        long sizeInBytes = elementCount * Unsafe.SizeOf<T>();
        long sizeClass = GetSizeClass(sizeInBytes);

        // Try to get from pool
        if (_pools.TryGetValue(sizeClass, out var pool) && pool.TryTake(out var buffer))
        {
            // Return pooled buffer (may be larger than requested)
            return Cast<T>(buffer);
        }

        // Allocate new buffer
        return await device.AllocateAsync<T>(elementCount);
    }

    /// <summary>
    /// Returns buffer to pool
    /// </summary>
    public async ValueTask ReturnAsync<T>(IUnifiedBuffer<T> buffer) where T : unmanaged
    {
        long sizeClass = GetSizeClass(buffer.SizeInBytes);

        var byteBuffer = Cast<byte>(buffer);
        _pools.GetOrAdd(sizeClass, _ => new ConcurrentBag<IUnifiedBuffer<byte>>())
            .Add(byteBuffer);
    }

    private static long GetSizeClass(long size)
    {
        // Round up to next power of 2
        return SizeClasses.First(sc => sc >= size);
    }

    private static long[] GenerateSizeClasses()
    {
        // 1KB, 2KB, 4KB, 8KB, ..., 512MB, 1GB
        var classes = new List<long>();
        for (long size = 1024; size <= 1024L * 1024 * 1024; size *= 2)
        {
            classes.Add(size);
        }
        return classes.ToArray();
    }
}

Benefits:

  • Fast Lookup: O(log n) size class determination
  • Minimal Waste: Power-of-2 sizing limits internal fragmentation
  • Concurrent: Lock-free concurrent bags per size class
  • Bounded: Configurable max pool size per class

Pooling Strategy

public enum AllocationMode
{
    /// <summary>
    /// Use pooling when possible (default)
    /// </summary>
    Pooled,

    /// <summary>
    /// Always allocate new buffer (no pooling)
    /// </summary>
    Unpooled,

    /// <summary>
    /// Allocate pinned memory for fast CPU-GPU transfers
    /// </summary>
    Pinned,

    /// <summary>
    /// Use unified memory (shared between CPU and GPU)
    /// </summary>
    Unified
}

Zero-Copy Operations

Span Access

CPU buffers support zero-copy access via Span<T>:

public class CpuBuffer<T> : IUnifiedBuffer<T> where T : unmanaged
{
    private T[] _data;

    /// <summary>
    /// Direct access to underlying array (zero-copy)
    /// </summary>
    public Span<T> AsSpan() => _data.AsSpan();

    /// <summary>
    /// Zero-copy read from buffer
    /// </summary>
    public Task CopyToAsync(Span<T> destination, long offset = 0)
    {
        _data.AsSpan((int)offset).CopyTo(destination);
        return Task.CompletedTask; // No async work needed
    }

    /// <summary>
    /// Zero-copy write to buffer
    /// </summary>
    public Task CopyFromAsync(ReadOnlySpan<T> source, long offset = 0)
    {
        source.CopyTo(_data.AsSpan((int)offset));
        return Task.CompletedTask;
    }
}

Performance:

  • AsSpan(): < 10ns (stack allocation only)
  • CopyTo/CopyFrom: Memory bandwidth limited (no overhead)
  • Benefit: 10x faster than allocating intermediate arrays

Pinned Memory

For GPU buffers, pinned memory accelerates transfers:

public class CudaPinnedBuffer<T> : IUnifiedBuffer<T> where T : unmanaged
{
    private GCHandle _pinnedHandle;
    private T[] _hostData;
    private IntPtr _devicePtr;

    public async Task CopyFromAsync(ReadOnlySpan<T> source, long offset = 0)
    {
        // 1. Copy to pinned host memory (fast)
        source.CopyTo(_hostData.AsSpan((int)offset));

        // 2. DMA transfer to device (fast, no intermediate copy)
        await CudaMemcpyAsync(
            _devicePtr + offset * sizeof(T),
            _pinnedHandle.AddrOfPinnedObject() + offset * sizeof(T),
            source.Length * sizeof(T),
            cudaMemcpyHostToDevice
        );
    }
}

Performance:

  • Pinned Memory: 2-3x faster transfers vs non-pinned
  • Overhead: One-time pinning cost (~100μs for 1MB)
  • Benefit: Worth it for buffers used multiple times

Unified Memory (CUDA/Metal)

Modern GPUs support unified memory:

public class UnifiedMemoryBuffer<T> : IUnifiedBuffer<T> where T : unmanaged
{
    private IntPtr _unifiedPtr; // Accessible from both CPU and GPU

    public Span<T> AsSpan()
    {
        // Direct access from CPU (no copy needed!)
        unsafe
        {
            return new Span<T>(_unifiedPtr.ToPointer(), (int)Length);
        }
    }

    public Task CopyFromAsync(ReadOnlySpan<T> source, long offset = 0)
    {
        // Write directly to unified memory
        source.CopyTo(AsSpan()[(int)offset..]);

        // Page migration happens automatically on GPU access
        return Task.CompletedTask;
    }
}

Benefits (Apple Silicon M1/M2/M3):

  • Zero-Copy: CPU and GPU access same physical memory
  • 2-3x Speedup: No explicit transfers needed
  • Automatic: OS handles page migration

Data Transfer Engine

Advanced Memory Transfer Engine

Orchestrates concurrent and pipelined transfers:

public class AdvancedMemoryTransferEngine
{
    private readonly SemaphoreSlim _transferSemaphore;
    private readonly int _maxConcurrentTransfers;

    /// <summary>
    /// Copies data with automatic optimization
    /// </summary>
    public async Task TransferAsync<T>(
        IUnifiedBuffer<T> source,
        IUnifiedBuffer<T> destination,
        TransferOptions options) where T : unmanaged
    {
        // 1. Choose transfer strategy
        var strategy = ChooseStrategy(source, destination, options);

        // 2. Throttle concurrent transfers
        await _transferSemaphore.WaitAsync();

        try
        {
            // 3. Execute transfer
            await strategy.ExecuteAsync(source, destination);
        }
        finally
        {
            _transferSemaphore.Release();
        }
    }

    private ITransferStrategy ChooseStrategy(
        IUnifiedBuffer source,
        IUnifiedBuffer destination,
        TransferOptions options)
    {
        // Same device: Use device-local copy
        if (source.Device == destination.Device)
        {
            return new DeviceLocalCopyStrategy();
        }

        // Both GPU: Use P2P if available
        if (source.Device.Type.IsGpu() && destination.Device.Type.IsGpu())
        {
            if (P2PManager.IsPeerAccessEnabled(source.Device, destination.Device))
            {
                return new P2PTransferStrategy();
            }
        }

        // CPU-GPU: Use pinned memory if available
        if (source.Device.Type == AcceleratorType.CPU ||
            destination.Device.Type == AcceleratorType.CPU)
        {
            return new PinnedMemoryTransferStrategy();
        }

        // Default: Staged transfer via host
        return new StagedTransferStrategy();
    }
}

Transfer Strategies

1. Device-Local Copy (fastest):

public class DeviceLocalCopyStrategy : ITransferStrategy
{
    public async Task ExecuteAsync(IUnifiedBuffer source, IUnifiedBuffer destination)
    {
        // Same device: Use device memcpy (no host involvement)
        await source.Device.MemcpyAsync(
            destination.DevicePointer,
            source.DevicePointer,
            source.SizeInBytes
        );
    }
}

Performance: Memory bandwidth limited (~900 GB/s on modern GPUs)

2. P2P Transfer (GPU to GPU):

public class P2PTransferStrategy : ITransferStrategy
{
    public async Task ExecuteAsync(IUnifiedBuffer source, IUnifiedBuffer destination)
    {
        // Direct GPU-to-GPU transfer (no host memory)
        await CudaMemcpyPeerAsync(
            destination.DevicePointer, destination.Device.Id,
            source.DevicePointer, source.Device.Id,
            source.SizeInBytes
        );
    }
}

Performance: NVLink: ~300 GB/s, PCIe 4.0: ~32 GB/s

3. Pinned Memory Transfer (CPU-GPU):

public class PinnedMemoryTransferStrategy : ITransferStrategy
{
    public async Task ExecuteAsync(IUnifiedBuffer source, IUnifiedBuffer destination)
    {
        // Allocate pinned staging buffer
        using var pinnedBuffer = await AllocatePinnedAsync(source.SizeInBytes);

        // Copy to pinned buffer
        await source.CopyToAsync(pinnedBuffer);

        // DMA transfer to GPU
        await destination.CopyFromAsync(pinnedBuffer);
    }
}

Performance: 2-3x faster than non-pinned (12-16 GB/s on PCIe 4.0)

4. Staged Transfer (fallback):

public class StagedTransferStrategy : ITransferStrategy
{
    public async Task ExecuteAsync(IUnifiedBuffer source, IUnifiedBuffer destination)
    {
        // Copy to host memory
        var hostBuffer = new T[source.Length];
        await source.CopyToAsync(hostBuffer);

        // Copy from host to destination device
        await destination.CopyFromAsync(hostBuffer);
    }
}

Performance: Slowest, but always works

Pipeline Optimization

For multiple transfers, pipeline stages:

public async Task TransferBatchAsync<T>(
    IReadOnlyList<IUnifiedBuffer<T>> sources,
    IReadOnlyList<IUnifiedBuffer<T>> destinations) where T : unmanaged
{
    var tasks = new List<Task>();

    // Launch all transfers concurrently (up to max concurrent limit)
    for (int i = 0; i < sources.Count; i++)
    {
        tasks.Add(TransferAsync(sources[i], destinations[i]));

        // Throttle: wait if too many concurrent transfers
        if (tasks.Count >= _maxConcurrentTransfers)
        {
            await Task.WhenAny(tasks);
            tasks.RemoveAll(t => t.IsCompleted);
        }
    }

    // Wait for all remaining transfers
    await Task.WhenAll(tasks);
}

Benefit: Overlaps transfers with computation, 1.5-2x throughput improvement

P2P Memory Management

Peer-to-Peer Transfer Manager

public class P2PManager
{
    private readonly Dictionary<(int, int), bool> _peerAccessMatrix = new();

    /// <summary>
    /// Enables direct memory access between two GPUs
    /// </summary>
    public async Task<bool> EnablePeerAccessAsync(int deviceId1, int deviceId2)
    {
        // Check if P2P is supported
        int canAccess;
        cuDeviceCanAccessPeer(out canAccess, deviceId1, deviceId2);

        if (canAccess == 0)
            return false;

        // Enable peer access
        cuCtxSetCurrent(_contexts[deviceId1]);
        cuCtxEnablePeerAccess(_contexts[deviceId2], 0);

        cuCtxSetCurrent(_contexts[deviceId2]);
        cuCtxEnablePeerAccess(_contexts[deviceId1], 0);

        _peerAccessMatrix[(deviceId1, deviceId2)] = true;
        _peerAccessMatrix[(deviceId2, deviceId1)] = true;

        return true;
    }

    /// <summary>
    /// Checks if P2P is enabled between devices
    /// </summary>
    public bool IsPeerAccessEnabled(IAccelerator device1, IAccelerator device2)
    {
        if (device1.Type != AcceleratorType.CUDA || device2.Type != AcceleratorType.CUDA)
            return false;

        return _peerAccessMatrix.GetValueOrDefault((device1.Id, device2.Id), false);
    }

    /// <summary>
    /// Transfers data directly between GPUs (P2P)
    /// </summary>
    public async Task TransferP2PAsync<T>(
        IUnifiedBuffer<T> source,
        IUnifiedBuffer<T> destination) where T : unmanaged
    {
        if (!IsPeerAccessEnabled(source.Device, destination.Device))
        {
            throw new InvalidOperationException("P2P not enabled between devices");
        }

        await CudaMemcpyPeerAsync(
            destination.DevicePointer, destination.Device.Id,
            source.DevicePointer, source.Device.Id,
            source.SizeInBytes
        );
    }
}

Requirements:

  • Same GPU architecture (e.g., all NVIDIA)
  • NVLink or PCIe connectivity
  • Compute Capability 2.0+ for CUDA

Performance:

  • NVLink: Up to 300 GB/s (6x PCIe)
  • PCIe 4.0: ~32 GB/s
  • Benefit: No host memory involvement

Memory Statistics and Monitoring

Memory Statistics

public class MemoryStatistics
{
    /// <summary>
    /// Total bytes allocated (excluding pool)
    /// </summary>
    public long TotalAllocated { get; set; }

    /// <summary>
    /// Total bytes in pool (available for reuse)
    /// </summary>
    public long TotalPooled { get; set; }

    /// <summary>
    /// Number of allocations from pool (cache hits)
    /// </summary>
    public long PoolHits { get; set; }

    /// <summary>
    /// Number of allocations that required new memory (cache misses)
    /// </summary>
    public long PoolMisses { get; set; }

    /// <summary>
    /// Current number of active buffers
    /// </summary>
    public int ActiveBuffers { get; set; }

    /// <summary>
    /// Peak memory usage (bytes)
    /// </summary>
    public long PeakUsage { get; set; }

    /// <summary>
    /// Pool hit rate (0.0 to 1.0)
    /// </summary>
    public double HitRate => PoolHits + PoolMisses > 0
        ? (double)PoolHits / (PoolHits + PoolMisses)
        : 0.0;
}

Usage Example

// Get memory statistics
var stats = memoryManager.GetStatistics();

logger.LogInformation(
    "Memory: {Active} buffers, {Allocated:N0} bytes allocated, " +
    "{Pooled:N0} bytes pooled, {HitRate:P1} pool hit rate",
    stats.ActiveBuffers,
    stats.TotalAllocated,
    stats.TotalPooled,
    stats.HitRate
);

// Example output:
// Memory: 42 buffers, 134,217,728 bytes allocated,
// 67,108,864 bytes pooled, 92.3% pool hit rate

Best Practices

Buffer Lifecycle

// ✅ GOOD: Use using for automatic disposal
public async Task ProcessDataAsync()
{
    using var inputBuffer = await memoryManager.AllocateAsync<float>(1_000_000);
    using var outputBuffer = await memoryManager.AllocateAsync<float>(1_000_000);

    await kernel.ExecuteAsync(inputBuffer, outputBuffer);
}

// ✅ GOOD: Explicit return to pool for reuse
public async Task ProcessBatchAsync()
{
    var buffer = await memoryManager.AllocateAsync<float>(1_000_000);

    try
    {
        await kernel.ExecuteAsync(buffer);
    }
    finally
    {
        await memoryManager.ReturnToPoolAsync(buffer);
    }
}

// ❌ BAD: Forget to dispose (memory leak)
public async Task LeakMemoryAsync()
{
    var buffer = await memoryManager.AllocateAsync<float>(1_000_000);
    await kernel.ExecuteAsync(buffer);
    // buffer never disposed or returned!
}

Transfer Optimization

// ✅ GOOD: Batch transfers for efficiency
public async Task TransferBatchAsync(List<float[]> data)
{
    var buffers = new List<IUnifiedBuffer<float>>();

    // Allocate all buffers
    foreach (var array in data)
    {
        buffers.Add(await memoryManager.AllocateAsync<float>(array.Length));
    }

    // Transfer concurrently
    var tasks = buffers.Zip(data, (buffer, array) => buffer.CopyFromAsync(array));
    await Task.WhenAll(tasks);
}

// ❌ BAD: Sequential transfers
public async Task TransferSequentiallyAsync(List<float[]> data)
{
    foreach (var array in data)
    {
        var buffer = await memoryManager.AllocateAsync<float>(array.Length);
        await buffer.CopyFromAsync(array); // Waits for each transfer!
    }
}

Allocation Strategy

// ✅ GOOD: Reuse buffers across iterations
public async Task ProcessStreamAsync(IAsyncEnumerable<float[]> stream)
{
    var buffer = await memoryManager.AllocateAsync<float>(1_000_000);

    await foreach (var data in stream)
    {
        await buffer.CopyFromAsync(data);
        await kernel.ExecuteAsync(buffer);
    }

    await buffer.DisposeAsync(); // One allocation for entire stream
}

// ❌ BAD: Allocate in loop
public async Task AllocateInLoopAsync(IAsyncEnumerable<float[]> stream)
{
    await foreach (var data in stream)
    {
        var buffer = await memoryManager.AllocateAsync<float>(data.Length);
        await kernel.ExecuteAsync(buffer);
        await buffer.DisposeAsync(); // Allocate/free every iteration!
    }
}

Performance Benchmarks

Allocation Performance

Operation Time Notes
Pool hit < 1μs Buffer available in pool
Pool miss (small) ~10μs Allocate < 1MB
Pool miss (large) ~100μs Allocate > 100MB
Pinned allocation +100μs One-time pinning cost
Disposal < 1μs Return to pool

Transfer Performance

Transfer Type Bandwidth Notes
CPU zero-copy ~200 GB/s Span access (memory bandwidth)
CPU-GPU pinned 12-16 GB/s PCIe 4.0 x16
CPU-GPU non-pinned 4-6 GB/s Slower due to page locking
GPU-GPU P2P (NVLink) ~300 GB/s Direct GPU-to-GPU
GPU-GPU P2P (PCIe) ~32 GB/s PCIe 4.0 x16
GPU-GPU staged ~6 GB/s Via host memory (slowest)

Pool Efficiency

Measured in production workloads:

  • Pool Hit Rate: 90-95% (typical)
  • Allocation Reduction: 90%+ vs no pooling
  • Memory Overhead: ~10% (pooled but unused buffers)
  • Speedup: 11.2x for allocation-heavy workloads

Testing Strategy

Memory Tests

[Fact]
public async Task Allocate_ReturnsValidBuffer()
{
    // Arrange
    var manager = CreateMemoryManager();

    // Act
    using var buffer = await manager.AllocateAsync<float>(1000);

    // Assert
    Assert.NotNull(buffer);
    Assert.Equal(1000, buffer.Length);
    Assert.Equal(1000 * sizeof(float), buffer.SizeInBytes);
}

[Fact]
public async Task ReturnToPool_BufferCanBeReused()
{
    // Arrange
    var manager = CreateMemoryManager();
    var buffer1 = await manager.AllocateAsync<float>(1000);

    // Act
    await manager.ReturnToPoolAsync(buffer1);
    var buffer2 = await manager.AllocateAsync<float>(1000);

    // Assert
    Assert.Same(buffer1, buffer2); // Same buffer instance reused
}

[Fact]
public async Task Transfer_CopiesDataCorrectly()
{
    // Arrange
    var source = Enumerable.Range(0, 1000).Select(i => (float)i).ToArray();
    var buffer = await manager.AllocateAsync<float>(1000);

    // Act
    await buffer.CopyFromAsync(source);
    var destination = new float[1000];
    await buffer.CopyToAsync(destination);

    // Assert
    Assert.Equal(source, destination);
}

Performance Tests

[Fact]
public async Task Pooling_ReducesAllocationTime()
{
    var manager = CreateMemoryManager();

    // Warm up pool
    var warmupBuffer = await manager.AllocateAsync<float>(1000);
    await manager.ReturnToPoolAsync(warmupBuffer);

    // Measure pool hit
    var sw = Stopwatch.StartNew();
    var buffer = await manager.AllocateAsync<float>(1000);
    sw.Stop();

    Assert.True(sw.Elapsed.TotalMicroseconds < 10, "Pool hit should be < 10μs");
}

Configuration

Memory Manager Options

public class UnifiedMemoryOptions
{
    /// <summary>
    /// Enable memory pooling
    /// </summary>
    public bool EnablePooling { get; set; } = true;

    /// <summary>
    /// Maximum pool size per size class (bytes)
    /// </summary>
    public long MaxPoolSizePerClass { get; set; } = 1024L * 1024 * 1024; // 1GB

    /// <summary>
    /// Maximum concurrent transfers
    /// </summary>
    public int MaxConcurrentTransfers { get; set; } = 4;

    /// <summary>
    /// Use pinned memory for CPU-GPU transfers
    /// </summary>
    public bool UsePinnedMemory { get; set; } = true;

    /// <summary>
    /// Enable P2P transfers when available
    /// </summary>
    public bool EnableP2P { get; set; } = true;

    /// <summary>
    /// Trim pool when memory pressure detected
    /// </summary>
    public bool TrimPoolOnPressure { get; set; } = true;
}

// Configuration
services.AddDotComputeRuntime(options =>
{
    options.Memory = new UnifiedMemoryOptions
    {
        EnablePooling = true,
        MaxPoolSizePerClass = 2L * 1024 * 1024 * 1024, // 2GB per class
        MaxConcurrentTransfers = 8
    };
});