Table of Contents

Metal Backend for Apple Silicon

The DotCompute Metal backend provides GPU acceleration for Apple Silicon (M1/M2/M3) and Intel Macs using Apple's Metal framework. This guide covers architecture, features, performance characteristics, and best practices.

Overview

The Metal backend delivers:

  • Native Apple Silicon support with unified memory architecture
  • Metal Performance Shaders (MPS) integration for optimized ML operations
  • Thread-safe parallel execution via command queue pooling
  • Binary kernel caching for sub-millisecond compilation
  • 90%+ memory pool efficiency with automatic buffer recycling

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    C# Managed Layer                         │
├─────────────────────────────────────────────────────────────┤
│  MetalAccelerator    │  MetalKernelCompiler   │ MPS Backend │
│  MetalMemoryPool     │  MetalCommandQueuePool │ Orchestrator│
├─────────────────────────────────────────────────────────────┤
│                    P/Invoke Interop                         │
├─────────────────────────────────────────────────────────────┤
│              libDotComputeMetal.dylib (Native)              │
│  DCMetalDevice.mm  │  DCMetalMPS.mm  │  Command Queue Pool  │
├─────────────────────────────────────────────────────────────┤
│              Apple Metal Framework + MPS                    │
└─────────────────────────────────────────────────────────────┘

Key Components

Component Purpose
MetalAccelerator Main entry point for GPU operations
MetalKernelCompiler Compiles MSL kernels with binary caching
MetalMemoryPoolManager Manages buffer allocation with pooling
MetalCommandQueuePool Thread-safe command queue management
MetalPerformanceShadersBackend MPS BLAS/CNN/Neural operations
MetalMPSOrchestrator Auto-selects MPS vs custom kernels

Getting Started

Prerequisites

  • macOS 10.15+ (Catalina or later)
  • .NET 9.0+
  • Xcode Command Line Tools (for native compilation)

Installation

# Add the Metal backend package
dotnet add package DotCompute.Backends.Metal

Basic Usage

using DotCompute.Backends.Metal;

// Create Metal accelerator
using var accelerator = MetalAccelerator.Create();

// Compile a kernel
var kernel = accelerator.CompileKernel(@"
    #include <metal_stdlib>
    using namespace metal;

    kernel void vector_add(
        device const float* a [[buffer(0)]],
        device const float* b [[buffer(1)]],
        device float* result [[buffer(2)]],
        uint id [[thread_position_in_grid]])
    {
        result[id] = a[id] + b[id];
    }
");

// Allocate buffers
var bufferA = accelerator.Allocate<float>(1024);
var bufferB = accelerator.Allocate<float>(1024);
var result = accelerator.Allocate<float>(1024);

// Execute kernel
kernel.Execute(bufferA, bufferB, result, gridSize: 1024);

Features

1. Unified Memory Architecture

Apple Silicon uses unified memory, eliminating CPU-GPU data transfers:

// Check if unified memory is available
if (accelerator.Capabilities.HasUnifiedMemory)
{
    // Direct memory access - no explicit transfers needed
    var buffer = accelerator.Allocate<float>(1024);
    buffer.CopyFrom(cpuData); // Direct copy, no staging
}

Performance Impact: 1.5-2.5x speedup for memory-bound operations compared to discrete GPU memory simulation.

2. Binary Kernel Caching

Compiled kernels are cached as binary archives for instant reuse:

// First compilation: ~200ms
var kernel1 = compiler.Compile(source, "my_kernel");

// Cache hit: ~5μs (40,000x faster)
var kernel2 = compiler.Compile(source, "my_kernel");

Cache files are stored in the application's cache directory and persist across sessions.

3. Command Queue Pool

Thread-safe parallel kernel execution without crashes:

// Parallel execution uses pooled command queues
await Task.WhenAll(
    Task.Run(() => kernel1.Execute(data1)),
    Task.Run(() => kernel2.Execute(data2)),
    Task.Run(() => kernel3.Execute(data3))
);

Implementation: The native layer maintains a thread-safe pool of MTLCommandQueue instances, preventing resource contention and ensuring safe concurrent execution.

4. Metal Performance Shaders Integration

MPS provides Apple-optimized implementations for common operations:

using var mps = new MetalMPSOrchestrator(accelerator);

// Matrix multiplication (auto-selects MPS vs custom kernel)
var result = mps.MatrixMultiply(matrixA, matrixB);

// Neural network operations
mps.ReLU(input, output);
mps.BatchNormalization(input, gamma, beta, mean, variance, output);

// CNN operations
mps.Convolution2D(input, kernel, output, stride: 1, padding: 0);

When to Use MPS:

  • Matrix sizes > 256x256 elements
  • CNN operations (convolution, pooling)
  • Neural network activations on large tensors
  • When leveraging Apple's optimized implementations matters more than raw flexibility

5. Memory Pool Management

Automatic buffer recycling reduces allocation overhead:

// Buffers are automatically pooled and recycled
for (int i = 0; i < 1000; i++)
{
    using var temp = accelerator.Allocate<float>(1024);
    // Buffer returned to pool on dispose
}
// 90%+ of allocations served from pool

Performance Characteristics

Benchmark Results (Apple M2)

Operation Performance Target Met
Kernel Cache Hit 5.5 μs ✅ (<1ms)
Command Buffer Acquisition 0.27 μs ✅ (<100μs)
Backend Cold Start 7.76 ms ✅ (<10ms)
Unified Memory Speedup 1.5-2.5x ✅ (>1.5x)
Parallel Execution Speedup 2.2x ✅ (>1.5x)

GPU Family Support

GPU Family Devices Features
Apple8 M2, M2 Pro/Max/Ultra Full MPS, advanced neural
Apple7 M1, M1 Pro/Max/Ultra Full MPS
Apple6 A14 MPS BLAS/CNN
Apple5 A13 Basic MPS
Mac2 Intel Macs Limited MPS

Thread Safety

Command Queue Pool Architecture

The Metal backend implements thread-safe parallel execution through:

  1. Shared Command Queue Pool: A map of device → command queues protected by mutex
  2. Per-Thread Command Buffers: Each execution gets its own command buffer from the shared queue
  3. Explicit Cleanup: Resources are properly released on dispose to prevent exit-time crashes
// Native implementation (DCMetalDevice.mm)
static std::mutex s_commandQueueMutex;
static std::map<void*, id<MTLCommandQueue>> s_commandQueues;

id<MTLCommandQueue> getOrCreateCommandQueue(id<MTLDevice> device) {
    std::lock_guard<std::mutex> lock(s_commandQueueMutex);
    // Thread-safe queue retrieval/creation
}

MPS Thread Safety

MPS operations use their own shared command queue pool with explicit cleanup:

// Proper disposal prevents exit-time crashes
using var mpsBackend = new MetalPerformanceShadersBackend(device, logger);
// Operations...
// Dispose calls native cleanup automatically

Best Practices

1. Kernel Optimization

// Use threadgroup memory for frequently accessed data
kernel void optimized_kernel(
    device float* data [[buffer(0)]],
    threadgroup float* shared [[threadgroup(0)]],
    uint tid [[thread_index_in_threadgroup]],
    uint gid [[thread_position_in_grid]])
{
    // Load to threadgroup memory
    shared[tid] = data[gid];
    threadgroup_barrier(mem_flags::mem_threadgroup);

    // Process with fast threadgroup access
    // ...
}

2. Buffer Reuse

// Reuse buffers instead of reallocating
var buffer = accelerator.Allocate<float>(size);
for (int iteration = 0; iteration < 100; iteration++)
{
    buffer.CopyFrom(newData);
    kernel.Execute(buffer);
}
buffer.Dispose();

3. Batch Operations

// Batch multiple operations in single command buffer
using var batch = accelerator.CreateCommandBatch();
batch.Add(kernel1, args1);
batch.Add(kernel2, args2);
batch.Add(kernel3, args3);
await batch.ExecuteAsync();

4. MPS vs Custom Kernels

// Use MPSOrchestrator for automatic selection
var orchestrator = new MetalMPSOrchestrator(accelerator);

// Orchestrator checks:
// - Data size (MPS for large, custom for small)
// - Operation type
// - Device capabilities
var result = orchestrator.MatrixMultiply(a, b); // Auto-selects best path

Troubleshooting

Common Issues

1. SIGSEGV on Disposal

  • Cause: Command queue cleanup conflicts with Objective-C ARC
  • Solution: Ensure proper disposal order; use latest native library with cleanup functions

2. Slow First Kernel Execution

  • Cause: JIT compilation of Metal shaders
  • Solution: Use warm-up calls or enable binary caching

3. Memory Pressure Warnings

  • Cause: Buffer pool exhaustion
  • Solution: Dispose buffers promptly; check pool configuration

4. Parallel Execution Crashes

  • Cause: Pre-v0.5 lacked command queue pooling
  • Solution: Update to latest version with MetalCommandQueuePool

Diagnostic Logging

Enable detailed Metal logging:

var accelerator = MetalAccelerator.Create(options =>
{
    options.EnableDebugLogging = true;
    options.LogLevel = LogLevel.Trace;
});

Native Library Issues

# Verify native library is loaded
otool -L libDotComputeMetal.dylib

# Check exported symbols
nm -gU libDotComputeMetal.dylib | grep DCMetal

# Rebuild native library
cd src/Backends/DotCompute.Backends.Metal/native
./build.sh

API Reference

MetalAccelerator

public class MetalAccelerator : IAccelerator
{
    // Create accelerator for default GPU
    public static MetalAccelerator Create();

    // Device capabilities
    public MetalCapabilities Capabilities { get; }

    // Memory allocation
    public IBuffer<T> Allocate<T>(int count) where T : unmanaged;

    // Kernel compilation
    public ICompiledKernel CompileKernel(string source, string entryPoint);
}

MetalCapabilities

public readonly struct MetalCapabilities
{
    public string DeviceName { get; }
    public string GPUFamily { get; }       // Apple7, Apple8, etc.
    public bool HasUnifiedMemory { get; }
    public int MaxThreadsPerThreadgroup { get; }
    public long MaxBufferLength { get; }
}

MetalMPSOrchestrator

public class MetalMPSOrchestrator : IDisposable
{
    // BLAS operations
    public void MatrixMultiply(ReadOnlySpan<float> a, ReadOnlySpan<float> b,
                               Span<float> c, int m, int n, int k);

    // Neural operations
    public void ReLU(ReadOnlySpan<float> input, Span<float> output);
    public void BatchNormalization(...);

    // CNN operations
    public void Convolution2D(...);
    public void MaxPooling2D(...);
}

Version History

Version Changes
0.5.0 Ring Kernel support, disposal hang fix, type casting fix, command queue pooling, MPS cleanup fix
0.4.2 Binary kernel caching, memory pool optimization
0.4.0 MPS integration, unified memory support
0.3.0 Initial Metal backend release

v0.5.0 Critical Fixes

Ring Kernel Message Passing Infrastructure (November 27, 2025):

  1. Disposal Hang Fix (MetalRingKernelRuntime.cs:516-530):

    • Problem: Synchronous wait on async GPU buffer disposal causing indefinite hangs
    • Solution: Wrapped CleanupKernelState in 3-second timeout using Task.Run with CancellationTokenSource
    • Impact: Disposal now completes reliably in <3 seconds instead of hanging indefinitely
    • File: src/Backends/DotCompute.Backends.Metal/RingKernels/MetalRingKernelRuntime.cs
  2. Type Casting Fix (MetalRingKernelRuntime.cs:176, 1166-1182):

    • Problem: Output message queues created with incorrect input type, causing InvalidCastException
    • Solution: Implemented GetKernelOutputType() method mapping kernel IDs to correct output types
    • Impact: PageRank pipeline now correctly handles: ContributionSender→PageRankContribution, RankAggregator→RankAggregationResult, ConvergenceChecker→ConvergenceCheckResult
    • File: src/Backends/DotCompute.Backends.Metal/RingKernels/MetalRingKernelRuntime.cs

Performance Validation (All targets met):

  • Kernel Compilation Cache: 4.326 μs (417x faster than 1ms target)
  • Command Buffer Acquisition: 0.24 μs (417x faster than 100μs target)
  • Backend Cold Start: 8.38 ms (16% under 10ms target)
  • Unified Memory Speedup: 2.33x (within 2-3x target range)
  • Parallel Execution: 2.22x (exceeds 1.5x target)

See Also