Table of Contents

Memory Ordering API: Causal Consistency for GPU Computing

Overview

The Memory Ordering API provides fine-grained control over memory consistency and operation ordering in GPU kernels. It enables correct implementation of lock-free data structures, producer-consumer patterns, and distributed coordination by enforcing causal relationships between memory operations across threads, devices, and the host CPU.

Key Features:

  • Three Consistency Models: Relaxed (1.0×), Release-Acquire (0.85×), Sequential (0.60×)
  • Three Fence Scopes: Thread-block (~10ns), Device (~100ns), System (~200ns)
  • Strategic Placement: Entry, Exit, After Writes, Before Reads, Full Barrier
  • Zero Configuration: Default relaxed model for maximum performance
  • Explicit Control: Per-kernel fence insertion for critical sections
Backend Thread-Block Device System Release-Acquire Sequential Min. Version
CUDA ✅ (CC 7.0+) CUDA 9.0+
OpenCL OpenCL 2.0+
Metal Metal 2.0+
CPU N/A N/A Volatile + Interlocked

Quick Start

Basic Setup: Enable Causal Ordering

using DotCompute.Abstractions.Memory;
using DotCompute.Backends.CUDA.Factory;

// Create accelerator
using var factory = new CudaAcceleratorFactory();
await using var accelerator = factory.CreateProductionAccelerator(0);

// Get memory ordering provider
var orderingProvider = accelerator.GetMemoryOrderingProvider();
if (orderingProvider == null)
{
    Console.WriteLine("Memory ordering not supported on this device");
    return;
}

// Enable causal ordering (Release-Acquire semantics)
orderingProvider.EnableCausalOrdering(true);
Console.WriteLine($"Consistency model: {orderingProvider.ConsistencyModel}");
Console.WriteLine($"Performance multiplier: {orderingProvider.GetOverheadMultiplier():F2}×");

Producer-Consumer Pattern with Explicit Fences

// Set Release-Acquire model for causal ordering
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

// Producer: Insert release fence after writing data
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

// Consumer: Insert acquire fence before reading data
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);

Multi-GPU System-Wide Synchronization

// Enable system-wide fences for cross-GPU coordination
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

// Insert system fence for visibility across all GPUs and CPU
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

Console.WriteLine($"Acquire-Release hardware support: {orderingProvider.IsAcquireReleaseSupported}");

Memory Consistency Models

1. Relaxed Consistency (Default)

Performance: 1.0× baseline (no overhead) Guarantees: None - operations may be reordered arbitrarily Use Case: Data-parallel algorithms with no inter-thread communication

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);

How it works:

  • GPU default memory model: maximum performance, minimal synchronization
  • Operations may execute and complete in any order
  • Threads may observe writes in different orders
  • No happens-before relationships between operations

Example Behavior:

// Thread 1: Writes
data[0] = 42;
data[1] = 100;

// Thread 2: Reads
int x = data[1];  // May see 100
int y = data[0];  // May see 0 (old value)!

When to Use:

  • Map/reduce operations with independent elements
  • Element-wise array transformations
  • No shared state between threads
  • Manual fence management for critical sections only

Performance: 0.85× baseline (15% overhead) Guarantees: Causal ordering - writes before release are visible after acquire Use Case: Producer-consumer patterns, message passing, distributed data structures

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

How it works:

  • Release operation: All prior writes become visible to other threads
  • Acquire operation: All subsequent reads observe values written before the release
  • Causality: If Thread A releases and Thread B acquires, B sees all of A's prior writes

Example Behavior:

// Producer (Thread 1)
data[0] = 42;
data[1] = 100;
__threadfence();           // Release fence
flag = READY;              // Signal ready

// Consumer (Thread 2)
while (flag != READY) { }  // Wait for signal
__threadfence();           // Acquire fence
int x = data[0];           // Guaranteed to see 42
int y = data[1];           // Guaranteed to see 100

When to Use:

  • Actor systems: Orleans.GpuBridge.Core message passing
  • Lock-free queues: Producer-consumer coordination
  • Distributed hash tables: Consistent key-value updates
  • Multi-GPU coordination: Cross-device data sharing

Implementation:

// CUDA Release (producer)
data[tid] = compute_value();
__threadfence();              // Release fence
atomicExch(&flag[tid], READY);

// CUDA Acquire (consumer)
while (atomicAdd(&flag[tid], 0) != READY) { }
__threadfence();              // Acquire fence
value = data[tid];

3. Sequential Consistency (Strongest)

Performance: 0.60× baseline (40% overhead) Guarantees: Total order - all threads observe operations in the same order Use Case: Complex algorithms requiring total order, or debugging race conditions

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);

How it works:

  • All memory operations appear to execute in a single global order
  • Every thread observes the same interleaving of operations
  • Fence before and after every memory operation
  • Strongest guarantee, highest overhead

Example Behavior:

// Thread 1
data[0] = 42;
data[1] = 100;

// Thread 2
int x = data[1];
int y = data[0];

// Sequential consistency guarantees:
// If x == 100, then y == 42 (never y == 0)

When to Use:

  • Algorithm correctness requires total order visibility
  • Debugging relaxed-model race conditions (disable after fixing)
  • Complex distributed algorithms (e.g., consensus protocols)
  • Performance is secondary to correctness

⚠️ Warning: 40% performance penalty. Avoid unless absolutely necessary. Start with Release-Acquire and add explicit fences only where needed.

Fence Types and Scopes

1. Thread-Block Fence

Latency: ~10 nanoseconds Scope: All threads in the same thread block Hardware: __threadfence_block() (CUDA), mem_fence(CLK_LOCAL_MEM_FENCE) (OpenCL)

orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.FullBarrier);

Use Cases:

  • Producer-consumer patterns within a block
  • Shared memory synchronization
  • Block-local data structure updates

Example CUDA Kernel:

extern "C" __global__ void block_producer_consumer(float* shared_data)
{
    __shared__ float buffer[256];
    int tid = threadIdx.x;

    // Producer threads (first half)
    if (tid < 128)
    {
        buffer[tid] = compute_value(tid);
        __threadfence_block();  // Release: writes visible
        buffer[tid + 128] = 1;  // Signal ready
    }

    // Consumer threads (second half)
    if (tid >= 128)
    {
        while (buffer[tid] != 1) { }  // Wait for signal
        __threadfence_block();        // Acquire: reads observe writes
        float value = buffer[tid - 128];
        shared_data[tid] = value;
    }
}

Performance: Fastest fence type (~10ns), ideal for intra-block coordination.

2. Device Fence

Latency: ~100 nanoseconds Scope: All threads on the same GPU Hardware: __threadfence() (CUDA), mem_fence(CLK_GLOBAL_MEM_FENCE) (OpenCL)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

Use Cases:

  • Grid-wide producer-consumer patterns
  • Device-global data structure updates
  • Inter-block communication via global memory

Example CUDA Kernel:

extern "C" __global__ void device_wide_counter(int* counter, int* results)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // All blocks increment counter
    int my_ticket = atomicAdd(counter, 1);
    __threadfence();  // Device fence: ensure counter update visible

    // All blocks read final counter
    results[tid] = *counter;  // Should see total increments
}

Performance: Medium overhead (~100ns), required for cross-block coordination.

3. System Fence

Latency: ~200 nanoseconds Scope: All processors (CPU, all GPUs, all devices) Hardware: __threadfence_system() (CUDA) Requirements: Unified virtual addressing (UVA), CUDA 9.0+

orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

Use Cases:

  • GPU-CPU communication via mapped/pinned memory
  • Multi-GPU synchronization
  • System-wide distributed data structures
  • Causal message passing in Orleans.GpuBridge.Core

Example CUDA Kernel:

extern "C" __global__ void gpu_to_cpu_message(
    volatile int* cpu_visible_flag,
    volatile int* cpu_visible_data)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if (tid == 0)
    {
        // Write data
        *cpu_visible_data = 12345;
        __threadfence_system();  // System fence: visible to CPU

        // Signal CPU
        *cpu_visible_flag = 1;
    }
}

C# CPU Side:

// CPU waits for GPU signal
volatile int flag = 0;
volatile int data = 0;

// ... launch kernel with mapped memory ...

// Spin wait (or use better synchronization)
while (flag == 0) { Thread.SpinWait(100); }

// Data is guaranteed visible after fence
Console.WriteLine($"GPU wrote: {data}");  // Prints: GPU wrote: 12345

Performance: Slowest fence (~200ns), strongest guarantee (CPU + all GPUs).

Fence Location Strategies

Strategic Fence Placement

Fence locations control precise insertion points in kernel code, enabling fine-grained performance tuning.

1. Release Semantics (After Writes)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

When to use: Producer threads publishing data

Example:

// Producer writes data
data[tid] = compute();

// Release fence: all writes visible
__threadfence();

// Signal ready (acquire by consumer)
flag[tid] = READY;

2. Acquire Semantics (Before Reads)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);

When to use: Consumer threads reading published data

Example:

// Wait for producer signal
while (flag[producer] != READY) { }

// Acquire fence: observe producer writes
__threadfence();

// Read data (guaranteed fresh)
value = data[producer];

3. Full Barrier (After Writes + Before Reads)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

When to use: Bidirectional synchronization, strongest guarantee

Example:

// Write phase
data[tid] = compute();

// Full barrier: writes visible, reads fresh
__threadfence();

// Read phase (observes all writes)
value = data[(tid + 1) % N];

4. Kernel Entry/Exit

orderingProvider.InsertFence(FenceType.Device, FenceLocation.KernelEntry);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.KernelExit);

When to use: Ensure consistent memory state across kernel boundaries

Example:

__global__ void with_entry_exit_fences(float* data)
{
    // [Entry fence inserted here by DotCompute]
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // Kernel logic
    data[tid] = compute(data[tid]);

    // [Exit fence inserted here by DotCompute]
}

Performance Optimization: Minimal Fence Placement

✅ Efficient Pattern:

// Only fence where communication happens
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);

// Explicit fences at critical sections
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Release);  // After write
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Acquire);  // Before read

❌ Inefficient Pattern:

// Pervasive fencing (40% overhead)
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);
// Fences inserted everywhere automatically
Strategy Overhead When to Use
Relaxed + Explicit Fences 5-10% Recommended: fence only critical sections
Release-Acquire Model 15% Good default: automatic causal ordering
Sequential Model 40% Last resort: debugging or complex algorithms

Use Cases

1. Producer-Consumer Pattern

Scenario: Multiple producer threads write data, consumer threads read when ready

// Configure Release-Acquire model
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);

CUDA Kernel:

__global__ void producer_consumer(float* data, volatile int* flags, int N)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int producer_id = tid % (N / 2);

    if (tid < N / 2)
    {
        // Producer: compute and publish
        data[producer_id] = expensive_compute(producer_id);
        __threadfence();  // Release fence
        flags[producer_id] = 1;  // Signal ready
    }
    else
    {
        // Consumer: wait and read
        int consumer_id = tid - N / 2;
        while (flags[consumer_id] == 0) { }  // Spin wait
        __threadfence();  // Acquire fence
        float value = data[consumer_id];  // Guaranteed fresh
        // ... process value ...
    }
}

2. Lock-Free Queue (MPSC)

Scenario: Multiple producers, single consumer, atomic enqueue/dequeue

// Release-Acquire for atomic operations
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

CUDA Kernel:

__device__ void enqueue(int* queue, volatile int* head, int value)
{
    int pos = atomicAdd(head, 1);  // Atomic increment
    queue[pos] = value;            // Write value
    __threadfence();               // Release: ensure write visible
}

__device__ int dequeue(int* queue, volatile int* head, volatile int* tail)
{
    __threadfence();  // Acquire: observe enqueue writes
    int pos = atomicAdd(tail, 1);
    int value = queue[pos];
    return value;
}

3. Distributed Hash Table

Scenario: Concurrent key-value updates across multiple GPUs

// System-wide fences for multi-GPU coordination
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

CUDA Kernel:

__global__ void hash_table_update(
    KVPair* table,
    int* keys,
    int* values,
    int N)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= N) return;

    int key = keys[tid];
    int hash = key % TABLE_SIZE;

    // Atomic key-value update
    KVPair old = table[hash];
    table[hash] = {key, values[tid]};

    __threadfence_system();  // System fence: visible to all GPUs + CPU
}

4. Orleans.GpuBridge.Core Integration

Scenario: GPU-native actors with causal message ordering

// Enable causal ordering for actor message passing
orderingProvider.EnableCausalOrdering(true);
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

// System fences for GPU-CPU communication
orderingProvider.InsertFence(FenceType.System, FenceLocation.Release);

Actor Message Send (GPU Kernel):

__global__ void actor_send_message(
    volatile Message* mailbox,
    volatile int* message_count,
    Message msg)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid != 0) return;

    // Write message payload
    int slot = atomicAdd(message_count, 1);
    mailbox[slot] = msg;

    // Release fence: ensure message visible to CPU
    __threadfence_system();
}

Actor Message Receive (CPU):

// Acquire fence: observe GPU writes
volatile int messageCount = 0;
while (messageCount == 0)
{
    Thread.MemoryBarrier();  // CPU acquire
    messageCount = gpuMailbox.MessageCount;
}

// Read message (guaranteed fresh)
Message msg = gpuMailbox.Messages[0];
Console.WriteLine($"Received: {msg.Payload}");

Performance Characteristics

Consistency Model Overhead

Model Performance Fence Frequency Use Case
Relaxed 1.0× None (manual) Data-parallel, no communication
Release-Acquire 0.85× At sync points Producer-consumer, actors
Sequential 0.60× Every operation Debugging, complex algorithms

Measured Performance (RTX 2000 Ada, 1M operations):

Relaxed:         2.14ms (baseline)
Release-Acquire: 2.52ms (18% overhead)
Sequential:      3.57ms (67% overhead)

Fence Type Overhead

Fence Type Latency Scope Typical Use
Thread-Block ~10ns Single block Intra-block coordination
Device ~100ns Single GPU Grid-wide coordination
System ~200ns CPU + all GPUs Multi-GPU + CPU sync

Amortization Strategy:

  • High-frequency fencing: Use Thread-Block fences (10ns)
  • Medium-frequency: Use Device fences (100ns)
  • Low-frequency: Use System fences only at coarse-grained boundaries

Hardware Acceleration

Backend Hardware Support Overhead
CUDA CC 7.0+ (Volta) Native acquire-release 0.85×
CUDA CC 5.0-6.x Software emulation 0.70×
OpenCL 2.0+ atomic_work_item_fence() 0.80×
Metal 2.0+ threadgroup_barrier() 0.82×
CPU Volatile + Interlocked 0.90×

Query Hardware Support:

if (orderingProvider.IsAcquireReleaseSupported)
{
    Console.WriteLine("Native hardware support: 0.85× overhead");
}
else
{
    Console.WriteLine("Software emulation: 0.70× overhead (higher cost)");
}

Hardware Requirements

CUDA Backend

Minimum Requirements:

  • Compute Capability 2.0 (Fermi) for thread-block and device fences
  • CUDA Toolkit 9.0 or later for full release-acquire support
  • NVIDIA Driver 384.81 or later

Recommended Configuration:

  • Compute Capability 7.0+ (Volta or newer) for native acquire-release
  • CUDA Toolkit 12.0 or later
  • NVIDIA Driver 525.60.13 or later
  • Unified Virtual Addressing (UVA) enabled for system fences
Feature CC 2.0 CC 5.0 CC 6.0 CC 7.0+
Thread-Block Fences
Device Fences
System Fences (UVA)
Native Acquire-Release
GPU Compute Capability Native Acq-Rel Tested
RTX 2000 Ada 8.9
RTX 4090 8.9
RTX 3090 8.6
RTX 2080 Ti 7.5
GTX 1080 Ti 6.1 ❌ (emulation)
GTX 980 Ti 5.2 ❌ (emulation)

OpenCL Backend

Minimum Requirements:

  • OpenCL 2.0 for atomic_work_item_fence()
  • mem_fence() with acquire/release flags

Supported Platforms:

  • NVIDIA: OpenCL 3.0 (via CUDA driver)
  • AMD: OpenCL 2.2 (via ROCm)
  • Intel: OpenCL 3.0 (via Level Zero)
  • ARM Mali: OpenCL 3.0
  • Qualcomm Adreno: OpenCL 2.0

Metal Backend

Minimum Requirements:

  • Metal 2.0 or later (macOS 10.13+)
  • threadgroup_barrier(), device_barrier() support

Supported Platforms:

  • Apple Silicon: M1/M2/M3/M4 series (full support)
  • Intel Macs: Metal 2.3+ (limited system fences)
  • iOS/iPadOS: Metal 2.0+ (mobile GPUs)

Best Practices

✅ Do

  1. Start with Relaxed, add fences only where needed

    orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);
    // Explicit fences at critical sections only
    orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Release);
    
  2. Use Release-Acquire as default for coordinated algorithms

    orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
    // Good balance of correctness and performance
    
  3. Match fence scope to communication scope

    // Intra-block: Thread-Block fence (10ns)
    orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.FullBarrier);
    
    // Cross-block: Device fence (100ns)
    orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);
    
    // Multi-GPU: System fence (200ns)
    orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);
    
  4. Check hardware support before using advanced features

    if (!orderingProvider.IsAcquireReleaseSupported)
    {
        Console.WriteLine("Warning: Acquire-release emulated, higher overhead");
    }
    
  5. Profile before and after adding memory ordering

    var sw = Stopwatch.StartNew();
    // ... kernel with fences ...
    sw.Stop();
    Console.WriteLine($"Overhead: {sw.ElapsedMilliseconds}ms");
    
  6. Use FenceLocation presets for common patterns

    orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);
    orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);
    orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);
    

❌ Don't

  1. Don't use Sequential unless absolutely necessary

    // ❌ BAD: 40% overhead for all operations
    orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);
    
    // ✅ GOOD: Release-Acquire + explicit fences
    orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
    orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);
    
  2. Don't over-fence - profile to find minimal fencing

    // ❌ BAD: Excessive fencing degrades performance
    orderingProvider.InsertFence(FenceType.System, FenceLocation.KernelEntry);
    orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);
    orderingProvider.InsertFence(FenceType.System, FenceLocation.KernelExit);
    
    // ✅ GOOD: Fence only at communication boundaries
    orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);
    
  3. Don't assume fences fix all race conditions

    // Fences ensure ordering, not atomicity
    // Use atomic operations for concurrent updates
    atomicAdd(&counter, 1);  // Atomic operation
    __threadfence();         // Fence for visibility
    
  4. Don't use system fences for intra-GPU communication

    // ❌ BAD: System fence overkill (200ns) for single-GPU
    orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);
    
    // ✅ GOOD: Device fence sufficient (100ns)
    orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);
    
  5. Don't forget to check for null provider

    var provider = accelerator.GetMemoryOrderingProvider();
    if (provider == null)
    {
        throw new NotSupportedException("Memory ordering not available");
    }
    
  6. Don't change consistency model during kernel execution

    // ❌ BAD: Not thread-safe, undefined behavior
    Task.Run(() => orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential));
    
    // ✅ GOOD: Configure during initialization only
    orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
    // ... then launch kernels ...
    

Troubleshooting

Issue: Race Condition Despite Fences

Symptoms:

  • Non-deterministic results
  • Values appear stale or corrupted
  • Different threads observe different data

Diagnosis:

// Enable sequential consistency for debugging
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);

// If race disappears, root cause is insufficient ordering
// If race persists, root cause is missing atomicity

Solutions:

  1. Check atomic operations:

    // ❌ BAD: Non-atomic read-modify-write
    counter = counter + 1;
    __threadfence();
    
    // ✅ GOOD: Atomic operation + fence
    atomicAdd(&counter, 1);
    __threadfence();
    
  2. Verify fence placement:

    // ❌ BAD: Fence after read (too late)
    value = data[tid];
    __threadfence();
    
    // ✅ GOOD: Fence before read (acquire)
    __threadfence();
    value = data[tid];
    
  3. Increase fence scope:

    // Try ThreadBlock → Device → System
    orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);
    

Issue: Performance Degradation

Symptoms:

  • Kernel execution time increased significantly
  • Throughput reduced by 20-50%
  • Performance worse than CPU

Diagnosis:

// Measure overhead
var baseline = BenchmarkWithoutFences();
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
var withFences = BenchmarkWithFences();

double overhead = (withFences - baseline) / baseline * 100;
Console.WriteLine($"Overhead: {overhead:F1}%");

Solutions:

  1. Reduce fence scope:

    // System (200ns) → Device (100ns) → ThreadBlock (10ns)
    orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Release);
    
  2. Use relaxed model + explicit fences:

    orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);
    // Add fences only at critical sections (5-10% overhead vs 15-40%)
    
  3. Batch operations between fences:

    // ❌ BAD: Fence per operation (40% overhead)
    for (int i = 0; i < N; i++)
    {
        data[i] = compute(i);
        __threadfence();
    }
    
    // ✅ GOOD: Fence after batch (5% overhead)
    for (int i = 0; i < N; i++)
    {
        data[i] = compute(i);
    }
    __threadfence();
    

Issue: System Fences Not Working

Symptoms:

  • CPU doesn't observe GPU writes
  • Multi-GPU coordination fails
  • System fence throws NotSupportedException

Solutions:

  1. Check UVA support:

    nvidia-smi -q | grep "Unified Memory"
    # Should show: "Supported: Yes"
    
  2. Verify compute capability:

    var (major, _) = CudaCapabilityManager.GetTargetComputeCapability();
    if (major < 2)
    {
        Console.WriteLine("System fences require CC 2.0+ (UVA)");
    }
    
  3. Use pinned memory for CPU visibility:

    // Allocate pinned (page-locked) memory
    var pinnedBuffer = accelerator.AllocatePinned<int>(1024);
    // System fences ensure CPU sees GPU writes
    
  4. Fallback to explicit synchronization:

    // If system fences unavailable, use device synchronization
    await accelerator.SynchronizeAsync();
    // CPU can now safely read GPU memory
    

Issue: Inconsistent Calibration Results

Symptoms:

  • GetOverheadMultiplier() returns unexpected values
  • Performance varies between runs
  • Fence overhead higher than expected

Solutions:

  1. Warm up GPU before benchmarking:

    // Run dummy kernel to warm up GPU
    for (int i = 0; i < 5; i++)
    {
        await accelerator.ExecuteKernelAsync(warmupKernel);
    }
    await accelerator.SynchronizeAsync();
    
    // Now measure fence overhead
    var overhead = orderingProvider.GetOverheadMultiplier();
    
  2. Check for thermal throttling:

    nvidia-smi --query-gpu=temperature.gpu,clocks.gr --format=csv
    # High temperature → clock throttling → variable performance
    
  3. Lock GPU clocks for consistent benchmarks:

    sudo nvidia-smi -pm 1              # Enable persistence mode
    sudo nvidia-smi -lgc 1500          # Lock GPU clock to 1500MHz
    

API Reference

IMemoryOrderingProvider Interface

public interface IMemoryOrderingProvider
{
    /// <summary>
    /// Enables causal memory ordering (release-acquire semantics).
    /// </summary>
    void EnableCausalOrdering(bool enable = true);

    /// <summary>
    /// Inserts a memory fence at the specified location in kernel code.
    /// </summary>
    void InsertFence(FenceType type, FenceLocation? location = null);

    /// <summary>
    /// Configures the memory consistency model for kernel execution.
    /// </summary>
    void SetConsistencyModel(MemoryConsistencyModel model);

    /// <summary>
    /// Gets the current memory consistency model.
    /// </summary>
    MemoryConsistencyModel ConsistencyModel { get; }

    /// <summary>
    /// Gets whether the device supports acquire-release memory ordering.
    /// </summary>
    bool IsAcquireReleaseSupported { get; }

    /// <summary>
    /// Gets the overhead multiplier for the current consistency model.
    /// </summary>
    double GetOverheadMultiplier();
}

MemoryConsistencyModel Enum

public enum MemoryConsistencyModel
{
    /// <summary>
    /// Relaxed: No ordering guarantees (1.0× performance).
    /// </summary>
    Relaxed = 0,

    /// <summary>
    /// Release-Acquire: Causal ordering (0.85× performance).
    /// </summary>
    ReleaseAcquire = 1,

    /// <summary>
    /// Sequential: Total order (0.60× performance).
    /// </summary>
    Sequential = 2
}

FenceType Enum

public enum FenceType
{
    /// <summary>
    /// Thread-block scope (~10ns latency).
    /// </summary>
    ThreadBlock = 0,

    /// <summary>
    /// Device-wide scope (~100ns latency).
    /// </summary>
    Device = 1,

    /// <summary>
    /// System-wide scope (~200ns latency, requires UVA).
    /// </summary>
    System = 2
}

FenceLocation Class

public sealed class FenceLocation
{
    public int? InstructionIndex { get; init; }
    public bool AtEntry { get; init; }
    public bool AtExit { get; init; }
    public bool AfterWrites { get; init; }
    public bool BeforeReads { get; init; }

    // Presets
    public static FenceLocation Release { get; }      // AfterWrites
    public static FenceLocation Acquire { get; }      // BeforeReads
    public static FenceLocation FullBarrier { get; }  // AfterWrites + BeforeReads
    public static FenceLocation KernelEntry { get; }  // AtEntry
    public static FenceLocation KernelExit { get; }   // AtExit
}

Version History

v0.5.0-alpha (Current)

  • Initial release of Memory Ordering API
  • Three consistency models: Relaxed, Release-Acquire, Sequential
  • Three fence scopes: Thread-Block, Device, System
  • Strategic fence placement with FenceLocation
  • CUDA backend implementation with hardware detection
  • Comprehensive documentation and examples

Planned Enhancements (v0.6.0)

  • Automatic fence insertion during kernel compilation
  • Memory ordering visualization tools
  • Performance profiling integration
  • OpenCL and Metal backend implementations

Next: Timing API | Barrier API | Multi-GPU Guide