Memory Ordering API: Causal Consistency for GPU Computing

Overview

The Memory Ordering API provides fine-grained control over memory consistency and operation ordering in GPU kernels. It enables correct implementation of lock-free data structures, producer-consumer patterns, and distributed coordination by enforcing causal relationships between memory operations across threads, devices, and the host CPU.

Key Features:

Three Consistency Models: Relaxed (1.0×), Release-Acquire (0.85×), Sequential (0.60×)
Three Fence Scopes: Thread-block (~10ns), Device (~100ns), System (~200ns)
Strategic Placement: Entry, Exit, After Writes, Before Reads, Full Barrier
Zero Configuration: Default relaxed model for maximum performance
Explicit Control: Per-kernel fence insertion for critical sections

Backend	Thread-Block	Device	System	Release-Acquire	Sequential	Min. Version
CUDA	✅	✅	✅	✅ (CC 7.0+)	✅	CUDA 9.0+
OpenCL	✅	✅	❌	✅	✅	OpenCL 2.0+
Metal	✅	✅	❌	✅	✅	Metal 2.0+
CPU	✅	N/A	N/A	✅	✅	Volatile + Interlocked

Quick Start

Basic Setup: Enable Causal Ordering

using DotCompute.Abstractions.Memory;
using DotCompute.Backends.CUDA.Factory;

// Create accelerator
using var factory = new CudaAcceleratorFactory();
await using var accelerator = factory.CreateProductionAccelerator(0);

// Get memory ordering provider
var orderingProvider = accelerator.GetMemoryOrderingProvider();
if (orderingProvider == null)
{
    Console.WriteLine("Memory ordering not supported on this device");
    return;
}

// Enable causal ordering (Release-Acquire semantics)
orderingProvider.EnableCausalOrdering(true);
Console.WriteLine($"Consistency model: {orderingProvider.ConsistencyModel}");
Console.WriteLine($"Performance multiplier: {orderingProvider.GetOverheadMultiplier():F2}×");

Producer-Consumer Pattern with Explicit Fences

// Set Release-Acquire model for causal ordering
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

// Producer: Insert release fence after writing data
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

// Consumer: Insert acquire fence before reading data
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);

Multi-GPU System-Wide Synchronization

// Enable system-wide fences for cross-GPU coordination
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

// Insert system fence for visibility across all GPUs and CPU
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

Console.WriteLine($"Acquire-Release hardware support: {orderingProvider.IsAcquireReleaseSupported}");

Memory Consistency Models

1. Relaxed Consistency (Default)

Performance: 1.0× baseline (no overhead) Guarantees: None - operations may be reordered arbitrarily Use Case: Data-parallel algorithms with no inter-thread communication

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);

How it works:

GPU default memory model: maximum performance, minimal synchronization
Operations may execute and complete in any order
Threads may observe writes in different orders
No happens-before relationships between operations

Example Behavior:

// Thread 1: Writes
data[0] = 42;
data[1] = 100;

// Thread 2: Reads
int x = data[1];  // May see 100
int y = data[0];  // May see 0 (old value)!

When to Use:

Map/reduce operations with independent elements
Element-wise array transformations
No shared state between threads
Manual fence management for critical sections only

2. Release-Acquire Consistency (Recommended)

Performance: 0.85× baseline (15% overhead) Guarantees: Causal ordering - writes before release are visible after acquire Use Case: Producer-consumer patterns, message passing, distributed data structures

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

How it works:

Release operation: All prior writes become visible to other threads
Acquire operation: All subsequent reads observe values written before the release
Causality: If Thread A releases and Thread B acquires, B sees all of A's prior writes

Example Behavior:

// Producer (Thread 1)
data[0] = 42;
data[1] = 100;
__threadfence();           // Release fence
flag = READY;              // Signal ready

// Consumer (Thread 2)
while (flag != READY) { }  // Wait for signal
__threadfence();           // Acquire fence
int x = data[0];           // Guaranteed to see 42
int y = data[1];           // Guaranteed to see 100

When to Use:

Actor systems: Orleans.GpuBridge.Core message passing
Lock-free queues: Producer-consumer coordination
Distributed hash tables: Consistent key-value updates
Multi-GPU coordination: Cross-device data sharing

Implementation:

// CUDA Release (producer)
data[tid] = compute_value();
__threadfence();              // Release fence
atomicExch(&flag[tid], READY);

// CUDA Acquire (consumer)
while (atomicAdd(&flag[tid], 0) != READY) { }
__threadfence();              // Acquire fence
value = data[tid];

3. Sequential Consistency (Strongest)

Performance: 0.60× baseline (40% overhead) Guarantees: Total order - all threads observe operations in the same order Use Case: Complex algorithms requiring total order, or debugging race conditions

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);

How it works:

All memory operations appear to execute in a single global order
Every thread observes the same interleaving of operations
Fence before and after every memory operation
Strongest guarantee, highest overhead

Example Behavior:

// Thread 1
data[0] = 42;
data[1] = 100;

// Thread 2
int x = data[1];
int y = data[0];

// Sequential consistency guarantees:
// If x == 100, then y == 42 (never y == 0)

When to Use:

Algorithm correctness requires total order visibility
Debugging relaxed-model race conditions (disable after fixing)
Complex distributed algorithms (e.g., consensus protocols)
Performance is secondary to correctness

⚠️ Warning: 40% performance penalty. Avoid unless absolutely necessary. Start with Release-Acquire and add explicit fences only where needed.

Fence Types and Scopes

1. Thread-Block Fence

Latency: ~10 nanoseconds Scope: All threads in the same thread block Hardware: __threadfence_block() (CUDA), mem_fence(CLK_LOCAL_MEM_FENCE) (OpenCL)

orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.FullBarrier);

Use Cases:

Producer-consumer patterns within a block
Shared memory synchronization
Block-local data structure updates

Example CUDA Kernel:

extern "C" __global__ void block_producer_consumer(float* shared_data)
{
    __shared__ float buffer[256];
    int tid = threadIdx.x;

    // Producer threads (first half)
    if (tid < 128)
    {
        buffer[tid] = compute_value(tid);
        __threadfence_block();  // Release: writes visible
        buffer[tid + 128] = 1;  // Signal ready
    }

    // Consumer threads (second half)
    if (tid >= 128)
    {
        while (buffer[tid] != 1) { }  // Wait for signal
        __threadfence_block();        // Acquire: reads observe writes
        float value = buffer[tid - 128];
        shared_data[tid] = value;
    }
}

Performance: Fastest fence type (~10ns), ideal for intra-block coordination.

2. Device Fence

Latency: ~100 nanoseconds Scope: All threads on the same GPU Hardware: __threadfence() (CUDA), mem_fence(CLK_GLOBAL_MEM_FENCE) (OpenCL)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

Use Cases:

Grid-wide producer-consumer patterns
Device-global data structure updates
Inter-block communication via global memory

Example CUDA Kernel:

extern "C" __global__ void device_wide_counter(int* counter, int* results)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // All blocks increment counter
    int my_ticket = atomicAdd(counter, 1);
    __threadfence();  // Device fence: ensure counter update visible

    // All blocks read final counter
    results[tid] = *counter;  // Should see total increments
}

Performance: Medium overhead (~100ns), required for cross-block coordination.

3. System Fence

Latency: ~200 nanoseconds Scope: All processors (CPU, all GPUs, all devices) Hardware: __threadfence_system() (CUDA) Requirements: Unified virtual addressing (UVA), CUDA 9.0+

orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

Use Cases:

GPU-CPU communication via mapped/pinned memory
Multi-GPU synchronization
System-wide distributed data structures
Causal message passing in Orleans.GpuBridge.Core

Example CUDA Kernel:

extern "C" __global__ void gpu_to_cpu_message(
    volatile int* cpu_visible_flag,
    volatile int* cpu_visible_data)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if (tid == 0)
    {
        // Write data
        *cpu_visible_data = 12345;
        __threadfence_system();  // System fence: visible to CPU

        // Signal CPU
        *cpu_visible_flag = 1;
    }
}

C# CPU Side:

// CPU waits for GPU signal
volatile int flag = 0;
volatile int data = 0;

// ... launch kernel with mapped memory ...

// Spin wait (or use better synchronization)
while (flag == 0) { Thread.SpinWait(100); }

// Data is guaranteed visible after fence
Console.WriteLine($"GPU wrote: {data}");  // Prints: GPU wrote: 12345

Performance: Slowest fence (~200ns), strongest guarantee (CPU + all GPUs).

Fence Location Strategies

Strategic Fence Placement

Fence locations control precise insertion points in kernel code, enabling fine-grained performance tuning.

1. Release Semantics (After Writes)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

When to use: Producer threads publishing data

Example:

// Producer writes data
data[tid] = compute();

// Release fence: all writes visible
__threadfence();

// Signal ready (acquire by consumer)
flag[tid] = READY;

2. Acquire Semantics (Before Reads)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);

When to use: Consumer threads reading published data

Example:

// Wait for producer signal
while (flag[producer] != READY) { }

// Acquire fence: observe producer writes
__threadfence();

// Read data (guaranteed fresh)
value = data[producer];

3. Full Barrier (After Writes + Before Reads)

orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

When to use: Bidirectional synchronization, strongest guarantee

Example:

// Write phase
data[tid] = compute();

// Full barrier: writes visible, reads fresh
__threadfence();

// Read phase (observes all writes)
value = data[(tid + 1) % N];

4. Kernel Entry/Exit

orderingProvider.InsertFence(FenceType.Device, FenceLocation.KernelEntry);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.KernelExit);

When to use: Ensure consistent memory state across kernel boundaries

Example:

__global__ void with_entry_exit_fences(float* data)
{
    // [Entry fence inserted here by DotCompute]
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    // Kernel logic
    data[tid] = compute(data[tid]);

    // [Exit fence inserted here by DotCompute]
}

Performance Optimization: Minimal Fence Placement

✅ Efficient Pattern:

// Only fence where communication happens
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);

// Explicit fences at critical sections
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Release);  // After write
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Acquire);  // Before read

❌ Inefficient Pattern:

// Pervasive fencing (40% overhead)
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);
// Fences inserted everywhere automatically

Strategy	Overhead	When to Use
Relaxed + Explicit Fences	5-10%	Recommended: fence only critical sections
Release-Acquire Model	15%	Good default: automatic causal ordering
Sequential Model	40%	Last resort: debugging or complex algorithms

Use Cases

1. Producer-Consumer Pattern

Scenario: Multiple producer threads write data, consumer threads read when ready

// Configure Release-Acquire model
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);

CUDA Kernel:

__global__ void producer_consumer(float* data, volatile int* flags, int N)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int producer_id = tid % (N / 2);

    if (tid < N / 2)
    {
        // Producer: compute and publish
        data[producer_id] = expensive_compute(producer_id);
        __threadfence();  // Release fence
        flags[producer_id] = 1;  // Signal ready
    }
    else
    {
        // Consumer: wait and read
        int consumer_id = tid - N / 2;
        while (flags[consumer_id] == 0) { }  // Spin wait
        __threadfence();  // Acquire fence
        float value = data[consumer_id];  // Guaranteed fresh
        // ... process value ...
    }
}

2. Lock-Free Queue (MPSC)

Scenario: Multiple producers, single consumer, atomic enqueue/dequeue

// Release-Acquire for atomic operations
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

CUDA Kernel:

__device__ void enqueue(int* queue, volatile int* head, int value)
{
    int pos = atomicAdd(head, 1);  // Atomic increment
    queue[pos] = value;            // Write value
    __threadfence();               // Release: ensure write visible
}

__device__ int dequeue(int* queue, volatile int* head, volatile int* tail)
{
    __threadfence();  // Acquire: observe enqueue writes
    int pos = atomicAdd(tail, 1);
    int value = queue[pos];
    return value;
}

3. Distributed Hash Table

Scenario: Concurrent key-value updates across multiple GPUs

// System-wide fences for multi-GPU coordination
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

CUDA Kernel:

__global__ void hash_table_update(
    KVPair* table,
    int* keys,
    int* values,
    int N)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= N) return;

    int key = keys[tid];
    int hash = key % TABLE_SIZE;

    // Atomic key-value update
    KVPair old = table[hash];
    table[hash] = {key, values[tid]};

    __threadfence_system();  // System fence: visible to all GPUs + CPU
}

4. Orleans.GpuBridge.Core Integration

Scenario: GPU-native actors with causal message ordering

// Enable causal ordering for actor message passing
orderingProvider.EnableCausalOrdering(true);
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);

// System fences for GPU-CPU communication
orderingProvider.InsertFence(FenceType.System, FenceLocation.Release);

Actor Message Send (GPU Kernel):

__global__ void actor_send_message(
    volatile Message* mailbox,
    volatile int* message_count,
    Message msg)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid != 0) return;

    // Write message payload
    int slot = atomicAdd(message_count, 1);
    mailbox[slot] = msg;

    // Release fence: ensure message visible to CPU
    __threadfence_system();
}

Actor Message Receive (CPU):

// Acquire fence: observe GPU writes
volatile int messageCount = 0;
while (messageCount == 0)
{
    Thread.MemoryBarrier();  // CPU acquire
    messageCount = gpuMailbox.MessageCount;
}

// Read message (guaranteed fresh)
Message msg = gpuMailbox.Messages[0];
Console.WriteLine($"Received: {msg.Payload}");

Performance Characteristics

Consistency Model Overhead

Model	Performance	Fence Frequency	Use Case
Relaxed	1.0×	None (manual)	Data-parallel, no communication
Release-Acquire	0.85×	At sync points	Producer-consumer, actors
Sequential	0.60×	Every operation	Debugging, complex algorithms

Measured Performance (RTX 2000 Ada, 1M operations):

Relaxed:         2.14ms (baseline)
Release-Acquire: 2.52ms (18% overhead)
Sequential:      3.57ms (67% overhead)

Fence Type Overhead

Fence Type	Latency	Scope	Typical Use
Thread-Block	~10ns	Single block	Intra-block coordination
Device	~100ns	Single GPU	Grid-wide coordination
System	~200ns	CPU + all GPUs	Multi-GPU + CPU sync

Amortization Strategy:

High-frequency fencing: Use Thread-Block fences (10ns)
Medium-frequency: Use Device fences (100ns)
Low-frequency: Use System fences only at coarse-grained boundaries

Hardware Acceleration

Backend	Hardware Support	Overhead
CUDA CC 7.0+ (Volta)	Native acquire-release	0.85×
CUDA CC 5.0-6.x	Software emulation	0.70×
OpenCL 2.0+	atomic_work_item_fence()	0.80×
Metal 2.0+	threadgroup_barrier()	0.82×
CPU	Volatile + Interlocked	0.90×

Query Hardware Support:

if (orderingProvider.IsAcquireReleaseSupported)
{
    Console.WriteLine("Native hardware support: 0.85× overhead");
}
else
{
    Console.WriteLine("Software emulation: 0.70× overhead (higher cost)");
}

Hardware Requirements

CUDA Backend

Minimum Requirements:

Compute Capability 2.0 (Fermi) for thread-block and device fences
CUDA Toolkit 9.0 or later for full release-acquire support
NVIDIA Driver 384.81 or later

Recommended Configuration:

Compute Capability 7.0+ (Volta or newer) for native acquire-release
CUDA Toolkit 12.0 or later
NVIDIA Driver 525.60.13 or later
Unified Virtual Addressing (UVA) enabled for system fences

Feature	CC 2.0	CC 5.0	CC 6.0	CC 7.0+
Thread-Block Fences	✅	✅	✅	✅
Device Fences	✅	✅	✅	✅
System Fences (UVA)	✅	✅	✅	✅
Native Acquire-Release	❌	❌	❌	✅

GPU	Compute Capability	Native Acq-Rel	Tested
RTX 2000 Ada	8.9	✅	✅
RTX 4090	8.9	✅	✅
RTX 3090	8.6	✅	✅
RTX 2080 Ti	7.5	✅	✅
GTX 1080 Ti	6.1	❌ (emulation)	✅
GTX 980 Ti	5.2	❌ (emulation)	✅

OpenCL Backend

Minimum Requirements:

OpenCL 2.0 for atomic_work_item_fence()
mem_fence() with acquire/release flags

Supported Platforms:

NVIDIA: OpenCL 3.0 (via CUDA driver)
AMD: OpenCL 2.2 (via ROCm)
Intel: OpenCL 3.0 (via Level Zero)
ARM Mali: OpenCL 3.0
Qualcomm Adreno: OpenCL 2.0

Metal Backend

Minimum Requirements:

Metal 2.0 or later (macOS 10.13+)
threadgroup_barrier(), device_barrier() support

Supported Platforms:

Apple Silicon: M1/M2/M3/M4 series (full support)
Intel Macs: Metal 2.3+ (limited system fences)
iOS/iPadOS: Metal 2.0+ (mobile GPUs)

Best Practices

✅ Do

Start with Relaxed, add fences only where needed

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);
// Explicit fences at critical sections only
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Release);

Use Release-Acquire as default for coordinated algorithms

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
// Good balance of correctness and performance

Match fence scope to communication scope

// Intra-block: Thread-Block fence (10ns)
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.FullBarrier);

// Cross-block: Device fence (100ns)
orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

// Multi-GPU: System fence (200ns)
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

Check hardware support before using advanced features

if (!orderingProvider.IsAcquireReleaseSupported)
{
    Console.WriteLine("Warning: Acquire-release emulated, higher overhead");
}

Profile before and after adding memory ordering

var sw = Stopwatch.StartNew();
// ... kernel with fences ...
sw.Stop();
Console.WriteLine($"Overhead: {sw.ElapsedMilliseconds}ms");

Use FenceLocation presets for common patterns

orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Acquire);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

❌ Don't

Don't use Sequential unless absolutely necessary

// ❌ BAD: 40% overhead for all operations
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);

// ✅ GOOD: Release-Acquire + explicit fences
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

Don't over-fence - profile to find minimal fencing

// ❌ BAD: Excessive fencing degrades performance
orderingProvider.InsertFence(FenceType.System, FenceLocation.KernelEntry);
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);
orderingProvider.InsertFence(FenceType.System, FenceLocation.KernelExit);

// ✅ GOOD: Fence only at communication boundaries
orderingProvider.InsertFence(FenceType.Device, FenceLocation.Release);

Don't assume fences fix all race conditions

// Fences ensure ordering, not atomicity
// Use atomic operations for concurrent updates
atomicAdd(&counter, 1);  // Atomic operation
__threadfence();         // Fence for visibility

Don't use system fences for intra-GPU communication

// ❌ BAD: System fence overkill (200ns) for single-GPU
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

// ✅ GOOD: Device fence sufficient (100ns)
orderingProvider.InsertFence(FenceType.Device, FenceLocation.FullBarrier);

Don't forget to check for null provider

var provider = accelerator.GetMemoryOrderingProvider();
if (provider == null)
{
    throw new NotSupportedException("Memory ordering not available");
}

Don't change consistency model during kernel execution

// ❌ BAD: Not thread-safe, undefined behavior
Task.Run(() => orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential));

// ✅ GOOD: Configure during initialization only
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
// ... then launch kernels ...

Troubleshooting

Issue: Race Condition Despite Fences

Symptoms:

Non-deterministic results
Values appear stale or corrupted
Different threads observe different data

Diagnosis:

// Enable sequential consistency for debugging
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Sequential);

// If race disappears, root cause is insufficient ordering
// If race persists, root cause is missing atomicity

Solutions:

Check atomic operations:

// ❌ BAD: Non-atomic read-modify-write
counter = counter + 1;
__threadfence();

// ✅ GOOD: Atomic operation + fence
atomicAdd(&counter, 1);
__threadfence();

Verify fence placement:

// ❌ BAD: Fence after read (too late)
value = data[tid];
__threadfence();

// ✅ GOOD: Fence before read (acquire)
__threadfence();
value = data[tid];

Increase fence scope:

// Try ThreadBlock → Device → System
orderingProvider.InsertFence(FenceType.System, FenceLocation.FullBarrier);

Issue: Performance Degradation

Symptoms:

Kernel execution time increased significantly
Throughput reduced by 20-50%
Performance worse than CPU

Diagnosis:

// Measure overhead
var baseline = BenchmarkWithoutFences();
orderingProvider.SetConsistencyModel(MemoryConsistencyModel.ReleaseAcquire);
var withFences = BenchmarkWithFences();

double overhead = (withFences - baseline) / baseline * 100;
Console.WriteLine($"Overhead: {overhead:F1}%");

Solutions:

Reduce fence scope:

// System (200ns) → Device (100ns) → ThreadBlock (10ns)
orderingProvider.InsertFence(FenceType.ThreadBlock, FenceLocation.Release);

Use relaxed model + explicit fences:

orderingProvider.SetConsistencyModel(MemoryConsistencyModel.Relaxed);
// Add fences only at critical sections (5-10% overhead vs 15-40%)

Batch operations between fences:

// ❌ BAD: Fence per operation (40% overhead)
for (int i = 0; i < N; i++)
{
    data[i] = compute(i);
    __threadfence();
}

// ✅ GOOD: Fence after batch (5% overhead)
for (int i = 0; i < N; i++)
{
    data[i] = compute(i);
}
__threadfence();

Issue: System Fences Not Working

Symptoms:

CPU doesn't observe GPU writes
Multi-GPU coordination fails
System fence throws NotSupportedException

Solutions:

Check UVA support:

nvidia-smi -q | grep "Unified Memory"
# Should show: "Supported: Yes"

Verify compute capability:

var (major, _) = CudaCapabilityManager.GetTargetComputeCapability();
if (major < 2)
{
    Console.WriteLine("System fences require CC 2.0+ (UVA)");
}

Use pinned memory for CPU visibility:

// Allocate pinned (page-locked) memory
var pinnedBuffer = accelerator.AllocatePinned<int>(1024);
// System fences ensure CPU sees GPU writes

Fallback to explicit synchronization:

// If system fences unavailable, use device synchronization
await accelerator.SynchronizeAsync();
// CPU can now safely read GPU memory

Issue: Inconsistent Calibration Results

Symptoms:

GetOverheadMultiplier() returns unexpected values
Performance varies between runs
Fence overhead higher than expected

Solutions:

Warm up GPU before benchmarking:

// Run dummy kernel to warm up GPU
for (int i = 0; i < 5; i++)
{
    await accelerator.ExecuteKernelAsync(warmupKernel);
}
await accelerator.SynchronizeAsync();

// Now measure fence overhead
var overhead = orderingProvider.GetOverheadMultiplier();

Check for thermal throttling:

nvidia-smi --query-gpu=temperature.gpu,clocks.gr --format=csv
# High temperature → clock throttling → variable performance

Lock GPU clocks for consistent benchmarks:

sudo nvidia-smi -pm 1              # Enable persistence mode
sudo nvidia-smi -lgc 1500          # Lock GPU clock to 1500MHz

API Reference

IMemoryOrderingProvider Interface

public interface IMemoryOrderingProvider
{
    /// <summary>
    /// Enables causal memory ordering (release-acquire semantics).
    /// </summary>
    void EnableCausalOrdering(bool enable = true);

    /// <summary>
    /// Inserts a memory fence at the specified location in kernel code.
    /// </summary>
    void InsertFence(FenceType type, FenceLocation? location = null);

    /// <summary>
    /// Configures the memory consistency model for kernel execution.
    /// </summary>
    void SetConsistencyModel(MemoryConsistencyModel model);

    /// <summary>
    /// Gets the current memory consistency model.
    /// </summary>
    MemoryConsistencyModel ConsistencyModel { get; }

    /// <summary>
    /// Gets whether the device supports acquire-release memory ordering.
    /// </summary>
    bool IsAcquireReleaseSupported { get; }

    /// <summary>
    /// Gets the overhead multiplier for the current consistency model.
    /// </summary>
    double GetOverheadMultiplier();
}

MemoryConsistencyModel Enum

public enum MemoryConsistencyModel
{
    /// <summary>
    /// Relaxed: No ordering guarantees (1.0× performance).
    /// </summary>
    Relaxed = 0,

    /// <summary>
    /// Release-Acquire: Causal ordering (0.85× performance).
    /// </summary>
    ReleaseAcquire = 1,

    /// <summary>
    /// Sequential: Total order (0.60× performance).
    /// </summary>
    Sequential = 2
}

FenceType Enum

public enum FenceType
{
    /// <summary>
    /// Thread-block scope (~10ns latency).
    /// </summary>
    ThreadBlock = 0,

    /// <summary>
    /// Device-wide scope (~100ns latency).
    /// </summary>
    Device = 1,

    /// <summary>
    /// System-wide scope (~200ns latency, requires UVA).
    /// </summary>
    System = 2
}

FenceLocation Class

public sealed class FenceLocation
{
    public int? InstructionIndex { get; init; }
    public bool AtEntry { get; init; }
    public bool AtExit { get; init; }
    public bool AfterWrites { get; init; }
    public bool BeforeReads { get; init; }

    // Presets
    public static FenceLocation Release { get; }      // AfterWrites
    public static FenceLocation Acquire { get; }      // BeforeReads
    public static FenceLocation FullBarrier { get; }  // AfterWrites + BeforeReads
    public static FenceLocation KernelEntry { get; }  // AtEntry
    public static FenceLocation KernelExit { get; }   // AtExit
}

Timing API - High-precision GPU timestamps for profiling
Barrier API - Hardware-accelerated thread synchronization
Multi-GPU Guide - Cross-device coordination
Orleans Integration - Actor-based distributed computing
Performance Tuning - Optimization strategies

Version History

v0.6.2 (Current)

All 5 Ring Kernel phases complete (295/302 tests - 97.7%)
Phase 5: Performance & Observability complete (94/94 tests)
OpenTelemetry integration with Prometheus metrics
Health check infrastructure
Memory Ordering API stable

v0.5.0-alpha

Initial release of Memory Ordering API
Three consistency models: Relaxed, Release-Acquire, Sequential
Three fence scopes: Thread-Block, Device, System
Strategic fence placement with FenceLocation
CUDA backend implementation with hardware detection
Comprehensive documentation and examples

Planned Enhancements (v0.6.0)

Automatic fence insertion during kernel compilation
Memory ordering visualization tools
Performance profiling integration
OpenCL and Metal backend implementations

Next: Timing API | Barrier API | Multi-GPU Guide

Table of Contents

Memory Ordering API: Causal Consistency for GPU Computing

Overview

Quick Start

Basic Setup: Enable Causal Ordering

Producer-Consumer Pattern with Explicit Fences

Multi-GPU System-Wide Synchronization

Memory Consistency Models

1. Relaxed Consistency (Default)

2. Release-Acquire Consistency (Recommended)

3. Sequential Consistency (Strongest)

Fence Types and Scopes

1. Thread-Block Fence

2. Device Fence

3. System Fence

Fence Location Strategies

Strategic Fence Placement

1. Release Semantics (After Writes)

2. Acquire Semantics (Before Reads)

3. Full Barrier (After Writes + Before Reads)

4. Kernel Entry/Exit

Performance Optimization: Minimal Fence Placement

Use Cases

1. Producer-Consumer Pattern

2. Lock-Free Queue (MPSC)

3. Distributed Hash Table

4. Orleans.GpuBridge.Core Integration

Performance Characteristics

Consistency Model Overhead

Fence Type Overhead

Hardware Acceleration

Hardware Requirements

CUDA Backend

OpenCL Backend

Metal Backend

Best Practices

✅ Do

❌ Don't

Troubleshooting

Issue: Race Condition Despite Fences

Issue: Performance Degradation

Issue: System Fences Not Working

Issue: Inconsistent Calibration Results

API Reference

IMemoryOrderingProvider Interface

MemoryConsistencyModel Enum

FenceType Enum

FenceLocation Class

Related Documentation

Version History

v0.6.2 (Current)

v0.5.0-alpha

Planned Enhancements (v0.6.0)