Class CudaMemoryOrderingProvider

Namespace: DotCompute.Backends.CUDA.Memory

Assembly: DotCompute.Backends.CUDA.dll

CUDA-specific implementation of memory ordering primitives.

public sealed class CudaMemoryOrderingProvider : IMemoryOrderingProvider, IDisposable

Inheritance: object

CudaMemoryOrderingProvider

Implements: IMemoryOrderingProvider

IDisposable

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: BufferExtensions.IsSafeToDispose(IDisposable)

BufferExtensions.SafeDispose(IDisposable)

Remarks

This provider implements causal memory ordering using CUDA's __threadfence_* intrinsics:

__threadfence_block(): Thread-block scope (~10ns)
__threadfence(): Device scope (~100ns)
__threadfence_system(): System scope (~200ns)

Compute Capability Requirements:

CC 2.0+: Thread-block and device fences
CC 2.0+ with UVA: System fences (requires unified virtual addressing)
CC 7.0+ (Volta): Hardware acquire-release support

Performance Characteristics:

Relaxed model: 1.0× (baseline, no overhead)
Release-Acquire model: 0.85× (15% overhead)
Sequential model: 0.60× (40% overhead)

Thread Safety: Configuration methods (SetConsistencyModel, EnableCausalOrdering) are not thread-safe and should be called during initialization only. Fence insertion is safe to call concurrently from multiple threads.

Constructors

CudaMemoryOrderingProvider(ILogger?)

Initializes a new CUDA memory ordering provider.

public CudaMemoryOrderingProvider(ILogger? logger = null)

Parameters

logger ILogger: Optional logger for diagnostic output.

Properties

ConsistencyModel

Gets the current memory consistency model.

public MemoryConsistencyModel ConsistencyModel { get; }

Property Value

MemoryConsistencyModel: The active consistency model, or Relaxed if not explicitly set.

IsAcquireReleaseSupported

Gets whether the device supports acquire-release memory ordering.

public bool IsAcquireReleaseSupported { get; }

Property Value

bool: True if release-acquire semantics are supported in hardware, false otherwise.

Remarks

Hardware Support:

CUDA 9.0+ (Volta CC 7.0+): Native support
OpenCL 2.0+: Via atomic_work_item_fence()
Older GPUs: Software emulation via fences (higher overhead)

If false, the provider may emulate acquire-release using pervasive fences, increasing overhead from 15% to 30-40%.

IsCausalOrderingEnabled

Gets whether causal memory ordering is currently enabled.

public bool IsCausalOrderingEnabled { get; }

Property Value

bool: True if release-acquire semantics are active, false if using relaxed model.

Remarks

Indicates whether EnableCausalOrdering(bool) has been called with true.

SupportsSystemFences

Gets whether system-wide memory fences are supported.

public bool SupportsSystemFences { get; }

Property Value

bool: True if FenceType.System is supported (CPU+GPU visibility), false otherwise.

Remarks

Requirements:

CUDA: Compute Capability 2.0+ with Unified Virtual Addressing (UVA)
OpenCL: Shared virtual memory (SVM) support
Metal: Shared memory resources enabled

If false, system fences will throw NotSupportedException.

Methods

Dispose()

Disposes the memory ordering provider.

public void Dispose()

EnableCausalOrdering(bool)

Enables causal memory ordering (release-acquire semantics).

public void EnableCausalOrdering(bool enable = true)

Parameters

enable bool: True to enable causal ordering, false to use relaxed model.

Examples

// Producer thread (release)
data[tid] = compute();
provider.EnableCausalOrdering(true);  // Ensures data is visible
flag[tid] = READY;

// Consumer thread (acquire)
while (flag[producer] != READY) { }
provider.EnableCausalOrdering(true);  // Ensures data is observed
value = data[producer];

Remarks

When enabled, memory operations use release-acquire semantics:

Release (Write): All prior memory operations complete before the write is visible
Acquire (Read): All subsequent memory operations see values written before the read
Causality: If A writes and B reads, B observes all of A's prior writes

Implementation:

CUDA: Automatic fence insertion before stores (release) and after loads (acquire)
OpenCL: mem_fence() with acquire/release flags
CPU: Volatile + Interlocked operations

Performance: ~15% overhead from fence insertion.

Thread Safety: This setting affects all kernels launched after this call. Not thread-safe; call during initialization only.

GetOverheadMultiplier()

Gets the overhead multiplier for the current consistency model.

public double GetOverheadMultiplier()

Returns

double: Performance multiplier (1.0 = no overhead). Lower values indicate higher overhead.

Remarks

Use this to estimate performance impact of memory ordering:

Relaxed: 1.0× (baseline)
ReleaseAcquire: 0.85× (15% overhead)
Sequential: 0.60× (40% overhead)

Actual overhead depends on memory access patterns. Compute-bound kernels see minimal impact, memory-bound kernels see higher impact.

InsertFence(FenceType, FenceLocation?)

Inserts a memory fence at the specified location in kernel code.

public void InsertFence(FenceType type, FenceLocation? location = null)

Parameters

type FenceType: The fence scope (ThreadBlock, Device, or System).
location FenceLocation: Optional fence location specification. If null, inserts at current location.

Examples

// Producer-consumer with explicit fences
provider.InsertFence(FenceType.Device, FenceLocation.Release);  // After write
provider.InsertFence(FenceType.Device, FenceLocation.Acquire);  // Before read

Remarks

Memory fences provide explicit control over operation ordering and visibility. Fences ensure that all memory operations before the fence complete before any operations after the fence begin.

Fence Types:

ThreadBlock: Visibility within one block (~10ns)
Device: Visibility across all blocks on one GPU (~100ns)
System: Visibility across CPU, all GPUs (~200ns)

Strategic Placement:

After writes: Release semantics (producer)
Before reads: Acquire semantics (consumer)
Both: Full barrier (strongest guarantee)

⚠️ Warning: Excessive fencing degrades performance. Profile your kernel to identify minimal fence placement that ensures correctness.

Exceptions

NotSupportedException: Thrown when the specified fence type is not supported by the device.

SetConsistencyModel(MemoryConsistencyModel)

Configures the memory consistency model for kernel execution.

public void SetConsistencyModel(MemoryConsistencyModel model)

Parameters

model MemoryConsistencyModel: The desired consistency model.

Remarks

The consistency model determines default ordering guarantees for all memory operations:

Relaxed: No ordering (1.0× performance, default GPU model)
ReleaseAcquire: Causal ordering (0.85× performance, recommended)
Sequential: Total order (0.60× performance, strongest guarantee)

Model Selection Guide:

Use Relaxed: Data-parallel algorithms with no inter-thread communication
Use ReleaseAcquire: Message passing, actor systems, distributed data structures (default for Orleans.GpuBridge)
Use Sequential: Complex algorithms requiring total order, or debugging race conditions

Thread Safety: This setting affects all kernels launched after this call. Not thread-safe; call during initialization only.

Performance Tip: Start with Relaxed and add explicit fences only where needed. This often outperforms pervasive models (ReleaseAcquire/Sequential).

Exceptions

NotSupportedException: Thrown when the specified consistency model is not supported by the device.

Table of Contents

Class CudaMemoryOrderingProvider

Remarks

Constructors

CudaMemoryOrderingProvider(ILogger?)

Parameters

Properties

ConsistencyModel

Property Value

IsAcquireReleaseSupported

Property Value

Remarks

IsCausalOrderingEnabled

Property Value

Remarks

SupportsSystemFences

Property Value

Remarks

Methods

Dispose()

EnableCausalOrdering(bool)

Parameters

Examples

Remarks

GetOverheadMultiplier()

Returns

Remarks

InsertFence(FenceType, FenceLocation?)

Parameters

Examples

Remarks

Exceptions

SetConsistencyModel(MemoryConsistencyModel)

Parameters

Remarks

Exceptions