Class CudaMemoryOrderingProvider
- Namespace
- DotCompute.Backends.CUDA.Memory
- Assembly
- DotCompute.Backends.CUDA.dll
CUDA-specific implementation of memory ordering primitives.
public sealed class CudaMemoryOrderingProvider : IMemoryOrderingProvider, IFenceInjectionService, IDisposable
- Inheritance
-
CudaMemoryOrderingProvider
- Implements
- Inherited Members
- Extension Methods
Remarks
This provider implements causal memory ordering using CUDA's __threadfence_* intrinsics:
- __threadfence_block(): Thread-block scope (~10ns)
- __threadfence(): Device scope (~100ns)
- __threadfence_system(): System scope (~200ns)
Compute Capability Requirements:
- CC 2.0+: Thread-block and device fences
- CC 2.0+ with UVA: System fences (requires unified virtual addressing)
- CC 7.0+ (Volta): Hardware acquire-release support
Performance Characteristics:
- Relaxed model: 1.0× (baseline, no overhead)
- Release-Acquire model: 0.85× (15% overhead)
- Sequential model: 0.60× (40% overhead)
Thread Safety: Configuration methods (SetConsistencyModel, EnableCausalOrdering) are not thread-safe and should be called during initialization only. Fence insertion is safe to call concurrently from multiple threads.
Fence Injection: This provider also implements IFenceInjectionService to bridge fence requests with the kernel compiler. When InsertFence(FenceType, FenceLocation?) is called, the fence request is queued and can be retrieved during kernel compilation via GetPendingFences(). The compiler should inject the appropriate PTX fence instructions and then call ClearPendingFences().
Constructors
CudaMemoryOrderingProvider(ILogger?)
Initializes a new CUDA memory ordering provider.
public CudaMemoryOrderingProvider(ILogger? logger = null)
Parameters
loggerILoggerOptional logger for diagnostic output.
Properties
ConsistencyModel
Gets the current memory consistency model.
public MemoryConsistencyModel ConsistencyModel { get; }
Property Value
- MemoryConsistencyModel
The active consistency model, or Relaxed if not explicitly set.
IsAcquireReleaseSupported
Gets whether the device supports acquire-release memory ordering.
public bool IsAcquireReleaseSupported { get; }
Property Value
- bool
True if release-acquire semantics are supported in hardware, false otherwise.
Remarks
Hardware Support:
- CUDA 9.0+ (Volta CC 7.0+): Native support
- OpenCL 2.0+: Via atomic_work_item_fence()
- Older GPUs: Software emulation via fences (higher overhead)
If false, the provider may emulate acquire-release using pervasive fences, increasing overhead from 15% to 30-40%.
IsCausalOrderingEnabled
Gets whether causal memory ordering is currently enabled.
public bool IsCausalOrderingEnabled { get; }
Property Value
- bool
True if release-acquire semantics are active, false if using relaxed model.
Remarks
Indicates whether EnableCausalOrdering(bool) has been called with true.
PendingFenceCount
Gets the number of pending fence requests.
public int PendingFenceCount { get; }
Property Value
SupportsSystemFences
Gets whether system-wide memory fences are supported.
public bool SupportsSystemFences { get; }
Property Value
- bool
True if FenceType.System is supported (CPU+GPU visibility), false otherwise.
Remarks
Requirements:
- CUDA: Compute Capability 2.0+ with Unified Virtual Addressing (UVA)
- OpenCL: Shared virtual memory (SVM) support
- Metal: Shared memory resources enabled
If false, system fences will throw NotSupportedException.
Methods
ClearPendingFences()
Clears all pending fence requests after they have been processed.
public void ClearPendingFences()
Remarks
Call this after successfully compiling a kernel to avoid re-injecting the same fences in subsequent compilations.
Dispose()
Disposes the memory ordering provider.
public void Dispose()
EnableCausalOrdering(bool)
Enables causal memory ordering (release-acquire semantics).
public void EnableCausalOrdering(bool enable = true)
Parameters
enableboolTrue to enable causal ordering, false to use relaxed model.
Examples
// Producer thread (release)
data[tid] = compute();
provider.EnableCausalOrdering(true); // Ensures data is visible
flag[tid] = READY;
// Consumer thread (acquire)
while (flag[producer] != READY) { }
provider.EnableCausalOrdering(true); // Ensures data is observed
value = data[producer];
Remarks
When enabled, memory operations use release-acquire semantics:
- Release (Write): All prior memory operations complete before the write is visible
- Acquire (Read): All subsequent memory operations see values written before the read
- Causality: If A writes and B reads, B observes all of A's prior writes
Implementation:
- CUDA: Automatic fence insertion before stores (release) and after loads (acquire)
- OpenCL: mem_fence() with acquire/release flags
- CPU: Volatile + Interlocked operations
Performance: ~15% overhead from fence insertion.
Thread Safety: This setting affects all kernels launched after this call. Not thread-safe; call during initialization only.
GetFencesForLocation(bool, bool, bool, bool)
Gets fences appropriate for the specified location in kernel code.
public IReadOnlyList<FenceRequest> GetFencesForLocation(bool atEntry = false, bool atExit = false, bool afterWrites = false, bool beforeReads = false)
Parameters
atEntryboolInclude fences marked for kernel entry.
atExitboolInclude fences marked for kernel exit.
afterWritesboolInclude fences marked for after write operations.
beforeReadsboolInclude fences marked for before read operations.
Returns
- IReadOnlyList<FenceRequest>
Filtered collection of fence requests matching the criteria.
GetOverheadMultiplier()
Gets the overhead multiplier for the current consistency model.
public double GetOverheadMultiplier()
Returns
- double
Performance multiplier (1.0 = no overhead). Lower values indicate higher overhead.
Remarks
Use this to estimate performance impact of memory ordering:
- Relaxed: 1.0× (baseline)
- ReleaseAcquire: 0.85× (15% overhead)
- Sequential: 0.60× (40% overhead)
Actual overhead depends on memory access patterns. Compute-bound kernels see minimal impact, memory-bound kernels see higher impact.
GetPendingFences()
Gets all pending fence requests that should be injected during the next compilation.
public IReadOnlyList<FenceRequest> GetPendingFences()
Returns
- IReadOnlyList<FenceRequest>
A collection of pending fence requests, ordered by request time.
Remarks
This method returns a snapshot of pending fences. The returned collection is not affected by subsequent calls to QueueFence(FenceRequest) or ClearPendingFences().
InsertFence(FenceType, FenceLocation?)
Inserts a memory fence at the specified location in kernel code.
public void InsertFence(FenceType type, FenceLocation? location = null)
Parameters
typeFenceTypeThe fence scope (ThreadBlock, Device, or System).
locationFenceLocationOptional fence location specification. If null, inserts at current location.
Examples
// Producer-consumer with explicit fences
provider.InsertFence(FenceType.Device, FenceLocation.Release); // After write
provider.InsertFence(FenceType.Device, FenceLocation.Acquire); // Before read
Remarks
Memory fences provide explicit control over operation ordering and visibility. Fences ensure that all memory operations before the fence complete before any operations after the fence begin.
Fence Types:
- ThreadBlock: Visibility within one block (~10ns)
- Device: Visibility across all blocks on one GPU (~100ns)
- System: Visibility across CPU, all GPUs (~200ns)
Strategic Placement:
- After writes: Release semantics (producer)
- Before reads: Acquire semantics (consumer)
- Both: Full barrier (strongest guarantee)
⚠️ Warning: Excessive fencing degrades performance. Profile your kernel to identify minimal fence placement that ensures correctness.
Exceptions
- NotSupportedException
Thrown when the specified fence type is not supported by the device.
QueueFence(FenceRequest)
Queues a new fence request for injection during the next kernel compilation.
public void QueueFence(FenceRequest request)
Parameters
requestFenceRequestThe fence request to queue.
SetConsistencyModel(MemoryConsistencyModel)
Configures the memory consistency model for kernel execution.
public void SetConsistencyModel(MemoryConsistencyModel model)
Parameters
modelMemoryConsistencyModelThe desired consistency model.
Remarks
The consistency model determines default ordering guarantees for all memory operations:
- Relaxed: No ordering (1.0× performance, default GPU model)
- ReleaseAcquire: Causal ordering (0.85× performance, recommended)
- Sequential: Total order (0.60× performance, strongest guarantee)
Model Selection Guide:
- Use Relaxed: Data-parallel algorithms with no inter-thread communication
- Use ReleaseAcquire: Message passing, actor systems, distributed data structures (default for Orleans.GpuBridge)
- Use Sequential: Complex algorithms requiring total order, or debugging race conditions
Thread Safety: This setting affects all kernels launched after this call. Not thread-safe; call during initialization only.
Performance Tip: Start with Relaxed and add explicit fences only where needed. This often outperforms pervasive models (ReleaseAcquire/Sequential).
Exceptions
- NotSupportedException
Thrown when the specified consistency model is not supported by the device.