Enum FenceType
- Namespace
- DotCompute.Abstractions.Memory
- Assembly
- DotCompute.Abstractions.dll
Specifies the scope of memory fence operations for GPU synchronization.
public enum FenceType
Fields
Device = 1Device-wide fence ensuring memory consistency across all thread blocks on a single GPU.
CUDA:
__threadfence()Visibility: All threads on the same GPU see memory updates after this fence. Does not guarantee visibility to host CPU or other GPUs.
Use Cases:
- Grid-wide producer-consumer patterns
- Device-global data structure updates
- Inter-block communication via global memory
Performance: ~100ns latency (medium overhead).
System = 2System-wide fence ensuring memory consistency across CPU, GPU, and all devices.
CUDA:
__threadfence_system()Visibility: All processors in the system (CPU, all GPUs) see memory updates after this fence. Strongest consistency guarantee, highest overhead.
Use Cases:
- GPU-CPU communication via mapped/pinned memory
- Multi-GPU synchronization
- System-wide distributed data structures
- Causal message passing in Orleans.GpuBridge.Core
Performance: ~200ns latency (slowest, strongest guarantee).
Requirements: Requires unified virtual addressing (UVA) on CUDA.
ThreadBlock = 0Thread-block scope fence ensuring memory consistency within a single thread block.
CUDA:
__threadfence_block()Visibility: All threads in the same thread block see memory updates after this fence. Does not guarantee visibility to threads in other blocks.
Use Cases:
- Producer-consumer patterns within a block
- Shared memory synchronization
- Block-local data structure updates
Performance: ~10ns latency (fastest fence type).
Remarks
Memory fences control the visibility of memory operations across different levels of the GPU memory hierarchy. The fence type determines which threads observe the memory consistency guarantees.
Platform Mapping:
- CUDA: __threadfence_block(), __threadfence(), __threadfence_system()
- OpenCL: mem_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE)
- Metal: threadgroup_barrier(), device_barrier(), system_barrier()
Performance vs. Scope Trade-off:
| ThreadBlock | Fastest (~10ns), limited visibility |
| Device | Medium (~100ns), intra-device visibility |
| System | Slowest (~200ns), inter-device visibility |