Table of Contents

Enum FenceType

Namespace
DotCompute.Abstractions.Memory
Assembly
DotCompute.Abstractions.dll

Specifies the scope of memory fence operations for GPU synchronization.

public enum FenceType

Fields

Device = 1

Device-wide fence ensuring memory consistency across all thread blocks on a single GPU.

CUDA: __threadfence()

Visibility: All threads on the same GPU see memory updates after this fence. Does not guarantee visibility to host CPU or other GPUs.

Use Cases:

  • Grid-wide producer-consumer patterns
  • Device-global data structure updates
  • Inter-block communication via global memory

Performance: ~100ns latency (medium overhead).

System = 2

System-wide fence ensuring memory consistency across CPU, GPU, and all devices.

CUDA: __threadfence_system()

Visibility: All processors in the system (CPU, all GPUs) see memory updates after this fence. Strongest consistency guarantee, highest overhead.

Use Cases:

  • GPU-CPU communication via mapped/pinned memory
  • Multi-GPU synchronization
  • System-wide distributed data structures
  • Causal message passing in Orleans.GpuBridge.Core

Performance: ~200ns latency (slowest, strongest guarantee).

Requirements: Requires unified virtual addressing (UVA) on CUDA.

ThreadBlock = 0

Thread-block scope fence ensuring memory consistency within a single thread block.

CUDA: __threadfence_block()

Visibility: All threads in the same thread block see memory updates after this fence. Does not guarantee visibility to threads in other blocks.

Use Cases:

  • Producer-consumer patterns within a block
  • Shared memory synchronization
  • Block-local data structure updates

Performance: ~10ns latency (fastest fence type).

Remarks

Memory fences control the visibility of memory operations across different levels of the GPU memory hierarchy. The fence type determines which threads observe the memory consistency guarantees.

Platform Mapping:

  • CUDA: __threadfence_block(), __threadfence(), __threadfence_system()
  • OpenCL: mem_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE)
  • Metal: threadgroup_barrier(), device_barrier(), system_barrier()

Performance vs. Scope Trade-off:

ThreadBlockFastest (~10ns), limited visibility
DeviceMedium (~100ns), intra-device visibility
SystemSlowest (~200ns), inter-device visibility