Enum BarrierScope

Namespace: DotCompute.Abstractions.Barriers

Assembly: DotCompute.Abstractions.dll

Defines the synchronization scope for GPU thread barriers.

public enum BarrierScope

Fields

Grid = 1

Synchronize all threads across all blocks in the entire kernel grid.

Grid-wide barriers enable global synchronization for algorithms requiring inter-block communication. Requires cooperative kernel launch via cudaLaunchCooperativeKernel.

Use Cases:

Global reductions to single value
Iterative algorithms across entire dataset
Multi-step computations requiring global state

Requirements:

Compute Capability 6.0+ (Pascal or newer)
Cooperative kernel launch
Grid size ≤ maximum concurrent kernel limit

Performance: Grid barriers have ~1-10μs latency depending on grid size and device generation. Use sparingly in tight loops.

System = 4

Synchronize threads across multiple GPUs and the CPU in a system-wide barrier.

System-wide barriers enable synchronization across multiple GPUs and the host CPU. This is the most complex and slowest barrier type due to PCIe roundtrip latency.

Architecture: System barriers operate in three phases:

Device-Local Phase: Each GPU executes device-local barrier (__threadfence_system)
Cross-GPU Phase: CPU waits for all GPU events via CUDA events
Resume Phase: CPU signals all GPUs to continue via mapped memory

Use Cases:

Multi-GPU global reductions
Distributed graph algorithms across GPUs
Multi-GPU iterative solvers
System-wide checkpoint synchronization

Requirements:

Multiple CUDA devices (2-8 GPUs typical)
Compute Capability 6.0+ for efficient P2P (Pascal or newer)
P2P capability between all device pairs (recommended)
Pinned host memory for cross-device signaling

Performance: System barriers have ~1-10ms latency due to PCIe roundtrip and CPU coordination. Use sparingly - typically once per iteration in multi-GPU algorithms.

Limitations:

Maximum 8 GPUs typical (PCIe topology limit)
Performance degrades with increased GPU count
May not work reliably with integrated GPUs
Requires careful error handling for multi-device scenarios

ThreadBlock = 0

Synchronize all threads within a single thread block.

This is the most common and efficient barrier scope, mapping directly to __syncthreads() in CUDA. All threads in the block wait until every thread reaches the barrier.

Use Cases:

Shared memory synchronization
Reduction operations within a block
Stencil computations

Constraints: Maximum block size varies by device (typically 1024 threads).

Tile = 3

Synchronize an arbitrary subset of threads (tile) within a thread block.

Tile barriers enable flexible synchronization patterns where only a subset of threads need to wait. Particularly useful for irregular workloads.

Use Cases:

Work-stealing algorithms
Dynamic partitioning
Hierarchical parallelism

Performance: Tile barriers are more flexible but slightly slower than thread-block barriers (~20ns vs ~10ns).

Warp = 2

Synchronize all threads within a single 32-thread warp (CUDA-specific).

Warp-level synchronization is implicit in lockstep execution but explicit barriers enable safe divergent execution patterns. Maps to __syncwarp().

Use Cases:

Warp-level reductions
Ballot/shuffle operations
Warp-synchronous programming patterns

Note: Warp size is 32 threads on NVIDIA GPUs. Other vendors may differ.

Remarks

Barrier scope determines which threads participate in synchronization:

ThreadBlock: All threads in the same CUDA thread block
Grid: All threads across all blocks in the kernel grid (requires cooperative launch)
Warp: All threads in the same 32-thread warp
Tile: Arbitrary subset of threads within a block

Performance Characteristics:

ThreadBlock	~10ns latency, hardware support, most common
Grid	~1-10μs latency, requires cooperative launch, CC 6.0+
Warp	~1ns latency, implicit in lockstep execution
Tile	~20ns latency, flexible but slower than block

Hardware Requirements:

ThreadBlock: All CUDA devices (CC 1.0+)
Grid: Pascal and newer (CC 6.0+)
Warp: All CUDA devices (CC 1.0+)
Tile: Volta and newer (CC 7.0+) for best performance

Table of Contents

Enum BarrierScope

Fields

Remarks