Enum BarrierScope
- Namespace
- DotCompute.Abstractions.Barriers
- Assembly
- DotCompute.Abstractions.dll
Defines the synchronization scope for GPU thread barriers.
public enum BarrierScope
Fields
Grid = 1Synchronize all threads across all blocks in the entire kernel grid.
Grid-wide barriers enable global synchronization for algorithms requiring inter-block communication. Requires cooperative kernel launch via
cudaLaunchCooperativeKernel.Use Cases:
- Global reductions to single value
- Iterative algorithms across entire dataset
- Multi-step computations requiring global state
Requirements:
- Compute Capability 6.0+ (Pascal or newer)
- Cooperative kernel launch
- Grid size ≤ maximum concurrent kernel limit
Performance: Grid barriers have ~1-10μs latency depending on grid size and device generation. Use sparingly in tight loops.
System = 4Synchronize threads across multiple GPUs and the CPU in a system-wide barrier.
System-wide barriers enable synchronization across multiple GPUs and the host CPU. This is the most complex and slowest barrier type due to PCIe roundtrip latency.
Architecture: System barriers operate in three phases:
- Device-Local Phase: Each GPU executes device-local barrier (__threadfence_system)
- Cross-GPU Phase: CPU waits for all GPU events via CUDA events
- Resume Phase: CPU signals all GPUs to continue via mapped memory
Use Cases:
- Multi-GPU global reductions
- Distributed graph algorithms across GPUs
- Multi-GPU iterative solvers
- System-wide checkpoint synchronization
Requirements:
- Multiple CUDA devices (2-8 GPUs typical)
- Compute Capability 6.0+ for efficient P2P (Pascal or newer)
- P2P capability between all device pairs (recommended)
- Pinned host memory for cross-device signaling
Performance: System barriers have ~1-10ms latency due to PCIe roundtrip and CPU coordination. Use sparingly - typically once per iteration in multi-GPU algorithms.
Limitations:
- Maximum 8 GPUs typical (PCIe topology limit)
- Performance degrades with increased GPU count
- May not work reliably with integrated GPUs
- Requires careful error handling for multi-device scenarios
ThreadBlock = 0Synchronize all threads within a single thread block.
This is the most common and efficient barrier scope, mapping directly to
__syncthreads()in CUDA. All threads in the block wait until every thread reaches the barrier.Use Cases:
- Shared memory synchronization
- Reduction operations within a block
- Stencil computations
Constraints: Maximum block size varies by device (typically 1024 threads).
Tile = 3Synchronize an arbitrary subset of threads (tile) within a thread block.
Tile barriers enable flexible synchronization patterns where only a subset of threads need to wait. Particularly useful for irregular workloads.
Use Cases:
- Work-stealing algorithms
- Dynamic partitioning
- Hierarchical parallelism
Performance: Tile barriers are more flexible but slightly slower than thread-block barriers (~20ns vs ~10ns).
Warp = 2Synchronize all threads within a single 32-thread warp (CUDA-specific).
Warp-level synchronization is implicit in lockstep execution but explicit barriers enable safe divergent execution patterns. Maps to
__syncwarp().Use Cases:
- Warp-level reductions
- Ballot/shuffle operations
- Warp-synchronous programming patterns
Note: Warp size is 32 threads on NVIDIA GPUs. Other vendors may differ.
Remarks
Barrier scope determines which threads participate in synchronization:
- ThreadBlock: All threads in the same CUDA thread block
- Grid: All threads across all blocks in the kernel grid (requires cooperative launch)
- Warp: All threads in the same 32-thread warp
- Tile: Arbitrary subset of threads within a block
Performance Characteristics:
| ThreadBlock | ~10ns latency, hardware support, most common |
| Grid | ~1-10μs latency, requires cooperative launch, CC 6.0+ |
| Warp | ~1ns latency, implicit in lockstep execution |
| Tile | ~20ns latency, flexible but slower than block |
Hardware Requirements:
- ThreadBlock: All CUDA devices (CC 1.0+)
- Grid: Pascal and newer (CC 6.0+)
- Warp: All CUDA devices (CC 1.0+)
- Tile: Volta and newer (CC 7.0+) for best performance