Table of Contents

Enum BarrierScope

Namespace
DotCompute.Abstractions.Barriers
Assembly
DotCompute.Abstractions.dll

Defines the synchronization scope for GPU thread barriers.

public enum BarrierScope

Fields

Grid = 1

Synchronize all threads across all blocks in the entire kernel grid.

Grid-wide barriers enable global synchronization for algorithms requiring inter-block communication. Requires cooperative kernel launch via cudaLaunchCooperativeKernel.

Use Cases:

  • Global reductions to single value
  • Iterative algorithms across entire dataset
  • Multi-step computations requiring global state

Requirements:

  • Compute Capability 6.0+ (Pascal or newer)
  • Cooperative kernel launch
  • Grid size ≤ maximum concurrent kernel limit

Performance: Grid barriers have ~1-10μs latency depending on grid size and device generation. Use sparingly in tight loops.

System = 4

Synchronize threads across multiple GPUs and the CPU in a system-wide barrier.

System-wide barriers enable synchronization across multiple GPUs and the host CPU. This is the most complex and slowest barrier type due to PCIe roundtrip latency.

Architecture: System barriers operate in three phases:

  1. Device-Local Phase: Each GPU executes device-local barrier (__threadfence_system)
  2. Cross-GPU Phase: CPU waits for all GPU events via CUDA events
  3. Resume Phase: CPU signals all GPUs to continue via mapped memory

Use Cases:

  • Multi-GPU global reductions
  • Distributed graph algorithms across GPUs
  • Multi-GPU iterative solvers
  • System-wide checkpoint synchronization

Requirements:

  • Multiple CUDA devices (2-8 GPUs typical)
  • Compute Capability 6.0+ for efficient P2P (Pascal or newer)
  • P2P capability between all device pairs (recommended)
  • Pinned host memory for cross-device signaling

Performance: System barriers have ~1-10ms latency due to PCIe roundtrip and CPU coordination. Use sparingly - typically once per iteration in multi-GPU algorithms.

Limitations:

  • Maximum 8 GPUs typical (PCIe topology limit)
  • Performance degrades with increased GPU count
  • May not work reliably with integrated GPUs
  • Requires careful error handling for multi-device scenarios
ThreadBlock = 0

Synchronize all threads within a single thread block.

This is the most common and efficient barrier scope, mapping directly to __syncthreads() in CUDA. All threads in the block wait until every thread reaches the barrier.

Use Cases:

  • Shared memory synchronization
  • Reduction operations within a block
  • Stencil computations

Constraints: Maximum block size varies by device (typically 1024 threads).

Tile = 3

Synchronize an arbitrary subset of threads (tile) within a thread block.

Tile barriers enable flexible synchronization patterns where only a subset of threads need to wait. Particularly useful for irregular workloads.

Use Cases:

  • Work-stealing algorithms
  • Dynamic partitioning
  • Hierarchical parallelism

Performance: Tile barriers are more flexible but slightly slower than thread-block barriers (~20ns vs ~10ns).

Warp = 2

Synchronize all threads within a single 32-thread warp (CUDA-specific).

Warp-level synchronization is implicit in lockstep execution but explicit barriers enable safe divergent execution patterns. Maps to __syncwarp().

Use Cases:

  • Warp-level reductions
  • Ballot/shuffle operations
  • Warp-synchronous programming patterns

Note: Warp size is 32 threads on NVIDIA GPUs. Other vendors may differ.

Remarks

Barrier scope determines which threads participate in synchronization:

  • ThreadBlock: All threads in the same CUDA thread block
  • Grid: All threads across all blocks in the kernel grid (requires cooperative launch)
  • Warp: All threads in the same 32-thread warp
  • Tile: Arbitrary subset of threads within a block

Performance Characteristics:

ThreadBlock~10ns latency, hardware support, most common
Grid~1-10μs latency, requires cooperative launch, CC 6.0+
Warp~1ns latency, implicit in lockstep execution
Tile~20ns latency, flexible but slower than block

Hardware Requirements:

  • ThreadBlock: All CUDA devices (CC 1.0+)
  • Grid: Pascal and newer (CC 6.0+)
  • Warp: All CUDA devices (CC 1.0+)
  • Tile: Volta and newer (CC 7.0+) for best performance