Barriers and Memory Ordering
Status: â Production Ready | Backends: CUDA, Metal, OpenCL | Last Updated: January 2025
Complete guide to GPU thread synchronization, barriers, and memory consistency in DotCompute.
ð Table of Contents
- Overview
- Thread Barriers
- Memory Ordering
- Kernel Attribute Properties
- Backend Support
- Performance Considerations
- Common Patterns
- Debugging
Overview
Modern GPUs execute thousands of threads concurrently, requiring careful synchronization for correctness. DotCompute provides two complementary mechanisms:
- Thread Barriers: Synchronize thread execution within a scope
- Memory Ordering: Control visibility and ordering of memory operations
// Example: Combining barriers with memory ordering
[Kernel(
UseBarriers = true,
BarrierScope = BarrierScope.ThreadBlock,
MemoryConsistency = MemoryConsistencyModel.ReleaseAcquire)]
public static void ProducerConsumer(Span<int> data, Span<int> flags)
{
int tid = Kernel.ThreadId.X;
// Phase 1: Write data
data[tid] = ComputeValue(tid);
Kernel.Barrier(); // Ensure all writes complete
// Phase 2: Set flag (release semantics)
flags[tid] = READY;
Kernel.Barrier();
// Phase 3: Read neighbor's data (acquire semantics)
int neighbor = (tid + 1) % Kernel.BlockDim.X;
while (flags[neighbor] != READY) { } // Spin wait
int neighborData = data[neighbor]; // Guaranteed to see write
data[tid] = ProcessPair(data[tid], neighborData);
}
Thread Barriers
What are Barriers?
Barriers are synchronization points where all threads in a scope wait until every thread reaches that point.
Key Concepts:
- Scope: Which threads synchronize together (ThreadBlock, Warp, Grid, etc.)
- Capacity: Number of threads that must reach the barrier
- Latency: Time cost of synchronization (~1-20ns depending on scope)
Barrier Scopes
public enum BarrierScope
{
ThreadBlock = 0, // All threads in a block (~10-20ns)
Warp = 2, // All threads in a 32-thread warp (~1-5ns)
Grid = 1, // All threads across all blocks (~1-10Ξs, CUDA only)
Tile = 3, // Arbitrary subset of threads (~20ns)
System = 4 // Multiple GPUs + CPU (~1-10ms)
}
ThreadBlock Barriers (Most Common)
Synchronizes all threads within a single thread block.
[Kernel(UseBarriers = true, BarrierScope = BarrierScope.ThreadBlock)]
public static void MatrixTranspose(
ReadOnlySpan<float> input,
Span<float> output,
int width,
int height)
{
// Allocate shared memory for tile
var tile = Kernel.AllocateShared<float>(32, 32);
int tx = Kernel.ThreadIdx.X;
int ty = Kernel.ThreadIdx.Y;
int bx = Kernel.BlockId.X;
int by = Kernel.BlockId.Y;
int x = bx * 32 + tx;
int y = by * 32 + ty;
// Load tile into shared memory
if (x < width && y < height)
{
tile[ty, tx] = input[y * width + x];
}
// Wait for all threads to finish loading
Kernel.Barrier(); // ThreadBlock barrier
// Write transposed tile to global memory
x = by * 32 + tx; // Swap dimensions
y = bx * 32 + ty;
if (x < height && y < width)
{
output[y * height + x] = tile[tx, ty]; // Note: swapped indices
}
}
Backend Mapping:
- CUDA:
__syncthreads() - Metal:
threadgroup_barrier(mem_flags::mem_device_and_threadgroup) - OpenCL:
barrier(CLK_GLOBAL_MEM_FENCE)
Warp Barriers (Fine-Grained)
Synchronizes threads within a 32-thread warp for fine-grained operations.
[Kernel(UseBarriers = true, BarrierScope = BarrierScope.Warp)]
public static void WarpReduce(Span<float> values, Span<float> results)
{
int tid = Kernel.ThreadId.X;
int warpId = tid / 32;
int laneId = tid % 32;
float value = values[tid];
// Warp-level reduction using shuffle
for (int offset = 16; offset > 0; offset /= 2)
{
value += Kernel.ShuffleDown(value, offset);
Kernel.Barrier(); // Warp barrier (implicit in shuffle, but explicit for clarity)
}
// Lane 0 writes result
if (laneId == 0)
{
results[warpId] = value;
}
}
Backend Mapping:
- CUDA:
__syncwarp()or implicit in shuffle operations - Metal:
simdgroup_barrier(mem_flags::mem_none) - OpenCL:
sub_group_barrier()
Grid Barriers (CUDA Only)
â ïļ Warning: Grid barriers require cooperative kernel launch and are NOT supported on Metal.
[Kernel(
Backends = KernelBackends.CUDA, // CUDA only!
UseBarriers = true,
BarrierScope = BarrierScope.Grid)]
public static void GlobalHistogram(
ReadOnlySpan<int> data,
Span<int> histogram,
Span<int> blockCounts)
{
int tid = Kernel.ThreadId.X;
int bid = Kernel.BlockId.X;
// Phase 1: Local histogram in each block
int value = data[tid];
Kernel.AtomicAdd(ref blockCounts[bid * 256 + value], 1);
// Grid barrier - wait for all blocks to finish local histograms
Kernel.GridBarrier(); // CUDA cooperative launch required
// Phase 2: Aggregate block histograms into global histogram
if (bid == 0) // Block 0 aggregates
{
int sum = 0;
for (int b = 0; b < Kernel.GridDim.X; b++)
{
sum += blockCounts[b * 256 + tid];
}
histogram[tid] = sum;
}
}
Backend Mapping:
- CUDA:
cooperative_groups::grid_group::sync() - Metal: â Not supported (use multiple kernel dispatches instead)
- OpenCL: â Not supported
Metal Alternative (Multiple Kernel Dispatches):
// Kernel 1: Local histograms
[Kernel(Backends = KernelBackends.Metal)]
public static void LocalHistogram(
ReadOnlySpan<int> data,
Span<int> blockCounts)
{
// Block-local histogram
}
// CPU-side synchronization
await accelerator.ExecuteKernelAsync(LocalHistogram, ...);
await accelerator.SynchronizeAsync(); // Implicit barrier
// Kernel 2: Aggregate
[Kernel(Backends = KernelBackends.Metal)]
public static void AggregateHistogram(
ReadOnlySpan<int> blockCounts,
Span<int> histogram)
{
// Aggregate block counts
}
Barrier Capacity
Specify the expected number of threads participating in synchronization:
[Kernel(
UseBarriers = true,
BarrierScope = BarrierScope.ThreadBlock,
BarrierCapacity = 256)] // 256 threads per block
public static void FixedSizeKernel(Span<float> data)
{
// Runtime validates that actual block size âĪ 256
Kernel.Barrier();
}
// Or let runtime calculate automatically (capacity = 0)
[Kernel(UseBarriers = true, BarrierCapacity = 0)] // Automatic
public static void FlexibleKernel(Span<float> data)
{
// Capacity calculated from BlockDimensions at launch time
Kernel.Barrier();
}
Memory Ordering
Why Memory Ordering Matters
GPUs use relaxed memory models by default for performance. Without explicit ordering:
// Thread 0 (Producer)
data[0] = 42; // Write 1
flag[0] = READY; // Write 2
// Thread 1 (Consumer)
while (flag[0] != READY) { } // Read flag
int value = data[0]; // Read data - MAY SEE OLD VALUE!
Problem: Thread 1 might see flag[0] == READY but still read stale data[0] due to reordering!
Solution: Memory consistency models ensure proper ordering.
Memory Consistency Models
public enum MemoryConsistencyModel
{
Relaxed = 0, // No ordering (1.0Ã performance)
ReleaseAcquire = 1, // Causal ordering (0.85Ã performance)
Sequential = 2 // Total order (0.60Ã performance)
}
Relaxed (Default)
No ordering guarantees. Fastest but requires manual fencing.
[Kernel(MemoryConsistency = MemoryConsistencyModel.Relaxed)]
public static void DataParallel(Span<float> data)
{
int idx = Kernel.ThreadId.X;
// No inter-thread communication - no ordering needed
data[idx] = MathF.Sqrt(data[idx]);
}
Use When:
- â Purely data-parallel algorithms
- â No inter-thread communication
- â Each thread operates independently
Release-Acquire (Recommended for Message Passing)
Ensures causality: if A writes then B reads, B sees A's prior writes.
[Kernel(
MemoryConsistency = MemoryConsistencyModel.ReleaseAcquire,
UseBarriers = true)]
public static void MessagePassing(Span<int> messages, Span<int> flags)
{
int tid = Kernel.ThreadId.X;
int neighbor = (tid + 1) % Kernel.BlockDim.X;
// Producer: Write message
messages[tid] = ComputeMessage(tid);
// Implicit release: message write visible before flag
flags[tid] = READY;
Kernel.Barrier();
// Consumer: Wait for neighbor
while (flags[neighbor] != READY) { }
// Implicit acquire: see all writes before flag set
int msg = messages[neighbor]; // Guaranteed to see message write
ProcessMessage(msg);
}
Use When:
- â Producer-consumer patterns
- â Message passing between threads
- â Distributed data structures
- â Ring kernels with message queues
Performance: 15% overhead from fence insertion, acceptable for most workloads.
Sequential (Strongest, Highest Overhead)
Total order across all threads. Use sparingly.
[Kernel(MemoryConsistency = MemoryConsistencyModel.Sequential)]
public static void StrictOrdering(Span<int> data)
{
// All memory operations see total order
// 40% performance overhead!
}
Use When:
- â Debugging race conditions
- â Algorithms requiring global order (rare)
Avoid: Start with Relaxed or ReleaseAcquire first.
Causal Ordering (Convenience Property)
Shorthand for enabling Release-Acquire:
[Kernel(EnableCausalOrdering = true)] // Equivalent to MemoryConsistency = ReleaseAcquire
public static void SafeMessaging(Span<int> data, Span<int> flags)
{
// Release-acquire semantics automatically applied
}
Kernel Attribute Properties
Complete reference for barrier and memory ordering properties:
[Kernel(
// Barrier Configuration
UseBarriers = true, // Enable barriers
BarrierScope = BarrierScope.ThreadBlock, // Synchronization scope
BarrierCapacity = 256, // Expected thread count (0 = auto)
// Memory Ordering Configuration
MemoryConsistency = MemoryConsistencyModel.ReleaseAcquire, // Consistency model
EnableCausalOrdering = true, // Shorthand for ReleaseAcquire
// Other Properties
Backends = KernelBackends.CUDA | KernelBackends.Metal,
BlockDimensions = new int[] { 16, 16 },
GridDimensions = new int[] { 64, 64 })]
public static void FullyConfigured(Span<float> data)
{
// Kernel implementation
}
Property Defaults
| Property | Default | Ring Kernel Default |
|---|---|---|
UseBarriers |
false |
false |
BarrierScope |
ThreadBlock |
ThreadBlock |
BarrierCapacity |
0 (auto) |
0 (auto) |
MemoryConsistency |
Relaxed |
ReleaseAcquire â |
EnableCausalOrdering |
false |
true â |
â Ring kernels default to safer settings due to message-passing nature.
Backend Support
Feature Matrix
| Feature | CUDA | Metal | OpenCL | CPU |
|---|---|---|---|---|
| ThreadBlock Barriers | â Full | â Full | â Full | â Emulated |
| Warp Barriers | â Native | â Simdgroup | â Subgroup | â N/A |
| Grid Barriers | â Cooperative | â Not Supported | â No | â No |
| Tile Barriers | â CC 7.0+ | â No | â No | â No |
| System Barriers | â Multi-GPU | â No | â No | â Emulated |
| Relaxed Model | â | â | â | â |
| ReleaseAcquire | â | â | â | â |
| Sequential | â | â | â | â |
CUDA Implementation
// ThreadBlock barrier
__syncthreads();
// Warp barrier
__syncwarp(0xffffffff);
// Grid barrier (cooperative launch required)
cooperative_groups::grid_group g = cooperative_groups::this_grid();
g.sync();
// Memory fences
__threadfence(); // Device scope
__threadfence_block(); // Block scope
__threadfence_system(); // System scope
Metal Implementation
// ThreadBlock barrier
threadgroup_barrier(mem_flags::mem_device_and_threadgroup);
// Simdgroup (warp) barrier
simdgroup_barrier(mem_flags::mem_none);
// Memory fences
threadgroup_barrier(mem_flags::mem_device); // Device fence
threadgroup_barrier(mem_flags::mem_threadgroup); // Threadgroup fence
threadgroup_barrier(mem_flags::mem_texture); // Texture fence
Performance Considerations
Barrier Latency
| Barrier Scope | Typical Latency | When to Use |
|---|---|---|
| Warp | ~1-5ns | Fine-grained sync, shuffle operations |
| ThreadBlock | ~10-20ns | Shared memory sync, most common |
| Grid | ~1-10Ξs | Multi-block coordination (CUDA only) |
| System | ~1-10ms | Multi-GPU algorithms (rare) |
Memory Consistency Overhead
| Model | Performance Multiplier | Overhead | When to Use |
|---|---|---|---|
| Relaxed | 1.0Ã | 0% | Data-parallel, no communication |
| ReleaseAcquire | 0.85Ã | 15% | Message passing, recommended |
| Sequential | 0.60Ã | 40% | Debugging only |
Best Practices
â DO:
- Use narrowest barrier scope needed (Warp < ThreadBlock < Grid)
- Prefer ReleaseAcquire over Sequential
- Add barriers only where necessary
- Profile before and after adding barriers
- Use shared memory to reduce global memory traffic
â DON'T:
- Add barriers "just in case"
- Use Grid barriers on Metal (not supported)
- Use Sequential consistency in production (40% overhead)
- Overuse System-wide barriers (1-10ms each)
- Add barriers inside tight loops
Example: Optimized Reduction
[Kernel(
UseBarriers = true,
BarrierScope = BarrierScope.ThreadBlock,
MemoryConsistency = MemoryConsistencyModel.Relaxed)] // No inter-thread communication
public static void OptimizedReduce(
ReadOnlySpan<float> input,
Span<float> output,
int n)
{
var shared = Kernel.AllocateShared<float>(256);
int tid = Kernel.ThreadIdx.X;
int i = Kernel.BlockId.X * Kernel.BlockDim.X + tid;
// Load to shared memory
shared[tid] = (i < n) ? input[i] : 0.0f;
Kernel.Barrier(); // Necessary: sync after load
// Reduction in shared memory
for (int s = Kernel.BlockDim.X / 2; s > 0; s /= 2)
{
if (tid < s)
{
shared[tid] += shared[tid + s];
}
Kernel.Barrier(); // Necessary: sync after each reduction step
}
// Write result
if (tid == 0)
{
output[Kernel.BlockId.X] = shared[0];
}
}
Barrier Count: 2 + log2(blockSize) barriers
- 1 after load
- log2(blockSize) in reduction loop
- Total: ~10 barriers for 256-thread block
- Overhead: ~100-200ns total (acceptable)
Common Patterns
Pattern 1: Shared Memory Communication
[Kernel(UseBarriers = true)]
public static void SharedMemPattern(Span<float> data)
{
var shared = Kernel.AllocateShared<float>(256);
int tid = Kernel.ThreadIdx.X;
// Phase 1: Load
shared[tid] = data[tid];
Kernel.Barrier(); // Wait for all loads
// Phase 2: Process (read from shared)
float result = shared[tid] + shared[(tid + 1) % 256];
Kernel.Barrier(); // Wait for all reads
// Phase 3: Store
data[tid] = result;
}
Pattern 2: Producer-Consumer
[Kernel(
UseBarriers = true,
EnableCausalOrdering = true)] // Ensures flag/data visibility
public static void ProducerConsumer(
Span<int> data,
Span<int> flags)
{
int tid = Kernel.ThreadId.X;
int partner = (tid + Kernel.BlockDim.X / 2) % Kernel.BlockDim.X;
// Produce
data[tid] = ComputeValue(tid);
flags[tid] = READY;
Kernel.Barrier();
// Consume
while (flags[partner] != READY) { } // Spin wait
int partnerData = data[partner]; // Guaranteed to see write
data[tid] = Combine(data[tid], partnerData);
}
Pattern 3: Multi-Phase Algorithm
[Kernel(UseBarriers = true)]
public static void MultiPhase(Span<float> data)
{
var temp = Kernel.AllocateShared<float>(256);
int tid = Kernel.ThreadIdx.X;
// Phase 1: Forward pass
temp[tid] = data[tid];
Kernel.Barrier();
// Phase 2: Process neighbors
float left = temp[(tid - 1 + 256) % 256];
float right = temp[(tid + 1) % 256];
temp[tid] = (left + temp[tid] + right) / 3.0f;
Kernel.Barrier();
// Phase 3: Write back
data[tid] = temp[tid];
}
Pattern 4: Warp-Level Primitives
[Kernel(UseBarriers = true, BarrierScope = BarrierScope.Warp)]
public static void WarpScan(Span<int> data)
{
int tid = Kernel.ThreadId.X;
int laneId = tid % 32;
int value = data[tid];
// Inclusive scan using shuffle
for (int offset = 1; offset < 32; offset *= 2)
{
int temp = Kernel.ShuffleUp(value, offset);
if (laneId >= offset)
value += temp;
Kernel.Barrier(); // Warp sync
}
data[tid] = value;
}
Debugging
Detecting Race Conditions
Use Sequential consistency to debug suspected race conditions:
// Development/Debugging
[Kernel(MemoryConsistency = MemoryConsistencyModel.Sequential)]
public static void DebugKernel(Span<int> data)
{
// If this fixes the bug, you have a race condition
// Then fix with proper barriers and switch back to ReleaseAcquire
}
Cross-Backend Validation
// Test with both relaxed and strict ordering
var resultRelaxed = await RunKernelAsync(
MemoryConsistencyModel.Relaxed);
var resultSequential = await RunKernelAsync(
MemoryConsistencyModel.Sequential);
if (!resultRelaxed.Equals(resultSequential))
{
Console.WriteLine("Race condition detected!");
// Add barriers or upgrade consistency model
}
Common Issues
Issue 1: Missing Barrier After Shared Memory Write
// â BUG: Reading before all writes complete
shared[tid] = data[tid];
float value = shared[(tid + 1) % 256]; // BUG: May read stale data!
// â
FIX: Add barrier
shared[tid] = data[tid];
Kernel.Barrier(); // Wait for all writes
float value = shared[(tid + 1) % 256]; // OK: Sees all writes
Issue 2: Conditional Barrier (Deadlock)
// â BUG: Not all threads reach barrier
if (tid < 128)
{
Kernel.Barrier(); // Deadlock! Threads 128-255 never reach
}
// â
FIX: All threads must reach barrier
Kernel.Barrier(); // All threads participate
if (tid < 128)
{
// Process
}
Issue 3: Grid Barrier on Metal
// â BUG: Grid barriers not supported on Metal
[Kernel(
Backends = KernelBackends.Metal,
BarrierScope = BarrierScope.Grid)] // Runtime error!
public static void MetalGridBarrier() { }
// â
FIX: Use multiple kernel dispatches
[Kernel(Backends = KernelBackends.Metal)]
public static void Phase1() { }
// CPU synchronization
await accelerator.SynchronizeAsync();
[Kernel(Backends = KernelBackends.Metal)]
public static void Phase2() { }
See Also
- Kernel Attribute Reference - Complete attribute documentation
- Ring Kernels - Message-passing patterns
- Performance Guide - Optimization strategies
- CUDA Programming - CUDA-specific features
- Metal Shading - Metal-specific features
Next: Explore ring kernel patterns for advanced message-passing algorithms.