Introduction to Ring Kernels
Ring Kernels are a revolutionary programming model in DotCompute that enables persistent GPU-resident computation with actor-style message passing. Unlike traditional kernels that launch, execute, and terminate for each invocation, Ring Kernels remain resident on the GPU, processing messages continuously with near-zero launch overhead.
What Are Ring Kernels?
Ring Kernels implement the persistent kernel pattern, where GPU compute units remain active in a processing loop, consuming messages from lock-free queues and producing results asynchronously. This enables entirely new programming paradigms on GPUs:
Traditional Kernel Model
Host → Launch Kernel → GPU Executes → Kernel Terminates → Host
(5-50μs overhead per launch)
Ring Kernel Model
Host → Launch Once → GPU Stays Resident → Process Messages Continuously
(0μs launch overhead after initial launch)
Key Concepts
1. Persistent Execution
Ring Kernels run in an infinite loop on the GPU, waiting for and processing messages as they arrive. The kernel lifecycle:
Launch → Activate → [Process Messages] → Deactivate → Terminate
↑ ↓
└───────────────┘
2. Lock-Free Message Passing
Messages are exchanged through lock-free ring buffers using atomic operations:
- Enqueue: Compare-and-swap to claim slot, write message
- Dequeue: Compare-and-swap to claim message, read data
- Thread-safe: Multiple producers and consumers without locks
3. Actor-Style Programming
Each kernel instance acts as an independent actor with:
- Mailbox: Input queue for receiving messages
- State: Persistent local state across messages
- Behavior: Message processing logic
- Output: Results sent to other actors or host
Why Use Ring Kernels?
Performance Benefits
1. Eliminate Launch Overhead
- Traditional: 5-50μs per kernel launch
- Ring Kernel: One-time launch, then 0μs
2. High Message Throughput
- CPU simulation: ~10K-100K messages/sec
- GPU (CUDA): ~1M-10M messages/sec
- GPU (Metal/OpenCL): ~500K-5M messages/sec
3. Low Latency
- Traditional: Launch overhead + execution time
- Ring Kernel: Immediate message processing (no launch)
Programming Model Benefits
1. Reactive Programming
- Event-driven computation
- Asynchronous message handling
- Natural fit for streaming data
2. Actor Systems
- Isolated actors with message passing
- Location transparency
- Fault isolation
3. Graph Computation
- Vertex-centric algorithms (Pregel-style)
- Bulk synchronous parallel (BSP) patterns
- Dynamic workload distribution
Supported Backends
Ring Kernels work across all DotCompute backends:
| Backend | Status | Performance | Platform |
|---|---|---|---|
| CUDA | ✅ Production | ~1M-10M msgs/sec | NVIDIA GPUs |
| Metal | ✅ Production | ~500K-5M msgs/sec | Apple Silicon |
| OpenCL | ✅ Production | ~500K-5M msgs/sec | Cross-platform |
| CPU | ✅ Simulation | ~10K-100K msgs/sec | All platforms |
Use Cases
1. Graph Analytics
Problem: Traditional batch processing inefficient for dynamic graphs
Solution: Vertex-centric message passing with Ring Kernels
// PageRank with Ring Kernels
[RingKernel(Mode = RingKernelMode.Persistent, Domain = RingKernelDomain.GraphAnalytics)]
public class PageRankVertex
{
private float _rank = 1.0f;
private int _outDegree;
public void ProcessMessage(VertexMessage msg)
{
// Accumulate contributions from neighbors
_rank = 0.15f + 0.85f * msg.Contribution;
// Send updated rank to outgoing edges
float contribution = _rank / _outDegree;
foreach (var neighbor in GetOutEdges())
{
SendMessage(neighbor, new VertexMessage { Contribution = contribution });
}
}
}
2. Spatial Simulations
Problem: Stencil computations with frequent halo exchanges
Solution: Persistent kernels with local communication
// Heat diffusion simulation
[RingKernel(Mode = RingKernelMode.Persistent, Domain = RingKernelDomain.SpatialSimulation)]
public class HeatDiffusion
{
private float _temperature;
private const float Alpha = 0.1f;
public void ProcessMessage(HaloMessage msg)
{
// Update temperature from neighbors
_temperature = (1 - 4 * Alpha) * _temperature
+ Alpha * (msg.North + msg.South + msg.East + msg.West);
// Send updated value to neighbors
BroadcastToNeighbors(new HaloMessage { Value = _temperature });
}
}
3. Real-Time Event Processing
Problem: Low-latency stream processing on GPU
Solution: Event-driven Ring Kernels with immediate processing
// Real-time anomaly detection
[RingKernel(Mode = RingKernelMode.EventDriven, Domain = RingKernelDomain.ActorModel)]
public class AnomalyDetector
{
private MovingAverage _average = new();
private float _threshold = 3.0f;
public void ProcessEvent(SensorReading reading)
{
float deviation = Math.Abs(reading.Value - _average.Current);
if (deviation > _threshold * _average.StdDev)
{
// Anomaly detected - alert immediately
SendAlert(new AnomalyAlert
{
Timestamp = reading.Timestamp,
Value = reading.Value,
ExpectedRange = _average.Current ± _average.StdDev
});
}
_average.Update(reading.Value);
}
}
4. Distributed Actor Systems
Problem: Scalable actor-based computation
Solution: GPU-resident actors with mailbox-based communication
// Distributed key-value store actors
[RingKernel(Mode = RingKernelMode.Persistent, Domain = RingKernelDomain.ActorModel)]
public class KVStoreActor
{
private Dictionary<int, string> _storage = new();
public void ProcessMessage(KVMessage msg)
{
switch (msg.Type)
{
case MessageType.Get:
var value = _storage.TryGetValue(msg.Key, out var v) ? v : null;
Reply(new KVResponse { Key = msg.Key, Value = value });
break;
case MessageType.Put:
_storage[msg.Key] = msg.Value;
Reply(new KVResponse { Success = true });
break;
case MessageType.Delete:
_storage.Remove(msg.Key);
Reply(new KVResponse { Success = true });
break;
}
}
}
Execution Modes
Ring Kernels support two execution modes:
Persistent Mode
Behavior: Kernel runs continuously until explicitly terminated
Best For:
- Long-running services
- Continuous stream processing
- Actor systems with steady workload
Trade-offs:
- ✅ Zero launch overhead
- ✅ Immediate message processing
- ❌ Consumes GPU resources continuously
[RingKernel(Mode = RingKernelMode.Persistent)]
public class PersistentProcessor { }
Event-Driven Mode
Behavior: Kernel activates on-demand, processes batch, then idles
Best For:
- Bursty workloads
- Power-constrained devices
- Shared GPU resources
Trade-offs:
- ✅ Conserves GPU resources
- ✅ Automatic power management
- ❌ Small activation overhead (~1-10μs)
[RingKernel(Mode = RingKernelMode.EventDriven)]
public class EventDrivenProcessor { }
Message Passing Strategies
Choose the right strategy for your workload:
1. SharedMemory (Fastest)
Use For: Intra-block communication, low capacity (<64KB)
[RingKernel(MessagingStrategy = MessagePassingStrategy.SharedMemory)]
public class SharedMemoryKernel { }
Characteristics:
- ⚡ Lowest latency (~10ns access)
- 📊 Limited capacity (GPU shared memory size)
- 🔒 Lock-free with atomic operations
- ✅ Best for producer-consumer patterns
2. AtomicQueue (Scalable)
Use For: Inter-block communication, larger capacity
[RingKernel(MessagingStrategy = MessagePassingStrategy.AtomicQueue)]
public class GlobalMemoryKernel { }
Characteristics:
- ⚡ Medium latency (~100ns access)
- 📊 Large capacity (GPU global memory)
- 🔒 Lock-free with exponential backoff
- ✅ Best for distributed actors
3. P2P (Multi-GPU)
Use For: GPU-to-GPU direct transfers
[RingKernel(MessagingStrategy = MessagePassingStrategy.P2P)]
public class MultiGPUKernel { }
Characteristics:
- ⚡ Low latency (~1μs direct copy)
- 🔗 Requires P2P capable GPUs
- 📡 Direct GPU memory access
- ✅ Best for multi-GPU pipelines
4. NCCL (Collective)
Use For: Multi-GPU reductions and broadcasts
[RingKernel(MessagingStrategy = MessagePassingStrategy.NCCL)]
public class CollectiveKernel { }
Characteristics:
- ⚡ Optimized collective operations
- 🌐 Multi-node support
- 📊 Scales to hundreds of GPUs
- ✅ Best for distributed training
Synchronization and Memory Ordering
Ring kernels have unique synchronization needs due to message passing. Unlike regular kernels (which default to relaxed memory ordering), ring kernels default to Release-Acquire consistency for correct message visibility.
Barrier Support
Ring kernels support GPU thread barriers for coordinating threads within a kernel instance:
[RingKernel(
UseBarriers = true, // Enable barriers
BarrierScope = BarrierScope.ThreadBlock, // Sync within thread block
MemoryConsistency = MemoryConsistencyModel.ReleaseAcquire, // Default for ring kernels
EnableCausalOrdering = true)] // Default true for message passing
public static void RingKernelWithBarriers(
MessageQueue<float> incoming,
MessageQueue<float> outgoing)
{
var shared = Kernel.AllocateShared<float>(256);
int tid = Kernel.ThreadId.X;
// Phase 1: Process incoming messages into shared memory
if (incoming.TryDequeue(out var msg))
{
shared[tid] = msg;
}
Kernel.Barrier(); // Wait for all threads
// Phase 2: Aggregate and send results
if (tid == 0)
{
float sum = 0;
for (int i = 0; i < 256; i++)
sum += shared[i];
outgoing.Enqueue(sum / 256.0f);
}
}
Ring Kernel vs Regular Kernel Defaults
Ring kernels have safer defaults for message passing:
| Property | Regular Kernel Default | Ring Kernel Default | Reason |
|---|---|---|---|
MemoryConsistency |
Relaxed |
ReleaseAcquire |
Message passing requires causality |
EnableCausalOrdering |
false |
true |
Ensures message visibility |
| Performance Overhead | 0% | 15% | Acceptable for persistent kernels |
Key Insight: Ring kernels run persistently, so the 15% overhead of Release-Acquire consistency is amortized over the kernel's lifetime. This provides safety by default for message-passing patterns.
When to Use Barriers in Ring Kernels
Use Barriers:
- Coordinating shared memory access for message batching
- Implementing reduction operations on incoming messages
- Multi-phase message processing with dependencies
- Aggregating results before sending outgoing messages
Example: Message Batch Processing:
[RingKernel(
UseBarriers = true,
BarrierScope = BarrierScope.ThreadBlock)]
public static void BatchProcessor(
MessageQueue<int> incoming,
MessageQueue<int> outgoing)
{
var shared = Kernel.AllocateShared<int>(256);
int tid = Kernel.ThreadId.X;
// Each thread dequeues one message
shared[tid] = incoming.TryDequeue(out var msg) ? msg : 0;
Kernel.Barrier(); // Ensure all messages loaded
// Thread 0 aggregates batch
if (tid == 0)
{
int batchSum = 0;
for (int i = 0; i < 256; i++)
batchSum += shared[i];
outgoing.Enqueue(batchSum);
}
}
See Also: Barriers and Memory Ordering for comprehensive details
Domain Optimizations
Specify your application domain for automatic optimizations:
General
[RingKernel(Domain = RingKernelDomain.General)]
No specific optimizations. Good default.
GraphAnalytics
[RingKernel(Domain = RingKernelDomain.GraphAnalytics)]
Optimized for:
- Irregular memory access patterns
- Load imbalance
- Grid synchronization (BSP)
SpatialSimulation
[RingKernel(Domain = RingKernelDomain.SpatialSimulation)]
Optimized for:
- Regular memory access patterns
- Local communication
- Halo exchange
ActorModel
[RingKernel(Domain = RingKernelDomain.ActorModel)]
Optimized for:
- Message-heavy workloads
- Low-latency delivery
- Dynamic workload distribution
Getting Started
1. Define Your Ring Kernel
using DotCompute.Abstractions.RingKernels;
[RingKernel(
Mode = RingKernelMode.Persistent,
MessagingStrategy = MessagePassingStrategy.AtomicQueue,
Domain = RingKernelDomain.General)]
public class MyFirstRingKernel
{
private int _messageCount = 0;
public void ProcessMessage(int data)
{
// Process incoming message
int result = data * 2;
_messageCount++;
// Send result
SendResult(result);
}
}
2. Launch the Kernel
using DotCompute.Backends.CUDA.RingKernels; // or Metal, OpenCL
using DotCompute.Abstractions.RingKernels;
// Create runtime
var logger = loggerFactory.CreateLogger<CudaRingKernelRuntime>();
var compiler = new CudaRingKernelCompiler(compilerLogger);
var registry = new MessageQueueRegistry();
var runtime = new CudaRingKernelRuntime(logger, compiler, registry);
// Configure launch options (optional - defaults to ProductionDefaults)
var options = RingKernelLaunchOptions.ProductionDefaults();
// Or use:
// var options = RingKernelLaunchOptions.LowLatencyDefaults();
// var options = RingKernelLaunchOptions.HighThroughputDefaults();
// Launch kernel (stays resident)
await runtime.LaunchAsync("my_kernel", gridSize: 1, blockSize: 256, options);
// Activate processing
await runtime.ActivateAsync("my_kernel");
3. Send Messages
// Send 1000 messages
for (int i = 0; i < 1000; i++)
{
var message = KernelMessage<int>.CreateData(
senderId: 0,
receiverId: -1,
payload: i
);
await runtime.SendMessageAsync("my_kernel", message);
}
4. Monitor Status
// Get kernel status
var status = await runtime.GetStatusAsync("my_kernel");
Console.WriteLine($"Active: {status.IsActive}");
Console.WriteLine($"Messages Processed: {status.MessagesProcessed}");
// Get performance metrics
var metrics = await runtime.GetMetricsAsync("my_kernel");
Console.WriteLine($"Throughput: {metrics.ThroughputMsgsPerSec:F0} msgs/sec");
Console.WriteLine($"Avg Latency: {metrics.AvgProcessingTimeMs:F2}ms");
5. Cleanup
// Deactivate (pause processing)
await runtime.DeactivateAsync("my_kernel");
// Terminate (cleanup resources)
await runtime.TerminateAsync("my_kernel");
// Dispose runtime
await runtime.DisposeAsync();
Best Practices
1. Choose the Right Mode
- Persistent: Steady workloads, low latency critical
- EventDriven: Bursty workloads, power efficiency important
2. Size Your Queues Appropriately
- Too small: Messages dropped, throughput limited
- Too large: Memory waste, cache pollution
- Rule of thumb: 256-1024 messages per queue
3. Use Appropriate Message Sizes
- Keep messages small (< 256 bytes ideal)
- Use indirection for large data (pointers to buffers)
- Pad to avoid false sharing (64-byte cache lines)
4. Monitor Queue Utilization
var metrics = await runtime.GetMetricsAsync("kernel_id");
if (metrics.InputQueueUtilization > 0.8)
{
// Queue nearly full - increase capacity or add more kernels
}
5. Handle Termination Gracefully
// Set timeout for graceful shutdown
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
await runtime.TerminateAsync("kernel_id", cts.Token);
Queue Configuration with RingKernelLaunchOptions
Ring Kernels use message queues for communication, and their behavior is fully configurable via the RingKernelLaunchOptions class (introduced in v0.5.3-alpha).
Configuration Properties
public sealed class RingKernelLaunchOptions
{
// Queue capacity (default: 4096, range: 16-1M, must be power-of-2)
public int QueueCapacity { get; set; } = 4096;
// Deduplication window (default: 1024, range: 16-1024)
public int DeduplicationWindowSize { get; set; } = 1024;
// Backpressure strategy (default: Block)
public BackpressureStrategy BackpressureStrategy { get; set; } = BackpressureStrategy.Block;
// Enable priority-based message ordering (default: false)
public bool EnablePriorityQueue { get; set; } = false;
}
Factory Methods
Production Defaults (Recommended for most use cases)
var options = RingKernelLaunchOptions.ProductionDefaults();
// QueueCapacity: 4096 messages (handles burst traffic, 2M+ msg/s)
// DeduplicationWindowSize: 1024 messages (covers recent messages)
// BackpressureStrategy: Block (no message loss)
// EnablePriorityQueue: false (maximize throughput)
await runtime.LaunchAsync("kernel_id", gridSize: 1, blockSize: 256, options);
Low-Latency Defaults (Sub-microsecond response)
var options = RingKernelLaunchOptions.LowLatencyDefaults();
// QueueCapacity: 256 messages (minimal memory footprint)
// DeduplicationWindowSize: 256 messages (proportional to capacity)
// BackpressureStrategy: Reject (fail-fast, no blocking)
// EnablePriorityQueue: false (FIFO is fastest)
await runtime.LaunchAsync("kernel_id", gridSize: 1, blockSize: 256, options);
High-Throughput Defaults (Batch processing)
var options = RingKernelLaunchOptions.HighThroughputDefaults();
// QueueCapacity: 16384 messages (large burst buffer)
// DeduplicationWindowSize: 1024 messages (maximum window)
// BackpressureStrategy: Block (no message loss)
// EnablePriorityQueue: false (maximize throughput)
await runtime.LaunchAsync("kernel_id", gridSize: 1, blockSize: 256, options);
Custom Configuration
var options = new RingKernelLaunchOptions
{
QueueCapacity = 8192, // Power-of-2 (16-1M)
DeduplicationWindowSize = 512, // 16-1024 messages
BackpressureStrategy = BackpressureStrategy.DropOldest, // Real-time telemetry
EnablePriorityQueue = true // Enable priority ordering
};
// Validate before launch
options.Validate(); // Throws ArgumentOutOfRangeException if invalid
await runtime.LaunchAsync("kernel_id", gridSize: 1, blockSize: 256, options);
Backpressure Strategies
Choose the right strategy for your workload:
| Strategy | Behavior | Use Case | Performance |
|---|---|---|---|
| Block | Wait for space in queue | Guaranteed delivery, no message loss | May stall producer |
| Reject | Return false immediately | Fire-and-forget, latency-sensitive | No blocking, predictable latency |
| DropOldest | Evict oldest message to make space | Real-time telemetry, latest data most important | No blocking, always succeeds |
| DropNew | Discard new message | Historical logging, preserve oldest data | No blocking, returns false |
Configuration Guidelines
Queue Capacity Sizing
- Too Small: Messages dropped/rejected, throughput limited
- Too Large: Memory waste, cache pollution, stale data
- Formula:
Capacity ≥ Peak Message Rate × Polling Interval × 2 - Example: 1000 msg/s × 10ms × 2 = 20 messages minimum, use 64-256 for safety
Deduplication Window
- Cost: ~32 bytes × window size per queue
- Benefit: Prevents duplicate processing (useful for retry scenarios)
- Auto-Clamping: Window size automatically clamped to QueueCapacity if smaller
Memory Usage
- Queue Structure: 64 bytes (head/tail/capacity/metadata)
- Message Storage:
QueueCapacity × 32bytes (for IRingKernelMessage types) - Deduplication:
DeduplicationWindowSize × 32bytes (hash table) - Example: 4096 capacity + 1024 dedup = ~128KB + 32KB = 160KB per queue
Performance Expectations
Message Throughput
| Backend | Single Kernel | Multi-Kernel (4x) |
|---|---|---|
| CUDA | 1-10M msgs/sec | 4-40M msgs/sec |
| Metal | 500K-5M msgs/sec | 2-20M msgs/sec |
| OpenCL | 500K-5M msgs/sec | 2-20M msgs/sec |
| CPU | 10-100K msgs/sec | 40-400K msgs/sec |
Latency
| Operation | Typical Latency |
|---|---|
| Launch (one-time) | 1-10ms |
| Activate/Deactivate | 10-100μs |
| Message enqueue (host) | 100-500ns |
| Message processing (GPU) | 10-100ns |
| Terminate | 10-100ms |
Comparison vs Traditional Kernels
For a workload with 1000 invocations:
Traditional Kernels:
- Launch overhead: 1000 × 25μs = 25ms
- Execution time: Variable
- Total: 25ms + execution
Ring Kernels:
- Launch overhead: 1 × 5ms = 5ms (one-time)
- Execution time: Variable (same as traditional)
- Total: 5ms + execution
Speedup: ~5x reduction in overhead for this example
Next Steps
Ready to dive deeper? Check out these resources:
- Advanced Ring Kernel Programming - Deep dive into patterns and optimization
- Ring Kernel API Reference - Complete API documentation
- Multi-GPU Programming - Using Ring Kernels across multiple GPUs
- Performance Tuning - Optimize your Ring Kernel performance
Summary
Ring Kernels enable persistent GPU-resident computation with:
- ✅ Zero launch overhead after initial launch
- ✅ Actor-style message passing with lock-free queues
- ✅ Cross-backend support (CUDA, Metal, OpenCL, CPU)
- ✅ Multiple execution modes and message strategies
- ✅ Domain-specific optimizations
Perfect for:
- Graph analytics and network algorithms
- Spatial simulations and stencil computations
- Real-time event processing and streaming
- Distributed actor systems
Start building high-performance, GPU-resident applications today!