Unified Ring Kernel System - Migration Guide

Overview

This guide helps the Orleans.GpuBridge team migrate from the legacy Ring Kernel system to the new Unified Ring Kernel System. The unified system uses a single RingKernelControlBlock* parameter and auto-generates handler stubs.

Current Status (v0.6.2)

All 5 Phases Complete:

Phase 1: MemoryPack Integration (43/43 tests passing)
Phase 2: CPU Ring Kernel Implementation (43/43 tests passing)
Phase 3: CUDA Ring Kernel Implementation (115/122 tests - 94.3%)
Phase 4: Multi-GPU Coordination (infrastructure complete)
Phase 5: Performance & Observability (94/94 tests passing)

Implemented:

Single RingKernelControlBlock* parameter (replaces multiple parameters)
Auto-generated handler function stubs
K2K messaging infrastructure in generated CUDA code
Pub/Sub topic routing infrastructure
Temporal timestamp fields in control block
RingKernelContext API for GPU operations in C#
Inline handler detection (methods with RingKernelContext parameter)
C# to CUDA transpiler for RingKernelContext calls
OpenTelemetry integration (ActivitySource, Meter)
Prometheus-compatible metrics exporter
Health check infrastructure

In Progress:

Full automatic C# to CUDA transpilation for method bodies
Source generator integration for compile-time transpilation

Breaking Changes Summary

Component	Old API	New API
Control Structure	Multiple `RingBuffer*` parameters	Single `RingKernelControlBlock*`
Termination Flag	`int* terminate_flag`	`control_block->should_terminate`
Message Counter	Separate parameter	`control_block->messages_processed`
Handler Definition	Separate `.cu` file	Auto-generated stub (customize in CUDA)
K2K Messaging	Manual implementation	Infrastructure generated automatically

Migration Steps

Step 1: Update RingKernel Attribute

Important: Ring Kernels MUST return void. This is enforced by the analyzer. The kernel method body should be empty - handler logic goes in the generated CUDA code.

Old API:

[RingKernel(
    KernelId = "my_kernel",
    Capacity = 1024,
    InputQueueSize = 256,
    OutputQueueSize = 256)]
public static void MyKernel(Span<byte> input, Span<byte> output)
{
    // Handler logic was in separate file
}

New API (Current):

[RingKernel(
    KernelId = "my_kernel",
    Capacity = 1024,
    InputQueueSize = 256,
    OutputQueueSize = 256,
    MaxInputMessageSizeBytes = 65792,
    MaxOutputMessageSizeBytes = 65792,
    ProcessingMode = RingProcessingMode.Continuous,
    EnableTimestamps = true)]
public static void MyKernel()
{
    // Empty - handler logic is in generated CUDA stub
    // Customize the generated process_my_message() function in CUDA
}

Step 2: Customize Generated CUDA Handler

The stub generator creates a handler function that you can customize:

// Auto-generated handler function - customize this
__device__ bool process_my_message(
    const unsigned char* msg_buffer,
    int msg_size,
    unsigned char* output_buffer,
    int* output_size_ptr,
    RingKernelControlBlock* control_block)
{
    // TODO: Add your message processing logic here
    // Default stub just validates and echoes

    if (msg_size <= 0) return false;

    // Copy input to output (echo behavior)
    for (int i = 0; i < msg_size; i++) {
        output_buffer[i] = msg_buffer[i];
    }
    *output_size_ptr = msg_size;

    return true;
}

Step 3: Update CUDA Code Generation Expectations

Old Generated CUDA Structure:

extern "C" __global__ void my_kernel(
    RingBuffer* input_ring,
    RingBuffer* output_ring,
    int* terminate_flag,
    int* message_count)
{
    while (!*terminate_flag) {
        // Manual message processing
    }
}

New Generated CUDA Structure:

// Auto-generated handler function
__device__ bool process_my_message(
    const unsigned char* msg_buffer,
    int msg_size,
    unsigned char* output_buffer,
    int* output_size_ptr,
    RingKernelControlBlock* control_block)
{
    // Stub implementation - customize as needed
    return true;
}

extern "C" __global__ void my_kernel_kernel(
    RingKernelControlBlock* control_block)
{
    // Unified control block contains all state
    while (!control_block->should_terminate) {
        // Auto-generated message processing loop
        bool success = process_my_message(...);
        if (success) {
            atomicAdd(&control_block->messages_processed, 1ULL);
        }
    }
}

Step 4: Configure K2K (Kernel-to-Kernel) Messaging

Attribute Configuration:

[RingKernel(
    KernelId = "producer",
    PublishesToKernels = new[] { "consumer" })]
public static void ProducerKernel()
{
    // Empty - K2K infrastructure generated in CUDA
}

[RingKernel(
    KernelId = "consumer",
    SubscribesToKernels = new[] { "producer" })]
public static void ConsumerKernel()
{
    // Empty - K2K infrastructure generated in CUDA
}

Generated CUDA Infrastructure:

// K2K message header structure
typedef struct {
    int source_kernel_id;
    int target_kernel_id;
    int message_type;
    int payload_size;
    long long timestamp;
} K2KMessageHeader;

// K2K send queue and receive channels are initialized in control block
// control_block->k2k_send_queue_ptr
// control_block->k2k_receive_channels_ptr

Step 5: Configure Orleans.GpuBridge.Core Integration

The new system supports Orleans-specific features via attribute configuration:

[RingKernel(
    KernelId = "orleans_actor",

    // Orleans.GpuBridge.Core Properties
    EnableTimestamps = true,           // GPU clock64() for temporal consistency
    ProcessingMode = RingProcessingMode.Continuous,  // or Batch, Adaptive
    MaxMessagesPerIteration = 100,     // Batch size limit
    MessageQueueSize = 512,            // Unified queue size

    // Memory consistency for actor systems
    MemoryConsistency = "ReleaseAcquire",
    EnableCausalOrdering = true,

    // Barriers for synchronization
    UseBarriers = true,
    BarrierScope = "ThreadBlock")]
public static void OrleansActorKernel()
{
    // Empty - implement handler logic in generated CUDA
}

RingKernelControlBlock Structure

The new unified control block replaces multiple parameters:

struct RingKernelControlBlock {
    // Lifecycle control
    int is_active;              // Atomic: 1 = active, 0 = inactive
    int should_terminate;       // Atomic: 1 = terminate, 0 = continue
    int has_terminated;         // Atomic: 1 = terminated, 0 = running
    int errors_encountered;     // Atomic error counter

    // Performance metrics
    long long messages_processed;    // Atomic message counter
    long long last_activity_ticks;   // Atomic timestamp

    // Queue pointers (device memory) - see note below
    long long input_queue_head_ptr;
    long long input_queue_tail_ptr;
    long long output_queue_head_ptr;
    long long output_queue_tail_ptr;

    // K2K messaging (if enabled)
    long long k2k_send_queue_ptr;
    long long k2k_receive_channels_ptr;
    int k2k_channel_count;

    // Temporal (if EnableTimestamps = true)
    long long hlc_physical;
    long long hlc_logical;
};

Queue Pointer Architecture (Important)

Note: In the current implementation (v0.6.2), queue pointers (input_queue_head_ptr, etc.) are set to 0 (nullptr). Message passing uses a bridged architecture instead of direct queue pointer access.

Why Bridged Architecture?

The control block's queue pointer fields were originally designed for direct int* head/tail pointers. However, the actual message queues use more complex structures:

CudaMessageQueue: Type-safe GPU queues with integrated head/tail management
MessageQueueBridge: Host-side bridges for serializing managed types to GPU memory

How Message Passing Works:

┌─────────────────────┐     ┌────────────────────┐     ┌─────────────────────┐
│   Host Application  │────▶│ MessageQueueBridge │────▶│   GPU Ring Kernel   │
│   (C# managed)      │◀────│   (Serialization)  │◀────│   (CUDA device)     │
└─────────────────────┘     └────────────────────┘     └─────────────────────┘
         │                           │                          │
         │ IRingKernelMessage        │ MemoryPack bytes         │ Raw byte[]
         │ (MemoryPackable)          │ Host↔Device transfer     │ processing
         └───────────────────────────┴──────────────────────────┘

Input Path: Host → Bridge serializes message → Copies to GPU buffer → Kernel processes raw bytes
Output Path: Kernel writes to output buffer → Bridge deserializes → Host receives typed message

Implications for Kernel Code:

The kernel receives messages as raw unsigned char* buffers, not typed queues
Queue pointer null checks in generated CUDA code prevent invalid memory access
Message boundaries are managed by the bridge, not kernel-side queue logic
Use msg_buffer and msg_size parameters in handler functions instead of queue pointers

Future Plans:

A future release may implement direct MessageQueue struct pointers for kernels that don't require managed type serialization, enabling:

Zero-copy GPU-to-GPU message passing
Lower latency for unmanaged message types
Direct kernel-to-kernel (K2K) queue access

Handler Function Signature

All handler functions follow this fixed signature:

__device__ bool process_{kernel_name}_message(
    const unsigned char* msg_buffer,    // Input message bytes
    int msg_size,                        // Input message size
    unsigned char* output_buffer,        // Output buffer to write to
    int* output_size_ptr,               // Pointer to set output size
    RingKernelControlBlock* control_block  // Access to kernel state
);

Return value:

true: Message processed successfully, increment counter
false: Message processing failed, don't increment counter

RingKernelContext API

The RingKernelContext API provides a C#-like interface that gets translated to CUDA intrinsics by the compiler.

Synchronization

Method	CUDA Translation	Description
`ctx.SyncThreads()`	`__syncthreads()`	Block-level barrier
`ctx.SyncGrid()`	`cooperative_groups::grid_group::sync()`	Grid-level barrier
`ctx.SyncWarp(mask)`	`__syncwarp(mask)`	Warp-level barrier
`ctx.ThreadFence()`	`__threadfence()`	Device-scope memory fence
`ctx.ThreadFenceBlock()`	`__threadfence_block()`	Block-scope memory fence
`ctx.ThreadFenceSystem()`	`__threadfence_system()`	System-scope memory fence
`ctx.NamedBarrier(name)`	Cross-kernel barrier	Multi-kernel synchronization

Atomic Operations

Method	CUDA Translation
`ctx.AtomicAdd(ref x, val)`	`atomicAdd(&x, val)`
`ctx.AtomicCAS(ref x, cmp, val)`	`atomicCAS(&x, cmp, val)`
`ctx.AtomicExch(ref x, val)`	`atomicExch(&x, val)`
`ctx.AtomicMin(ref x, val)`	`atomicMin(&x, val)`
`ctx.AtomicMax(ref x, val)`	`atomicMax(&x, val)`

Temporal APIs (HLC)

Method	Description
`ctx.Now()`	Returns HLC timestamp from GPU clock64()
`ctx.Tick()`	Increments HLC logical counter
`ctx.UpdateClock(received)`	Merges received timestamp (causality)

K2K Messaging

Method	Description
`ctx.SendToKernel<T>(kernelId, msg)`	Send message to another kernel
`ctx.TryReceiveFromKernel<T>(kernelId, out msg)`	Try to receive from kernel
`ctx.GetPendingMessageCount(kernelId)`	Get pending message count

Pub/Sub Topics

Method	Description
`ctx.PublishToTopic<T>(topic, msg)`	Publish message to topic subscribers
`ctx.TryReceiveFromTopic<T>(topic, out msg)`	Receive from subscribed topic

Warp-Level Primitives

Method	CUDA Translation
`ctx.WarpShuffle(val, srcLane)`	`__shfl_sync(mask, val, srcLane)`
`ctx.WarpShuffleDown(val, delta)`	`__shfl_down_sync(mask, val, delta)`
`ctx.WarpReduce(val)`	Warp-wide sum reduction
`ctx.WarpBallot(pred)`	`__ballot_sync(mask, pred)`
`ctx.WarpAll(pred)`	`__all_sync(mask, pred)`
`ctx.WarpAny(pred)`	`__any_sync(mask, pred)`

Testing Migration

Update Test Assertions

Old Test Pattern:

// Old: Check for individual ring buffer parameters
Assert.Contains("RingBuffer* input_ring", generatedCode);
Assert.Contains("int* terminate_flag", generatedCode);

New Test Pattern:

// New: Check for unified control block
Assert.Contains("RingKernelControlBlock* control_block", generatedCode);
Assert.Contains("control_block->should_terminate", generatedCode);
Assert.Contains("control_block->messages_processed", generatedCode);

// Check handler function is generated
Assert.Contains("__device__ bool process_", generatedCode);
Assert.Contains("_message(", generatedCode);

Test K2K Infrastructure

[Fact]
public void StubGenerator_WithK2KMessaging_GeneratesInfrastructure()
{
    var kernel = CreateKernelWithK2K();
    var generator = new CudaRingKernelStubGenerator(logger);

    var code = generator.GenerateKernelStub(kernel);

    Assert.Contains("k2k_send_queue", code);
    Assert.Contains("k2k_receive_channels", code);
    Assert.Contains("K2KMessageHeader", code);
}

Backward Compatibility

Note: This migration is NOT backward compatible. The old API has been removed entirely as per project requirements.

If you have existing .cu handler files:

Keep the handler logic in CUDA
Update it to match the new handler signature (5 parameters)
Rename to follow process_{name}_message convention
The generated kernel stub will call your handler function

Common Migration Issues

Issue 1: Handler Function Not Found

Error:

error: identifier "process_my_kernel_message" is undefined

Solution: Ensure the kernel method name follows the convention. The handler name is derived from the method name:

Method: MyKernel → Handler: process_my_message
Method: ProcessDataKernel → Handler: process_process_data_message

Issue 2: Parameter Type Mismatch

Error:

error: argument of type "int" is incompatible with parameter of type "int *"

Solution: The handler function signature is fixed. Ensure your handler uses:

int* output_size_ptr (pointer, not value)
RingKernelControlBlock* control_block (5th parameter)

Issue 3: Missing Control Block Members

Error:

error: 'RingKernelControlBlock' has no member named 'old_field'

Solution: Use the new control block member names:

terminate_flag → control_block->should_terminate
message_count → control_block->messages_processed

Issue 4: Kernel Method Has Parameters

Error:

DC007: Ring kernel methods must have no parameters (current implementation)

Solution: Remove parameters from the kernel method. The current implementation requires empty kernel methods:

// Wrong
public static void MyKernel(RingKernelContext ctx, Message msg) { }

// Correct (current)
public static void MyKernel() { }

WSL2 Compatibility

Overview

Ring Kernels running in WSL2 (Windows Subsystem for Linux 2) have specific limitations due to how the NVIDIA CUDA driver handles concurrent kernel execution and API calls.

Key Limitation: In WSL2, persistent Ring Kernels running infinite loops block CUDA API calls from the host. This prevents:

Reading/writing control blocks while kernel is running
Sending messages to the kernel
Querying kernel status
Proper kernel termination

EventDriven Mode

To address this limitation, DotCompute implements EventDriven mode for Ring Kernels:

[RingKernel("my_kernel", Mode = RingKernelMode.EventDriven, EventDrivenMaxIterations = 1000)]
public static class MyKernel
{
    // Kernel implementation
}

How it works:

Instead of an infinite loop, the kernel runs for a limited number of iterations (default: 1000)
When the iteration limit is reached, the kernel exits with has_terminated = 2 (relaunchable)
The runtime automatically relaunches the kernel if it hasn't been terminated
This creates windows for the host to update control blocks between kernel launches

WSL2 Auto-Detection

DotCompute automatically detects WSL2 and overrides kernel mode:

[WSL2] Overriding Persistent mode to EventDriven mode for WSL2 compatibility (kernel: my_kernel)

Detection method: The runtime checks for /proc/sys/fs/binfmt_misc/WSLInterop to determine if running in WSL2.

Auto-detection applies to:

Stub generation (compile-time)
Kernel launch (runtime)

Configuration Options

Property	Default	Description
`Mode`	`Persistent`	`Persistent` (infinite loop) or `EventDriven` (finite iterations)
`EventDrivenMaxIterations`	`1000`	Number of dispatch iterations before kernel exits

Tuning EventDrivenMaxIterations:

Higher values: Less kernel relaunch overhead, longer blocking periods
Lower values: More responsive to control block updates, more relaunch overhead
Recommended range: 100-10000 depending on message processing frequency

Termination Values

The control block's has_terminated field distinguishes termination types:

Value	Meaning	Action
`0`	Running	Kernel is active
`1`	Permanent termination	`should_terminate` was set, kernel should NOT be relaunched
`2`	Relaunchable exit	Iteration limit reached, kernel CAN be relaunched

Example: Explicit EventDriven Mode

// For cross-platform compatibility, explicitly use EventDriven mode
[RingKernel("cross_platform_kernel",
    Mode = RingKernelMode.EventDriven,
    EventDrivenMaxIterations = 500)]
public static class CrossPlatformKernel
{
    public static void Process()
    {
        // Process messages
    }
}

Known Limitations in WSL2

Pinned Memory: cuMemHostAlloc with CU_MEMHOSTALLOC_DEVICEMAP may fail; fallback to standard device memory is used
Cooperative Kernels: Limited support; grid-wide barriers may not function correctly
Kernel Launch Latency: Additional overhead from relaunch cycles (~1-5ms per relaunch)
Control Block Access: Non-blocking reads may timeout more frequently

Troubleshooting WSL2 Issues

Issue: CUDA_ERROR_NO_DEVICE (100)

# Set WSL2 library path before running tests
export LD_LIBRARY_PATH="/usr/lib/wsl/lib:$LD_LIBRARY_PATH"
dotnet test

Issue: Kernel appears to hang

Verify EventDriven mode is active (check console output for WSL2 override message)
Reduce EventDrivenMaxIterations for faster response
Check if should_terminate is being set correctly

Issue: High CPU usage from relaunch loop

The relaunch loop uses adaptive backoff (10ms-100ms delays)
Increase EventDrivenMaxIterations to reduce relaunch frequency

Support

Documentation: /docs/articles/guides/ring-kernels-advanced.md
Tests: /tests/Unit/DotCompute.Backends.CUDA.Tests/Compilation/
Issues: https://github.com/mivertowski/DotCompute/issues

Last Updated: January 2026 Applies To: DotCompute v0.6.2+ Authors: Michael Ivertowski

Table of Contents

Unified Ring Kernel System - Migration Guide

Overview

Current Status (v0.6.2)

Breaking Changes Summary

Migration Steps

Step 1: Update RingKernel Attribute

Step 2: Customize Generated CUDA Handler

Step 3: Update CUDA Code Generation Expectations

Step 4: Configure K2K (Kernel-to-Kernel) Messaging

Step 5: Configure Orleans.GpuBridge.Core Integration

RingKernelControlBlock Structure

Queue Pointer Architecture (Important)

Handler Function Signature

RingKernelContext API

Synchronization

Atomic Operations

Temporal APIs (HLC)

K2K Messaging

Pub/Sub Topics

Warp-Level Primitives

Testing Migration

Update Test Assertions

Test K2K Infrastructure

Backward Compatibility

Common Migration Issues

Issue 1: Handler Function Not Found

Issue 2: Parameter Type Mismatch

Issue 3: Missing Control Block Members

Issue 4: Kernel Method Has Parameters

WSL2 Compatibility

Overview

EventDriven Mode

WSL2 Auto-Detection

Configuration Options

Termination Values

Example: Explicit EventDriven Mode

Known Limitations in WSL2

Troubleshooting WSL2 Issues

Support