Table of Contents

Struct RingKernelContext

Namespace
DotCompute.Abstractions.RingKernels
Assembly
DotCompute.Abstractions.dll

Provides the runtime context for ring kernel execution, exposing barriers, temporal operations, kernel-to-kernel messaging, and GPU intrinsics.

public ref struct RingKernelContext
Inherited Members

Examples

[RingKernel(KernelId = "Worker")]
public static WorkResponse Process(WorkRequest req, RingKernelContext ctx)
{
    // Barrier synchronization
    ctx.SyncThreads();

    // Get current timestamp
    var timestamp = ctx.Now();

    // Perform computation
    var result = req.Value * 2;

    // Send to another kernel
    ctx.SendToKernel("Aggregator", new AggregateMessage(result));

    return new WorkResponse(result, timestamp);
}

Remarks

RingKernelContext is passed to unified ring kernel methods and provides access to:

  • Thread Identity: ThreadId, BlockId, WarpId for GPU threading
  • Barriers: SyncThreads, SyncGrid, SyncWarp, NamedBarrier for synchronization
  • Temporal: HLC timestamps for causal ordering
  • Memory Ordering: Thread fences for memory consistency
  • K2K Messaging: SendToKernel, TryReceiveFromKernel for actor communication
  • Pub/Sub: PublishToTopic, SubscribeToTopic for broadcast patterns
  • Atomics: AtomicAdd, AtomicCAS, AtomicExch for thread-safe operations

Code Generation: Method calls on this context are translated to CUDA intrinsics by the C# to CUDA translator. For example:

ctx.SyncThreads()     → __syncthreads()
ctx.Now()             → clock64() with HLC
ctx.AtomicAdd(ref x, 1) → atomicAdd(&x, 1)

Properties

BlockDim

Gets the number of threads per block.

public readonly int BlockDim { get; }

Property Value

int

Remarks

Translated to blockDim.x in CUDA.

BlockId

Gets the block index within the grid.

public readonly int BlockId { get; }

Property Value

int

Remarks

Translated to blockIdx.x in CUDA.

ErrorsEncountered

Gets the number of errors encountered during processing.

public int ErrorsEncountered { get; }

Property Value

int

GlobalThreadId

Gets the global thread index across all blocks.

public readonly int GlobalThreadId { get; }

Property Value

int

Remarks

Translated to blockIdx.x * blockDim.x + threadIdx.x in CUDA.

GridDim

Gets the number of blocks in the grid.

public readonly int GridDim { get; }

Property Value

int

Remarks

Translated to gridDim.x in CUDA.

InputQueuePendingCount

Gets the number of pending messages in the input queue.

public int InputQueuePendingCount { get; }

Property Value

int

The number of messages waiting to be processed.

IsInputQueueEmpty

Checks if the input queue is empty.

public bool IsInputQueueEmpty { get; }

Property Value

bool

true if there are no messages waiting.

IsOutputQueueFull

Checks if the output queue is full.

public bool IsOutputQueueFull { get; }

Property Value

bool

true if the output queue cannot accept more messages.

IsTerminationRequested

Checks if termination has been requested.

public bool IsTerminationRequested { get; }

Property Value

bool

true if the kernel should terminate.

KernelId

Gets the kernel identifier for this ring kernel instance.

public readonly string KernelId { get; }

Property Value

string

LaneId

Gets the lane index within the warp (ThreadId % 32).

public int LaneId { get; }

Property Value

int

Remarks

Derived from threadIdx.x % 32 in CUDA.

MessagesProcessed

Gets the total number of messages processed by this kernel.

public long MessagesProcessed { get; }

Property Value

long

OutputQueueFreeSlots

Gets the number of free slots in the output queue.

public int OutputQueueFreeSlots { get; }

Property Value

int

The number of messages that can be enqueued before the queue is full.

ThreadId

Gets the thread index within the block (0 to BlockDim-1).

public readonly int ThreadId { get; }

Property Value

int

Remarks

Translated to threadIdx.x in CUDA.

WarpId

Gets the warp index (ThreadId / 32).

public int WarpId { get; }

Property Value

int

Remarks

Derived from threadIdx.x / 32 in CUDA.

Methods

AtomicAdd(ref int, int)

Atomically adds a value to an integer and returns the old value.

public int AtomicAdd(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to add.

Returns

int

The original value before the addition.

Remarks

Translated to atomicAdd(&target, value) in CUDA.

AtomicAdd(ref long, long)

Atomically adds a value to a long integer and returns the old value.

public long AtomicAdd(ref long target, long value)

Parameters

target long

Reference to the target long integer.

value long

The value to add.

Returns

long

The original value before the addition.

Remarks

CUDA Requirements: Native 64-bit atomics require CC 6.0+.

Translated to atomicAdd((unsigned long long*)&target, value) in CUDA.

AtomicAdd(ref float, float)

Atomically adds a value to a float and returns the old value.

public float AtomicAdd(ref float target, float value)

Parameters

target float

Reference to the target float.

value float

The value to add.

Returns

float

The original value before the addition.

Remarks

Translated to atomicAdd(&target, value) in CUDA.

AtomicAnd(ref int, int)

Atomically performs a bitwise AND operation and returns the old value.

public int AtomicAnd(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to AND with.

Returns

int

The original value before the operation.

Remarks

Translated to atomicAnd(&target, value) in CUDA.

AtomicCAS(ref int, int, int)

Atomically compares and swaps an integer value.

public int AtomicCAS(ref int target, int compare, int value)

Parameters

target int

Reference to the target integer.

compare int

The value to compare against.

value int

The value to store if comparison succeeds.

Returns

int

The original value (allows caller to check if swap occurred).

Remarks

Translated to atomicCAS(&target, compare, value) in CUDA.

AtomicCAS(ref long, long, long)

Atomically compares and swaps a long integer value.

public long AtomicCAS(ref long target, long compare, long value)

Parameters

target long

Reference to the target long integer.

compare long

The value to compare against.

value long

The value to store if comparison succeeds.

Returns

long

The original value (allows caller to check if swap occurred).

Remarks

CUDA Requirements: Native 64-bit atomics require CC 6.0+.

Translated to atomicCAS((unsigned long long*)&target, compare, value) in CUDA.

AtomicExch(ref int, int)

Atomically exchanges an integer value and returns the old value.

public int AtomicExch(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to store.

Returns

int

The original value before the exchange.

Remarks

Translated to atomicExch(&target, value) in CUDA.

AtomicMax(ref int, int)

Atomically computes the maximum and stores it.

public int AtomicMax(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to compare.

Returns

int

The original value before the operation.

Remarks

Translated to atomicMax(&target, value) in CUDA.

AtomicMin(ref int, int)

Atomically computes the minimum and stores it.

public int AtomicMin(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to compare.

Returns

int

The original value before the operation.

Remarks

Translated to atomicMin(&target, value) in CUDA.

AtomicOr(ref int, int)

Atomically performs a bitwise OR operation and returns the old value.

public int AtomicOr(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to OR with.

Returns

int

The original value before the operation.

Remarks

Translated to atomicOr(&target, value) in CUDA.

AtomicSub(ref int, int)

Atomically subtracts a value from an integer and returns the old value.

public int AtomicSub(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to subtract.

Returns

int

The original value before the subtraction.

Remarks

Translated to atomicSub(&target, value) in CUDA.

AtomicXor(ref int, int)

Atomically performs a bitwise XOR operation and returns the old value.

public int AtomicXor(ref int target, int value)

Parameters

target int

Reference to the target integer.

value int

The value to XOR with.

Returns

int

The original value before the operation.

Remarks

Translated to atomicXor(&target, value) in CUDA.

EnqueueOutput(ReadOnlySpan<byte>)

Enqueues raw bytes to the ring kernel's output queue.

public bool EnqueueOutput(ReadOnlySpan<byte> data)

Parameters

data ReadOnlySpan<byte>

The raw bytes to enqueue.

Returns

bool

true if the data was enqueued; false if the output queue is full.

Remarks

Use this overload when you have pre-serialized data or need to send raw bytes.

EnqueueOutput<T>(T)

Enqueues a message to the ring kernel's output queue.

public bool EnqueueOutput<T>(T message) where T : struct

Parameters

message T

The message to enqueue.

Returns

bool

true if the message was enqueued; false if the output queue is full.

Type Parameters

T

The message type (must have [RingKernelMessage] or [MemoryPackable] attribute).

Examples

[RingKernel(KernelId = "Processor")]
public static void ProcessData(RingKernelContext ctx, InputMessage input)
{
    var result = new OutputMessage { Value = input.Value * 2 };
    ctx.EnqueueOutput(result);
}

Remarks

Serializes the message using MemoryPack and copies it to the output ring buffer. This is the primary method for producing output from a ring kernel.

Thread Safety: Uses atomic operations to safely enqueue from multiple threads.

GetPendingMessageCount(string)

Gets the number of pending messages from a specific kernel.

public int GetPendingMessageCount(string sourceKernelId)

Parameters

sourceKernelId string

The kernel ID to check.

Returns

int

The number of messages waiting in the K2K queue.

NamedBarrier(int)

Synchronizes at a named barrier identified by integer ID.

public void NamedBarrier(int barrierId)

Parameters

barrierId int

The numeric identifier of the barrier.

Remarks

More efficient than string-based barriers for hot paths.

Translated to __barrier_sync(barrierId) on CC 7.0+.

NamedBarrier(string)

Synchronizes at a named barrier across multiple kernels.

public void NamedBarrier(string barrierName)

Parameters

barrierName string

The name of the cross-kernel barrier.

Remarks

Named barriers enable synchronization between different ring kernels. All kernels that declare this barrier in their NamedBarriers property must reach the barrier before any can proceed.

Requirements: Barrier must be declared in [RingKernel] attribute.

Now()

Gets the current HLC timestamp without advancing the clock.

public HlcTimestamp Now()

Returns

HlcTimestamp

The current hybrid logical clock timestamp.

Remarks

Reads the current GPU hardware timestamp (clock64()) combined with the logical counter for causal ordering.

Resolution: ~1ns on CC 6.0+, ~1μs on older GPUs.

PublishToTopic<T>(string, T)

Publishes a message to a topic (delivered to all subscribers).

public bool PublishToTopic<T>(string topic, T message) where T : struct

Parameters

topic string

The topic name to publish to.

message T

The message to broadcast.

Returns

bool

true if published successfully.

Type Parameters

T

The message type.

Remarks

Broadcasts a message to all kernels subscribed to the topic. Messages are copied to each subscriber's topic queue.

Requirements: Topic must be declared in PublishesToTopics.

ReportError()

Increments the error counter.

public void ReportError()

Remarks

Translated to atomicAdd(&control_block->errors_encountered, 1) in CUDA.

RequestTermination()

Signals that the kernel should terminate after completing current message.

public void RequestTermination()

Remarks

Sets control_block->should_terminate = 1 in CUDA.

Use for graceful shutdown in response to a shutdown message.

SendToKernel<T>(string, T)

Sends a message to another ring kernel (actor-to-actor communication).

public bool SendToKernel<T>(string targetKernelId, T message) where T : struct

Parameters

targetKernelId string

The kernel ID to send to.

message T

The message to send.

Returns

bool

true if the message was enqueued; false if the queue is full.

Type Parameters

T

The message type (must have [RingKernelMessage] attribute).

Remarks

Enqueues a message to the target kernel's K2K input queue. The message is serialized and copied to GPU-resident shared memory.

Requirements: Target must be declared in PublishesToKernels.

SyncGrid()

Synchronizes all threads across the entire grid.

public void SyncGrid()

Remarks

Translated to cooperative_groups::grid_group::sync() in CUDA.

Latency: ~1-10μs depending on grid size.

Requirements: Compute Capability 6.0+, cooperative launch mode.

SyncThreads()

Synchronizes all threads within the current block.

public void SyncThreads()

Remarks

Translated to __syncthreads() in CUDA.

Latency: ~10ns on modern GPUs.

Caution: All threads in the block must reach this barrier or deadlock occurs.

SyncWarp(uint)

Synchronizes all threads within the current warp.

public void SyncWarp(uint mask = 4294967295)

Parameters

mask uint

Bitmask indicating which threads participate (default: all threads, 0xFFFFFFFF).

Remarks

Translated to __syncwarp(mask) in CUDA.

Latency: ~1ns (essentially free on most GPUs).

ThreadFence()

Issues a thread fence for device-scope memory ordering.

public void ThreadFence()

Remarks

Translated to __threadfence() in CUDA.

Ensures all memory writes before the fence are visible to all threads on the device before any writes after the fence.

ThreadFenceBlock()

Issues a thread fence for block-scope memory ordering.

public void ThreadFenceBlock()

Remarks

Translated to __threadfence_block() in CUDA.

Ensures all memory writes before the fence are visible to all threads in the same block before any writes after the fence.

ThreadFenceSystem()

Issues a thread fence for system-scope memory ordering.

public void ThreadFenceSystem()

Remarks

Translated to __threadfence_system() in CUDA.

Ensures all memory writes before the fence are visible to all threads on all devices and the host CPU before any writes after the fence.

Latency: Higher than device-scope fence (~100-1000ns).

Tick()

Advances the local HLC clock (tick operation for local events).

public void Tick()

Remarks

Increments the logical counter component of the HLC. Use when a local event occurs that should be ordered after previous events.

TryReceiveFromKernel<T>(string, out T)

Attempts to receive a message from another ring kernel.

public bool TryReceiveFromKernel<T>(string sourceKernelId, out T message) where T : struct

Parameters

sourceKernelId string

The kernel ID to receive from.

message T

The received message if successful.

Returns

bool

true if a message was received; false if the queue is empty.

Type Parameters

T

The message type to receive.

Remarks

Dequeues a message from the source kernel's K2K output queue. Non-blocking.

Requirements: Source must be declared in SubscribesToKernels.

TryReceiveFromTopic<T>(string, out T)

Attempts to receive a message from a subscribed topic.

public bool TryReceiveFromTopic<T>(string topic, out T message) where T : struct

Parameters

topic string

The topic name to receive from.

message T

The received message if successful.

Returns

bool

true if a message was received; false if the queue is empty.

Type Parameters

T

The message type to receive.

Remarks

Requirements: Topic must be declared in SubscribesToTopics.

UpdateClock(HlcTimestamp)

Updates the local HLC from a received timestamp (merge operation).

public void UpdateClock(HlcTimestamp received)

Parameters

received HlcTimestamp

The timestamp received from another kernel/host.

Remarks

Merges the received timestamp with the local clock to maintain causal ordering. The local clock is set to max(local, received) + 1.

WarpAll(bool, uint)

Returns true if all active threads have true predicate.

public bool WarpAll(bool predicate, uint mask = 4294967295)

Parameters

predicate bool

The condition to test.

mask uint

Active thread mask (default: all threads).

Returns

bool

true if all active threads have true predicate.

Remarks

Translated to __all_sync(mask, predicate) in CUDA.

WarpAny(bool, uint)

Returns true if any active thread has true predicate.

public bool WarpAny(bool predicate, uint mask = 4294967295)

Parameters

predicate bool

The condition to test.

mask uint

Active thread mask (default: all threads).

Returns

bool

true if any active thread has true predicate.

Remarks

Translated to __any_sync(mask, predicate) in CUDA.

WarpBallot(bool, uint)

Returns a ballot of threads where the predicate is true.

public uint WarpBallot(bool predicate, uint mask = 4294967295)

Parameters

predicate bool

The condition to test.

mask uint

Active thread mask (default: all threads).

Returns

uint

Bitmask where bit i is set if thread i's predicate is true.

Remarks

Translated to __ballot_sync(mask, predicate) in CUDA.

WarpReduce(int, uint)

Performs a warp-wide reduction (sum).

public int WarpReduce(int value, uint mask = 4294967295)

Parameters

value int

The value to reduce.

mask uint

Active thread mask (default: all threads).

Returns

int

The sum of all values in the warp (returned to all lanes).

WarpShuffle(int, int, uint)

Shuffles a value from another lane in the warp.

public int WarpShuffle(int value, int srcLane, uint mask = 4294967295)

Parameters

value int

The value to share.

srcLane int

The source lane to read from.

mask uint

Active thread mask (default: all threads).

Returns

int

The value from the source lane.

Remarks

Translated to __shfl_sync(mask, value, srcLane) in CUDA.

WarpShuffleDown(int, int, uint)

Shuffles a value from a lane with relative offset.

public int WarpShuffleDown(int value, int delta, uint mask = 4294967295)

Parameters

value int

The value to share.

delta int

The offset from current lane.

mask uint

Active thread mask (default: all threads).

Returns

int

The value from lane (currentLane + delta) % 32.

Remarks

Translated to __shfl_down_sync(mask, value, delta) in CUDA.