Struct RingKernelContext
- Namespace
- DotCompute.Abstractions.RingKernels
- Assembly
- DotCompute.Abstractions.dll
Provides the runtime context for ring kernel execution, exposing barriers, temporal operations, kernel-to-kernel messaging, and GPU intrinsics.
public ref struct RingKernelContext
- Inherited Members
Examples
[RingKernel(KernelId = "Worker")]
public static WorkResponse Process(WorkRequest req, RingKernelContext ctx)
{
// Barrier synchronization
ctx.SyncThreads();
// Get current timestamp
var timestamp = ctx.Now();
// Perform computation
var result = req.Value * 2;
// Send to another kernel
ctx.SendToKernel("Aggregator", new AggregateMessage(result));
return new WorkResponse(result, timestamp);
}
Remarks
RingKernelContext is passed to unified ring kernel methods and provides access to:
- Thread Identity: ThreadId, BlockId, WarpId for GPU threading
- Barriers: SyncThreads, SyncGrid, SyncWarp, NamedBarrier for synchronization
- Temporal: HLC timestamps for causal ordering
- Memory Ordering: Thread fences for memory consistency
- K2K Messaging: SendToKernel, TryReceiveFromKernel for actor communication
- Pub/Sub: PublishToTopic, SubscribeToTopic for broadcast patterns
- Atomics: AtomicAdd, AtomicCAS, AtomicExch for thread-safe operations
Code Generation: Method calls on this context are translated to CUDA intrinsics by the C# to CUDA translator. For example:
ctx.SyncThreads() → __syncthreads()
ctx.Now() → clock64() with HLC
ctx.AtomicAdd(ref x, 1) → atomicAdd(&x, 1)
Properties
BlockDim
Gets the number of threads per block.
public readonly int BlockDim { get; }
Property Value
Remarks
Translated to blockDim.x in CUDA.
BlockId
Gets the block index within the grid.
public readonly int BlockId { get; }
Property Value
Remarks
Translated to blockIdx.x in CUDA.
ErrorsEncountered
Gets the number of errors encountered during processing.
public int ErrorsEncountered { get; }
Property Value
GlobalThreadId
Gets the global thread index across all blocks.
public readonly int GlobalThreadId { get; }
Property Value
Remarks
Translated to blockIdx.x * blockDim.x + threadIdx.x in CUDA.
GridDim
Gets the number of blocks in the grid.
public readonly int GridDim { get; }
Property Value
Remarks
Translated to gridDim.x in CUDA.
InputQueuePendingCount
Gets the number of pending messages in the input queue.
public int InputQueuePendingCount { get; }
Property Value
- int
The number of messages waiting to be processed.
IsInputQueueEmpty
Checks if the input queue is empty.
public bool IsInputQueueEmpty { get; }
Property Value
- bool
trueif there are no messages waiting.
IsOutputQueueFull
Checks if the output queue is full.
public bool IsOutputQueueFull { get; }
Property Value
- bool
trueif the output queue cannot accept more messages.
IsTerminationRequested
Checks if termination has been requested.
public bool IsTerminationRequested { get; }
Property Value
- bool
trueif the kernel should terminate.
KernelId
Gets the kernel identifier for this ring kernel instance.
public readonly string KernelId { get; }
Property Value
LaneId
Gets the lane index within the warp (ThreadId % 32).
public int LaneId { get; }
Property Value
Remarks
Derived from threadIdx.x % 32 in CUDA.
MessagesProcessed
Gets the total number of messages processed by this kernel.
public long MessagesProcessed { get; }
Property Value
OutputQueueFreeSlots
Gets the number of free slots in the output queue.
public int OutputQueueFreeSlots { get; }
Property Value
- int
The number of messages that can be enqueued before the queue is full.
ThreadId
Gets the thread index within the block (0 to BlockDim-1).
public readonly int ThreadId { get; }
Property Value
Remarks
Translated to threadIdx.x in CUDA.
WarpId
Gets the warp index (ThreadId / 32).
public int WarpId { get; }
Property Value
Remarks
Derived from threadIdx.x / 32 in CUDA.
Methods
AtomicAdd(ref int, int)
Atomically adds a value to an integer and returns the old value.
public int AtomicAdd(ref int target, int value)
Parameters
Returns
- int
The original value before the addition.
Remarks
Translated to atomicAdd(&target, value) in CUDA.
AtomicAdd(ref long, long)
Atomically adds a value to a long integer and returns the old value.
public long AtomicAdd(ref long target, long value)
Parameters
Returns
- long
The original value before the addition.
Remarks
CUDA Requirements: Native 64-bit atomics require CC 6.0+.
Translated to atomicAdd((unsigned long long*)&target, value) in CUDA.
AtomicAdd(ref float, float)
Atomically adds a value to a float and returns the old value.
public float AtomicAdd(ref float target, float value)
Parameters
Returns
- float
The original value before the addition.
Remarks
Translated to atomicAdd(&target, value) in CUDA.
AtomicAnd(ref int, int)
Atomically performs a bitwise AND operation and returns the old value.
public int AtomicAnd(ref int target, int value)
Parameters
Returns
- int
The original value before the operation.
Remarks
Translated to atomicAnd(&target, value) in CUDA.
AtomicCAS(ref int, int, int)
Atomically compares and swaps an integer value.
public int AtomicCAS(ref int target, int compare, int value)
Parameters
targetintReference to the target integer.
compareintThe value to compare against.
valueintThe value to store if comparison succeeds.
Returns
- int
The original value (allows caller to check if swap occurred).
Remarks
Translated to atomicCAS(&target, compare, value) in CUDA.
AtomicCAS(ref long, long, long)
Atomically compares and swaps a long integer value.
public long AtomicCAS(ref long target, long compare, long value)
Parameters
targetlongReference to the target long integer.
comparelongThe value to compare against.
valuelongThe value to store if comparison succeeds.
Returns
- long
The original value (allows caller to check if swap occurred).
Remarks
CUDA Requirements: Native 64-bit atomics require CC 6.0+.
Translated to atomicCAS((unsigned long long*)&target, compare, value) in CUDA.
AtomicExch(ref int, int)
Atomically exchanges an integer value and returns the old value.
public int AtomicExch(ref int target, int value)
Parameters
Returns
- int
The original value before the exchange.
Remarks
Translated to atomicExch(&target, value) in CUDA.
AtomicMax(ref int, int)
Atomically computes the maximum and stores it.
public int AtomicMax(ref int target, int value)
Parameters
Returns
- int
The original value before the operation.
Remarks
Translated to atomicMax(&target, value) in CUDA.
AtomicMin(ref int, int)
Atomically computes the minimum and stores it.
public int AtomicMin(ref int target, int value)
Parameters
Returns
- int
The original value before the operation.
Remarks
Translated to atomicMin(&target, value) in CUDA.
AtomicOr(ref int, int)
Atomically performs a bitwise OR operation and returns the old value.
public int AtomicOr(ref int target, int value)
Parameters
Returns
- int
The original value before the operation.
Remarks
Translated to atomicOr(&target, value) in CUDA.
AtomicSub(ref int, int)
Atomically subtracts a value from an integer and returns the old value.
public int AtomicSub(ref int target, int value)
Parameters
Returns
- int
The original value before the subtraction.
Remarks
Translated to atomicSub(&target, value) in CUDA.
AtomicXor(ref int, int)
Atomically performs a bitwise XOR operation and returns the old value.
public int AtomicXor(ref int target, int value)
Parameters
Returns
- int
The original value before the operation.
Remarks
Translated to atomicXor(&target, value) in CUDA.
EnqueueOutput(ReadOnlySpan<byte>)
Enqueues raw bytes to the ring kernel's output queue.
public bool EnqueueOutput(ReadOnlySpan<byte> data)
Parameters
dataReadOnlySpan<byte>The raw bytes to enqueue.
Returns
- bool
trueif the data was enqueued;falseif the output queue is full.
Remarks
Use this overload when you have pre-serialized data or need to send raw bytes.
EnqueueOutput<T>(T)
Enqueues a message to the ring kernel's output queue.
public bool EnqueueOutput<T>(T message) where T : struct
Parameters
messageTThe message to enqueue.
Returns
- bool
trueif the message was enqueued;falseif the output queue is full.
Type Parameters
TThe message type (must have [RingKernelMessage] or [MemoryPackable] attribute).
Examples
[RingKernel(KernelId = "Processor")]
public static void ProcessData(RingKernelContext ctx, InputMessage input)
{
var result = new OutputMessage { Value = input.Value * 2 };
ctx.EnqueueOutput(result);
}
Remarks
Serializes the message using MemoryPack and copies it to the output ring buffer. This is the primary method for producing output from a ring kernel.
Thread Safety: Uses atomic operations to safely enqueue from multiple threads.
GetPendingMessageCount(string)
Gets the number of pending messages from a specific kernel.
public int GetPendingMessageCount(string sourceKernelId)
Parameters
sourceKernelIdstringThe kernel ID to check.
Returns
- int
The number of messages waiting in the K2K queue.
NamedBarrier(int)
Synchronizes at a named barrier identified by integer ID.
public void NamedBarrier(int barrierId)
Parameters
barrierIdintThe numeric identifier of the barrier.
Remarks
More efficient than string-based barriers for hot paths.
Translated to __barrier_sync(barrierId) on CC 7.0+.
NamedBarrier(string)
Synchronizes at a named barrier across multiple kernels.
public void NamedBarrier(string barrierName)
Parameters
barrierNamestringThe name of the cross-kernel barrier.
Remarks
Named barriers enable synchronization between different ring kernels.
All kernels that declare this barrier in their NamedBarriers property
must reach the barrier before any can proceed.
Requirements: Barrier must be declared in [RingKernel] attribute.
Now()
Gets the current HLC timestamp without advancing the clock.
public HlcTimestamp Now()
Returns
- HlcTimestamp
The current hybrid logical clock timestamp.
Remarks
Reads the current GPU hardware timestamp (clock64()) combined with
the logical counter for causal ordering.
Resolution: ~1ns on CC 6.0+, ~1μs on older GPUs.
PublishToTopic<T>(string, T)
Publishes a message to a topic (delivered to all subscribers).
public bool PublishToTopic<T>(string topic, T message) where T : struct
Parameters
topicstringThe topic name to publish to.
messageTThe message to broadcast.
Returns
- bool
trueif published successfully.
Type Parameters
TThe message type.
Remarks
Broadcasts a message to all kernels subscribed to the topic. Messages are copied to each subscriber's topic queue.
Requirements: Topic must be declared in PublishesToTopics.
ReportError()
Increments the error counter.
public void ReportError()
Remarks
Translated to atomicAdd(&control_block->errors_encountered, 1) in CUDA.
RequestTermination()
Signals that the kernel should terminate after completing current message.
public void RequestTermination()
Remarks
Sets control_block->should_terminate = 1 in CUDA.
Use for graceful shutdown in response to a shutdown message.
SendToKernel<T>(string, T)
Sends a message to another ring kernel (actor-to-actor communication).
public bool SendToKernel<T>(string targetKernelId, T message) where T : struct
Parameters
targetKernelIdstringThe kernel ID to send to.
messageTThe message to send.
Returns
- bool
trueif the message was enqueued;falseif the queue is full.
Type Parameters
TThe message type (must have [RingKernelMessage] attribute).
Remarks
Enqueues a message to the target kernel's K2K input queue. The message is serialized and copied to GPU-resident shared memory.
Requirements: Target must be declared in PublishesToKernels.
SyncGrid()
Synchronizes all threads across the entire grid.
public void SyncGrid()
Remarks
Translated to cooperative_groups::grid_group::sync() in CUDA.
Latency: ~1-10μs depending on grid size.
Requirements: Compute Capability 6.0+, cooperative launch mode.
SyncThreads()
Synchronizes all threads within the current block.
public void SyncThreads()
Remarks
Translated to __syncthreads() in CUDA.
Latency: ~10ns on modern GPUs.
Caution: All threads in the block must reach this barrier or deadlock occurs.
SyncWarp(uint)
Synchronizes all threads within the current warp.
public void SyncWarp(uint mask = 4294967295)
Parameters
maskuintBitmask indicating which threads participate (default: all threads, 0xFFFFFFFF).
Remarks
Translated to __syncwarp(mask) in CUDA.
Latency: ~1ns (essentially free on most GPUs).
ThreadFence()
Issues a thread fence for device-scope memory ordering.
public void ThreadFence()
Remarks
Translated to __threadfence() in CUDA.
Ensures all memory writes before the fence are visible to all threads on the device before any writes after the fence.
ThreadFenceBlock()
Issues a thread fence for block-scope memory ordering.
public void ThreadFenceBlock()
Remarks
Translated to __threadfence_block() in CUDA.
Ensures all memory writes before the fence are visible to all threads in the same block before any writes after the fence.
ThreadFenceSystem()
Issues a thread fence for system-scope memory ordering.
public void ThreadFenceSystem()
Remarks
Translated to __threadfence_system() in CUDA.
Ensures all memory writes before the fence are visible to all threads on all devices and the host CPU before any writes after the fence.
Latency: Higher than device-scope fence (~100-1000ns).
Tick()
Advances the local HLC clock (tick operation for local events).
public void Tick()
Remarks
Increments the logical counter component of the HLC. Use when a local event occurs that should be ordered after previous events.
TryReceiveFromKernel<T>(string, out T)
Attempts to receive a message from another ring kernel.
public bool TryReceiveFromKernel<T>(string sourceKernelId, out T message) where T : struct
Parameters
sourceKernelIdstringThe kernel ID to receive from.
messageTThe received message if successful.
Returns
- bool
trueif a message was received;falseif the queue is empty.
Type Parameters
TThe message type to receive.
Remarks
Dequeues a message from the source kernel's K2K output queue. Non-blocking.
Requirements: Source must be declared in SubscribesToKernels.
TryReceiveFromTopic<T>(string, out T)
Attempts to receive a message from a subscribed topic.
public bool TryReceiveFromTopic<T>(string topic, out T message) where T : struct
Parameters
topicstringThe topic name to receive from.
messageTThe received message if successful.
Returns
- bool
trueif a message was received;falseif the queue is empty.
Type Parameters
TThe message type to receive.
Remarks
Requirements: Topic must be declared in SubscribesToTopics.
UpdateClock(HlcTimestamp)
Updates the local HLC from a received timestamp (merge operation).
public void UpdateClock(HlcTimestamp received)
Parameters
receivedHlcTimestampThe timestamp received from another kernel/host.
Remarks
Merges the received timestamp with the local clock to maintain causal ordering. The local clock is set to max(local, received) + 1.
WarpAll(bool, uint)
Returns true if all active threads have true predicate.
public bool WarpAll(bool predicate, uint mask = 4294967295)
Parameters
Returns
- bool
trueif all active threads have true predicate.
Remarks
Translated to __all_sync(mask, predicate) in CUDA.
WarpAny(bool, uint)
Returns true if any active thread has true predicate.
public bool WarpAny(bool predicate, uint mask = 4294967295)
Parameters
Returns
- bool
trueif any active thread has true predicate.
Remarks
Translated to __any_sync(mask, predicate) in CUDA.
WarpBallot(bool, uint)
Returns a ballot of threads where the predicate is true.
public uint WarpBallot(bool predicate, uint mask = 4294967295)
Parameters
Returns
- uint
Bitmask where bit i is set if thread i's predicate is true.
Remarks
Translated to __ballot_sync(mask, predicate) in CUDA.
WarpReduce(int, uint)
Performs a warp-wide reduction (sum).
public int WarpReduce(int value, uint mask = 4294967295)
Parameters
Returns
- int
The sum of all values in the warp (returned to all lanes).
WarpShuffle(int, int, uint)
Shuffles a value from another lane in the warp.
public int WarpShuffle(int value, int srcLane, uint mask = 4294967295)
Parameters
valueintThe value to share.
srcLaneintThe source lane to read from.
maskuintActive thread mask (default: all threads).
Returns
- int
The value from the source lane.
Remarks
Translated to __shfl_sync(mask, value, srcLane) in CUDA.
WarpShuffleDown(int, int, uint)
Shuffles a value from a lane with relative offset.
public int WarpShuffleDown(int value, int delta, uint mask = 4294967295)
Parameters
valueintThe value to share.
deltaintThe offset from current lane.
maskuintActive thread mask (default: all threads).
Returns
- int
The value from lane (currentLane + delta) % 32.
Remarks
Translated to __shfl_down_sync(mask, value, delta) in CUDA.