Table of Contents

Class RingKernelAttribute

Namespace
DotCompute.Abstractions.Attributes
Assembly
DotCompute.Abstractions.dll

Marks a method as a ring kernel - a persistent GPU kernel with message passing capabilities. Ring kernels stay resident on the GPU and can be activated/deactivated dynamically, enabling GPU-native actor programming models and complex communication patterns.

[AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
public sealed class RingKernelAttribute : Attribute
Inheritance
RingKernelAttribute
Inherited Members

Examples

[RingKernel(
    KernelId = "graph-worker",
    Domain = RingKernelDomain.GraphAnalytics,
    MessagingStrategy = MessagePassingStrategy.SharedMemory,
    Capacity = 1024)]
public static void GraphWorker(
    MessageQueue<VertexMessage> input,
    MessageQueue<VertexMessage> output,
    Span<float> vertexData)
{
    int vertexId = Kernel.ThreadId.X;

    while (input.TryDequeue(out var msg))
    {
        // Process message and update vertex data
        vertexData[vertexId] += msg.Value;

        // Send result to neighbors
        output.Enqueue(new VertexMessage { TargetVertex = msg.Sender, Value = vertexData[vertexId] });
    }
}

Remarks

Ring kernels differ from standard kernels in several key ways:

  • They remain persistent on the GPU after launch
  • They support message passing between kernels via lock-free queues
  • They can be activated/deactivated without termination
  • They enable GPU-native actor systems and complex algorithms

Ring kernels are ideal for:

  • Graph analytics (vertex-centric message passing)
  • Spatial simulations (halo exchange between blocks)
  • Actor model implementations (mailbox-based communication)
  • Streaming pipelines (producer-consumer patterns)

Properties

Backends

Gets or sets the target backends for this ring kernel. Multiple backends can be specified using bitwise OR flags.

public KernelBackends Backends { get; set; }

Property Value

KernelBackends

The supported backends as flags. Defaults to CUDA | OpenCL | Metal. CPU backend can be added for testing and simulation.

Remarks

Backend selection affects:

  • Code generation strategies (each backend has different capabilities)
  • Performance characteristics (persistent kernels are optimal on GPU)
  • Availability (some backends may not support all features)
The runtime will select the best available backend from the specified options.

BarrierCapacity

Gets or sets the expected number of threads participating in barrier synchronization. Default is 0 (automatic based on block size).

public int BarrierCapacity { get; set; }

Property Value

int

BarrierScope

Gets or sets the synchronization scope for barriers used in this ring kernel. Default is ThreadBlock.

public BarrierScope BarrierScope { get; set; }

Property Value

BarrierScope

Capacity

Gets or sets the maximum number of work items in the ring buffer. This determines how many concurrent work items can be processed by the kernel.

public int Capacity { get; set; }

Property Value

int

The ring buffer capacity. Defaults to 1024 work items.

Remarks

Capacity affects:

  • GPU memory usage (larger capacity requires more memory)
  • Throughput (larger capacity may improve throughput for bursty workloads)
  • Latency (smaller capacity may reduce latency for steady workloads)
Choose capacity based on expected workload characteristics and available GPU memory.

Domain

Gets or sets domain-specific optimization hints for this ring kernel. Helps the compiler apply appropriate optimizations for the algorithm type.

public RingKernelDomain Domain { get; set; }

Property Value

RingKernelDomain

The kernel domain. Defaults to General.

Remarks

General:

  • No specific domain optimizations
  • Use for custom algorithms

GraphAnalytics:

  • Vertex-centric message passing patterns (Pregel model)
  • Optimized for sparse data structures
  • Best for PageRank, BFS, shortest paths

SpatialSimulation:

  • Stencil operations on regular grids
  • Optimized for halo exchange patterns
  • Best for wave propagation, heat transfer, fluid dynamics

ActorModel:

  • Mailbox-based message passing
  • Optimized for message routing and supervision
  • Best for concurrent systems, simulations

EnableCausalOrdering

Gets or sets whether to enable causal memory ordering (release-acquire semantics). Default is true for ring kernels (unlike regular kernels).

public bool EnableCausalOrdering { get; set; }

Property Value

bool

EnableTimestamps

Gets or sets whether to enable GPU hardware timestamp tracking for temporal consistency. Default is false.

public bool EnableTimestamps { get; set; }

Property Value

bool

true to capture GPU timestamps for temporal actor systems and causal ordering; false to disable timestamp tracking.

Remarks

When enabled, the kernel captures GPU hardware timestamps (via clock64()) for:

  • Hybrid Logical Clock (HLC) implementation
  • Vector clock synchronization across actors
  • Temporal pattern detection in message streams
  • Causal consistency validation

Timestamp Resolution: On CUDA CC 6.0+ GPUs, timestamps have ~1ns resolution. Older GPUs may have coarser granularity. Metal and OpenCL backends support timestamps via platform-specific APIs.

Performance Impact: Timestamp capture adds ~2-5ns overhead per message. For temporal actor systems, this overhead is typically acceptable and enables powerful causal ordering guarantees.

EventDrivenMaxIterations

Gets or sets the maximum number of loop iterations for EventDriven mode kernels. Default is 1000. After this many iterations, the kernel exits and can be relaunched.

public int EventDrivenMaxIterations { get; set; }

Property Value

int

The maximum number of dispatch loop iterations before kernel exits. Only applies when Mode is EventDriven.

Remarks

In WSL2, persistent kernels block CUDA API calls from the host, making control block updates impossible. EventDriven mode with a finite iteration count allows the kernel to exit periodically, enabling the host to update the control block and relaunch.

Trade-offs:

  • Lower values: More responsive to control changes, higher launch overhead
  • Higher values: Lower launch overhead, less responsive to control changes

Typical Values:

  • 100-500: High responsiveness (interactive applications)
  • 1000-5000: Balanced (typical workloads)
  • 10000+: Maximum throughput (batch processing)

InputMessageType

Gets or sets the input message type for this kernel.

public Type? InputMessageType { get; set; }

Property Value

Type

The type of input messages this kernel processes, or null for auto-detection.

Remarks

When specified, the kernel method should have a parameter of this type. The code generator will create serialization code for this type.

When null, the generator auto-detects the input type from the method's first parameter.

InputQueueSize

Gets or sets the input queue size for incoming messages. Controls the buffer size for messages sent to this kernel from other kernels or the host.

public int InputQueueSize { get; set; }

Property Value

int

The input queue size in number of messages. Defaults to 256 messages.

Remarks

A larger input queue allows more messages to be buffered, reducing the chance of dropped messages in high-throughput scenarios, but increases memory usage.

KernelId

Gets or sets the unique identifier for this ring kernel. Must be unique within the application to enable runtime management and message routing.

public string KernelId { get; set; }

Property Value

string

The kernel identifier string. Defaults to an empty string if not specified.

Remarks

The kernel ID is used for:

  • Launching and managing kernel instances at runtime
  • Routing messages between kernels
  • Collecting metrics and profiling data

MaxInputMessageSizeBytes

Gets or sets the maximum input message size in bytes. This configures the buffer size allocated for each individual input message.

public int MaxInputMessageSizeBytes { get; set; }

Property Value

int

The maximum size of a single input message. Defaults to 65792 bytes (64KB + 256-byte header).

Remarks

This value must be large enough to accommodate the largest serialized message that will be sent to this kernel. If a message exceeds this size, it will be truncated or rejected.

The default value of 65792 bytes (65536 + 256) is designed to handle:

  • 256-byte message header (routing, timestamp, metadata)
  • 64KB payload (MemoryPack serialized data)

Memory Impact: Total input queue memory = InputQueueSize × MaxInputMessageSizeBytes

Performance: Larger buffers use more GPU memory but prevent message truncation. Match this value to your actual message size for optimal memory usage.

MaxMessagesPerIteration

Gets or sets the maximum number of messages processed per dispatch loop iteration. Default is 0 (unlimited - process all available messages).

public int MaxMessagesPerIteration { get; set; }

Property Value

int

The maximum messages per iteration, or 0 for unlimited processing.

Remarks

Limiting messages per iteration ensures fairness when multiple ring kernels share GPU resources. Without a limit, high-volume actors can starve lower-volume actors by monopolizing execution time.

Fairness Patterns:

  • 16 - Bounded execution time, good for mixed workloads
  • 32 - Balance between fairness and efficiency
  • 0 - No limit, process entire queue (can cause starvation)

Performance Impact: Setting a limit adds a counter check per iteration (~1-2 cycles overhead). The fairness benefit typically outweighs this cost in multi-actor systems.

Interaction with ProcessingMode:

  • Continuous: Iteration limit applies to single-message iterations
  • Batch: Iteration limit × batch size = total messages processed
  • Adaptive: Limit applies to both continuous and batch phases

MaxOutputMessageSizeBytes

Gets or sets the maximum output message size in bytes. This configures the buffer size allocated for each individual output message.

public int MaxOutputMessageSizeBytes { get; set; }

Property Value

int

The maximum size of a single output message. Defaults to 65792 bytes (64KB + 256-byte header).

Remarks

This value must be large enough to accommodate the largest serialized message that will be sent from this kernel. If a message exceeds this size, it will be truncated or rejected.

The default value of 65792 bytes (65536 + 256) is designed to handle:

  • 256-byte message header (routing, timestamp, metadata)
  • 64KB payload (MemoryPack serialized data)

Memory Impact: Total output queue memory = OutputQueueSize × MaxOutputMessageSizeBytes

Performance: Larger buffers use more GPU memory but prevent message truncation. Match this value to your actual message size for optimal memory usage.

MemoryConsistency

Gets or sets the memory consistency model for this ring kernel's memory operations. Default is ReleaseAcquire (recommended for message passing).

public MemoryConsistencyModel MemoryConsistency { get; set; }

Property Value

MemoryConsistencyModel

MessageQueueSize

Gets or sets a unified message queue size that overrides both InputQueueSize and OutputQueueSize. Default is 0 (use InputQueueSize/OutputQueueSize separately).

public int MessageQueueSize { get; set; }

Property Value

int

The unified queue size for both input and output queues, or 0 to use separate sizes. Must be a power of 2 if non-zero.

Remarks

This property provides a convenient way to set both input and output queue sizes to the same value for symmetric message passing patterns. When set to a non-zero value, it overrides both InputQueueSize and OutputQueueSize.

Usage Examples:

  • 4096 - High-volume actor with symmetric send/receive
  • 8192 - Very high throughput processing
  • 0 - Use InputQueueSize/OutputQueueSize for asymmetric patterns

Memory Impact: Total queue memory = MessageQueueSize × (MaxInputMessageSizeBytes + MaxOutputMessageSizeBytes)

MessagingStrategy

Gets or sets the message passing strategy used by this ring kernel. Different strategies offer different trade-offs in performance, complexity, and hardware support.

public MessagePassingStrategy MessagingStrategy { get; set; }

Property Value

MessagePassingStrategy

The messaging strategy. Defaults to SharedMemory.

Remarks

SharedMemory:

  • Lock-free ring buffers in GPU shared memory
  • Lowest latency for intra-block communication
  • Supported on all GPU backends

AtomicQueue:

  • Atomic operations with exponential backoff
  • Better for highly contended scenarios
  • More robust under heavy load

P2P:

  • Peer-to-peer GPU memory transfers (CUDA only)
  • Direct GPU-to-GPU communication without CPU involvement
  • Best for multi-GPU systems

NCCL:

  • NVIDIA Collective Communications Library (CUDA only)
  • Optimized for collective operations across multiple GPUs
  • Best for distributed training and large-scale simulations

Mode

Gets or sets the execution mode for this ring kernel. Determines whether the kernel runs continuously or activates on demand.

public RingKernelMode Mode { get; set; }

Property Value

RingKernelMode

The execution mode. Defaults to Persistent.

Remarks

Persistent mode:

  • Kernel runs continuously in a loop until termination
  • Lower latency for message processing
  • Higher GPU utilization
  • Best for streaming and real-time workloads

EventDriven mode:

  • Kernel activates only when messages arrive or events occur
  • Lower power consumption when idle
  • Better GPU sharing with other workloads
  • Best for bursty or sporadic workloads

NamedBarriers

Gets or sets named barriers that this kernel participates in.

public string[] NamedBarriers { get; set; }

Property Value

string[]

Array of named barrier identifiers this kernel uses, or empty for none.

Remarks

Named barriers enable synchronization across multiple kernels. All kernels participating in a named barrier must reach the barrier before any can proceed.

Use RingKernelContext.NamedBarrier(name) to synchronize at a named barrier.

OutputMessageType

Gets or sets the output message type for this kernel.

public Type? OutputMessageType { get; set; }

Property Value

Type

The type of output messages this kernel produces, or null for auto-detection.

Remarks

When specified, the kernel method should return this type (or void for fire-and-forget). The code generator will create serialization code for this type.

When null, the generator auto-detects the output type from the method's return type.

OutputQueueSize

Gets or sets the output queue size for outgoing messages. Controls the buffer size for messages sent from this kernel to other kernels or the host.

public int OutputQueueSize { get; set; }

Property Value

int

The output queue size in number of messages. Defaults to 256 messages.

Remarks

A larger output queue allows more messages to be buffered, reducing backpressure in producer-consumer scenarios, but increases memory usage.

ProcessingMode

Gets or sets how the ring kernel processes messages from its input queue. Default is Continuous (single message per iteration for minimum latency).

public RingProcessingMode ProcessingMode { get; set; }

Property Value

RingProcessingMode

The processing mode: Continuous, Batch, or Adaptive.

Remarks

Processing mode affects the trade-off between latency and throughput:

Continuous Mode: Process one message per iteration.

  • Lowest latency (~100-500ns per message)
  • Best for latency-critical actor request-response
  • Lower peak throughput due to dispatch overhead

Batch Mode: Process multiple messages per iteration.

  • Highest throughput (amortizes dispatch overhead)
  • Best for high-volume data processing pipelines
  • Higher latency for individual messages

Adaptive Mode: Switch between Continuous and Batch based on queue depth.

  • Low latency when queue is shallow
  • High throughput when queue is deep
  • Recommended for variable workloads

PublishesToKernels

Gets or sets the kernel IDs that this kernel publishes to (sends messages to).

public string[] PublishesToKernels { get; set; }

Property Value

string[]

Array of kernel IDs for kernel-to-kernel message sending, or empty for none.

Remarks

Enables actor-to-actor communication patterns. When specified, this kernel can send messages to the listed kernels using RingKernelContext.SendToKernel.

The runtime validates that target kernels exist and allocates K2K message queues.

[RingKernel(
    KernelId = "Producer",
    PublishesToKernels = new[] { "Consumer", "Logger" })]
public static ProduceResponse Process(ProduceRequest req, RingKernelContext ctx)
{
    var result = ComputeResult(req);

    // Send to Consumer kernel (K2K messaging)
    ctx.SendToKernel("Consumer", new ConsumeMessage(result));

    // Also log to Logger kernel
    ctx.SendToKernel("Logger", new LogMessage($"Produced: {result}"));

    return new ProduceResponse(result);
}

PublishesToTopics

Gets or sets the topics this kernel can publish to.

public string[] PublishesToTopics { get; set; }

Property Value

string[]

Array of topic names for pub/sub publishing, or empty for none.

Remarks

Enables publish-subscribe communication patterns. Use RingKernelContext.PublishToTopic to send messages to all subscribers.

SubscribesToKernels

Gets or sets the kernel IDs that this kernel subscribes to (receives messages from).

public string[] SubscribesToKernels { get; set; }

Property Value

string[]

Array of kernel IDs for kernel-to-kernel message reception, or empty for none.

Remarks

Enables actor-to-actor communication patterns. When specified, this kernel can receive messages from the listed kernels using RingKernelContext.TryReceiveFromKernel.

The runtime allocates K2K message queues for each subscription.

[RingKernel(
    KernelId = "Aggregator",
    SubscribesToKernels = new[] { "Worker1", "Worker2", "Worker3" })]
public static AggregateResponse Process(AggregateRequest req, RingKernelContext ctx)
{
    // Can receive from Worker1, Worker2, Worker3
    while (ctx.TryReceiveFromKernel<WorkerResult>("Worker1", out var result))
    {
        // Process result from Worker1
    }
}

SubscribesToTopics

Gets or sets the topics this kernel subscribes to for pub/sub messaging.

public string[] SubscribesToTopics { get; set; }

Property Value

string[]

Array of topic names for pub/sub subscription, or empty for none.

Remarks

Enables publish-subscribe communication patterns. Messages published to a topic are delivered to all kernels subscribed to that topic.

Topics are useful for broadcast scenarios where multiple kernels need to receive the same message (e.g., configuration updates, heartbeats).

UseBarriers

Gets or sets whether this ring kernel uses GPU thread barriers for synchronization. Default is false.

public bool UseBarriers { get; set; }

Property Value

bool