Architecture Overview

Current Status

RingKernel is under active development. The core runtime, CPU backend, CUDA backend, and WebGPU backend are functional with verified GPU execution.

Working today:

  • Runtime creation and kernel lifecycle management
  • CPU backend (fully functional)
  • CUDA backend (verified with real PTX kernels, ~93B elements/sec)
  • WebGPU backend (cross-platform via wgpu)
  • Message passing infrastructure (queues, serialization, HLC timestamps)
  • Pub/Sub messaging with topic wildcards
  • K2K (kernel-to-kernel) direct messaging
  • Telemetry and metrics collection
  • Rust-to-CUDA transpiler (ringkernel-cuda-codegen)
  • Rust-to-WGSL transpiler (ringkernel-wgpu-codegen)
  • 20+ working examples
  • 5 showcase applications: WaveSim, WaveSim3D, TxMon, AccNet, ProcInt
  • 520+ tests across the workspace

In progress:

  • Metal backend (scaffolded)

DotCompute Ring Kernel Architecture

The Ring Kernel system implements a GPU-native actor model with persistent state. This is a Rust port of DotCompute’s Ring Kernel system.

Component Mapping

DotCompute Component Rust Equivalent Purpose
IRingKernelRuntime RingKernel struct Runtime and kernel lifecycle
IRingKernelMessage trait RingMessage Type-safe message protocol
IMessageQueue<T> trait MessageQueue<T> Lock-free ring buffer
RingKernelContext struct RingContext GPU intrinsics facade
RingKernelControlBlock #[repr(C)] struct ControlBlock GPU-resident state (128 bytes)
HlcTimestamp struct HlcTimestamp Hybrid Logical Clock
MemoryPackSerializer rkyv / zerocopy derive Zero-copy serialization

System Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                         HOST (CPU) SIDE                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────┐    ┌──────────────────┐    ┌───────────────────┐   │
│  │  Application    │───▶│ RingKernelRuntime│───▶│  Message Bridge   │   │
│  │  (async/await)  │    │  (lifecycle mgmt)│    │  (Host↔GPU DMA)   │   │
│  └─────────────────┘    └──────────────────┘    └─────────┬─────────┘   │
│                                │                          │              │
│                         ┌──────┴──────┐                   │              │
│                         ▼             ▼                   ▼              │
│                 ┌───────────┐  ┌───────────┐    ┌─────────────────┐     │
│                 │  Launch   │  │ Terminate │    │  Serialization  │     │
│                 │  Options  │  │  Handler  │    │  (rkyv/zerocopy)│     │
│                 └───────────┘  └───────────┘    └─────────────────┘     │
│                                                                          │
├──────────────────────────────────PCIe────────────────────────────────────┤
│                                                                          │
│                         DEVICE (GPU) SIDE                                │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                    PERSISTENT KERNEL                             │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │    │
│  │  │ Control     │  │ Input Queue │  │ Message Handler         │  │    │
│  │  │ Block       │  │ (lock-free) │  │ (user-defined logic)    │  │    │
│  │  │ (128 bytes) │  │             │  │                         │  │    │
│  │  │ - is_active │  │ head ──────▶│  │ process(ctx, msg) {     │  │    │
│  │  │ - terminate │  │ tail ◀──────│  │   ctx.sync_threads();   │  │    │
│  │  │ - msg_count │  │ buffer[]    │  │   // GPU computation    │  │    │
│  │  │ - errors    │  │             │  │   ctx.enqueue_output(); │  │    │
│  │  └─────────────┘  └─────────────┘  │ }                       │  │    │
│  │                                     └─────────────────────────┘  │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │    │
│  │  │ Telemetry   │  │Output Queue │  │ K2K Messaging           │  │    │
│  │  │ Buffer      │  │ (lock-free) │  │ (kernel-to-kernel)      │  │    │
│  │  │ (64 bytes)  │  │             │  │                         │  │    │
│  │  │ - processed │  │ head ◀──────│  │ send_to_kernel("other") │  │    │
│  │  │ - latency   │  │ tail ──────▶│  │ recv_from_kernel()      │  │    │
│  │  │ - errors    │  │ buffer[]    │  │                         │  │    │
│  │  └─────────────┘  └─────────────┘  └─────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Kernel Lifecycle State Machine

Kernels follow a deterministic state machine. By default, kernels auto-activate on launch.

        ┌──────────┐
        │ Launched │
        └────┬─────┘
             │ activate()
             ▼
        ┌──────────┐
   ┌────│  Active  │────┐
   │    └────┬─────┘    │
   │         │          │ deactivate() / suspend()
   │         │          ▼
   │         │    ┌────────────┐
   │         │    │Deactivated │
   │         │    └─────┬──────┘
   │         │          │
   │         │ terminate()
   │         ▼          │
   │    ┌──────────┐    │
   └───▶│Terminated│◀───┘
        └──────────┘

API Usage

// Launch with auto-activation (default)
let kernel = runtime.launch("processor", LaunchOptions::default()).await?;
assert!(kernel.is_active());

// Launch without auto-activation
let kernel = runtime.launch("processor",
    LaunchOptions::default().without_auto_activate()
).await?;
kernel.activate().await?;

// Suspend and resume
kernel.suspend().await?;  // alias for deactivate()
kernel.resume().await?;   // alias for activate()

// Check state
println!("State: {:?}", kernel.state());
println!("Active: {}", kernel.is_active());

// Clean shutdown
kernel.terminate().await?;

Message Flow

Host → GPU (Input)

1. Application calls kernel.send(message)
2. Message serialized via rkyv (zero-copy)
3. Bridge copies to pinned host buffer
4. DMA transfer to GPU input queue
5. Kernel dequeues and processes

GPU → Host (Output)

1. Kernel calls ctx.enqueue_output(response)
2. Message written to GPU output queue
3. Bridge polls/DMA copies to host
4. Deserialized via rkyv
5. Future resolved, application receives response

GPU → GPU (K2K Messaging)

1. Kernel A calls ctx.send_to_kernel("B", msg)
2. Routing table lookup (O(1) hash)
3. Message copied to Kernel B's K2K queue
4. Kernel B calls ctx.try_receive_from_kernel("A")
5. Direct GPU memory access (no PCIe)

Memory Layout Requirements

Control Block (128 bytes, cache-line aligned)

#[repr(C, align(128))]
pub struct ControlBlock {
    // Flags (offset 0-15)
    pub is_active: AtomicI32,           // 0: inactive, 1: processing
    pub should_terminate: AtomicI32,    // 0: run, 1: shutdown
    pub has_terminated: AtomicI32,      // 0: running, 1: done, 2: relaunchable
    pub errors_encountered: AtomicI32,

    // Counters (offset 16-31)
    pub messages_processed: AtomicI64,
    pub last_activity_ticks: AtomicI64,

    // Input queue descriptors (offset 32-63)
    pub input_queue_head_ptr: u64,      // Device pointer
    pub input_queue_tail_ptr: u64,
    pub input_queue_buffer_ptr: u64,
    pub input_queue_capacity: i32,
    pub input_queue_message_size: i32,

    // Output queue descriptors (offset 64-95)
    pub output_queue_head_ptr: u64,
    pub output_queue_tail_ptr: u64,
    pub output_queue_buffer_ptr: u64,
    pub output_queue_capacity: i32,
    pub output_queue_message_size: i32,

    // Reserved (offset 96-127)
    _reserved: [u64; 4],
}

Telemetry Buffer (64 bytes, cache-line aligned)

#[repr(C, align(64))]
pub struct TelemetryBuffer {
    pub messages_processed: AtomicU64,
    pub messages_dropped: AtomicU64,
    pub last_processed_timestamp: AtomicI64,
    pub queue_depth: AtomicI32,
    pub total_latency_nanos: AtomicU64,
    pub max_latency_nanos: AtomicU64,
    pub min_latency_nanos: AtomicU64,
    pub error_code: AtomicI32,
}

Backend Abstraction

RingKernel supports multiple GPU backends through the Backend enum:

Backend Platform Status Notes
CPU All Working Full functionality, ideal for development
CUDA Linux, Windows Working Verified GPU execution, requires CUDA toolkit
Metal macOS, iOS Scaffolded API defined, implementation pending
WebGPU Cross-platform Working Via wgpu (Vulkan, Metal, DX12)

Backend Selection

// Auto-detect best available backend
let runtime = RingKernel::builder()
    .backend(Backend::Auto)  // CUDA → Metal → WebGPU → CPU
    .build()
    .await?;

// Force specific backend
let runtime = RingKernel::builder()
    .backend(Backend::Cuda)
    .build()
    .await?;

// Check active backend
println!("Using backend: {:?}", runtime.backend());

Performance (CUDA backend, RTX 2000 Ada)

  • Vector operations: ~75M elements/sec
  • Memory bandwidth: 7.6 GB/s HtoD, 1.4 GB/s DtoH

Next: Crate Structure