Execution Modes
RustKernels supports two execution modes with different performance characteristics. Understanding these modes is essential for choosing the right approach for your workload.
Overview
| Aspect | Batch Mode | Ring Mode |
|---|---|---|
| Latency | 10-50μs | 100-500ns |
| Launch Overhead | Higher | Minimal |
| State Location | CPU memory | GPU memory |
| Programming Model | Request/response | Actor messages |
| Best For | Heavy periodic computation | High-frequency streaming |
Batch Mode
Batch mode provides CPU-orchestrated kernel execution. The kernel is launched on-demand, executes on the GPU, and returns results to the CPU.
Characteristics
- Launch overhead: 10-50μs per invocation
- State management: State lives in CPU memory between calls
- Execution model: Synchronous request/response
- Data transfer: Input copied to GPU, output copied back
When to Use Batch Mode
- Heavy computational tasks (matrix operations, large graph processing)
- Periodic batch jobs (nightly risk calculations, weekly reports)
- Tasks where launch overhead is negligible compared to computation time
- When you need to process a complete dataset at once
Implementation
Batch kernels implement the BatchKernel trait:
use rustkernel_core::traits::{GpuKernel, BatchKernel};
use rustkernel_core::kernel::KernelMetadata;
pub struct MyBatchKernel {
metadata: KernelMetadata,
}
impl GpuKernel for MyBatchKernel {
fn metadata(&self) -> &KernelMetadata {
&self.metadata
}
}
impl BatchKernel<MyInput, MyOutput> for MyBatchKernel {
async fn execute(&self, input: MyInput) -> Result<MyOutput> {
// GPU computation here
Ok(output)
}
}
Usage Example
use rustkernel::graph::centrality::PageRank;
let kernel = PageRank::new();
// Prepare large graph input
let input = PageRankInput {
num_nodes: 1_000_000,
edges: load_edges_from_file()?,
damping_factor: 0.85,
max_iterations: 100,
tolerance: 1e-6,
};
// Execute - may take seconds for large graphs
let result = kernel.execute(input).await?;
println!("Top node: {} with score {:.4}",
result.top_node(),
result.scores[result.top_node()]
);
Ring Mode
Ring mode provides GPU-persistent actors. The kernel maintains state on the GPU and processes messages with minimal latency.
Characteristics
- Message latency: 100-500ns per message
- State persistence: State remains on GPU between messages
- Execution model: Asynchronous actor messages
- Data transfer: Only message payloads transferred
When to Use Ring Mode
- High-frequency operations (order matching, real-time scoring)
- Streaming workloads (continuous data feeds)
- When sub-millisecond latency is critical
- Incremental updates to persistent state
Implementation
Ring kernels implement the RingKernelHandler trait:
use rustkernel_core::traits::{GpuKernel, RingKernelHandler};
use rustkernel_core::ring::{RingContext, RingMessage};
pub struct MyRingKernel {
metadata: KernelMetadata,
}
impl GpuKernel for MyRingKernel {
fn metadata(&self) -> &KernelMetadata {
&self.metadata
}
}
impl RingKernelHandler<MyRequest, MyResponse> for MyRingKernel {
async fn handle(
&self,
ctx: &mut RingContext,
msg: MyRequest,
) -> Result<MyResponse> {
// Process message, update GPU state
Ok(response)
}
}
Ring Message Definition
Ring messages use fixed-point arithmetic and rkyv serialization:
use ringkernel_derive::RingMessage;
use rkyv::{Archive, Serialize, Deserialize};
#[derive(Debug, Clone, Archive, Serialize, Deserialize, RingMessage)]
#[archive(check_bytes)]
#[message(type_id = 200)] // Unique within domain range
pub struct ScoreQueryRequest {
#[message(id)]
pub id: MessageId,
pub node_id: u32,
}
#[derive(Debug, Clone, Archive, Serialize, Deserialize, RingMessage)]
#[archive(check_bytes)]
#[message(type_id = 201)]
pub struct ScoreQueryResponse {
#[message(id)]
pub id: MessageId,
pub score_fp: i64, // Fixed-point score
}
Usage Example
use rustkernel::graph::centrality::PageRankRing;
// Create ring kernel (initializes GPU state)
let ring = PageRankRing::new();
// Stream of edge updates
for edge in incoming_edges {
// Low-latency edge addition
ring.add_edge(edge.from, edge.to).await?;
}
// Query current scores (sub-millisecond)
let scores = ring.query_scores().await?;
// Trigger re-computation if needed
ring.recalculate().await?;
Choosing Between Modes
Decision Matrix
| Scenario | Recommended Mode | Reason |
|---|---|---|
| Nightly risk report | Batch | Large computation, latency not critical |
| Real-time fraud scoring | Ring | Sub-ms latency required |
| Graph analysis on static data | Batch | One-time computation |
| Order book matching | Ring | Continuous high-frequency updates |
| ML model inference (bulk) | Batch | Process entire batch at once |
| ML model inference (streaming) | Ring | Incremental predictions |
Hybrid Approach
Many applications combine both modes:
// Batch: Initial heavy computation
let graph_kernel = GraphBuilder::new();
let initial_graph = graph_kernel.build(edges).await?;
// Ring: Real-time updates
let ring_kernel = GraphRing::from(initial_graph);
loop {
// Process streaming updates with low latency
let update = receive_update().await;
ring_kernel.handle(update).await?;
// Periodically sync state back for batch analysis
if should_sync() {
let snapshot = ring_kernel.snapshot().await?;
batch_analysis(snapshot).await?;
}
}
Performance Considerations
Batch Mode Optimization
- Batch your inputs: Process multiple items in one call
- Minimize data transfer: Only send required fields
- Use async: Don’t block on kernel completion
// Good: Process many items at once
let results = kernel.execute_batch(items).await?;
// Avoid: Processing one at a time
for item in items {
let result = kernel.execute(item).await?; // High overhead
}
Ring Mode Optimization
- Keep messages small: Only transfer deltas
- Batch when possible: Group related messages
- Use K2K for coordination: Avoid CPU round-trips
// Good: Small incremental update
ring.update_score(node_id, delta).await?;
// Avoid: Transferring full state
ring.set_all_scores(full_score_vector).await?; // Large transfer
Iterative Kernels
Some algorithms naturally span multiple iterations (PageRank, K-Means). These implement IterativeKernel:
pub trait IterativeKernel<S, I, O>: GpuKernel {
fn initial_state(&self, input: &I) -> S;
async fn iterate(&self, state: &mut S, input: &I) -> Result<IterationResult<O>>;
fn converged(&self, state: &S, threshold: f64) -> bool;
}
Usage:
let kernel = PageRank::new();
// Create initial state
let mut state = kernel.initial_state(&input);
// Iterate until convergence
while !kernel.converged(&state, tolerance) {
let result = kernel.iterate(&mut state, &input).await?;
println!("Iteration {}: delta={:.6}", result.iteration, result.delta);
}
Next Steps
- Architecture Overview - System design and components
- Kernel Catalogue - Available kernels by domain
- K2K Messaging - Cross-kernel coordination