Concepts and Background
This guide explains the core concepts behind Orleans.GpuBridge.Core and the revolutionary GPU-native actor paradigm.
Table of Contents
- The Actor Model
- GPU Computing Fundamentals
- GPU-Native Actors
- Ring Kernels
- Deployment Models
- Temporal Alignment
- Hypergraph Actors
The Actor Model
Traditional Actor Systems
The actor model is a concurrent computation paradigm where:
- Actors are independent units of computation with private state
- Messages are sent asynchronously between actors
- Single-threaded execution within each actor ensures thread safety
- Location transparency allows actors to run anywhere in a cluster
Microsoft Orleans implements the virtual actor model:
public interface IMyActor : IGrainWithIntegerKey
{
Task<int> ProcessAsync(int value);
}
public class MyActor : Grain, IMyActor
{
private int _state = 0;
public Task<int> ProcessAsync(int value)
{
_state += value; // Thread-safe by design
return Task.FromResult(_state);
}
}
Actor Benefits
- Simplified concurrency - No locks or mutexes needed
- Horizontal scalability - Add more nodes to handle more actors
- Fault tolerance - Actors can be recreated after failures
- Location transparency - Call actors without knowing their location
GPU Computing Fundamentals
Why GPUs?
Modern GPUs offer exceptional parallel processing capabilities:
| Resource | CPU (AMD EPYC 7763) | GPU (NVIDIA A100) | Advantage |
|---|---|---|---|
| Cores | 64 | 6,912 CUDA cores | 108× |
| Memory Bandwidth | 200 GB/s | 1,935 GB/s | 10× |
| FP32 Performance | 2 TFLOPS | 19.5 TFLOPS | 10× |
| FP64 Performance | 1 TFLOPS | 9.7 TFLOPS | 10× |
Traditional GPU Programming
Traditional GPU programming (CUDA/OpenCL) requires:
- Explicit memory management - Allocate, copy, free
- Kernel launches - Each computation requires kernel launch (~5-20μs overhead)
- CPU-GPU synchronization - Wait for GPU completion
- Low-level languages - C/C++ with vendor extensions
Example CUDA code:
// CUDA kernel
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
// Host code
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_c, n * sizeof(float));
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaMemcpy(h_c, d_c, n * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
This is complex, error-prone, and difficult to distribute.
GPU-Native Actors
The Revolutionary Paradigm
GPU-Native Actors combine the actor model with GPU computing in a fundamentally new way:
Traditional Approach: CPU actors offload work to GPU
- Actor runs on CPU
- Kernel launched for each computation
- 10-50μs kernel launch overhead
- State lives on CPU
GPU-Native Approach: Actors live permanently on GPU
- Actor state resides in GPU memory
- Ring kernel runs continuously
- Zero kernel launch overhead
- 100-500ns message latency
Architecture Comparison
Traditional GPU-Offload Model:
┌─────────────┐
│ CPU Actor │ ──launch──> ┌──────────┐
│ (State) │ │ GPU │
│ │ <──result─── │ (Kernel) │
└─────────────┘ └──────────┘
10-50μs overhead per call
GPU-Native Model:
┌──────────────────────────┐
│ GPU Ring Kernel │
│ ┌──────────┬───────────┐ │
│ │ Message │ Actor │ │
│ │ Queue │ State │ │
│ └──────────┴───────────┘ │
│ Runs continuously │
└──────────────────────────┘
100-500ns latency
Key Innovations
- Ring Kernels - Persistent GPU threads running infinite loops
- GPU-Resident State - Actor state never leaves GPU memory
- Message Queues on GPU - Lock-free queues in GPU memory
- Temporal Clocks on GPU - HLC and Vector Clocks maintained on GPU
- Zero-Copy Messaging - GPU-to-GPU communication via unified memory
Ring Kernels
What are Ring Kernels?
Ring kernels are GPU kernels that run as infinite dispatch loops:
// Ring kernel - runs forever on GPU
__global__ void ring_kernel(MessageQueue* queue, ActorState* state) {
// Thread persists indefinitely
while (true) {
// Dequeue message (non-blocking)
Message msg = queue->dequeue();
if (msg.type == UPDATE) {
// Process update
state->value += msg.data;
}
else if (msg.type == QUERY) {
// Respond to query
msg.respond(state->value);
}
// No kernel exit - loop continues
}
}
Benefits
- Zero launch overhead - Kernel launched once, runs forever
- Persistent state - State maintained across messages
- Sub-microsecond latency - Message processing at 100-500ns
- High throughput - 2M messages/second per actor
Memory Architecture
GPU Memory Layout:
┌─────────────────────────────────┐
│ Global Memory │
│ ┌─────────────────────────────┐ │
│ │ Actor 1 State │ │
│ │ - Message Queue (lock-free) │ │
│ │ - Actor Data │ │
│ │ - Temporal Clock (HLC) │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ Actor 2 State │ │
│ │ - Message Queue │ │
│ │ - Actor Data │ │
│ │ - Temporal Clock │ │
│ └─────────────────────────────┘ │
│ ... │
└─────────────────────────────────┘
Deployment Models
Orleans.GpuBridge.Core supports two deployment models:
1. GPU-Offload Model (Traditional)
CPU actors offload compute to GPU:
[GpuAccelerated]
public class BatchProcessingGrain : Grain
{
[GpuKernel("kernels/Process")]
private IGpuKernel<float[], float[]> _kernel;
public async Task<float[]> ProcessBatchAsync(float[] batch)
{
// Kernel launches on demand
return await _kernel.ExecuteAsync(batch);
}
}
Best for:
- Infrequent GPU usage
- Large batch processing
- CPU-bound coordination logic
Performance:
- 10-50μs kernel launch overhead
- High throughput for large batches
2. GPU-Native Model (Revolutionary)
Actors live permanently on GPU:
[GpuAccelerated(Mode = GpuMode.Native)]
public class StreamProcessingGrain : Grain
{
[RingKernel("kernels/RingProcess")]
private IRingKernel<Event, Result> _kernel;
public async Task<Result> ProcessEventAsync(Event evt)
{
// Ring kernel processes without relaunch
return await _kernel.ExecuteAsync(evt);
}
}
Best for:
- High-frequency messaging
- Real-time stream processing
- Temporal graph analytics
- Low-latency requirements
Performance:
- Zero kernel launch overhead
- 100-500ns message latency
- 2M messages/second throughput
Temporal Alignment
The Challenge
Distributed systems require temporal ordering for:
- Causal consistency (A caused B)
- Conflict detection (concurrent updates)
- Behavioral analytics (event sequence patterns)
Hybrid Logical Clocks (HLC)
HLC combines physical time with logical counters:
public struct HybridLogicalClock
{
public long PhysicalTime; // Wall clock time (ns)
public long LogicalCounter; // Logical counter
public void Update(long eventTime)
{
var now = GetPhysicalTime();
PhysicalTime = Math.Max(Math.Max(PhysicalTime, eventTime), now);
LogicalCounter = (PhysicalTime == eventTime)
? LogicalCounter + 1
: 0;
}
}
GPU Implementation: HLC maintained in GPU memory at 20ns per update (vs 50ns on CPU)
Vector Clocks
Vector clocks track causal dependencies:
public class VectorClock
{
private Dictionary<string, long> _clocks;
public void Increment(string actorId)
{
_clocks[actorId]++;
}
public bool HappenedBefore(VectorClock other)
{
return _clocks.All(kv =>
kv.Value <= other._clocks.GetValueOrDefault(kv.Key));
}
}
GPU Implementation: Efficient GPU-parallel vector comparison
Use Cases
- Fraud detection - Detect causally related transactions
- Anomaly detection - Identify temporal pattern violations
- Distributed debugging - Trace causal relationships
- Real-time analytics - Maintain temporal graph structures
Hypergraph Actors
Beyond Binary Edges
Traditional graphs model binary relationships:
User1 ─likes→ Post1
Hypergraphs model multi-way relationships:
Transaction { Buyer, Seller, Bank, Product, Shipper }
GPU-Accelerated Hypergraphs
Orleans.GpuBridge.Core enables GPU-native hypergraph actors:
[GpuAccelerated(Mode = GpuMode.Native)]
public class HyperedgeGrain : Grain, IHyperedge
{
[RingKernel("kernels/PatternMatch")]
private IRingKernel<HypergraphQuery, bool> _kernel;
public async Task<bool> MatchesPatternAsync(HypergraphQuery query)
{
// GPU-accelerated pattern matching
return await _kernel.ExecuteAsync(query);
}
}
Performance:
- Pattern detection: <100μs (vs >10ms on CPU)
- 10-500× faster than traditional graph databases
- Real-time analytics on billion-edge hypergraphs
Use Cases
- Financial fraud detection - Multi-party transaction analysis
- Supply chain optimization - Multi-modal logistics
- Cybersecurity - Advanced persistent threat (APT) detection
- Healthcare - Multi-drug interaction analysis
Performance Characteristics
Latency Comparison
| Operation | CPU Actors | GPU-Offload | GPU-Native | Improvement |
|---|---|---|---|---|
| Message routing | 10-100μs | 10-100μs | 100-500ns | 20-200× |
| Kernel launch | N/A | 10-50μs | 0ns | ∞ |
| State access | 50ns | 50ns | 20ns | 2.5× |
| Temporal update | 50ns | 50ns | 20ns | 2.5× |
Throughput Comparison
| Workload | CPU Actors | GPU-Native | Improvement |
|---|---|---|---|
| Message processing | 15K/s | 2M/s | 133× |
| Vector operations | 100K/s | 1B/s | 10,000× |
| Hypergraph queries | 100/s | 10K/s | 100× |
Memory Bandwidth
| Location | Bandwidth | Latency |
|---|---|---|
| CPU RAM | 200 GB/s | 50ns |
| GPU Global Memory | 1,935 GB/s | 200ns |
| GPU L2 Cache | 6 TB/s | 50ns |
| GPU L1 Cache | 20 TB/s | 20ns |
GPU-native actors leverage 10-100× higher bandwidth for state access.
When to Use Each Model
Use GPU-Offload When:
✅ Infrequent GPU usage (< 10 calls/second per actor) ✅ Large batch processing (batch size > 10K elements) ✅ Complex CPU coordination logic ✅ Existing CPU-based workflows
Use GPU-Native When:
✅ High-frequency messaging (> 1K messages/second per actor) ✅ Real-time requirements (< 1ms latency) ✅ Temporal graph analytics ✅ Hypergraph pattern matching ✅ Stream processing pipelines ✅ Digital twins and simulation
Next Steps
Now that you understand the core concepts:
- Architecture Overview - Deep dive into system design
- GPU-Native Actors Guide - Build GPU-native applications
- Temporal Correctness - Implement HLC and Vector Clocks
- Hypergraph Actors - Build multi-way relationships
Further Reading
- Getting Started Guide - Build your first GPU-accelerated grain
- API Reference - Complete API documentation
- Orleans Documentation - Microsoft Orleans reference