Concepts and Background

This guide explains the core concepts behind Orleans.GpuBridge.Core and the revolutionary GPU-native actor paradigm.

Table of Contents

The Actor Model

Traditional Actor Systems

The actor model is a concurrent computation paradigm where:

  • Actors are independent units of computation with private state
  • Messages are sent asynchronously between actors
  • Single-threaded execution within each actor ensures thread safety
  • Location transparency allows actors to run anywhere in a cluster

Microsoft Orleans implements the virtual actor model:

public interface IMyActor : IGrainWithIntegerKey
{
    Task<int> ProcessAsync(int value);
}

public class MyActor : Grain, IMyActor
{
    private int _state = 0;

    public Task<int> ProcessAsync(int value)
    {
        _state += value;  // Thread-safe by design
        return Task.FromResult(_state);
    }
}

Actor Benefits

  • Simplified concurrency - No locks or mutexes needed
  • Horizontal scalability - Add more nodes to handle more actors
  • Fault tolerance - Actors can be recreated after failures
  • Location transparency - Call actors without knowing their location

GPU Computing Fundamentals

Why GPUs?

Modern GPUs offer exceptional parallel processing capabilities:

Resource CPU (AMD EPYC 7763) GPU (NVIDIA A100) Advantage
Cores 64 6,912 CUDA cores 108×
Memory Bandwidth 200 GB/s 1,935 GB/s 10×
FP32 Performance 2 TFLOPS 19.5 TFLOPS 10×
FP64 Performance 1 TFLOPS 9.7 TFLOPS 10×

Traditional GPU Programming

Traditional GPU programming (CUDA/OpenCL) requires:

  1. Explicit memory management - Allocate, copy, free
  2. Kernel launches - Each computation requires kernel launch (~5-20μs overhead)
  3. CPU-GPU synchronization - Wait for GPU completion
  4. Low-level languages - C/C++ with vendor extensions

Example CUDA code:

// CUDA kernel
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

// Host code
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_c, n * sizeof(float));

cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);

vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, n);

cudaMemcpy(h_c, d_c, n * sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

This is complex, error-prone, and difficult to distribute.

GPU-Native Actors

The Revolutionary Paradigm

GPU-Native Actors combine the actor model with GPU computing in a fundamentally new way:

Traditional Approach: CPU actors offload work to GPU

  • Actor runs on CPU
  • Kernel launched for each computation
  • 10-50μs kernel launch overhead
  • State lives on CPU

GPU-Native Approach: Actors live permanently on GPU

  • Actor state resides in GPU memory
  • Ring kernel runs continuously
  • Zero kernel launch overhead
  • 100-500ns message latency

Architecture Comparison

Traditional GPU-Offload Model:
┌─────────────┐
│ CPU Actor   │ ──launch──> ┌──────────┐
│ (State)     │              │ GPU      │
│             │ <──result─── │ (Kernel) │
└─────────────┘              └──────────┘
    10-50μs overhead per call

GPU-Native Model:
┌──────────────────────────┐
│ GPU Ring Kernel          │
│ ┌──────────┬───────────┐ │
│ │ Message  │ Actor     │ │
│ │ Queue    │ State     │ │
│ └──────────┴───────────┘ │
│  Runs continuously       │
└──────────────────────────┘
    100-500ns latency

Key Innovations

  1. Ring Kernels - Persistent GPU threads running infinite loops
  2. GPU-Resident State - Actor state never leaves GPU memory
  3. Message Queues on GPU - Lock-free queues in GPU memory
  4. Temporal Clocks on GPU - HLC and Vector Clocks maintained on GPU
  5. Zero-Copy Messaging - GPU-to-GPU communication via unified memory

Ring Kernels

What are Ring Kernels?

Ring kernels are GPU kernels that run as infinite dispatch loops:

// Ring kernel - runs forever on GPU
__global__ void ring_kernel(MessageQueue* queue, ActorState* state) {
    // Thread persists indefinitely
    while (true) {
        // Dequeue message (non-blocking)
        Message msg = queue->dequeue();

        if (msg.type == UPDATE) {
            // Process update
            state->value += msg.data;
        }
        else if (msg.type == QUERY) {
            // Respond to query
            msg.respond(state->value);
        }

        // No kernel exit - loop continues
    }
}

Benefits

  • Zero launch overhead - Kernel launched once, runs forever
  • Persistent state - State maintained across messages
  • Sub-microsecond latency - Message processing at 100-500ns
  • High throughput - 2M messages/second per actor

Memory Architecture

GPU Memory Layout:
┌─────────────────────────────────┐
│ Global Memory                   │
│ ┌─────────────────────────────┐ │
│ │ Actor 1 State               │ │
│ │ - Message Queue (lock-free) │ │
│ │ - Actor Data                │ │
│ │ - Temporal Clock (HLC)      │ │
│ └─────────────────────────────┘ │
│ ┌─────────────────────────────┐ │
│ │ Actor 2 State               │ │
│ │ - Message Queue             │ │
│ │ - Actor Data                │ │
│ │ - Temporal Clock            │ │
│ └─────────────────────────────┘ │
│ ...                             │
└─────────────────────────────────┘

Deployment Models

Orleans.GpuBridge.Core supports two deployment models:

1. GPU-Offload Model (Traditional)

CPU actors offload compute to GPU:

[GpuAccelerated]
public class BatchProcessingGrain : Grain
{
    [GpuKernel("kernels/Process")]
    private IGpuKernel<float[], float[]> _kernel;

    public async Task<float[]> ProcessBatchAsync(float[] batch)
    {
        // Kernel launches on demand
        return await _kernel.ExecuteAsync(batch);
    }
}

Best for:

  • Infrequent GPU usage
  • Large batch processing
  • CPU-bound coordination logic

Performance:

  • 10-50μs kernel launch overhead
  • High throughput for large batches

2. GPU-Native Model (Revolutionary)

Actors live permanently on GPU:

[GpuAccelerated(Mode = GpuMode.Native)]
public class StreamProcessingGrain : Grain
{
    [RingKernel("kernels/RingProcess")]
    private IRingKernel<Event, Result> _kernel;

    public async Task<Result> ProcessEventAsync(Event evt)
    {
        // Ring kernel processes without relaunch
        return await _kernel.ExecuteAsync(evt);
    }
}

Best for:

  • High-frequency messaging
  • Real-time stream processing
  • Temporal graph analytics
  • Low-latency requirements

Performance:

  • Zero kernel launch overhead
  • 100-500ns message latency
  • 2M messages/second throughput

Temporal Alignment

The Challenge

Distributed systems require temporal ordering for:

  • Causal consistency (A caused B)
  • Conflict detection (concurrent updates)
  • Behavioral analytics (event sequence patterns)

Hybrid Logical Clocks (HLC)

HLC combines physical time with logical counters:

public struct HybridLogicalClock
{
    public long PhysicalTime;  // Wall clock time (ns)
    public long LogicalCounter; // Logical counter

    public void Update(long eventTime)
    {
        var now = GetPhysicalTime();
        PhysicalTime = Math.Max(Math.Max(PhysicalTime, eventTime), now);
        LogicalCounter = (PhysicalTime == eventTime)
            ? LogicalCounter + 1
            : 0;
    }
}

GPU Implementation: HLC maintained in GPU memory at 20ns per update (vs 50ns on CPU)

Vector Clocks

Vector clocks track causal dependencies:

public class VectorClock
{
    private Dictionary<string, long> _clocks;

    public void Increment(string actorId)
    {
        _clocks[actorId]++;
    }

    public bool HappenedBefore(VectorClock other)
    {
        return _clocks.All(kv =>
            kv.Value <= other._clocks.GetValueOrDefault(kv.Key));
    }
}

GPU Implementation: Efficient GPU-parallel vector comparison

Use Cases

  • Fraud detection - Detect causally related transactions
  • Anomaly detection - Identify temporal pattern violations
  • Distributed debugging - Trace causal relationships
  • Real-time analytics - Maintain temporal graph structures

Hypergraph Actors

Beyond Binary Edges

Traditional graphs model binary relationships:

User1 ─likes→ Post1

Hypergraphs model multi-way relationships:

Transaction { Buyer, Seller, Bank, Product, Shipper }

GPU-Accelerated Hypergraphs

Orleans.GpuBridge.Core enables GPU-native hypergraph actors:

[GpuAccelerated(Mode = GpuMode.Native)]
public class HyperedgeGrain : Grain, IHyperedge
{
    [RingKernel("kernels/PatternMatch")]
    private IRingKernel<HypergraphQuery, bool> _kernel;

    public async Task<bool> MatchesPatternAsync(HypergraphQuery query)
    {
        // GPU-accelerated pattern matching
        return await _kernel.ExecuteAsync(query);
    }
}

Performance:

  • Pattern detection: <100μs (vs >10ms on CPU)
  • 10-500× faster than traditional graph databases
  • Real-time analytics on billion-edge hypergraphs

Use Cases

  • Financial fraud detection - Multi-party transaction analysis
  • Supply chain optimization - Multi-modal logistics
  • Cybersecurity - Advanced persistent threat (APT) detection
  • Healthcare - Multi-drug interaction analysis

Performance Characteristics

Latency Comparison

Operation CPU Actors GPU-Offload GPU-Native Improvement
Message routing 10-100μs 10-100μs 100-500ns 20-200×
Kernel launch N/A 10-50μs 0ns
State access 50ns 50ns 20ns 2.5×
Temporal update 50ns 50ns 20ns 2.5×

Throughput Comparison

Workload CPU Actors GPU-Native Improvement
Message processing 15K/s 2M/s 133×
Vector operations 100K/s 1B/s 10,000×
Hypergraph queries 100/s 10K/s 100×

Memory Bandwidth

Location Bandwidth Latency
CPU RAM 200 GB/s 50ns
GPU Global Memory 1,935 GB/s 200ns
GPU L2 Cache 6 TB/s 50ns
GPU L1 Cache 20 TB/s 20ns

GPU-native actors leverage 10-100× higher bandwidth for state access.

When to Use Each Model

Use GPU-Offload When:

✅ Infrequent GPU usage (< 10 calls/second per actor) ✅ Large batch processing (batch size > 10K elements) ✅ Complex CPU coordination logic ✅ Existing CPU-based workflows

Use GPU-Native When:

✅ High-frequency messaging (> 1K messages/second per actor) ✅ Real-time requirements (< 1ms latency) ✅ Temporal graph analytics ✅ Hypergraph pattern matching ✅ Stream processing pipelines ✅ Digital twins and simulation

Next Steps

Now that you understand the core concepts:

  1. Architecture Overview - Deep dive into system design
  2. GPU-Native Actors Guide - Build GPU-native applications
  3. Temporal Correctness - Implement HLC and Vector Clocks
  4. Hypergraph Actors - Build multi-way relationships

Further Reading


← Back to Getting Started | Next: Architecture →