Architecture Overview
This document provides a comprehensive overview of Orleans.GpuBridge.Core's architecture, design decisions, and internal components.
Table of Contents
- System Architecture
- Component Layers
- Core Components
- GPU Backend Abstraction
- Placement Strategies
- Memory Management
- Ring Kernel Lifecycle
- Design Decisions
System Architecture
Orleans.GpuBridge.Core is built as a layered architecture integrating with Microsoft Orleans:
┌──────────────────────────────────────────────────────────┐
│ Application Layer │
│ - Business Logic (C#) │
│ - Grain Implementations │
│ - Type-safe Interfaces │
└──────────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────────┐
│ Orleans.GpuBridge Abstractions │
│ - IGpuBridge, IGpuKernel<TIn,TOut> │
│ - [GpuAccelerated], [RingKernel] Attributes │
│ - GpuPipeline<T> Fluent API │
│ - Temporal Clock Interfaces (HLC, Vector Clocks) │
└──────────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────────┐
│ Orleans.GpuBridge Runtime │
│ ┌────────────────┬──────────────┬───────────────────┐ │
│ │ KernelCatalog │ DeviceBroker │ PlacementStrategy │ │
│ ├────────────────┼──────────────┼───────────────────┤ │
│ │ RingManager │ MemoryPool │ TemporalClocks │ │
│ └────────────────┴──────────────┴───────────────────┘ │
└──────────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────────┐
│ Orleans Distributed Runtime │
│ - Virtual Actor Model (Grains) │
│ - Location Transparency │
│ - Automatic Failover & Activation │
│ - Streaming & Persistence │
└──────────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────────┐
│ GPU Backend Abstraction Layer │
│ ┌──────────────┬────────────┬────────────┐ │
│ │ DotCompute │ ILGPU │ CPU │ │
│ │ (CUDA/ROCm) │ (IL→GPU) │ (Fallback) │ │
│ └──────────────┴────────────┴────────────┘ │
└──────────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────────┐
│ Hardware Layer │
│ - NVIDIA GPUs (CUDA 12.0+) │
│ - AMD GPUs (ROCm 5.0+) │
│ - Intel GPUs (oneAPI - future) │
└──────────────────────────────────────────────────────────┘
Component Layers
1. Abstractions Layer
Purpose: Define contracts and interfaces for GPU acceleration
Key Types:
// Core bridge interface
public interface IGpuBridge
{
Task<TOut> ExecuteKernelAsync<TIn, TOut>(string kernelId, TIn input);
IReadOnlyList<IGpuDevice> GetAvailableDevices();
}
// Kernel execution interface
public interface IGpuKernel<TIn, TOut> : IDisposable
{
Task<TOut> ExecuteAsync(TIn input);
}
// Ring kernel interface (persistent execution)
public interface IRingKernel<TIn, TOut> : IGpuKernel<TIn, TOut>
{
Task StartAsync(); // Launch ring kernel
Task StopAsync(); // Terminate ring kernel
}
// Temporal clock interfaces
public interface IHybridLogicalClock
{
HLCTimestamp Now();
void Update(HLCTimestamp remote);
}
public interface IVectorClock
{
void Increment(string actorId);
bool HappenedBefore(IVectorClock other);
}
Attributes:
// Mark grain as GPU-accelerated
[AttributeUsage(AttributeTargets.Class)]
public class GpuAcceleratedAttribute : Attribute
{
public GpuMode Mode { get; set; } = GpuMode.Offload;
public string? PreferredDevice { get; set; }
}
// Inject GPU kernel
[AttributeUsage(AttributeTargets.Field | AttributeTargets.Property)]
public class GpuKernelAttribute : Attribute
{
public string KernelId { get; }
public GpuKernelAttribute(string kernelId) => KernelId = kernelId;
}
// Inject ring kernel
[AttributeUsage(AttributeTargets.Field | AttributeTargets.Property)]
public class RingKernelAttribute : GpuKernelAttribute
{
public RingKernelAttribute(string kernelId) : base(kernelId) { }
}
2. Runtime Layer
Purpose: Implement GPU integration with Orleans runtime
KernelCatalog
Manages kernel registration and resolution:
public class KernelCatalog : IKernelCatalog
{
private readonly ConcurrentDictionary<string, KernelRegistration> _kernels;
private readonly IDeviceBroker _deviceBroker;
public void RegisterKernel<TIn, TOut>(
string kernelId,
Func<IGpuDevice, IGpuKernel<TIn, TOut>> factory)
{
var registration = new KernelRegistration(kernelId, factory);
_kernels[kernelId] = registration;
}
public IGpuKernel<TIn, TOut> ResolveKernel<TIn, TOut>(
string kernelId,
IGpuDevice? device = null)
{
device ??= _deviceBroker.GetDefaultDevice();
var registration = _kernels[kernelId];
return registration.CreateInstance<TIn, TOut>(device);
}
}
DeviceBroker
Manages GPU device lifecycle:
public class DeviceBroker : IDeviceBroker
{
private readonly List<IGpuDevice> _devices;
private readonly DeviceLoadBalancer _loadBalancer;
public DeviceBroker(GpuBridgeOptions options)
{
_devices = DiscoverDevices(options);
_loadBalancer = new DeviceLoadBalancer(_devices);
}
public IGpuDevice GetDefaultDevice()
{
return _loadBalancer.SelectDevice();
}
public IGpuDevice? GetDeviceById(string deviceId)
{
return _devices.FirstOrDefault(d => d.Id == deviceId);
}
private List<IGpuDevice> DiscoverDevices(GpuBridgeOptions options)
{
var devices = new List<IGpuDevice>();
// Discover CUDA devices
devices.AddRange(CudaDeviceDiscovery.Discover());
// Discover ROCm devices
devices.AddRange(RocmDeviceDiscovery.Discover());
// Add CPU fallback
devices.Add(new CpuFallbackDevice());
return devices;
}
}
RingManager
Manages ring kernel lifecycle:
public class RingManager : IRingManager
{
private readonly ConcurrentDictionary<Guid, RingKernelState> _rings;
public async Task<IRingKernel<TIn, TOut>> LaunchRingAsync<TIn, TOut>(
string kernelId,
IGpuDevice device)
{
var ringId = Guid.NewGuid();
var messageQueue = new GpuMessageQueue<TIn>(device);
var resultQueue = new GpuMessageQueue<TOut>(device);
// Launch persistent kernel
var kernel = await device.LaunchPersistentKernelAsync(
kernelId,
new[] { messageQueue.DevicePointer, resultQueue.DevicePointer }
);
var ring = new RingKernel<TIn, TOut>(kernel, messageQueue, resultQueue);
_rings[ringId] = new RingKernelState(ring);
return ring;
}
public async Task TerminateRingAsync(Guid ringId)
{
if (_rings.TryRemove(ringId, out var state))
{
await state.Kernel.StopAsync();
state.Dispose();
}
}
}
3. BridgeFX Layer
Purpose: High-level pipeline API for batch processing
public class GpuPipeline<TIn, TOut>
{
private readonly IGrainFactory _grainFactory;
private readonly string _kernelId;
private int _batchSize = 1000;
private int _maxConcurrency = 10;
public static GpuPipeline<TIn, TOut> For(
IGrainFactory grainFactory,
string kernelId)
{
return new GpuPipeline<TIn, TOut>(grainFactory, kernelId);
}
public GpuPipeline<TIn, TOut> WithBatchSize(int size)
{
_batchSize = size;
return this;
}
public GpuPipeline<TIn, TOut> WithMaxConcurrency(int max)
{
_maxConcurrency = max;
return this;
}
public async Task<TOut[]> ExecuteAsync(TIn[] data)
{
// Partition data into batches
var batches = data.Chunk(_batchSize).ToArray();
// Process batches concurrently
var tasks = batches.Select(async (batch, index) =>
{
var grain = _grainFactory.GetGrain<IGpuBatchGrain>(index);
return await grain.ProcessBatchAsync(_kernelId, batch);
});
var results = await Task.WhenAll(tasks);
return results.SelectMany(r => r).ToArray();
}
}
4. Grains Layer
Purpose: Pre-built grain implementations for common patterns
[GpuAccelerated(Mode = GpuMode.Native)]
public class GpuResidentGrain<TState> : Grain, IGpuResidentGrain<TState>
where TState : new()
{
[RingKernel("kernels/StateManager")]
private IRingKernel<StateCommand, StateResponse> _kernel;
private TState _state = new();
public override async Task OnActivateAsync(CancellationToken ct)
{
// Initialize ring kernel with GPU-resident state
await _kernel.StartAsync();
await _kernel.ExecuteAsync(new InitCommand { State = _state });
await base.OnActivateAsync(ct);
}
public async Task<TResult> QueryAsync<TResult>(Func<TState, TResult> query)
{
var response = await _kernel.ExecuteAsync(
new QueryCommand { Query = query }
);
return (TResult)response.Result;
}
public async Task UpdateAsync(Action<TState> update)
{
await _kernel.ExecuteAsync(
new UpdateCommand { Update = update }
);
}
public override async Task OnDeactivateAsync(DeactivationReason reason, CancellationToken ct)
{
// Terminate ring kernel
await _kernel.StopAsync();
await base.OnDeactivateAsync(reason, ct);
}
}
GPU Backend Abstraction
DotCompute Backend
Purpose: Unified API for CUDA and ROCm
public interface IDotComputeDevice : IGpuDevice
{
Task<T[]> AllocateAsync<T>(int count) where T : unmanaged;
Task CopyToDeviceAsync<T>(T[] hostData, T[] deviceData) where T : unmanaged;
Task CopyToHostAsync<T>(T[] deviceData, T[] hostData) where T : unmanaged;
Task<IDotComputeKernel> LoadKernelAsync(string ptxPath, string entryPoint);
}
public interface IDotComputeKernel : IDisposable
{
Task LaunchAsync(Dim3 gridDim, Dim3 blockDim, params object[] args);
Task LaunchPersistentAsync(Dim3 gridDim, Dim3 blockDim, params object[] args);
}
ILGPU Backend
Purpose: Compile C# to GPU code
public class ILGPUBackend : IGpuBackend
{
private readonly Context _context;
private readonly Accelerator _accelerator;
public ILGPUBackend()
{
_context = Context.CreateDefault();
_accelerator = _context.GetPreferredDevice(preferCPU: false)
.CreateAccelerator(_context);
}
public IGpuKernel<TIn, TOut> CompileKernel<TIn, TOut>(
Expression<Action<Index1D, TIn, TOut>> kernelFunc)
{
var kernel = _accelerator.LoadAutoGroupedStreamKernel(kernelFunc);
return new ILGPUKernel<TIn, TOut>(kernel, _accelerator);
}
}
Placement Strategies
GPU-Aware Placement
Orleans.GpuBridge.Core extends Orleans placement strategies for GPU-aware grain placement:
[Serializable]
public class GpuAwarePlacement : PlacementStrategy
{
public static GpuAwarePlacement Singleton { get; } = new();
}
public class GpuAwarePlacementDirector : IPlacementDirector
{
private readonly IDeviceBroker _deviceBroker;
public Task<SiloAddress> OnAddActivation(
PlacementStrategy strategy,
PlacementTarget target,
IPlacementContext context)
{
// Find silo with available GPU capacity
var silos = context.GetCompatibleSilos(target);
var bestSilo = silos
.Select(s => new
{
Silo = s,
GpuLoad = GetGpuLoad(s),
QueueDepth = GetQueueDepth(s)
})
.OrderBy(x => x.GpuLoad)
.ThenBy(x => x.QueueDepth)
.First();
return Task.FromResult(bestSilo.Silo);
}
private double GetGpuLoad(SiloAddress silo)
{
// Query GPU utilization via metrics
return _deviceBroker.GetDeviceForSilo(silo)?.Utilization ?? 1.0;
}
}
Memory Management
GPU Memory Pool
public class GpuMemoryPool : IDisposable
{
private readonly IGpuDevice _device;
private readonly ConcurrentBag<MemoryBlock> _availableBlocks;
private readonly List<MemoryBlock> _allocatedBlocks;
public async Task<GpuBuffer<T>> AllocateAsync<T>(int count)
where T : unmanaged
{
var size = count * Marshal.SizeOf<T>();
// Try to reuse existing block
if (_availableBlocks.TryTake(out var block) && block.Size >= size)
{
return new GpuBuffer<T>(block, count);
}
// Allocate new block
var newBlock = await _device.AllocateAsync(size);
_allocatedBlocks.Add(newBlock);
return new GpuBuffer<T>(newBlock, count);
}
public void Release<T>(GpuBuffer<T> buffer) where T : unmanaged
{
_availableBlocks.Add(buffer.Block);
}
public void Dispose()
{
foreach (var block in _allocatedBlocks)
{
block.Dispose();
}
}
}
Unified Memory
For GPU-native actors, unified memory simplifies data sharing:
public class UnifiedMemoryAllocator
{
public unsafe T* AllocateUnified<T>(int count) where T : unmanaged
{
void* ptr;
cudaMallocManaged(&ptr, count * sizeof(T), cudaMemAttachGlobal);
return (T*)ptr;
}
public unsafe void Prefetch<T>(T* ptr, int count, int deviceId)
where T : unmanaged
{
cudaMemPrefetchAsync(ptr, count * sizeof(T), deviceId, stream: 0);
}
}
Ring Kernel Lifecycle
Launch Sequence
sequenceDiagram
participant Grain
participant RingManager
participant GPU
Grain->>RingManager: LaunchRingAsync()
RingManager->>GPU: Allocate message queues
GPU-->>RingManager: Queue pointers
RingManager->>GPU: Launch persistent kernel
GPU-->>RingManager: Kernel handle
RingManager-->>Grain: IRingKernel
loop Message Processing
Grain->>IRingKernel: ExecuteAsync(message)
IRingKernel->>GPU: Enqueue message
GPU->>GPU: Ring kernel processes
GPU-->>IRingKernel: Dequeue result
IRingKernel-->>Grain: result
end
Grain->>IRingKernel: StopAsync()
IRingKernel->>GPU: Set termination flag
GPU->>GPU: Exit ring loop
IRingKernel->>GPU: Free resources
GPU-Side Implementation
// Ring kernel implementation
__global__ void ring_kernel(
MessageQueue<Input>* input_queue,
MessageQueue<Output>* output_queue,
ActorState* state,
volatile bool* terminate_flag)
{
while (!*terminate_flag) {
// Non-blocking dequeue
Input msg;
if (!input_queue->try_dequeue(&msg)) {
continue; // No message, keep polling
}
// Process message
Output result = process_message(msg, state);
// Enqueue result
output_queue->enqueue(result);
}
}
Design Decisions
1. Why Orleans?
Orleans provides:
- Virtual Actor Model - Simplified distributed programming
- Location Transparency - Actors accessible anywhere
- Automatic Failover - Built-in fault tolerance
- Streaming Support - Reactive event processing
- Production-Proven - Used by Microsoft, Halo, and others
2. Why CPU Fallback?
CPU fallback enables:
- Testing without GPU - Unit tests on CI servers
- Graceful Degradation - Continue operation on GPU failure
- Development Flexibility - Prototype without GPU hardware
- Hybrid Workloads - Some grains on CPU, some on GPU
3. Why Ring Kernels?
Ring kernels provide:
- Zero Launch Overhead - Kernel already running
- Persistent State - State never leaves GPU
- Sub-microsecond Latency - 100-500ns message processing
- High Throughput - 2M messages/second per actor
4. Why Multiple Backends?
Multiple backends provide:
- Vendor Independence - Support NVIDIA and AMD
- Optimization Options - Choose best backend per workload
- Future-Proofing - Easy to add new backends (Intel, Apple)
Performance Considerations
Batch Size Optimization
// Too small: High overhead
batchSize = 10; // 1000 kernel launches/second
// Too large: High latency
batchSize = 1_000_000; // 1 launch/second, high memory
// Optimal: Balance throughput and latency
batchSize = 10_000; // 100 launches/second, good GPU utilization
Memory Transfer Overhead
Operation | Time
------------------------------|-------
CPU→GPU transfer (1MB) | 50μs
GPU→CPU transfer (1MB) | 50μs
GPU computation (1M ops) | 100μs
Ring kernel message (100B) | 100ns (no transfer!)
Lesson: Minimize CPU↔GPU transfers. Keep data on GPU with ring kernels.
Next Steps
- Getting Started - Build your first GPU grain
- API Reference - Complete API documentation
- GPU-Native Actors - Advanced patterns
- Temporal Correctness - HLC and Vector Clocks