Implementation Roadmap

Orleans.GpuBridge.Core development phases and timeline.

Overview

Orleans.GpuBridge.Core development follows a phased approach, progressively building GPU-native actor capabilities on top of the Orleans runtime.

Version 0.1.0 (Current Release)

Status: Released Theme: Foundation and Core Infrastructure

Completed Phases

Phase 1: Core Abstractions ✅

Goal: Establish foundational interfaces and contracts

IGpuBridge - Main bridge interface for GPU operations
IGpuKernel<TIn,TOut> - Kernel execution contract
[GpuAccelerated] attribute for grain marking
GpuBridgeOptions configuration
IHybridLogicalClock and IVectorClock interfaces

Phase 2: Runtime Infrastructure ✅

Goal: Build runtime support for GPU-accelerated grains

KernelCatalog - Kernel registration and resolution
DeviceBroker - GPU device management
DI integration via AddGpuBridge() extension
CPU fallback implementation for all kernels
Basic placement strategies

Phase 3: Pattern Detection Engine ✅

Goal: Temporal pattern matching capabilities

Pattern definition DSL
GPU-accelerated pattern matching
Causal anomaly detection
Temporal sequence recognition
Real-time pattern detection

Phase 4: Causal Correctness ✅

Goal: Graph analysis and deadlock detection

Causal graph construction
Happened-before relationship tracking
Deadlock detection algorithms
Causal consistency verification
Vector clock implementation

Phase 5: GPU Timing Extensions ✅

Goal: Temporal integration and memory ordering

GPU clock source integration
HLC implementation on GPU (20ns resolution)
Memory ordering semantics (Relaxed, ReleaseAcquire, Sequential)
Clock calibration between CPU/GPU
Software PTP synchronization

Phase 6: Ring Kernel Bridge ✅

Goal: DotCompute backend integration

DotCompute 0.5.1 NuGet integration
Ring kernel runtime implementation
GPU-resident message queues
Persistent kernel dispatch loops
EventDriven mode for WSL2 compatibility
[Kernel] and [RingKernel] attribute support

v0.1.0 Features Summary

Feature	Status	Notes
GPU kernel execution	✅	Via DotCompute 0.5.1
CPU fallback	✅	All operations
Temporal clocks (HLC)	✅	GPU-native, 20ns
Vector clocks	✅	Causal ordering
Pattern detection	✅	Real-time
Ring kernels	✅	EventDriven mode
Orleans integration	✅	Full lifecycle
Documentation	✅	DocFX-based

Future Versions

Version 0.2.0 (Planned)

Theme: Production Hardening

Phase 7: Queue-Depth Aware Placement

Goal: Intelligent grain placement based on GPU load

Monitor ring kernel queue depths
Dynamic load balancing across GPUs
Placement director integration
Silo preference for GPU locality
Queue overflow handling

Phase 8: GPU Memory Management

Goal: Efficient GPU memory utilization

Memory pool management
LRU eviction policies
Memory pressure detection
Automatic cleanup on deactivation
Memory usage metrics

Version 0.3.0 (Planned)

Theme: Multi-GPU and Distributed

Phase 9: Multi-GPU Support

Goal: Scale across multiple GPUs

GPU-to-GPU communication (NVLink/PCIe)
Cross-GPU actor migration
GPU affinity for related actors
Load balancing across GPUs

Phase 10: Distributed GPU Clusters

Goal: Cross-node GPU coordination

GPUDirect RDMA integration
Cross-silo GPU messaging
Distributed ring kernels
Global temporal ordering

Version 1.0.0 (Future)

Theme: Production-Ready

Comprehensive monitoring and observability
Performance tuning documentation
Production deployment guides
SLA guarantees
Enterprise support options

Known Limitations

WSL2 Limitations

WSL2's GPU virtualization (GPU-PV) has fundamental limitations that affect GPU-native actors:

Limitation	Impact	Workaround
No system-scope atomics	Persistent kernels don't see host memory changes	Use EventDriven mode
No unified memory coherence	GPU can't poll CPU flags	Start-active pattern
Message latency	~5 seconds vs 100-500ns	EventDriven relaunch
Memory spill	Can't use system RAM	Monitor VRAM usage

Recommendation: WSL2 is suitable for development and functional testing only. Production deployments should use native Linux.

General Limitations (v0.1.0)

Limitation	Planned Resolution
Single GPU per silo	Phase 9 (v0.3.0)
Manual placement hints	Phase 7 (v0.2.0)
No GPU-to-GPU messaging	Phase 9 (v0.3.0)
Limited memory management	Phase 8 (v0.2.0)

Performance Targets

v0.1.0 Achieved

Metric	Target	Achieved	Notes
HLC update	<50ns	20ns	GPU-native
Message latency (native)	<1μs	100-500ns	Persistent kernel
Message latency (WSL2)	<10s	~5s	EventDriven mode
Kernel launch	<50ms	~5ms	With warmup

v0.2.0 Targets

Metric	Target	Notes
Queue depth detection	<1ms	For placement
Memory pool allocation	<10μs	Pre-allocated
Load balancing latency	<100ms	Dynamic rebalancing

v1.0.0 Targets

Metric	Target	Notes
Cross-GPU latency	<10μs	NVLink
Cross-node GPU	<100μs	GPUDirect RDMA
Memory efficiency	>90%	Pool utilization

Contributing

Phase Implementation Guidelines

Research Phase: Document approach in docs/architecture/
Implementation: Follow TDD, write tests first
Integration: Ensure backward compatibility
Documentation: Update relevant guides
Review: PR with comprehensive testing

Priority Areas for Contribution

Phase 7: Queue-depth placement (high impact)
WSL2 workarounds (developer experience)
Performance benchmarks (documentation)
Multi-GPU prototypes (future planning)