Implementation Roadmap

Orleans.GpuBridge.Core development phases and timeline.

Overview

Orleans.GpuBridge.Core development follows a phased approach, progressively building GPU-native actor capabilities on top of the Orleans runtime.

Version 0.1.0 (Current Release)

Status: Released Theme: Foundation and Core Infrastructure

Completed Phases

Phase 1: Core Abstractions ✅

Goal: Establish foundational interfaces and contracts

  • IGpuBridge - Main bridge interface for GPU operations
  • IGpuKernel<TIn,TOut> - Kernel execution contract
  • [GpuAccelerated] attribute for grain marking
  • GpuBridgeOptions configuration
  • IHybridLogicalClock and IVectorClock interfaces

Phase 2: Runtime Infrastructure ✅

Goal: Build runtime support for GPU-accelerated grains

  • KernelCatalog - Kernel registration and resolution
  • DeviceBroker - GPU device management
  • DI integration via AddGpuBridge() extension
  • CPU fallback implementation for all kernels
  • Basic placement strategies

Phase 3: Pattern Detection Engine ✅

Goal: Temporal pattern matching capabilities

  • Pattern definition DSL
  • GPU-accelerated pattern matching
  • Causal anomaly detection
  • Temporal sequence recognition
  • Real-time pattern detection

Phase 4: Causal Correctness ✅

Goal: Graph analysis and deadlock detection

  • Causal graph construction
  • Happened-before relationship tracking
  • Deadlock detection algorithms
  • Causal consistency verification
  • Vector clock implementation

Phase 5: GPU Timing Extensions ✅

Goal: Temporal integration and memory ordering

  • GPU clock source integration
  • HLC implementation on GPU (20ns resolution)
  • Memory ordering semantics (Relaxed, ReleaseAcquire, Sequential)
  • Clock calibration between CPU/GPU
  • Software PTP synchronization

Phase 6: Ring Kernel Bridge ✅

Goal: DotCompute backend integration

  • DotCompute 0.5.1 NuGet integration
  • Ring kernel runtime implementation
  • GPU-resident message queues
  • Persistent kernel dispatch loops
  • EventDriven mode for WSL2 compatibility
  • [Kernel] and [RingKernel] attribute support

v0.1.0 Features Summary

Feature Status Notes
GPU kernel execution Via DotCompute 0.5.1
CPU fallback All operations
Temporal clocks (HLC) GPU-native, 20ns
Vector clocks Causal ordering
Pattern detection Real-time
Ring kernels EventDriven mode
Orleans integration Full lifecycle
Documentation DocFX-based

Future Versions

Version 0.2.0 (Planned)

Theme: Production Hardening

Phase 7: Queue-Depth Aware Placement

Goal: Intelligent grain placement based on GPU load

  • Monitor ring kernel queue depths
  • Dynamic load balancing across GPUs
  • Placement director integration
  • Silo preference for GPU locality
  • Queue overflow handling

Phase 8: GPU Memory Management

Goal: Efficient GPU memory utilization

  • Memory pool management
  • LRU eviction policies
  • Memory pressure detection
  • Automatic cleanup on deactivation
  • Memory usage metrics

Version 0.3.0 (Planned)

Theme: Multi-GPU and Distributed

Phase 9: Multi-GPU Support

Goal: Scale across multiple GPUs

  • GPU-to-GPU communication (NVLink/PCIe)
  • Cross-GPU actor migration
  • GPU affinity for related actors
  • Load balancing across GPUs

Phase 10: Distributed GPU Clusters

Goal: Cross-node GPU coordination

  • GPUDirect RDMA integration
  • Cross-silo GPU messaging
  • Distributed ring kernels
  • Global temporal ordering

Version 1.0.0 (Future)

Theme: Production-Ready

  • Comprehensive monitoring and observability
  • Performance tuning documentation
  • Production deployment guides
  • SLA guarantees
  • Enterprise support options

Known Limitations

WSL2 Limitations

WSL2's GPU virtualization (GPU-PV) has fundamental limitations that affect GPU-native actors:

Limitation Impact Workaround
No system-scope atomics Persistent kernels don't see host memory changes Use EventDriven mode
No unified memory coherence GPU can't poll CPU flags Start-active pattern
Message latency ~5 seconds vs 100-500ns EventDriven relaunch
Memory spill Can't use system RAM Monitor VRAM usage

Recommendation: WSL2 is suitable for development and functional testing only. Production deployments should use native Linux.

General Limitations (v0.1.0)

Limitation Planned Resolution
Single GPU per silo Phase 9 (v0.3.0)
Manual placement hints Phase 7 (v0.2.0)
No GPU-to-GPU messaging Phase 9 (v0.3.0)
Limited memory management Phase 8 (v0.2.0)

Performance Targets

v0.1.0 Achieved

Metric Target Achieved Notes
HLC update <50ns 20ns GPU-native
Message latency (native) <1μs 100-500ns Persistent kernel
Message latency (WSL2) <10s ~5s EventDriven mode
Kernel launch <50ms ~5ms With warmup

v0.2.0 Targets

Metric Target Notes
Queue depth detection <1ms For placement
Memory pool allocation <10μs Pre-allocated
Load balancing latency <100ms Dynamic rebalancing

v1.0.0 Targets

Metric Target Notes
Cross-GPU latency <10μs NVLink
Cross-node GPU <100μs GPUDirect RDMA
Memory efficiency >90% Pool utilization

Contributing

Phase Implementation Guidelines

  1. Research Phase: Document approach in docs/architecture/
  2. Implementation: Follow TDD, write tests first
  3. Integration: Ensure backward compatibility
  4. Documentation: Update relevant guides
  5. Review: PR with comprehensive testing

Priority Areas for Contribution

  • Phase 7: Queue-depth placement (high impact)
  • WSL2 workarounds (developer experience)
  • Performance benchmarks (documentation)
  • Multi-GPU prototypes (future planning)

See Also


Implementation Roadmap: Building the future of GPU-native distributed computing.

Version: 0.1.0 Last Updated: 2025-11-28