Implementation Roadmap
Orleans.GpuBridge.Core development phases and timeline.
Overview
Orleans.GpuBridge.Core development follows a phased approach, progressively building GPU-native actor capabilities on top of the Orleans runtime.
Version 0.1.0 (Current Release)
Status: Released Theme: Foundation and Core Infrastructure
Completed Phases
Phase 1: Core Abstractions ✅
Goal: Establish foundational interfaces and contracts
IGpuBridge- Main bridge interface for GPU operationsIGpuKernel<TIn,TOut>- Kernel execution contract[GpuAccelerated]attribute for grain markingGpuBridgeOptionsconfigurationIHybridLogicalClockandIVectorClockinterfaces
Phase 2: Runtime Infrastructure ✅
Goal: Build runtime support for GPU-accelerated grains
KernelCatalog- Kernel registration and resolutionDeviceBroker- GPU device management- DI integration via
AddGpuBridge()extension - CPU fallback implementation for all kernels
- Basic placement strategies
Phase 3: Pattern Detection Engine ✅
Goal: Temporal pattern matching capabilities
- Pattern definition DSL
- GPU-accelerated pattern matching
- Causal anomaly detection
- Temporal sequence recognition
- Real-time pattern detection
Phase 4: Causal Correctness ✅
Goal: Graph analysis and deadlock detection
- Causal graph construction
- Happened-before relationship tracking
- Deadlock detection algorithms
- Causal consistency verification
- Vector clock implementation
Phase 5: GPU Timing Extensions ✅
Goal: Temporal integration and memory ordering
- GPU clock source integration
- HLC implementation on GPU (20ns resolution)
- Memory ordering semantics (Relaxed, ReleaseAcquire, Sequential)
- Clock calibration between CPU/GPU
- Software PTP synchronization
Phase 6: Ring Kernel Bridge ✅
Goal: DotCompute backend integration
- DotCompute 0.5.1 NuGet integration
- Ring kernel runtime implementation
- GPU-resident message queues
- Persistent kernel dispatch loops
- EventDriven mode for WSL2 compatibility
[Kernel]and[RingKernel]attribute support
v0.1.0 Features Summary
| Feature | Status | Notes |
|---|---|---|
| GPU kernel execution | ✅ | Via DotCompute 0.5.1 |
| CPU fallback | ✅ | All operations |
| Temporal clocks (HLC) | ✅ | GPU-native, 20ns |
| Vector clocks | ✅ | Causal ordering |
| Pattern detection | ✅ | Real-time |
| Ring kernels | ✅ | EventDriven mode |
| Orleans integration | ✅ | Full lifecycle |
| Documentation | ✅ | DocFX-based |
Future Versions
Version 0.2.0 (Planned)
Theme: Production Hardening
Phase 7: Queue-Depth Aware Placement
Goal: Intelligent grain placement based on GPU load
- Monitor ring kernel queue depths
- Dynamic load balancing across GPUs
- Placement director integration
- Silo preference for GPU locality
- Queue overflow handling
Phase 8: GPU Memory Management
Goal: Efficient GPU memory utilization
- Memory pool management
- LRU eviction policies
- Memory pressure detection
- Automatic cleanup on deactivation
- Memory usage metrics
Version 0.3.0 (Planned)
Theme: Multi-GPU and Distributed
Phase 9: Multi-GPU Support
Goal: Scale across multiple GPUs
- GPU-to-GPU communication (NVLink/PCIe)
- Cross-GPU actor migration
- GPU affinity for related actors
- Load balancing across GPUs
Phase 10: Distributed GPU Clusters
Goal: Cross-node GPU coordination
- GPUDirect RDMA integration
- Cross-silo GPU messaging
- Distributed ring kernels
- Global temporal ordering
Version 1.0.0 (Future)
Theme: Production-Ready
- Comprehensive monitoring and observability
- Performance tuning documentation
- Production deployment guides
- SLA guarantees
- Enterprise support options
Known Limitations
WSL2 Limitations
WSL2's GPU virtualization (GPU-PV) has fundamental limitations that affect GPU-native actors:
| Limitation | Impact | Workaround |
|---|---|---|
| No system-scope atomics | Persistent kernels don't see host memory changes | Use EventDriven mode |
| No unified memory coherence | GPU can't poll CPU flags | Start-active pattern |
| Message latency | ~5 seconds vs 100-500ns | EventDriven relaunch |
| Memory spill | Can't use system RAM | Monitor VRAM usage |
Recommendation: WSL2 is suitable for development and functional testing only. Production deployments should use native Linux.
General Limitations (v0.1.0)
| Limitation | Planned Resolution |
|---|---|
| Single GPU per silo | Phase 9 (v0.3.0) |
| Manual placement hints | Phase 7 (v0.2.0) |
| No GPU-to-GPU messaging | Phase 9 (v0.3.0) |
| Limited memory management | Phase 8 (v0.2.0) |
Performance Targets
v0.1.0 Achieved
| Metric | Target | Achieved | Notes |
|---|---|---|---|
| HLC update | <50ns | 20ns | GPU-native |
| Message latency (native) | <1μs | 100-500ns | Persistent kernel |
| Message latency (WSL2) | <10s | ~5s | EventDriven mode |
| Kernel launch | <50ms | ~5ms | With warmup |
v0.2.0 Targets
| Metric | Target | Notes |
|---|---|---|
| Queue depth detection | <1ms | For placement |
| Memory pool allocation | <10μs | Pre-allocated |
| Load balancing latency | <100ms | Dynamic rebalancing |
v1.0.0 Targets
| Metric | Target | Notes |
|---|---|---|
| Cross-GPU latency | <10μs | NVLink |
| Cross-node GPU | <100μs | GPUDirect RDMA |
| Memory efficiency | >90% | Pool utilization |
Contributing
Phase Implementation Guidelines
- Research Phase: Document approach in
docs/architecture/ - Implementation: Follow TDD, write tests first
- Integration: Ensure backward compatibility
- Documentation: Update relevant guides
- Review: PR with comprehensive testing
Priority Areas for Contribution
- Phase 7: Queue-depth placement (high impact)
- WSL2 workarounds (developer experience)
- Performance benchmarks (documentation)
- Multi-GPU prototypes (future planning)
See Also
- Ring Kernel Integration - Phase 6 details
- Hybrid Layered Architecture - Architecture overview
- Phase 7 Implementation Guide - Next phase details
- Performance Benchmarks - Current measurements
Implementation Roadmap: Building the future of GPU-native distributed computing.
Version: 0.1.0 Last Updated: 2025-11-28