WSL2 CUDA Limitations and Mitigations
This document describes the known limitations when running DotCompute CUDA Ring Kernels under WSL2 (Windows Subsystem for Linux 2) and the mitigations implemented in the codebase.
Overview
WSL2 provides GPU pass-through via NVIDIA's CUDA on WSL driver, but there are several limitations compared to native Linux or Windows CUDA:
- Pinned Memory Allocation Failures
- Driver API vs Runtime API Context Management
- Unified Memory Restrictions
- Cooperative Kernel Limitations
Detailed Limitations and Mitigations
1. Pinned Memory Allocation (cudaHostAlloc)
Limitation: In WSL2, cudaHostAlloc with pinned/mapped flags often fails with cudaErrorInvalidValue. This prevents the use of zero-copy memory access patterns.
Impact:
- Cannot use pinned host memory for direct GPU access
cudaMemcpyAsyncrequires pinned host memory and fails with regular malloc buffers
Mitigation (implemented in RingKernelControlBlockHelper.cs):
// Try pinned allocation first
var stagingResult = CudaRuntime.cudaHostAlloc(ref stagingBuffer, size, pinnedFlags);
if (stagingResult != CudaError.Success)
{
// Fall back to regular malloc - will use sync copies
stagingBuffer = Marshal.AllocHGlobal(controlBlockSize);
isStagingPinned = false;
}
When pinned allocation fails:
- Use regular
Marshal.AllocHGlobalfor staging buffers - Track allocation type with
IsStagingPinnedflag - Use synchronous
cudaMemcpyinstead of asynccudaMemcpyAsync
2. Driver API vs Runtime API Context Management
Limitation: WSL2 has issues with CUDA Driver API context handles becoming stale or invalid after Runtime API calls. Mixing Driver API (cuMemAlloc, cuMemcpyHtoD) with Runtime API (cudaMalloc, cudaMemcpy) causes context corruption.
Impact:
- Driver API allocated memory cannot be accessed via Runtime API
- Context handles become invalid after
cudaSetDevice()calls - Kernel launches fail with
CUDA_ERROR_INVALID_HANDLE(400)
Mitigation (implemented in RingKernelControlBlockHelper.cs and CudaRingKernelRuntime.cs):
- Consistent API Usage: When using async control blocks in WSL2, use Runtime API exclusively for memory operations:
// Use Runtime API for device memory allocation
var deviceAllocResult = CudaRuntime.cudaMalloc(ref devicePtr, (ulong)controlBlockSize);
// Use Runtime API for memory copies
var copyResult = CudaRuntime.cudaMemcpy(
devicePtr, hostPtr, (nuint)size, CudaMemcpyKind.HostToDevice);
- Context Restoration: After Runtime API calls, restore Driver API context for kernel operations:
// After Runtime API operations, restore Driver API context
var ctxRestoreResult = CudaRuntimeCore.cuCtxSetCurrent(state.Context);
3. Unified Memory Restrictions
Limitation: WSL2 does not support concurrent CPU/GPU access to unified memory (cudaMallocManaged). Attempting to use unified memory with concurrent access flags fails or causes undefined behavior.
Impact:
- Cannot use
cudaMallocManagedwithcudaMemAttachGlobal - Zero-copy patterns using unified memory don't work
- Cooperative kernels with host-side control block access are problematic
Mitigation (implemented in RingKernelControlBlockHelper.cs):
// WSL2 detection and fallback
if (pinnedAllocResult != CudaError.Success)
{
Console.WriteLine("[DIAG] WSL2 detected - skipping unified memory");
// Fall back to device-only memory with async staging
}
The async control block pattern provides an alternative:
- Device memory for the control block
- Separate staging buffer on host
- Explicit async/sync copies between them
4. Cooperative Kernel Limitations
Limitation: Cooperative kernel launch (cuLaunchCooperativeKernel) may not work reliably in WSL2 due to context management issues and memory restrictions.
Impact:
- Grid-wide synchronization may fail
- Cooperative groups functionality may be limited
Mitigation (implemented in CudaRingKernelRuntime.cs):
// WSL2 compatibility mode: Use non-cooperative kernel
if (isWsl2Environment)
{
Console.WriteLine("[DIAG] Using NON-COOPERATIVE kernel (WSL2 compatibility mode)");
launchResult = CudaApi.cuLaunchKernel(
state.Function, gridSize, 1, 1, blockSize, 1, 1,
0, state.Stream, argPtrsHandle.AddrOfPinnedObject(), IntPtr.Zero);
}
Environment Setup
Required Configuration
Set the library path for WSL2 CUDA drivers:
export LD_LIBRARY_PATH="/usr/lib/wsl/lib:$LD_LIBRARY_PATH"
Or use the provided test script which handles this automatically:
./scripts/run-tests.sh tests/Hardware/DotCompute.Hardware.Cuda.Tests/DotCompute.Hardware.Cuda.Tests.csproj
Diagnostic Output
The codebase includes extensive diagnostic logging with [DIAG] prefix to help identify WSL2-specific issues:
[DIAG] WSL2 detected- WSL2 fallback path activated[DIAG] Pinned alloc: InvalidValue- Pinned memory failed, using fallback[DIAG] Using NON-COOPERATIVE kernel- WSL2 compatibility kernel launch
Performance Implications
The WSL2 mitigations have the following performance characteristics:
| Feature | Native Linux | WSL2 |
|---|---|---|
| Pinned Memory | Zero-copy access | Sync copies required |
| Async Transfers | Full async support | Sync fallback |
| Control Block Reads | Non-blocking | Blocking sync copy |
| Cooperative Kernels | Full support | Non-cooperative fallback |
Estimated overhead: 10-30% slower for control block operations compared to native Linux with pinned memory.
Affected Files
The following files contain WSL2-specific mitigations:
src/Backends/DotCompute.Backends.CUDA/RingKernels/RingKernelControlBlockHelper.cs- Async control block allocation with pinned/malloc fallback
- Sync copy fallback for non-pinned staging
- Runtime API memory operations (
cudaMalloc,cudaMemcpy) Readmethod uses Runtime API for device-to-host copiesWritemethod uses Runtime API for host-to-device copies- Proper resource cleanup for both allocation types
src/Backends/DotCompute.Backends.CUDA/RingKernels/CudaRingKernelRuntime.cs- Driver API context restoration after Runtime API calls
- Non-cooperative kernel launch fallback
- WSL2 detection and compatibility mode
ReadControlBlockAsyncuses AsyncControlBlock path when availableGetStatusAsyncuses non-blocking read with cached values
Testing
Run the WSL2-compatible tests:
# Single test
export LD_LIBRARY_PATH="/usr/lib/wsl/lib:$LD_LIBRARY_PATH"
dotnet test --filter "PageRankContributionSender_ShouldActivateDeactivate"
# All Ring Kernel tests
./scripts/run-tests.sh --filter "RingKernel"
Known Issues
Telemetry Buffer Allocation: The
CudaTelemetryBufferalso uses pinned memory allocation which fails in WSL2. Tests that use telemetry features (e.g.,E2E_Telemetry_EnableAndVerifyAllocation) will fail until similar fallback logic is added.Test Isolation: Running multiple E2E tests sequentially can lead to CUDA context corruption. Running tests in isolation is recommended for troubleshooting.
Error 700 (Illegal Address): This error can occur if Driver API and Runtime API memory are mixed incorrectly. The system now consistently uses Runtime API for async control block operations.
Future Improvements
- Consider implementing a full Runtime API path for kernel loading
- Add automatic WSL2 detection based on environment
- Implement hybrid mode that switches between pinned and sync paths based on operation success
- Add performance metrics to compare WSL2 vs native performance