LINQ GPU Acceleration Guide
This guide explains how to use DotCompute.Linq's GPU kernel generation to accelerate LINQ queries with automatic optimization across CUDA, OpenCL, and Metal backends.
Overview
DotCompute.Linq provides production-ready GPU kernel generation from standard LINQ expressions. The system automatically compiles LINQ operations into optimized GPU kernels with comprehensive optimization features including kernel fusion and filter compaction.
Key Features
- Automatic GPU Compilation: LINQ expressions → optimized GPU kernels
- Three GPU Backends: CUDA (NVIDIA), OpenCL (cross-platform), Metal (Apple)
- Kernel Fusion: 50-80% bandwidth reduction for chained operations
- Filter Compaction: Atomic stream compaction for variable-length results
- Graceful Fallback: CPU execution on GPU unavailability
Quick Start
Installation
dotnet add package DotCompute.Linq --version 0.2.0-alpha
dotnet add package DotCompute.Backends.CUDA --version 0.2.0-alpha # NVIDIA GPUs
dotnet add package DotCompute.Backends.OpenCL --version 0.2.0-alpha # Cross-platform GPU
dotnet add package DotCompute.Backends.Metal --version 0.2.0-alpha # Apple GPUs
Basic Usage
using DotCompute.Linq;
// Your data
float[] data = Enumerable.Range(0, 1_000_000)
.Select(i => (float)i)
.ToArray();
// Standard LINQ automatically compiled to GPU kernel
var result = data
.AsComputeQueryable()
.Where(x => x > 5000)
.Select(x => x * 2.0f)
.ToComputeArray();
// GPU acceleration is automatic!
// Expected: 10-30x speedup vs standard LINQ on 1M elements
Supported Operations
Map Operations (Select)
Transform each element with a function:
// Multiply by 2 - compiles to GPU kernel
var doubled = data
.AsComputeQueryable()
.Select(x => x * 2.0f)
.ToComputeArray();
// Complex transformations
var complex = data
.AsComputeQueryable()
.Select(x => (x * 3.0f) + 100.0f)
.ToComputeArray();
// Math operations
var computed = data
.AsComputeQueryable()
.Select(x => x * x + Math.Sqrt(x))
.ToComputeArray();
Generated CUDA Kernel:
extern "C" __global__ void Execute(const float* input, float* output, int length)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < length) {
output[idx] = (input[idx] * 2.0f);
}
}
Filter Operations (Where)
Select elements matching a predicate:
// Simple filter - uses atomic stream compaction
var filtered = data
.AsComputeQueryable()
.Where(x => x > 1000.0f)
.ToComputeArray();
// Complex predicates
var complexFilter = data
.AsComputeQueryable()
.Where(x => x > 500.0f && x < 2000.0f)
.ToComputeArray();
Generated CUDA Kernel with Atomic Compaction:
extern "C" __global__ void Execute(
const float* input,
float* output,
int* outputCount,
int length)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < length) {
if ((input[idx] > 1000.0f)) {
// Atomically allocate output position
int outIdx = atomicAdd(outputCount, 1);
output[outIdx] = input[idx];
}
}
}
Reduce Operations (Sum, Average, etc.)
Aggregate operations:
// Sum all elements
float sum = data
.AsComputeQueryable()
.Sum();
// Average
float average = data
.AsComputeQueryable()
.Average();
// Count matching elements
int count = data
.AsComputeQueryable()
.Count(x => x > 5000);
Kernel Fusion Optimization
One of the most powerful features is automatic kernel fusion - combining multiple LINQ operations into a single GPU kernel.
Example: Map → Filter → Map
// Three operations
var result = data
.AsComputeQueryable()
.Select(x => x * 2.0f) // Map 1
.Where(x => x > 1000.0f) // Filter
.Select(x => x + 100.0f) // Map 2
.ToComputeArray(); // Single fused GPU kernel!
Without Fusion (3 separate kernels):
Kernel 1 (Map): Read 1M elements → Write 1M elements
Kernel 2 (Filter): Read 1M elements → Write 500K elements (50% pass)
Kernel 3 (Map): Read 500K elements → Write 500K elements
Total: 2.5M reads + 2M writes = 4.5M memory operations
With Fusion (1 kernel):
Fused Kernel: Read 1M elements → Write 500K elements (only passing elements)
Total: 1M reads + 500K writes = 1.5M memory operations
Result: 66.7% bandwidth reduction (4.5M → 1.5M operations)
Generated Fused CUDA Kernel
extern "C" __global__ void Execute(const float* input, float* output, int length)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < length) {
// Fused operations: Map -> Filter -> Map
// Performance: Eliminates intermediate memory transfers
float value = input[idx];
bool passesFilter = true;
// Map: x * 2
if (passesFilter) {
value = (value * 2.0f);
}
// Filter: Check predicate
passesFilter = passesFilter && ((value > 1000.0f));
// Map: x + 100
if (passesFilter) {
value = (value + 100.0f);
}
// Write only if passed filter
if (passesFilter) {
output[idx] = value;
}
}
}
Supported Fusion Patterns
| Pattern | Description | Bandwidth Reduction |
|---|---|---|
| Map + Map | Two transformations | 33% (3 ops → 2 ops) |
| Map + Filter | Transform then filter | 50% (3 ops → 2 ops) |
| Filter + Map | Filter then transform | 50% (3 ops → 2 ops) |
| Map + Map + Map | Three transformations | 66.7% (5 ops → 2 ops) |
| Map + Filter + Map | Complex chain | 66.7% (5 ops → 2 ops) |
Backend Selection
Automatic Selection
The system automatically selects the best backend:
// Automatic backend selection
var result = data.AsComputeQueryable().Select(x => x * 2).ToComputeArray();
// Uses GPU if available, CPU otherwise
Manual Backend Selection
using DotCompute.Core;
// Force CUDA backend (NVIDIA GPUs)
var cudaResult = data
.AsComputeQueryable()
.WithBackend(ComputeBackend.Cuda)
.Select(x => x * 2)
.ToComputeArray();
// Force OpenCL backend (cross-platform)
var openclResult = data
.AsComputeQueryable()
.WithBackend(ComputeBackend.OpenCL)
.Select(x => x * 2)
.ToComputeArray();
// Force Metal backend (Apple GPUs)
var metalResult = data
.AsComputeQueryable()
.WithBackend(ComputeBackend.Metal)
.Select(x => x * 2)
.ToComputeArray();
// Force CPU SIMD (always available)
var cpuResult = data
.AsComputeQueryable()
.WithBackend(ComputeBackend.CpuSimd)
.Select(x => x * 2)
.ToComputeArray();
Performance Expectations
Based on GPU architecture and workload characteristics:
Small Data (1K - 100K elements)
| Operation | CPU LINQ | GPU (CUDA/OpenCL/Metal) | Speedup |
|---|---|---|---|
| Map | ~0.5ms | ~0.2ms | 2-3x |
| Filter | ~0.4ms | ~0.2ms | 2x |
| Reduce | ~0.3ms | ~0.1ms | 3x |
Note: GPU overhead dominates for small datasets. Use CPU for < 10K elements.
Medium Data (100K - 1M elements)
| Operation | CPU LINQ | GPU (CUDA/OpenCL/Metal) | Speedup |
|---|---|---|---|
| Map | ~15ms | 0.5-1.5ms | 10-30x ✅ |
| Filter | ~12ms | 1-2ms | 6-12x ✅ |
| Reduce | ~10ms | 0.3-1ms | 10-33x ✅ |
Note: Sweet spot for GPU acceleration. Significant speedups with manageable overhead.
Large Data (10M+ elements)
| Operation | CPU LINQ | GPU (CUDA/OpenCL/Metal) | Speedup |
|---|---|---|---|
| Map | ~150ms | 3-5ms | 30-50x ✅ |
| Filter | ~120ms | 5-10ms | 12-24x ✅ |
| Reduce | ~100ms | 2-5ms | 20-50x ✅ |
Note: Maximum GPU efficiency. CPU SIMD can help but GPU dominates.
Hardware Requirements
CUDA Backend (NVIDIA GPUs)
Supported Architectures:
- Maxwell (CC 5.0-5.3): GTX 900 series, GTX Titan X
- Pascal (CC 6.0-6.2): GTX 1000 series, P100, Titan Xp
- Volta (CC 7.0-7.2): V100, Titan V
- Turing (CC 7.5): RTX 2000 series, T4, Titan RTX
- Ampere (CC 8.0-8.6): RTX 3000 series, A100, A10
- Ada Lovelace (CC 8.9): RTX 4000 series, L4, L40
Requirements:
- CUDA Toolkit 12.0 or later
- Compatible NVIDIA drivers
- Windows, Linux, or WSL2
OpenCL Backend (Cross-Platform)
Supported Vendors:
- NVIDIA: GeForce, Quadro, Tesla (via OpenCL or CUDA)
- AMD: Radeon RX, Radeon Pro, Instinct (via ROCm or AMDGPU Pro)
- Intel: Arc, Iris Xe, UHD Graphics (via Intel Compute Runtime)
- ARM Mali: Mobile and embedded GPUs
- Qualcomm Adreno: Mobile GPUs
Requirements:
- OpenCL 1.2+ runtime
- Vendor-specific OpenCL drivers
- Cross-platform support (Windows, Linux, macOS, mobile)
Metal Backend (Apple GPUs)
Supported Hardware:
- Apple Silicon: M1, M2, M3, M4 (Unified Memory)
- AMD Discrete GPUs in Intel Macs
- Intel Integrated Graphics in older Macs
Requirements:
- macOS 10.13+ (High Sierra or later)
- Metal 2.0+ support
- Native macOS only
Best Practices
1. Data Size Considerations
// Good: Large datasets benefit from GPU
if (data.Length > 10_000)
{
result = data.AsComputeQueryable().Select(x => x * 2).ToComputeArray();
}
else
{
result = data.Select(x => x * 2).ToArray(); // Standard LINQ
}
2. Operation Chaining for Fusion
// Good: Chain operations for kernel fusion
var result = data
.AsComputeQueryable()
.Select(x => x * 2) // Will be fused
.Where(x => x > 100) // into single kernel
.Select(x => x + 50) // automatically!
.ToComputeArray();
// Avoid: Breaking chain prevents fusion
var temp1 = data.AsComputeQueryable().Select(x => x * 2).ToComputeArray();
var temp2 = temp1.AsComputeQueryable().Where(x => x > 100).ToComputeArray();
// Creates 2 separate kernels instead of 1 fused kernel
3. Memory Management
// Reuse buffers when possible
using var buffer = new UnifiedBuffer<float>(length);
// Multiple queries on same data
var result1 = data.AsComputeQueryable().Select(x => x * 2).ToComputeArray();
var result2 = data.AsComputeQueryable().Select(x => x * 3).ToComputeArray();
// Kernel compilation is cached automatically
4. Backend Selection Strategy
// Let the system choose for optimal performance
var result = data
.AsComputeQueryable()
.Select(x => x * 2)
.ToComputeArray(); // Automatic selection
// Override only when necessary
var gpuResult = data
.AsComputeQueryable()
.WithBackend(ComputeBackend.Cuda) // Force specific backend
.Select(x => x * 2)
.ToComputeArray();
Troubleshooting
GPU Not Detected
Problem: GPU acceleration not available.
Solutions:
- Verify GPU drivers installed:
nvidia-smi(CUDA) orclinfo(OpenCL) - Check CUDA Toolkit version:
nvcc --version(should be 12.0+) - Ensure correct NuGet packages installed
- Check logs for GPU initialization errors
Poor Performance
Problem: GPU slower than expected.
Solutions:
- Data Size: Ensure dataset is large enough (> 10K elements)
- Kernel Fusion: Chain operations to enable fusion
- Backend: Try different backend (
WithBackend()) - Memory Transfers: Minimize CPU↔GPU transfers
- Profiling: Use BenchmarkDotNet to measure actual performance
Compilation Errors
Problem: GPU kernel compilation fails.
Solutions:
- Check Expression: Ensure LINQ expression uses supported operations
- Type Support: Verify data types are supported (byte, int, float, double)
- Fallback: System automatically falls back to CPU on errors
- Logs: Check debug logs for detailed error messages
Next Steps
- Advanced Topics: GPU Kernel Generation Deep Dive
- Performance: Benchmarking LINQ Queries
- Examples: LINQ GPU Acceleration Examples
- API Reference: DotCompute.Linq API Documentation
Additional Resources
- GitHub: DotCompute Repository
- NuGet: DotCompute.Linq Package
- Issues: Report Bugs
- Discussions: Community Support