Quick Start Guide
Get up and running with DotCompute in minutes. This guide shows you how to write your first GPU-accelerated computation using correct API patterns for v0.4.1-rc2.
📖 See Also: Working Reference Example for comprehensive examples and patterns.
Prerequisites
- .NET 9.0 SDK or later
- Visual Studio 2022 17.8+ or VS Code with C# extension
- (Optional) NVIDIA GPU with Compute Capability 5.0+ for CUDA support
- (Optional) macOS with Apple Silicon for Metal support
Installation
Install DotCompute v0.4.1-rc2 via NuGet:
# Core packages (required)
dotnet add package DotCompute.Core --version 0.4.1-rc2
dotnet add package DotCompute.Abstractions --version 0.4.1-rc2
dotnet add package DotCompute.Runtime --version 0.4.1-rc2
# CPU backend (always recommended)
dotnet add package DotCompute.Backends.CPU --version 0.4.1-rc2
# GPU backends (optional)
dotnet add package DotCompute.Backends.CUDA --version 0.4.1-rc2 # NVIDIA GPUs
dotnet add package DotCompute.Backends.OpenCL --version 0.4.1-rc2 # Cross-platform GPU
dotnet add package DotCompute.Backends.Metal --version 0.4.1-rc2 # Apple Silicon
# Source generators (required for [Kernel] attribute)
dotnet add package DotCompute.Generators --version 0.4.1-rc2
Your First Kernel
The simplest way to create a compute kernel is using the [Kernel] attribute:
using DotCompute;
using DotCompute.Abstractions;
public static class MyKernels
{
[Kernel]
public static void VectorAdd(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> result)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
}
}
Executing with DotCompute Runtime
Use the unified AddDotComputeRuntime() method for complete setup:
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;
using DotCompute.Abstractions.Factories;
// Setup host with DotCompute services
var host = Host.CreateApplicationBuilder(args);
host.Services.AddLogging();
// ✅ Single method registers ALL necessary services!
host.Services.AddDotComputeRuntime();
var app = host.Build();
// Prepare data
var a = new float[] { 1, 2, 3, 4, 5 };
var b = new float[] { 10, 20, 30, 40, 50 };
var result = new float[5];
// Execute kernel with automatic backend selection
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
await orchestrator.ExecuteKernelAsync(
kernelName: "VectorAdd",
args: new object[] { a, b, result }
);
// Read results
Console.WriteLine(string.Join(", ", result)); // Output: 11, 22, 33, 44, 55
Device Discovery and Selection
Device enumeration works with the manual registration pattern:
// Factory is already obtained in the example above
var devices = await factory.GetAvailableDevicesAsync();
Console.WriteLine($"Found {devices.Count} device(s):");
foreach (var device in devices)
{
Console.WriteLine($" - {device.Name} ({device.DeviceType})");
Console.WriteLine($" Memory: {device.TotalMemory / (1024.0 * 1024 * 1024):F2} GB");
}
// Find and use a specific device (e.g., CUDA)
var cudaDevice = devices.FirstOrDefault(d => d.DeviceType == "CUDA");
if (cudaDevice != null)
{
// Create accelerator for this device
using var accelerator = await factory.CreateAsync(cudaDevice);
Console.WriteLine($"Using GPU: {cudaDevice.Name}");
// Now use the accelerator for kernel compilation and execution
// (See backend-specific documentation for details)
}
Backend Selection
With the manual registration pattern, you explicitly select which device to use:
// Get factory
var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();
var devices = await factory.GetAvailableDevicesAsync();
// Select best available device (priority: CUDA > OpenCL > CPU)
var device = devices.FirstOrDefault(d => d.DeviceType == "CUDA")
?? devices.FirstOrDefault(d => d.DeviceType == "OpenCL")
?? devices.First();
// Create accelerator
using var accelerator = await factory.CreateAsync(device);
Console.WriteLine($"Using device: {device.Name} ({device.DeviceType})");
// Now compile and execute kernels on this accelerator
// (See backend-specific documentation for compilation details)
Matrix Operations Example
Define a matrix multiplication kernel:
public static class MatrixKernels
{
[Kernel]
public static void MatrixMultiply(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b,
Span<float> result,
int width)
{
int row = Kernel.ThreadId.Y;
int col = Kernel.ThreadId.X;
if (row < width && col < width)
{
float sum = 0;
for (int k = 0; k < width; k++)
{
sum += a[row * width + k] * b[k * width + col];
}
result[row * width + col] = sum;
}
}
}
// Usage
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
var matrixA = new float[9] { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var matrixB = new float[9] { 1, 0, 0, 0, 1, 0, 0, 0, 1 }; // Identity matrix
var result = new float[9];
await orchestrator.ExecuteKernelAsync(
"MatrixMultiply",
new object[] { matrixA, matrixB, result, 3 }
);
Debugging Cross-Backend
Enable debugging services for cross-backend validation:
using DotCompute.Core.Debugging;
using Microsoft.Extensions.DependencyInjection;
// Add debugging services during setup
host.Services.AddProductionDebugging(options =>
{
options.Profile = DebugProfile.Development;
options.ValidateAllExecutions = true; // Validate CPU vs GPU results
});
// Debugging happens automatically during kernel execution
// Any discrepancies will be logged with detailed diagnostics
await orchestrator.ExecuteKernelAsync("VectorAdd", new object[] { a, b, result });
// Check logs for validation results
// If results don't match, detailed difference reports will be shown
Performance Optimization
Enable Adaptive Backend Selection
using Microsoft.Extensions.DependencyInjection;
using DotCompute.Runtime;
// During host setup
host.Services.AddDotComputeRuntime();
host.Services.AddProductionOptimization(); // ML-based backend selection
var app = host.Build();
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
// Orchestrator now uses machine learning to select optimal backend
await orchestrator.ExecuteKernelAsync("VectorAdd", new object[] { a, b, result });
Memory Pooling (Automatic)
// Memory pooling is automatic in DotCompute v0.4.0-rc2
// The runtime manages buffers efficiently, reducing allocations by 90%+
// Just use normal arrays - pooling happens automatically
var data = new float[1_000_000];
await orchestrator.ExecuteKernelAsync("ProcessData", new object[] { data });
// No manual pool management required!
Batch Multiple Kernel Calls
// Execute multiple kernels efficiently
var tasks = new[]
{
orchestrator.ExecuteKernelAsync("Kernel1", new object[] { data1 }),
orchestrator.ExecuteKernelAsync("Kernel2", new object[] { data2 }),
orchestrator.ExecuteKernelAsync("Kernel3", new object[] { data3 })
};
await Task.WhenAll(tasks); // Parallel execution
Advanced: Writing Raw MSL (Metal)
For Apple Silicon, you can write Metal Shading Language directly:
var mslCode = @"
#include <metal_stdlib>
using namespace metal;
kernel void vector_add(
const device float* a [[buffer(0)]],
const device float* b [[buffer(1)]],
device float* result [[buffer(2)]],
uint id [[thread_position_in_grid]])
{
result[id] = a[id] + b[id];
}
";
var kernel = await accelerator.CompileKernelAsync(
new KernelDefinition
{
Name = "vector_add",
Code = mslCode,
EntryPoint = "vector_add"
}
);
Native AOT Support
DotCompute is fully Native AOT compatible:
<PropertyGroup>
<PublishAot>true</PublishAot>
</PropertyGroup>
Startup time: < 10ms with Native AOT
Next Steps
- Kernel Attribute Reference - Learn all kernel features
- Performance Guide - Optimize your kernels
- CUDA Programming - Advanced CUDA features
- Algorithm Library - Pre-built operations
Example Projects
Check out complete examples in the repository:
- Vector Addition -
samples/VectorAdd/ - Matrix Multiplication -
samples/MatrixMultiply/ - Image Processing -
samples/ImageFilters/ - Signal Processing -
samples/FFT/
Getting Help
- Documentation: https://mivertowski.github.io/DotCompute/
- Issues: https://github.com/mivertowski/DotCompute/issues
- Discussions: https://github.com/mivertowski/DotCompute/discussions
Performance Tip: Start with CPU backend for development and debugging, then switch to GPU for production performance. DotCompute makes this transition seamless!