Memory Fundamentals
This module covers GPU memory hierarchy, buffer types, and efficient data transfer patterns.
GPU Memory Hierarchy
Memory Types
| Memory Type | Scope | Speed | Size | Persistence |
|---|---|---|---|---|
| Registers | Per-thread | Fastest | ~256 KB total | Kernel lifetime |
| Shared Memory | Per-block | Very fast | 48-164 KB | Block lifetime |
| Global Memory | All threads | Slower | 4-80 GB | Until freed |
| Constant Memory | All threads | Fast (cached) | 64 KB | Until freed |
Memory Access Pattern
CPU Host Memory ←→ GPU Global Memory ←→ Registers/Shared
[PCIe/NVLink] [High bandwidth]
Working with Memory in DotCompute
DotCompute simplifies memory management through automatic buffer handling:
Automatic Memory Management
The IComputeOrchestrator handles memory transfers automatically:
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using DotCompute.Runtime;
using DotCompute.Abstractions.Interfaces;
// Setup
var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime();
var app = host.Build();
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
// Prepare data (host arrays)
float[] inputA = new float[10000];
float[] inputB = new float[10000];
float[] result = new float[10000];
// Fill input data
for (int i = 0; i < 10000; i++)
{
inputA[i] = i;
inputB[i] = i * 2;
}
// Execute kernel - memory transfers handled automatically
await orchestrator.ExecuteKernelAsync(
kernelName: "VectorAdd",
args: new object[] { inputA, inputB, result }
);
// Results are automatically copied back to host array
Console.WriteLine($"Result[0] = {result[0]}"); // 0
Console.WriteLine($"Result[100] = {result[100]}"); // 300
What happens automatically:
- Input arrays are copied to GPU memory
- Kernel executes on GPU
- Output array is copied back to host
- GPU memory is released
Understanding Span Parameters
In kernels, use Span<T> and ReadOnlySpan<T> for buffer parameters:
using DotCompute.Generators.Kernel.Attributes;
public static partial class MyKernels
{
[Kernel]
public static void VectorAdd(
ReadOnlySpan<float> a, // Input (read-only on GPU)
ReadOnlySpan<float> b, // Input (read-only on GPU)
Span<float> result) // Output (read-write on GPU)
{
int idx = Kernel.ThreadId.X;
if (idx < result.Length)
{
result[idx] = a[idx] + b[idx];
}
}
}
Parameter conventions:
ReadOnlySpan<T>: Input data (GPU reads only)Span<T>: Output data (GPU reads and writes)- Scalar types (
int,float): Constants passed to all threads
Memory Coalescing
GPU memory is accessed in transactions (typically 32 or 128 bytes). Coalesced access patterns achieve maximum bandwidth.
Coalesced Access (Good)
Adjacent threads access adjacent memory locations:
[Kernel]
public static void CoalescedAccess(ReadOnlySpan<float> input, Span<float> output)
{
int idx = Kernel.ThreadId.X;
if (idx < output.Length)
{
// Thread 0 reads input[0], thread 1 reads input[1], etc.
// Memory transactions are combined
output[idx] = input[idx] * 2.0f;
}
}
Strided Access (Bad)
Threads access non-adjacent locations:
[Kernel]
public static void StridedAccess(ReadOnlySpan<float> input, Span<float> output, int stride)
{
int idx = Kernel.ThreadId.X;
if (idx * stride < output.Length)
{
// Thread 0 reads input[0], thread 1 reads input[stride], etc.
// Multiple memory transactions required
output[idx] = input[idx * stride] * 2.0f;
}
}
Performance Impact
| Access Pattern | Relative Performance |
|---|---|
| Coalesced | 100% (baseline) |
| Stride 2 | ~50% |
| Stride 32 | ~3% |
| Random | ~1-5% |
Buffer Best Practices
1. Minimize Transfers
Keep data on GPU between kernel calls when possible:
// BAD: Transfer intermediate results unnecessarily
float[] temp = new float[size];
await orchestrator.ExecuteKernelAsync("Kernel1", new object[] { input, temp });
// temp is copied back to CPU here
await orchestrator.ExecuteKernelAsync("Kernel2", new object[] { temp, output });
// temp is copied to GPU again
For multi-stage pipelines, consider combining kernels or using advanced memory patterns (see Intermediate Path).
2. Batch Operations
// BAD: Many small kernel calls
for (int i = 0; i < 1000; i++)
{
float[] smallData = GetSmallChunk(i);
await orchestrator.ExecuteKernelAsync("Process", new object[] { smallData });
}
// GOOD: Single large kernel call
float[] allData = GetAllData();
await orchestrator.ExecuteKernelAsync("ProcessBatch", new object[] { allData });
3. Choose Right Data Types
// Use float for most GPU operations (faster)
float[] data = new float[size];
// Use double only when precision is critical
double[] preciseData = new double[size];
// Note: float64 may not be supported on all GPUs
// Check device capabilities before using double
Practical Example: Image Grayscale
using DotCompute.Generators.Kernel.Attributes;
public static partial class ImageKernels
{
[Kernel]
public static void Grayscale(
ReadOnlySpan<byte> input, // RGBA input
Span<byte> output, // Grayscale output
int pixelCount)
{
int idx = Kernel.ThreadId.X;
if (idx < pixelCount)
{
int inputIdx = idx * 4; // RGBA = 4 bytes per pixel
// Luminosity formula
byte r = input[inputIdx];
byte g = input[inputIdx + 1];
byte b = input[inputIdx + 2];
output[idx] = (byte)(0.299f * r + 0.587f * g + 0.114f * b);
}
}
}
// Usage
public async Task ProcessImage(byte[] rgbaImage, int width, int height)
{
var host = Host.CreateApplicationBuilder().Build();
host.Services.AddDotComputeRuntime();
var app = host.Build();
var orchestrator = app.Services.GetRequiredService<IComputeOrchestrator>();
int pixelCount = width * height;
byte[] grayscale = new byte[pixelCount];
await orchestrator.ExecuteKernelAsync(
"Grayscale",
new object[] { rgbaImage, grayscale, pixelCount }
);
// grayscale now contains the result
}
Understanding Thread Configuration
DotCompute automatically configures thread dimensions, but understanding them helps write efficient kernels:
Total threads = GridSize × BlockSize
For 10,000 elements with BlockSize=256:
- GridSize = ceil(10000/256) = 40 blocks
- Total threads = 40 × 256 = 10,240 threads
- Extra 240 threads are handled by bounds checking
Why bounds checking is critical:
[Kernel]
public static void Process(Span<float> data)
{
int idx = Kernel.ThreadId.X;
// ALWAYS check bounds - some threads may be beyond data size
if (idx < data.Length)
{
data[idx] = data[idx] * 2.0f;
}
}
Exercises
Exercise 1: Transfer Benchmark
Measure how array size affects total execution time:
foreach (int size in new[] { 1000, 100_000, 10_000_000 })
{
float[] data = new float[size];
var sw = Stopwatch.StartNew();
await orchestrator.ExecuteKernelAsync("VectorAdd", new object[] { data, data, data });
Console.WriteLine($"Size: {size:N0}, Time: {sw.ElapsedMilliseconds}ms");
}
Exercise 2: Coalescing Impact
Compare performance of coalesced vs strided access patterns.
Exercise 3: Batch vs Individual
Compare performance of many small kernel calls vs one large call.
Key Takeaways
- GPU memory hierarchy affects performance significantly
- DotCompute handles transfers automatically for simple cases
- Coalesced access provides maximum memory bandwidth
- Always include bounds checks in kernels
- Minimize transfers by batching operations
Next Module
Learn how to choose and configure the appropriate compute backend.