Introduction to GPU Computing

This module introduces the fundamental concepts of GPU computing and helps you understand when GPU acceleration provides meaningful benefits.

CPU vs GPU Architecture

CPU: Few Powerful Cores

CPUs are optimized for sequential tasks with complex control flow:

4-64 cores optimized for single-thread performance
Large caches (L1, L2, L3) for low-latency memory access
Branch prediction and out-of-order execution
Ideal for: business logic, I/O operations, complex algorithms

GPU: Many Simple Cores

GPUs are optimized for parallel data processing:

Thousands of cores optimized for throughput
High memory bandwidth (hundreds of GB/s)
SIMD execution (Single Instruction, Multiple Data)
Ideal for: matrix operations, image processing, simulations

When to Use GPU Acceleration

Good Candidates

Workload	Why GPU Helps
Matrix multiplication	Highly parallel, regular memory access
Image/video processing	Independent pixel operations
Scientific simulations	Parallel numerical computations
Machine learning inference	Batch processing of data
Signal processing (FFT)	Parallel frequency domain operations

Poor Candidates

Workload	Why GPU Hurts
Small data sets (<1000 elements)	Transfer overhead dominates
Sequential algorithms	Cannot parallelize
Branch-heavy code	GPU threads diverge
Random memory access	Poor memory coalescing
I/O-bound operations	GPU cannot help

GPU Programming Model

Threads and Thread Blocks

GPU computation is organized hierarchically:

Grid (entire computation)
└── Thread Blocks (groups of threads)
    └── Threads (individual execution units)

Key concepts:

Thread: Smallest unit of execution
Thread Block: Group of threads that can synchronize
Grid: Collection of thread blocks

Example: Vector Addition

Adding two arrays of 10,000 elements:

CPU approach: 1 thread processes 10,000 elements sequentially
GPU approach: 10,000 threads each process 1 element in parallel

DotCompute's Approach

DotCompute simplifies GPU programming through:

1. Declarative Kernel Definition

using DotCompute.Generators.Kernel.Attributes;

public static partial class MyKernels
{
    [Kernel]
    public static void VectorAdd(
        ReadOnlySpan<float> a,
        ReadOnlySpan<float> b,
        Span<float> result)
    {
        int idx = Kernel.ThreadId.X;
        if (idx < result.Length)
        {
            result[idx] = a[idx] + b[idx];
        }
    }
}

2. Automatic Backend Selection

DotCompute detects available hardware and selects the best backend:

CUDA: NVIDIA GPUs (highest performance) - Production
CPU: SIMD-optimized fallback - Production
Metal: Apple Silicon (native macOS/iOS) - Experimental
OpenCL: AMD, Intel, and cross-vendor - Experimental

3. Source Generation

The [Kernel] attribute triggers compile-time code generation:

No runtime reflection
Native AOT compatible
Optimized for each backend

Performance Expectations

Realistic Speedups

Operation	Data Size	Typical Speedup
Vector operations	1M elements	10-50x
Matrix multiplication	1024x1024	20-100x
Image convolution	4K image	30-80x
Reduction (sum)	10M elements	5-20x

Break-Even Points

GPU acceleration has overhead. Typical break-even points:

CPU SIMD: ~100 elements (3.7x speedup)
GPU: ~10,000 elements (21-92x speedup)

Below these thresholds, CPU processing is often faster due to transfer overhead.

Hands-On Exercise

Before proceeding, verify your environment:

# Check .NET version
dotnet --version

# Create a test project
dotnet new console -n GpuTest
cd GpuTest

# Add DotCompute packages
dotnet add package DotCompute.Core
dotnet add package DotCompute.Abstractions
dotnet add package DotCompute.Runtime
dotnet add package DotCompute.Backends.CPU
dotnet add package DotCompute.Generators

Create Program.cs:

using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using DotCompute.Runtime;
using DotCompute.Abstractions.Factories;

// Build host with DotCompute services
var host = Host.CreateApplicationBuilder(args);
host.Services.AddDotComputeRuntime();
var app = host.Build();

// Get the accelerator factory
var factory = app.Services.GetRequiredService<IUnifiedAcceleratorFactory>();

// Enumerate available devices
var devices = await factory.GetAvailableDevicesAsync();

Console.WriteLine("Available devices:");
foreach (var device in devices)
{
    Console.WriteLine($"  - {device.DeviceType}: {device.Name}");
    Console.WriteLine($"    Memory: {device.TotalMemory / (1024.0 * 1024 * 1024):F2} GB");
}

Run and verify you see at least one device (CPU is always available).

Key Takeaways

GPUs excel at parallel data processing with thousands of simple cores
Not all workloads benefit from GPU acceleration
Data transfer overhead determines minimum efficient data size
DotCompute abstracts complexity while providing full control when needed

Next Module

Your First Kernel →

Learn to write and execute a complete GPU kernel.

Table of Contents