Ring Kernel Compilation Pipeline - Architecture Document

Version: 1.0 Date: January 2025 Status: Design Proposal Author: DotCompute Team

Executive Summary

This document outlines the architecture and implementation plan for a complete Ring Kernel compilation pipeline that transforms C# methods annotated with [RingKernel] into GPU-executable PTX code. The current implementation uses placeholder PTX and does not execute actual GPU kernels. This design addresses 5 critical gaps identified in GPU execution validation testing.

Current State Analysis
Problem Statement
Design Goals
Architecture Overview
Component Design
Implementation Phases
Testing Strategy
Performance Considerations
Risk Mitigation

Current State Analysis

Existing Infrastructure

Source Generators and Analyzers:

RingKernelAttributeAnalyzer.cs - Validates [RingKernel] attribute usage
RingKernelCodeBuilder.cs - Generates C# wrapper code for ring kernel methods
Source generator creates host-side invocation infrastructure

CUDA Backend:

CudaRingKernelRuntime.cs - Runtime execution engine
CudaRingKernelCompiler.cs - Kernel compilation infrastructure
CudaMessageQueueBridgeFactory.cs - Host ↔ Device message transfer
PTXCompiler.cs - General PTX compilation using NVRTC

Message Infrastructure:

MessageQueueBridge.cs - Bidirectional message transfer (recently enhanced with validation)
MemoryPackMessageSerializer.cs - High-performance serialization (2-5x faster than JSON)
IRingKernelMessage - Message interface with MessageId, MessageType, Timestamp

Current Kernel Launch Flow

// CudaRingKernelRuntime.cs:328-329 (PLACEHOLDER - NO ACTUAL EXECUTION)
// TODO: Replace placeholder with actual cuLaunchCooperativeKernel call
// For now, just log the kernel function pointer
_logger.LogDebug("Ring kernel compiled successfully: {KernelPtr:X16}", kernelPtr);

Key Finding: The kernel is compiled to placeholder PTX but never actually launched on the GPU.

Placeholder PTX Generation

// CudaRingKernelRuntime.cs:1013-1053
private string GeneratePlaceholderPTX(string kernelId, int inputQueueCapacity, int outputQueueCapacity)
{
    // Generates simple PTX that just logs execution
    // DOES NOT process messages from input queue
    // DOES NOT write results to output queue
    // Serves only as a compilation test
}

Gap: This PTX is not derived from the actual [RingKernel] C# method. User's kernel logic is completely ignored.

Example Ring Kernel (Not Compiled)

// VectorAddRingKernel.cs
[RingKernel(
    KernelId = "VectorAdd",
    InputQueueCapacity = 128,
    OutputQueueCapacity = 128)]
public static void Execute(
    Span<long> timestamps,
    Span<VectorAddRequest> requestQueue,
    Span<VectorAddResponse> responseQueue,
    ReadOnlySpan<float> vectorA,
    ReadOnlySpan<float> vectorB,
    Span<float> result)
{
    // This C# code is NEVER compiled to PTX
    // The placeholder PTX ignores this logic entirely
}

Problem Statement

Critical Gaps

No C# → PTX Compilation
- [RingKernel] methods are not compiled to PTX
- Placeholder PTX ignores user kernel logic
- No integration between C# source and GPU execution
No Cooperative Kernel Launch
- cuLaunchCooperativeKernel call is commented out
- Kernels are compiled but never executed
- No message processing on GPU
No Execution Validation
- No CUDA event timing
- No verification of GPU execution
- No correctness tests for kernel output
No Performance Metrics
- No profiling infrastructure
- No Nsight Compute integration
- No latency/throughput measurements
No End-to-End Testing
- Tests only verify placeholder compilation
- No tests for actual message processing
- No GPU execution validation

Impact

Current Status: Ring Kernel system is a non-functional prototype
User Experience: Developers cannot write custom GPU kernels
Performance: Zero GPU acceleration (CPU fallback not implemented)
Production Readiness: Cannot be deployed in current state

Design Goals

Primary Goals

Compile C# to PTX: Transform [RingKernel] methods into executable GPU code
Execute on GPU: Launch compiled kernels using cooperative groups
Process Messages: Read from input queue, execute logic, write to output queue
Validate Execution: Prove kernels run on GPU with event timing
Measure Performance: Integrate profiling and metrics collection

Non-Goals (Future Work)

Multi-GPU ring kernels (Phase 4 - future)
Dynamic kernel recompilation
JIT optimization
CPU fallback execution

Quality Requirements

Correctness: 100% of messages processed correctly
Performance: <50ns per message overhead (Ring Kernel latency budget)
Reliability: Graceful error handling, no GPU crashes
Testability: 90%+ code coverage, comprehensive validation tests

Architecture Overview

High-Level Flow

┌─────────────────────────────────────────────────────────────────┐
│ Developer writes [RingKernel] method in C#                      │
│                                                                  │
│  [RingKernel(KernelId = "VectorAdd")]                          │
│  public static void Execute(                                    │
│      Span<VectorAddRequest> requests,                          │
│      Span<VectorAddResponse> responses) { ... }                │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│ Source Generator (RingKernelCodeBuilder)                        │
│  ✓ Generates C# host-side wrapper                              │
│  ✓ Generates CUDA kernel stub (NEW)                            │
│  ✓ Emits kernel signature metadata                             │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│ CudaRingKernelCompiler                                          │
│  ✓ Detects [RingKernel] method signature                       │
│  ✓ Generates CUDA C++ kernel code                              │
│  ✓ Invokes PTXCompiler with NVRTC                              │
│  ✓ Loads compiled PTX module                                   │
│  ✓ Retrieves mangled function pointer                          │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│ CudaRingKernelRuntime                                           │
│  ✓ Creates MessageQueueBridge instances                        │
│  ✓ Allocates GPU buffers for message queues                    │
│  ✓ Launches kernel with cuLaunchCooperativeKernel              │
│  ✓ Monitors execution with CUDA events                         │
│  ✓ Pumps messages Host ↔ Device                                │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│ GPU Kernel Execution (Persistent)                               │
│  ✓ Cooperative thread block synchronization                    │
│  ✓ Reads messages from input queue buffer                      │
│  ✓ Executes user kernel logic                                  │
│  ✓ Writes results to output queue buffer                       │
│  ✓ Synchronizes with grid barrier                              │
└─────────────────────────────────────────────────────────────────┘

Key Components

Component	Responsibility	Status
`RingKernelCodeBuilder`	Generate C# wrapper + CUDA stub	✅ Partial (wrapper only)
`CudaRingKernelCompiler`	Compile C# → PTX	❌ Placeholder only
`PTXCompiler`	Invoke NVRTC, load module	✅ Complete
`CudaRingKernelRuntime`	Launch kernel, manage lifecycle	❌ No kernel launch
`MessageQueueBridge`	Host ↔ Device message transfer	✅ Complete (validated)
`CudaEventTimer`	GPU execution timing	❌ Not implemented

Component Design

1. Source Generator Enhancement

File: /src/Runtime/DotCompute.SourceGenerators/RingKernelCodeBuilder.cs

Current Behavior:

Generates C# host-side wrapper class
Creates method signature metadata

Required Changes:

// NEW: Generate CUDA kernel stub alongside C# wrapper
public void GenerateCudaKernelStub(RingKernelMethod kernel, StringBuilder output)
{
    output.AppendLine($"extern \"C\" __global__ void __launch_bounds__({kernel.ThreadsPerBlock})");
    output.AppendLine($"{kernel.KernelId}_kernel(");

    // Generate parameter list from C# signature
    output.AppendLine("    long* timestamps,");
    output.AppendLine($"    {GetCudaType(kernel.InputType)}* input_queue,");
    output.AppendLine($"    {GetCudaType(kernel.OutputType)}* output_queue,");

    // Additional kernel parameters from Span<> arguments
    foreach (var param in kernel.AdditionalParameters)
    {
        output.AppendLine($"    {GetCudaType(param.Type)}* {param.Name},");
    }

    output.AppendLine("    int input_capacity,");
    output.AppendLine("    int output_capacity)");
    output.AppendLine("{");
    output.AppendLine("    // Cooperative grid synchronization");
    output.AppendLine("    cooperative_groups::grid_group grid = cooperative_groups::this_grid();");
    output.AppendLine("    ");
    output.AppendLine("    // Persistent kernel loop");
    output.AppendLine("    while (true) {");
    output.AppendLine("        // TODO: Implement message processing logic");
    output.AppendLine("        grid.sync();");
    output.AppendLine("    }");
    output.AppendLine("}");
}

Design Decision: Generate CUDA stub during source generation phase, not runtime. This allows developers to see the CUDA code and debug more easily.

2. C# → CUDA Type Mapping

File: /src/Backends/DotCompute.Backends.CUDA/Compilation/CudaTypeMapper.cs (NEW)

/// <summary>
/// Maps C# types to CUDA C++ types for kernel generation.
/// </summary>
public static class CudaTypeMapper
{
    public static string GetCudaType(Type csType)
    {
        return csType.Name switch
        {
            "Int32" => "int",
            "Int64" => "long long",
            "Single" => "float",
            "Double" => "double",
            "Byte" => "unsigned char",
            "Boolean" => "bool",
            _ when csType.IsGenericType && csType.GetGenericTypeDefinition() == typeof(Span<>)
                => $"{GetCudaType(csType.GenericTypeArguments[0])}*",
            _ when csType.IsValueType
                => csType.Name, // Assume struct with same name exists in CUDA
            _ => throw new NotSupportedException($"Type {csType.Name} not supported in CUDA kernels")
        };
    }

    public static string GetMemoryPackSerializationSize(Type messageType)
    {
        // Calculate serialized size for MemoryPack messages
        // Header (256 bytes) + Payload (up to 64KB)
        return "65792"; // 256 + 65536
    }
}

3. Ring Kernel Compiler (Complete Rewrite)

File: /src/Backends/DotCompute.Backends.CUDA/RingKernels/CudaRingKernelCompiler.cs

Current Issues:

Only generates placeholder PTX
Ignores [RingKernel] method body
No integration with user C# code

New Architecture:

/// <summary>
/// Compiles [RingKernel] C# methods to CUDA PTX using multi-stage pipeline.
/// </summary>
public class CudaRingKernelCompiler
{
    private readonly ILogger<CudaRingKernelCompiler> _logger;
    private readonly ConcurrentDictionary<string, CompiledKernel> _compiledKernels = new();

    /// <summary>
    /// Compilation stages for Ring Kernel PTX generation.
    /// </summary>
    public enum CompilationStage
    {
        Discovery,      // Find [RingKernel] method via reflection
        Analysis,       // Extract method signature, parameters, types
        CudaGeneration, // Generate CUDA C++ kernel code
        PTXCompilation, // Compile CUDA → PTX with NVRTC
        ModuleLoad,     // Load PTX module into CUDA context
        Verification    // Verify function pointer retrieval
    }

    public async Task<CompiledKernel> CompileRingKernelAsync(
        string kernelId,
        CompilationOptions options,
        CancellationToken cancellationToken)
    {
        // Stage 1: Discovery - Find [RingKernel] method
        var kernelMethod = DiscoverRingKernelMethod(kernelId);
        if (kernelMethod == null)
        {
            throw new InvalidOperationException(
                $"Ring kernel '{kernelId}' not found. Ensure method has [RingKernel] attribute.");
        }

        // Stage 2: Analysis - Extract signature metadata
        var metadata = AnalyzeKernelSignature(kernelMethod);

        // Stage 3: CUDA Generation - Create CUDA C++ code
        var cudaSource = GenerateCudaKernelSource(metadata, options);

        // Stage 4: PTX Compilation - Invoke NVRTC
        var ptxBytes = await PTXCompiler.CompileToPtxAsync(
            cudaSource,
            kernelId,
            options with {
                IncludePaths = ["/usr/local/cuda/include"],
                CompilerFlags = ["--use_fast_math", "-std=c++17", "-rdc=true"]
            },
            _logger);

        // Stage 5: Module Load - Load into CUDA context
        var module = await LoadPTXModuleAsync(ptxBytes, kernelId);

        // Stage 6: Verification - Get kernel function pointer
        var functionPtr = GetKernelFunctionPointer(module, kernelId);

        var compiled = new CompiledKernel(
            kernelId,
            metadata,
            module,
            functionPtr,
            ptxBytes);

        _compiledKernels[kernelId] = compiled;

        _logger.LogInformation(
            "Compiled Ring Kernel '{KernelId}': PTX={PtxSize} bytes, " +
            "Input={InputType}, Output={OutputType}, Function={FunctionPtr:X16}",
            kernelId, ptxBytes.Length, metadata.InputType.Name,
            metadata.OutputType.Name, functionPtr.ToInt64());

        return compiled;
    }

    private MethodInfo? DiscoverRingKernelMethod(string kernelId)
    {
        // Search all loaded assemblies for [RingKernel] methods
        var assemblies = AppDomain.CurrentDomain.GetAssemblies();

        foreach (var assembly in assemblies)
        {
            foreach (var type in assembly.GetTypes())
            {
                foreach (var method in type.GetMethods(
                    BindingFlags.Public | BindingFlags.Static))
                {
                    var attr = method.GetCustomAttribute<RingKernelAttribute>();
                    if (attr?.KernelId == kernelId)
                    {
                        return method;
                    }
                }
            }
        }

        return null;
    }

    private KernelMetadata AnalyzeKernelSignature(MethodInfo method)
    {
        var parameters = method.GetParameters();

        // Ring kernel signature pattern:
        // param[0]: Span<long> timestamps
        // param[1]: Span<TInput> requestQueue
        // param[2]: Span<TOutput> responseQueue
        // param[3+]: Additional Span<T> parameters

        if (parameters.Length < 3)
        {
            throw new InvalidOperationException(
                $"Ring kernel must have at least 3 parameters: " +
                $"Span<long> timestamps, Span<TInput> requests, Span<TOutput> responses");
        }

        var inputType = ExtractSpanElementType(parameters[1].ParameterType);
        var outputType = ExtractSpanElementType(parameters[2].ParameterType);

        var additionalParams = parameters.Skip(3)
            .Select(p => new KernelParameter(
                p.Name!,
                ExtractSpanElementType(p.ParameterType)))
            .ToList();

        return new KernelMetadata(
            method.Name,
            inputType,
            outputType,
            additionalParams);
    }

    private string GenerateCudaKernelSource(
        KernelMetadata metadata,
        CompilationOptions options)
    {
        var sb = new StringBuilder();

        // CUDA headers
        sb.AppendLine("#include <cooperative_groups.h>");
        sb.AppendLine("#include <cuda_runtime.h>");
        sb.AppendLine();

        // MemoryPack message structure definitions
        sb.AppendLine($"// Input message: {metadata.InputType.Name}");
        sb.AppendLine(GenerateMessageStructure(metadata.InputType));
        sb.AppendLine();

        sb.AppendLine($"// Output message: {metadata.OutputType.Name}");
        sb.AppendLine(GenerateMessageStructure(metadata.OutputType));
        sb.AppendLine();

        // Kernel function
        sb.AppendLine($"extern \"C\" __global__ void __launch_bounds__(256)");
        sb.AppendLine($"{metadata.KernelName}_kernel(");
        sb.AppendLine($"    long long* timestamps,");
        sb.AppendLine($"    {CudaTypeMapper.GetCudaType(metadata.InputType)}* input_queue,");
        sb.AppendLine($"    {CudaTypeMapper.GetCudaType(metadata.OutputType)}* output_queue,");

        foreach (var param in metadata.AdditionalParameters)
        {
            sb.AppendLine($"    {CudaTypeMapper.GetCudaType(param.Type)}* {param.Name},");
        }

        sb.AppendLine($"    int input_capacity,");
        sb.AppendLine($"    int output_capacity)");
        sb.AppendLine("{");
        sb.AppendLine("    // Cooperative grid group for synchronization");
        sb.AppendLine("    namespace cg = cooperative_groups;");
        sb.AppendLine("    cg::grid_group grid = cg::this_grid();");
        sb.AppendLine();
        sb.AppendLine("    const int tid = blockIdx.x * blockDim.x + threadIdx.x;");
        sb.AppendLine();
        sb.AppendLine("    // Persistent kernel loop");
        sb.AppendLine("    while (true) {");
        sb.AppendLine("        // Poll for messages");
        sb.AppendLine("        if (tid < input_capacity) {");
        sb.AppendLine("            auto request = input_queue[tid];");
        sb.AppendLine("            ");
        sb.AppendLine("            // TODO: Translate C# kernel logic to CUDA");
        sb.AppendLine("            // This requires IL → CUDA transpilation (future work)");
        sb.AppendLine("            ");
        sb.AppendLine("            // Write response");
        sb.AppendLine("            if (tid < output_capacity) {");
        sb.AppendLine("                output_queue[tid] = response;");
        sb.AppendLine("            }");
        sb.AppendLine("        }");
        sb.AppendLine();
        sb.AppendLine("        // Synchronize all threads in grid");
        sb.AppendLine("        grid.sync();");
        sb.AppendLine("    }");
        sb.AppendLine("}");

        return sb.ToString();
    }

    private string GenerateMessageStructure(Type messageType)
    {
        // Generate CUDA struct matching MemoryPack serialization layout
        var sb = new StringBuilder();

        sb.AppendLine($"struct {messageType.Name} {{");

        // IRingKernelMessage base fields
        sb.AppendLine("    char message_id[16];      // Guid (128 bits)");
        sb.AppendLine("    char message_type[256];   // String (max 256 chars)");
        sb.AppendLine("    long long timestamp;      // Int64");

        // Message-specific fields
        foreach (var prop in messageType.GetProperties())
        {
            if (prop.DeclaringType == typeof(IRingKernelMessage))
                continue; // Skip base interface properties

            var cudaType = CudaTypeMapper.GetCudaType(prop.PropertyType);
            sb.AppendLine($"    {cudaType} {ToCamelCase(prop.Name)};");
        }

        sb.AppendLine("};");

        return sb.ToString();
    }
}

Key Design Decisions:

Multi-Stage Compilation: Clear separation of Discovery → Analysis → Generation → Compilation → Load → Verification
Reflection-Based Discovery: Find [RingKernel] methods at runtime (compatible with Native AOT via source generators)
MemoryPack Structure Generation: Generate CUDA structs matching MemoryPack serialization layout
Cooperative Groups: Use cooperative_groups::grid_group for grid-wide synchronization
Persistent Kernel Loop: Kernel runs indefinitely, polling for messages

4. Cooperative Kernel Launch

File: /src/Backends/DotCompute.Backends.CUDA/RingKernels/CudaRingKernelRuntime.cs

Current Code (lines 328-329):

// TODO: Replace placeholder with actual cuLaunchCooperativeKernel call
// For now, just log the kernel function pointer
_logger.LogDebug("Ring kernel compiled successfully: {KernelPtr:X16}", kernelPtr);

New Implementation:

/// <summary>
/// Launches a ring kernel using cooperative groups for grid-wide synchronization.
/// </summary>
private async Task LaunchCooperativeKernelAsync(
    CompiledKernel kernel,
    GpuByteBuffer inputBuffer,
    GpuByteBuffer outputBuffer,
    CudaStream stream,
    CancellationToken cancellationToken)
{
    // Calculate grid dimensions
    var (gridDim, blockDim) = CalculateKernelDimensions(
        kernel.Metadata.InputQueueCapacity,
        _deviceProperties.MaxThreadsPerBlock);

    // Prepare kernel parameters
    var kernelParams = new IntPtr[]
    {
        Marshal.AllocHGlobal(IntPtr.Size), // timestamps
        Marshal.AllocHGlobal(IntPtr.Size), // input_queue
        Marshal.AllocHGlobal(IntPtr.Size), // output_queue
        Marshal.AllocHGlobal(sizeof(int)),  // input_capacity
        Marshal.AllocHGlobal(sizeof(int))   // output_capacity
    };

    try
    {
        // Marshal pointers
        Marshal.WriteIntPtr(kernelParams[0], _timestampBuffer.DevicePtr);
        Marshal.WriteIntPtr(kernelParams[1], inputBuffer.DevicePtr);
        Marshal.WriteIntPtr(kernelParams[2], outputBuffer.DevicePtr);
        Marshal.WriteInt32(kernelParams[3], kernel.Metadata.InputQueueCapacity);
        Marshal.WriteInt32(kernelParams[4], kernel.Metadata.OutputQueueCapacity);

        // Create CUDA events for timing
        var startEvent = await CudaEventTimer.CreateEventAsync(_cudaContext);
        var endEvent = await CudaEventTimer.CreateEventAsync(_cudaContext);

        // Record start event
        await CudaEventTimer.RecordEventAsync(startEvent, stream, _cudaContext);

        // Launch cooperative kernel
        _logger.LogInformation(
            "Launching cooperative kernel '{KernelId}': Grid={GridDim}, Block={BlockDim}",
            kernel.KernelId, gridDim, blockDim);

        var launchResult = CudaApi.cuLaunchCooperativeKernel(
            kernel.FunctionPtr,
            gridDim.x, gridDim.y, gridDim.z,
            blockDim.x, blockDim.y, blockDim.z,
            sharedMemBytes: 0,
            stream.Handle,
            kernelParams);

        if (launchResult != CudaError.Success)
        {
            throw new InvalidOperationException(
                $"Cooperative kernel launch failed: {launchResult}");
        }

        // Record end event
        await CudaEventTimer.RecordEventAsync(endEvent, stream, _cudaContext);

        // Synchronize stream to ensure kernel started
        await stream.SynchronizeAsync(cancellationToken);

        // Measure elapsed time
        var elapsedMs = await CudaEventTimer.ElapsedTimeAsync(
            startEvent, endEvent, _cudaContext);

        _logger.LogInformation(
            "Cooperative kernel '{KernelId}' launched successfully. " +
            "Kernel startup time: {ElapsedMs:F3} ms",
            kernel.KernelId, elapsedMs);

        // Store timing event for validation tests
        _kernelTimingEvents[kernel.KernelId] = (startEvent, endEvent);
    }
    finally
    {
        // Free parameter memory
        foreach (var param in kernelParams)
        {
            Marshal.FreeHGlobal(param);
        }
    }
}

private (dim3 gridDim, dim3 blockDim) CalculateKernelDimensions(
    int messageCapacity,
    int maxThreadsPerBlock)
{
    // Use 256 threads per block (optimal for most GPUs)
    var threadsPerBlock = Math.Min(256, maxThreadsPerBlock);

    // Calculate blocks needed to cover message capacity
    var blocks = (messageCapacity + threadsPerBlock - 1) / threadsPerBlock;

    // Limit blocks to device maximum
    blocks = Math.Min(blocks, _deviceProperties.MaxGridDim.x);

    return (
        gridDim: new dim3((uint)blocks, 1, 1),
        blockDim: new dim3((uint)threadsPerBlock, 1, 1)
    );
}

5. CUDA Event Timing Infrastructure

File: /src/Backends/DotCompute.Backends.CUDA/Timing/CudaEventTimer.cs (NEW)

/// <summary>
/// Provides high-precision GPU timing using CUDA events.
/// </summary>
public static class CudaEventTimer
{
    /// <summary>
    /// Creates a CUDA event for timing.
    /// </summary>
    public static async Task<IntPtr> CreateEventAsync(IntPtr cudaContext)
    {
        return await Task.Run(() =>
        {
            CudaRuntime.cuCtxSetCurrent(cudaContext);

            var eventPtr = IntPtr.Zero;
            var result = CudaRuntime.cudaEventCreate(ref eventPtr);

            if (result != CudaError.Success)
            {
                throw new InvalidOperationException(
                    $"Failed to create CUDA event: {result}");
            }

            return eventPtr;
        });
    }

    /// <summary>
    /// Records a CUDA event in the specified stream.
    /// </summary>
    public static async Task RecordEventAsync(
        IntPtr eventPtr,
        CudaStream stream,
        IntPtr cudaContext)
    {
        await Task.Run(() =>
        {
            CudaRuntime.cuCtxSetCurrent(cudaContext);

            var result = CudaRuntime.cudaEventRecord(eventPtr, stream.Handle);

            if (result != CudaError.Success)
            {
                throw new InvalidOperationException(
                    $"Failed to record CUDA event: {result}");
            }
        });
    }

    /// <summary>
    /// Calculates elapsed time between two CUDA events in milliseconds.
    /// </summary>
    public static async Task<float> ElapsedTimeAsync(
        IntPtr startEvent,
        IntPtr endEvent,
        IntPtr cudaContext)
    {
        return await Task.Run(() =>
        {
            CudaRuntime.cuCtxSetCurrent(cudaContext);

            // Synchronize end event
            var syncResult = CudaRuntime.cudaEventSynchronize(endEvent);
            if (syncResult != CudaError.Success)
            {
                throw new InvalidOperationException(
                    $"Failed to synchronize CUDA event: {syncResult}");
            }

            // Calculate elapsed time
            var elapsedMs = 0f;
            var result = CudaRuntime.cudaEventElapsedTime(
                ref elapsedMs, startEvent, endEvent);

            if (result != CudaError.Success)
            {
                throw new InvalidOperationException(
                    $"Failed to calculate CUDA event elapsed time: {result}");
            }

            return elapsedMs;
        });
    }

    /// <summary>
    /// Destroys a CUDA event and releases resources.
    /// </summary>
    public static async Task DestroyEventAsync(
        IntPtr eventPtr,
        IntPtr cudaContext)
    {
        await Task.Run(() =>
        {
            CudaRuntime.cuCtxSetCurrent(cudaContext);

            var result = CudaRuntime.cudaEventDestroy(eventPtr);

            if (result != CudaError.Success)
            {
                throw new InvalidOperationException(
                    $"Failed to destroy CUDA event: {result}");
            }
        });
    }
}

Implementation Phases

Phase 1: Foundation (Week 1-2)

Goal: Establish compilation pipeline infrastructure

Tasks:

Create CudaTypeMapper for C# → CUDA type conversion
Enhance RingKernelCodeBuilder to generate CUDA stubs
Implement kernel discovery via reflection
Create CudaEventTimer timing infrastructure

Deliverables:

Type mapping for primitive types and structs
CUDA kernel stub generation
Reflection-based kernel discovery
CUDA event timing API

Success Criteria:

✅ Can discover [RingKernel] methods via reflection
✅ Can generate CUDA kernel stub from C# signature
✅ Can map C# types to CUDA types
✅ Can create and record CUDA events

Phase 2: Compilation Pipeline (Week 3-4)

Goal: Implement full C# → PTX compilation

Tasks:

Rewrite CudaRingKernelCompiler multi-stage pipeline
Implement GenerateCudaKernelSource() with MemoryPack structs
Integrate with existing PTXCompiler infrastructure
Add PTX module loading and function pointer retrieval
Implement cooperative kernel launch

Deliverables:

Complete compilation pipeline (Discovery → PTX)
MemoryPack-compatible CUDA structures
Cooperative kernel launch implementation
Function pointer verification

Success Criteria:

✅ Can compile [RingKernel] method to valid PTX
✅ PTX module loads successfully
✅ Can retrieve kernel function pointer
✅ cuLaunchCooperativeKernel succeeds without errors

Phase 3: Message Processing (Week 5-6)

Goal: Implement GPU-side message processing logic

Tasks:

Implement message deserialization in CUDA kernel
Add message validation on GPU (MessageId, MessageType checks)
Implement kernel logic placeholder (developer customization point)
Add response serialization and queue write
Integrate with MessageQueueBridge for Host ↔ Device transfer

Deliverables:

GPU message deserialization
Message validation logic
Response queue writing
End-to-end message flow

Success Criteria:

✅ Kernel reads valid messages from input queue
✅ Kernel writes responses to output queue
✅ Host receives responses via MessageQueueBridge
✅ Message count matches expected (no anomalies)

Phase 4: Validation & Testing (Week 7-8)

Goal: Comprehensive testing and validation

Tasks:

Create GPU execution validation tests
Add CUDA event timing verification
Implement correctness tests (vector addition, matrix multiply)
Add performance benchmarks
Integrate Nsight Compute profiling

Deliverables:

20+ GPU execution validation tests
CUDA event timing tests
Correctness validation (VectorAdd, MatMul)
Performance benchmarks
Nsight Compute integration

Success Criteria:

✅ 100% of messages processed correctly
✅ CUDA event timing shows GPU execution
✅ Kernel appears in Nsight Compute trace
✅ Performance meets <50ns per message target

Testing Strategy

Unit Tests

Test Coverage:

CudaTypeMapperTests - C# → CUDA type mapping
KernelDiscoveryTests - Reflection-based method discovery
CudaKernelSourceGenerationTests - CUDA code generation
PTXCompilationTests - NVRTC compilation
CooperativeKernelLaunchTests - Kernel launch verification

Integration Tests

Test Coverage:

EndToEndRingKernelTests - Full pipeline (C# → GPU execution)
MessageQueueIntegrationTests - Host ↔ Device message flow
CudaEventTimingTests - GPU execution timing validation

Hardware Tests

Test Coverage:

VectorAddRingKernelTests - Correctness validation
MatrixMultiplyRingKernelTests - Complex kernel validation
PerformanceBenchmarkTests - Latency/throughput measurements

Validation Criteria

Test Category	Pass Criteria
Type Mapping	100% of C# types map to valid CUDA types
Discovery	All `[RingKernel]` methods found
Compilation	PTX compiles without errors
Launch	`cuLaunchCooperativeKernel` returns `Success`
Timing	CUDA events show >0ms elapsed time
Correctness	100% of messages processed correctly
Performance	<50ns per message overhead

Performance Considerations

Optimization Targets

Compilation Performance:

Target: <500ms for kernel compilation (NVRTC)
Strategy: Cache compiled PTX modules by kernel ID + input/output types

Message Throughput:

Target: >20M messages/second (RTX 2000 Ada)
Strategy: Batch message processing, minimize grid synchronization

Latency:

Target: <50ns per message overhead
Strategy: Persistent kernel loop, avoid kernel launch overhead

Memory Optimization

Buffer Sizing:

Input queue: 128 messages × 65,792 bytes = 8.4 MB
Output queue: 128 messages × 65,792 bytes = 8.4 MB
Total GPU memory: ~17 MB per ring kernel instance

Memory Pooling:

Reuse GPU buffers across kernel invocations
Use existing MemoryPool infrastructure
Implement buffer compaction for sparse messages

CUDA Optimization

Thread Configuration:

Use 256 threads per block (optimal for most GPUs)
Launch enough blocks to cover message capacity
Leverage L1 cache for message buffers

Grid Synchronization:

Minimize grid.sync() calls (high overhead)
Use warp-level synchronization where possible
Consider per-CTA (thread block) processing

Risk Mitigation

Technical Risks

Risk 1: C# → CUDA Translation Complexity

Impact: High - Core functionality
Mitigation: Start with simple kernels (VectorAdd), expand incrementally
Fallback: Manual CUDA kernel development for complex logic

Risk 2: Cooperative Kernel Launch Failures

Impact: Medium - Requires specific GPU capabilities
Mitigation: Validate device supports cooperative launch during initialization
Fallback: Use standard kernel launch with manual synchronization

Risk 3: MemoryPack Serialization Overhead

Impact: Medium - Affects performance
Mitigation: Benchmark serialization latency, optimize hot paths
Fallback: Custom binary serialization for performance-critical kernels

Risk 4: NVRTC Compilation Errors

Impact: High - Blocks kernel execution
Mitigation: Generate valid CUDA C++ with extensive testing
Fallback: Provide detailed error messages, suggest fixes

Schedule Risks

Risk 1: Underestimated Complexity

Impact: High - Delays delivery
Mitigation: Incremental phases with clear milestones
Contingency: Prioritize core functionality, defer optimizations

Risk 2: Hardware Availability

Impact: Low - Testing blocked
Mitigation: Use existing RTX 2000 Ada (CC 8.9)
Contingency: Emulator testing for non-critical paths

Conclusion

This architecture document provides a comprehensive roadmap for implementing a production-ready Ring Kernel compilation pipeline. The multi-phase approach balances technical risk with incremental delivery, ensuring each component is thoroughly tested before integration.

Next Steps:

Review and approve this architecture
Begin Phase 1 implementation (Foundation)
Create tracking issues for each phase
Schedule weekly progress reviews

Timeline: 8 weeks to production-ready implementation Quality Standard: Production-grade, no shortcuts, comprehensive testing

Document Revision History:

v1.0 (January 2025) - Initial architecture design

Table of Contents

Ring Kernel Compilation Pipeline - Architecture Document

Executive Summary

Table of Contents

Current State Analysis

Existing Infrastructure

Current Kernel Launch Flow

Placeholder PTX Generation

Example Ring Kernel (Not Compiled)

Problem Statement

Critical Gaps

Impact

Design Goals

Primary Goals

Non-Goals (Future Work)

Quality Requirements

Architecture Overview

High-Level Flow

Key Components

Component Design

1. Source Generator Enhancement

2. C# → CUDA Type Mapping

3. Ring Kernel Compiler (Complete Rewrite)

4. Cooperative Kernel Launch

5. CUDA Event Timing Infrastructure

Implementation Phases

Phase 1: Foundation (Week 1-2)

Phase 2: Compilation Pipeline (Week 3-4)

Phase 3: Message Processing (Week 5-6)

Phase 4: Validation & Testing (Week 7-8)

Testing Strategy

Unit Tests

Integration Tests

Hardware Tests

Validation Criteria

Performance Considerations

Optimization Targets

Memory Optimization

CUDA Optimization

Risk Mitigation

Technical Risks

Schedule Risks

Conclusion