WGSL Code Generation

The ringkernel-wgpu-codegen crate provides a Rust-to-WGSL transpiler that lets you write GPU kernels in a Rust DSL and generate equivalent WGSL (WebGPU Shading Language) code.

Overview

The transpiler provides full feature parity with ringkernel-cuda-codegen, supporting the same three kernel types:

Global Kernels - Standard @compute entry points
Stencil Kernels - Tile-based kernels with GridPos abstraction
Ring Kernels - Persistent actor kernels (host-driven emulation)

The same Rust DSL code works with both CUDA and WGSL backends—just change the import.

Global Kernels

Generic compute shaders without stencil-specific patterns. Use DSL functions for thread/workgroup indices.

use ringkernel_wgpu_codegen::transpile_global_kernel;
use syn::parse_quote;

let kernel: syn::ItemFn = parse_quote! {
    fn saxpy(x: &[f32], y: &mut [f32], a: f32, n: i32) {
        let idx = block_idx_x() * block_dim_x() + thread_idx_x();
        if idx >= n { return; }
        y[idx as usize] = a * x[idx as usize] + y[idx as usize];
    }
};

let wgsl_code = transpile_global_kernel(&kernel)?;

Generated WGSL:

@group(0) @binding(0) var<storage, read> x: array<f32>;
@group(0) @binding(1) var<storage, read_write> y: array<f32>;
@group(0) @binding(2) var<storage, read> a: f32;
@group(0) @binding(3) var<storage, read> n: i32;

@compute @workgroup_size(256, 1, 1)
fn saxpy(
    @builtin(local_invocation_id) local_invocation_id: vec3<u32>,
    @builtin(workgroup_id) workgroup_id: vec3<u32>,
    @builtin(global_invocation_id) global_invocation_id: vec3<u32>,
    @builtin(num_workgroups) num_workgroups: vec3<u32>
) {
    var idx: i32 = i32(workgroup_id.x) * i32(256u) + i32(local_invocation_id.x);
    if (idx >= n) { return; }
    y[idx] = a * x[idx] + y[idx];
}

Stencil Kernels

For stencil computations (e.g., FDTD, convolutions), use the GridPos abstraction:

use ringkernel_wgpu_codegen::{transpile_stencil_kernel, StencilConfig};

let kernel: syn::ItemFn = parse_quote! {
    fn fdtd(p: &[f32], p_prev: &mut [f32], c2: f32, pos: GridPos) {
        let lap = pos.north(p) + pos.south(p) + pos.east(p) + pos.west(p)
                  - 4.0 * p[pos.idx()];
        p_prev[pos.idx()] = 2.0 * p[pos.idx()] - p_prev[pos.idx()] + c2 * lap;
    }
};

let config = StencilConfig::new("fdtd")
    .with_tile_size(16, 16)
    .with_halo(1);

let wgsl_code = transpile_stencil_kernel(&kernel, &config)?;

Stencil Intrinsics:

pos.idx() - Current cell linear index
pos.north(buf) - Access cell above (buf[idx - buffer_width])
pos.south(buf) - Access cell below (buf[idx + buffer_width])
pos.east(buf) - Access cell to the right (buf[idx + 1])
pos.west(buf) - Access cell to the left (buf[idx - 1])
pos.at(buf, dx, dy) - Access cell at relative offset

Ring Kernels

Persistent GPU kernels that process messages in a loop.

Important: WebGPU does not support true persistent kernels. Ring kernels are emulated using host-driven dispatch loops—the host re-dispatches the kernel until termination.

use ringkernel_wgpu_codegen::{transpile_ring_kernel, RingKernelConfig};

let handler: syn::ItemFn = parse_quote! {
    fn process(ctx: &RingContext, msg: &Request) -> Response {
        let tid = ctx.global_thread_id();
        ctx.sync_threads();
        let result = msg.value * 2.0;
        Response { value: result, id: tid as u64 }
    }
};

let config = RingKernelConfig::new("processor")
    .with_workgroup_size(256)
    .with_hlc(true);
    // Note: K2K is NOT supported in WGPU

let wgsl_code = transpile_ring_kernel(&handler, &config)?;

RingKernelConfig Options

Option	Default	Description
`workgroup_size`	256	Threads per workgroup
`enable_hlc`	false	Enable Hybrid Logical Clocks
`enable_k2k`	false	NOT SUPPORTED - will return error
`max_messages_per_dispatch`	1024	Messages processed per dispatch

K2K Limitation

Kernel-to-kernel (K2K) messaging is not supported in WGPU due to WebGPU’s execution model. Attempting to enable K2K will return an error:

let config = RingKernelConfig::new("processor")
    .with_k2k(true);  // This will error!

let result = transpile_ring_kernel(&handler, &config);
assert!(result.is_err());
// Error: "Kernel-to-kernel (K2K) messaging is not supported in WGPU"

Use host-mediated messaging instead for cross-kernel communication.

DSL Features

Control Flow

// If/else
if condition {
    do_something();
} else {
    do_other();
}

// Match (transpiles to switch/case)
match value {
    0 => handle_zero(),
    1 => handle_one(),
    _ => handle_default(),
}

// Early return
if idx >= n { return; }

Loops

// For loops (range patterns)
for i in 0..n {
    data[i] = 0.0;
}

// Inclusive range
for i in 0..=n {
    process(i);
}

// While loops
while condition {
    iterate();
}

// Infinite loops
loop {
    if done { break; }
    continue;
}

Note: Loops are forbidden in Stencil validation mode but allowed in Generic and RingKernel modes.

Shared Memory (Workgroup Variables)

use ringkernel_wgpu_codegen::{SharedTile, SharedArray, SharedVolume};

// 1D array: var<workgroup> cache: array<f32, 256>;
let cache: SharedArray<f32, 256>;

// 2D tile: var<workgroup> tile: array<array<f32, 16>, 16>;
let tile: SharedTile<f32, 16, 16>;

// 3D volume: var<workgroup> volume: array<array<array<f32, 8>, 8>, 8>;
let volume: SharedVolume<f32, 8, 8, 8>;

Higher-Dimensional Shared Memory

The transpiler supports 2D, 3D, and 4D+ shared memory arrays:

use ringkernel_wgpu_codegen::SharedMemoryConfig;

let mut shared = SharedMemoryConfig::new();

// 2D: array<array<T, Width>, Height>
shared.add_2d("tile", WgslType::F32, 16, 16);

// 3D: array<array<array<T, X>, Y>, Z>
shared.add_3d("volume", WgslType::F32, 8, 8, 8);

// 4D+: Linearized indexing with generated formula
shared.add_nd("hypercube", WgslType::F32, &[4, 4, 4, 4]);

Generated WGSL:

// 2D tile
var<workgroup> tile: array<array<f32, 16>, 16>;

// 3D volume
var<workgroup> volume: array<array<array<f32, 8>, 8>, 8>;

// 4D+ linearized (with helper function)
var<workgroup> hypercube: array<f32, 256>;  // 4*4*4*4 = 256
fn hypercube_index(w: u32, x: u32, y: u32, z: u32) -> u32 {
    return w * 64u + x * 16u + y * 4u + z;
}

Struct Literals

// Rust struct literal
let response = Response { value: result, id: tid as u64 };

// Transpiles to WGSL constructor:
// Response(result, u64(tid))

Type Mapping

Rust Type	WGSL Type	Notes
`f32`	`f32`	Direct mapping
`f64`	`f32`	Warning: Downcast (WGSL has no f64)
`i32`	`i32`	Direct mapping
`u32`	`u32`	Direct mapping
`i64`	`vec2<i32>`	Emulated as lo/hi pair
`u64`	`vec2<u32>`	Emulated as lo/hi pair
`bool`	`bool`	Direct mapping
`usize`	`u32`	WGSL uses 32-bit addressing
`&[T]`	`array<T>` (storage, read)	Storage buffer
`&mut [T]`	`array<T>` (storage, read_write)	Storage buffer

GPU Intrinsics

Thread/Workgroup Indices

Rust DSL	WGSL
`thread_idx_x()`	`local_invocation_id.x`
`thread_idx_y()`	`local_invocation_id.y`
`thread_idx_z()`	`local_invocation_id.z`
`block_idx_x()`	`workgroup_id.x`
`block_idx_y()`	`workgroup_id.y`
`block_idx_z()`	`workgroup_id.z`
`block_dim_x()`	`WORKGROUP_SIZE_X` (constant)
`global_thread_id()`	`global_invocation_id.x`
`grid_dim_x()`	`num_workgroups.x`

Synchronization

Rust DSL	WGSL
`sync_threads()`	`workgroupBarrier()`
`thread_fence()`	`storageBarrier()`
`thread_fence_block()`	`workgroupBarrier()`

Atomics

Rust DSL	WGSL
`atomic_add(ptr, val)`	`atomicAdd(ptr, val)`
`atomic_sub(ptr, val)`	`atomicSub(ptr, val)`
`atomic_min(ptr, val)`	`atomicMin(ptr, val)`
`atomic_max(ptr, val)`	`atomicMax(ptr, val)`
`atomic_exchange(ptr, val)`	`atomicExchange(ptr, val)`
`atomic_cas(ptr, cmp, val)`	`atomicCompareExchangeWeak(ptr, cmp, val)`
`atomic_load(ptr)`	`atomicLoad(ptr)`
`atomic_store(ptr, val)`	`atomicStore(ptr, val)`

Math Functions

Rust DSL	WGSL
`sqrt(x)`	`sqrt(x)`
`rsqrt(x)`	`inverseSqrt(x)`
`abs(x)`	`abs(x)`
`floor(x)`	`floor(x)`
`ceil(x)`	`ceil(x)`
`round(x)`	`round(x)`
`sin(x)`	`sin(x)`
`cos(x)`	`cos(x)`
`tan(x)`	`tan(x)`
`exp(x)`	`exp(x)`
`log(x)`	`log(x)`
`powf(x, y)`	`pow(x, y)`
`min(a, b)`	`min(a, b)`
`max(a, b)`	`max(a, b)`
`clamp(x, lo, hi)`	`clamp(x, lo, hi)`
`fma(a, b, c)`	`fma(a, b, c)`
`mix(a, b, t)`	`mix(a, b, t)`

Subgroup Operations (Warp Equivalents)

Note: These require the chromium_experimental_subgroups extension, which may not be available on all platforms.

Rust DSL	WGSL
`warp_shuffle(val, lane)`	`subgroupShuffle(val, lane)`
`warp_shuffle_up(val, delta)`	`subgroupShuffleUp(val, delta)`
`warp_shuffle_down(val, delta)`	`subgroupShuffleDown(val, delta)`
`warp_shuffle_xor(val, mask)`	`subgroupShuffleXor(val, mask)`
`warp_ballot(pred)`	`subgroupBallot(pred)`
`warp_all(pred)`	`subgroupAll(pred)`
`warp_any(pred)`	`subgroupAny(pred)`
`lane_id()`	`subgroup_invocation_id`
`warp_size()`	`subgroup_size`

64-bit Integer Emulation

WGSL 1.0 does not support 64-bit integers or atomics. The transpiler emulates them using lo/hi u32 pairs:

Type Representation

// u64 becomes vec2<u32> where:
// - x (lo) = lower 32 bits
// - y (hi) = upper 32 bits

Generated Helper Functions

The transpiler automatically includes these helpers for ring kernels:

// Read 64-bit value from atomic pair
fn read_u64(lo: ptr<storage, atomic<u32>, read_write>,
            hi: ptr<storage, atomic<u32>, read_write>) -> vec2<u32> {
    return vec2<u32>(atomicLoad(lo), atomicLoad(hi));
}

// Atomically increment 64-bit value
fn atomic_inc_u64(lo: ptr<storage, atomic<u32>, read_write>,
                  hi: ptr<storage, atomic<u32>, read_write>) {
    let old_lo = atomicAdd(lo, 1u);
    if (old_lo == 0xFFFFFFFFu) {
        atomicAdd(hi, 1u);
    }
}

// Compare two 64-bit values: returns -1, 0, or 1
fn compare_u64(a: vec2<u32>, b: vec2<u32>) -> i32 {
    if (a.y > b.y) { return 1; }
    if (a.y < b.y) { return -1; }
    if (a.x > b.x) { return 1; }
    if (a.x < b.x) { return -1; }
    return 0;
}

Validation Modes

The transpiler uses validation modes to control what constructs are allowed:

Mode	Loops	Description
`Stencil`	Forbidden	Classic stencil kernels (default for `transpile_stencil_kernel`)
`Generic`	Allowed	General-purpose kernels (default for `transpile_global_kernel`)
`RingKernel`	Required	Persistent actor kernels

WGSL vs CUDA Differences

Feature	CUDA	WGSL
Thread index	`threadIdx.x`	`local_invocation_id.x`
Block/workgroup	`blockIdx.x`	`workgroup_id.x`
Sync	`__syncthreads()`	`workgroupBarrier()`
Shared memory	`__shared__`	`var<workgroup>`
64-bit integers	Native	Emulated (vec2)
f64	Native	Not supported
Persistent kernels	GPU loop	Host-driven dispatch
K2K messaging	Supported	Not supported

Testing

The crate includes 55+ tests covering all features:

cargo test -p ringkernel-wgpu-codegen

Test categories:

Type mapping (primitives, slices, 64-bit emulation)
Expression transpilation (binary ops, method calls, indexing)
Intrinsics (thread indices, barriers, math)
Loops (for/while/loop patterns)
Validation (mode-specific constraints)
Higher-dimensional shared memory (2D, 3D, 4D+)

Next: Migration Guide

PreviousCUDA Codegen NextShowcase Applications