Table of Contents

Class SharedMemoryAttribute

Namespace
DotCompute.Abstractions.Attributes
Assembly
DotCompute.Abstractions.dll

Declares a shared memory allocation for a kernel method. Shared memory (threadgroup memory in Metal, shared in CUDA, __local in OpenCL) is fast on-chip memory shared between threads in a thread block/workgroup.

[AttributeUsage(AttributeTargets.Method, AllowMultiple = true, Inherited = false)]
public sealed class SharedMemoryAttribute : Attribute
Inheritance
SharedMemoryAttribute
Inherited Members

Examples

// Declare 256 floats of shared memory for a reduction
[Kernel(Backends = KernelBackends.CUDA | KernelBackends.Metal)]
[SharedMemory(typeof(float), "shared_data", Size = 256)]
public static void Reduction(ReadOnlySpan<float> input, Span<float> output)
{
    int tid = Kernel.ThreadId.X;
    int gid = Kernel.GlobalId.X;

    // Load into shared memory
    Kernel.SharedMemory<float>("shared_data")[tid] = input[gid];
    Kernel.Barrier();

    // Parallel reduction in shared memory
    for (int stride = 128; stride > 0; stride >>= 1)
    {
        if (tid < stride)
            Kernel.SharedMemory<float>("shared_data")[tid] +=
                Kernel.SharedMemory<float>("shared_data")[tid + stride];
        Kernel.Barrier();
    }

    if (tid == 0)
        output[Kernel.BlockId.X] = Kernel.SharedMemory<float>("shared_data")[0];
}

Remarks

Shared memory is significantly faster than global memory (typically 10-100x lower latency) and is essential for many GPU optimization patterns:

  • Reduction operations (sum, min, max across threads)
  • Tiled matrix multiplication (loading tiles into shared memory)
  • Stencil computations (caching neighbor values)
  • Inter-thread communication within a block

Backend Translation:

BackendTranslation
CUDA__shared__ T name[Size];
Metalthreadgroup T* name [[threadgroup(N)]];
OpenCL__local T name[Size];
CPUThread-local storage or stack allocation

Limitations:

  • Size is limited per thread block (typically 32-48KB on modern GPUs)
  • Only accessible within a single thread block/workgroup
  • Must use barriers for synchronization when sharing data

Constructors

SharedMemoryAttribute(Type, string)

Initializes a new instance of the SharedMemoryAttribute class.

public SharedMemoryAttribute(Type elementType, string name)

Parameters

elementType Type

The type of elements in the shared memory array.

name string

The name used to reference this shared memory in the kernel.

Properties

Alignment

Gets or sets the memory alignment requirement in bytes.

public int Alignment { get; set; }

Property Value

int

The required memory alignment in bytes. Default is 0 (use natural alignment).

Remarks

Proper alignment can improve memory access performance:

  • 16 bytes - Optimal for float4/int4 vector types
  • 32 bytes - Cache line alignment on some GPUs
  • 128 bytes - Memory transaction alignment on NVIDIA GPUs

Backend Translation:

  • CUDA: __align__(N)
  • Metal: alignas(N)
  • OpenCL: Uses aligned allocation

ElementType

Gets the element type for the shared memory array.

public Type ElementType { get; }

Property Value

Type

The type of each element in the shared memory. Must be an unmanaged type (e.g., int, float, double, or an unmanaged struct).

MetalBindingIndex

Gets or sets the threadgroup binding index for Metal shaders.

public int MetalBindingIndex { get; set; }

Property Value

int

The threadgroup buffer binding index for Metal (0-based). Default is -1 (auto-assign based on declaration order).

Remarks

Metal requires explicit binding indices for threadgroup memory:

threadgroup float* shared_data [[threadgroup(0)]];
threadgroup int* shared_indices [[threadgroup(1)]];

When set to -1, binding indices are auto-assigned in declaration order. Set explicitly when multiple kernels share consistent binding layouts.

Name

Gets the name used to reference this shared memory allocation in the kernel code.

public string Name { get; }

Property Value

string

A unique identifier for this shared memory block within the kernel. Used with Kernel.SharedMemory<T>(name) to access the memory.

Size

Gets or sets the number of elements in the shared memory array.

public int Size { get; set; }

Property Value

int

The size of the shared memory array in elements (not bytes). Default is 0, which indicates dynamic sizing based on thread block size.

Remarks

When Size is 0 (dynamic), the shared memory is typically sized to match the thread block size or can be specified at kernel launch time.

Memory Calculation: Total bytes = Size * sizeof(ElementType)

Common Sizes:

  • 256 - Standard block size
  • 512/1024 - Larger blocks for higher occupancy
  • 32 - Warp-sized for warp-level operations

ZeroInitialize

Gets or sets whether to zero-initialize the shared memory.

public bool ZeroInitialize { get; set; }

Property Value

bool

true to zero-initialize the shared memory before use; false for uninitialized memory (faster).

Remarks

Zero initialization adds overhead but ensures predictable initial values. Only enable when the algorithm requires known initial state.

Performance: Disabling saves ~1-2 clock cycles per element.