Table of Contents

Class AdvancedSimdKernels

Namespace
DotCompute.Backends.CPU.Kernels
Assembly
DotCompute.Backends.CPU.dll

Advanced SIMD kernel implementations with complete FMA, integer SIMD, enhanced ARM NEON, and modern vectorization techniques.

public static class AdvancedSimdKernels
Inheritance
AdvancedSimdKernels
Inherited Members

Methods

OptimizedMatrixMultiplyFloat32(float*, float*, float*, int, int, int)

Cache-friendly blocked matrix multiplication with FMA optimization. Essential for linear algebra workloads.

public static void OptimizedMatrixMultiplyFloat32(float* a, float* b, float* c, int m, int n, int k)

Parameters

a float*
b float*
c float*
m int
n int
k int

VectorAddInt16(short*, short*, short*, long)

Vectorized 16-bit integer operations (common in image processing).

public static void VectorAddInt16(short* a, short* b, short* result, long elementCount)

Parameters

a short*
b short*
result short*
elementCount long

VectorAddInt32(int*, int*, int*, long)

Vectorized 32-bit integer addition with full SIMD support.

public static void VectorAddInt32(int* a, int* b, int* result, long elementCount)

Parameters

a int*
b int*
result int*
elementCount long

VectorAdvancedNeonFloat32(float*, float*, float*, float*, long, NeonOperation)

Comprehensive ARM NEON floating-point operations with full instruction coverage.

public static void VectorAdvancedNeonFloat32(float* a, float* b, float* c, float* result, long elementCount, NeonOperation operation)

Parameters

a float*
b float*
c float*
result float*
elementCount long
operation NeonOperation

VectorConditionalSelect(float*, float*, float*, float*, long, float)

Conditional selection: result[i] = condition[i] ? a[i] : b[i] Uses SIMD masking to avoid branch divergence.

public static void VectorConditionalSelect(float* condition, float* a, float* b, float* result, long count, float threshold)

Parameters

condition float*
a float*
b float*
result float*
count long
threshold float

VectorFmaFloat32(float*, float*, float*, float*, long)

Vectorized FMA operation: result = a * b + c using hardware FMA instructions. Essential for scientific computing with optimal precision and performance.

public static void VectorFmaFloat32(float* a, float* b, float* c, float* result, long elementCount)

Parameters

a float*
b float*
c float*
result float*
elementCount long

VectorFmaFloat64(double*, double*, double*, double*, long)

Double precision FMA operation.

public static void VectorFmaFloat64(double* a, double* b, double* c, double* result, long elementCount)

Parameters

a double*
b double*
c double*
result double*
elementCount long

VectorGatherFloat32(float*, int*, float*, int)

Gather operation: loads elements from memory using indices. Critical for sparse data and indirect memory access patterns.

public static void VectorGatherFloat32(float* basePtr, int* indices, float* result, int count)

Parameters

basePtr float*
indices int*
result float*
count int

Remarks

Adoption site #1 for .NET 10 SIMD surface: uses Avx2.GatherVector256 to perform a true hardware gather of 8 floats in one instruction when AVX2 is available, falling back to a scalar loop otherwise. On AVX-512 hosts we issue two 256-bit gathers back-to-back to cover 16 elements per iteration — .NET 10 SDK 10.0.106 does not expose Avx512F.GatherVector512, so stitching two AVX2 gathers is the best available option without dropping to P/Invoke.

VectorHorizontalSum(float*, long)

Optimized horizontal sum reduction with SIMD.

public static float VectorHorizontalSum(float* data, long count)

Parameters

data float*
count long

Returns

float

VectorMultiplyInt64(long*, long*, long*, long)

Vectorized 64-bit integer multiplication.

public static void VectorMultiplyInt64(long* a, long* b, long* result, long elementCount)

Parameters

a long*
b long*
result long*
elementCount long

VectorScatterFloat32(float*, int*, float*, int)

Scatter operation: stores elements to memory using indices.

public static void VectorScatterFloat32(float* values, int* indices, float* basePtr, int count)

Parameters

values float*
indices int*
basePtr float*
count int

Remarks

Adoption site #2 for .NET 10 SIMD surface: .NET 10 SDK 10.0.106 does not expose Avx512F.Scatter in the x86 intrinsics surface, so we keep the scalar inner loop here. The previous code issued a pointless AVX-512 load of the values and indices that the scalar loop then re-read from memory; removing those dead loads cuts register pressure and lets the inner loop vectorize via LICM + the standard reuse of scalar stores. If a future .NET SDK exposes scatter intrinsics this is the single point to revisit.