Class AdvancedSimdKernels
- Namespace
- DotCompute.Backends.CPU.Kernels
- Assembly
- DotCompute.Backends.CPU.dll
Advanced SIMD kernel implementations with complete FMA, integer SIMD, enhanced ARM NEON, and modern vectorization techniques.
public static class AdvancedSimdKernels
- Inheritance
-
AdvancedSimdKernels
- Inherited Members
Methods
OptimizedMatrixMultiplyFloat32(float*, float*, float*, int, int, int)
Cache-friendly blocked matrix multiplication with FMA optimization. Essential for linear algebra workloads.
public static void OptimizedMatrixMultiplyFloat32(float* a, float* b, float* c, int m, int n, int k)
Parameters
VectorAddInt16(short*, short*, short*, long)
Vectorized 16-bit integer operations (common in image processing).
public static void VectorAddInt16(short* a, short* b, short* result, long elementCount)
Parameters
VectorAddInt32(int*, int*, int*, long)
Vectorized 32-bit integer addition with full SIMD support.
public static void VectorAddInt32(int* a, int* b, int* result, long elementCount)
Parameters
VectorAdvancedNeonFloat32(float*, float*, float*, float*, long, NeonOperation)
Comprehensive ARM NEON floating-point operations with full instruction coverage.
public static void VectorAdvancedNeonFloat32(float* a, float* b, float* c, float* result, long elementCount, NeonOperation operation)
Parameters
VectorConditionalSelect(float*, float*, float*, float*, long, float)
Conditional selection: result[i] = condition[i] ? a[i] : b[i] Uses SIMD masking to avoid branch divergence.
public static void VectorConditionalSelect(float* condition, float* a, float* b, float* result, long count, float threshold)
Parameters
VectorFmaFloat32(float*, float*, float*, float*, long)
Vectorized FMA operation: result = a * b + c using hardware FMA instructions. Essential for scientific computing with optimal precision and performance.
public static void VectorFmaFloat32(float* a, float* b, float* c, float* result, long elementCount)
Parameters
VectorFmaFloat64(double*, double*, double*, double*, long)
Double precision FMA operation.
public static void VectorFmaFloat64(double* a, double* b, double* c, double* result, long elementCount)
Parameters
VectorGatherFloat32(float*, int*, float*, int)
Gather operation: loads elements from memory using indices. Critical for sparse data and indirect memory access patterns.
public static void VectorGatherFloat32(float* basePtr, int* indices, float* result, int count)
Parameters
Remarks
Adoption site #1 for .NET 10 SIMD surface: uses Avx2.GatherVector256
to perform a true hardware gather of 8 floats in one instruction when AVX2 is
available, falling back to a scalar loop otherwise. On AVX-512 hosts we issue
two 256-bit gathers back-to-back to cover 16 elements per iteration — .NET 10
SDK 10.0.106 does not expose Avx512F.GatherVector512, so stitching two
AVX2 gathers is the best available option without dropping to P/Invoke.
VectorHorizontalSum(float*, long)
Optimized horizontal sum reduction with SIMD.
public static float VectorHorizontalSum(float* data, long count)
Parameters
Returns
VectorMultiplyInt64(long*, long*, long*, long)
Vectorized 64-bit integer multiplication.
public static void VectorMultiplyInt64(long* a, long* b, long* result, long elementCount)
Parameters
VectorScatterFloat32(float*, int*, float*, int)
Scatter operation: stores elements to memory using indices.
public static void VectorScatterFloat32(float* values, int* indices, float* basePtr, int count)
Parameters
Remarks
Adoption site #2 for .NET 10 SIMD surface: .NET 10 SDK 10.0.106 does not expose
Avx512F.Scatter in the x86 intrinsics surface, so we keep the scalar
inner loop here. The previous code issued a pointless AVX-512 load of the
values and indices that the scalar loop then re-read from memory; removing
those dead loads cuts register pressure and lets the inner loop vectorize
via LICM + the standard reuse of scalar stores. If a future .NET SDK exposes
scatter intrinsics this is the single point to revisit.