Table of Contents

Class AdvancedSimdKernels

Namespace
DotCompute.Backends.CPU.Kernels
Assembly
DotCompute.Backends.CPU.dll

Advanced SIMD kernel implementations with complete FMA, integer SIMD, enhanced ARM NEON, and modern vectorization techniques.

public static class AdvancedSimdKernels
Inheritance
AdvancedSimdKernels
Inherited Members

Methods

OptimizedMatrixMultiplyFloat32(float*, float*, float*, int, int, int)

Cache-friendly blocked matrix multiplication with FMA optimization. Essential for linear algebra workloads.

public static void OptimizedMatrixMultiplyFloat32(float* a, float* b, float* c, int m, int n, int k)

Parameters

a float*
b float*
c float*
m int
n int
k int

VectorAddInt16(short*, short*, short*, long)

Vectorized 16-bit integer operations (common in image processing).

public static void VectorAddInt16(short* a, short* b, short* result, long elementCount)

Parameters

a short*
b short*
result short*
elementCount long

VectorAddInt32(int*, int*, int*, long)

Vectorized 32-bit integer addition with full SIMD support.

public static void VectorAddInt32(int* a, int* b, int* result, long elementCount)

Parameters

a int*
b int*
result int*
elementCount long

VectorAdvancedNeonFloat32(float*, float*, float*, float*, long, NeonOperation)

Comprehensive ARM NEON floating-point operations with full instruction coverage.

public static void VectorAdvancedNeonFloat32(float* a, float* b, float* c, float* result, long elementCount, NeonOperation operation)

Parameters

a float*
b float*
c float*
result float*
elementCount long
operation NeonOperation

VectorConditionalSelect(float*, float*, float*, float*, long, float)

Conditional selection: result[i] = condition[i] ? a[i] : b[i] Uses SIMD masking to avoid branch divergence.

public static void VectorConditionalSelect(float* condition, float* a, float* b, float* result, long count, float threshold)

Parameters

condition float*
a float*
b float*
result float*
count long
threshold float

VectorFmaFloat32(float*, float*, float*, float*, long)

Vectorized FMA operation: result = a * b + c using hardware FMA instructions. Essential for scientific computing with optimal precision and performance.

public static void VectorFmaFloat32(float* a, float* b, float* c, float* result, long elementCount)

Parameters

a float*
b float*
c float*
result float*
elementCount long

VectorFmaFloat64(double*, double*, double*, double*, long)

Double precision FMA operation.

public static void VectorFmaFloat64(double* a, double* b, double* c, double* result, long elementCount)

Parameters

a double*
b double*
c double*
result double*
elementCount long

VectorGatherFloat32(float*, int*, float*, int)

Gather operation: loads elements from memory using indices. Critical for sparse data and indirect memory access patterns.

public static void VectorGatherFloat32(float* basePtr, int* indices, float* result, int count)

Parameters

basePtr float*
indices int*
result float*
count int

VectorHorizontalSum(float*, long)

Optimized horizontal sum reduction with SIMD.

public static float VectorHorizontalSum(float* data, long count)

Parameters

data float*
count long

Returns

float

VectorMultiplyInt64(long*, long*, long*, long)

Vectorized 64-bit integer multiplication.

public static void VectorMultiplyInt64(long* a, long* b, long* result, long elementCount)

Parameters

a long*
b long*
result long*
elementCount long

VectorScatterFloat32(float*, int*, float*, int)

Scatter operation: stores elements to memory using indices.

public static void VectorScatterFloat32(float* values, int* indices, float* basePtr, int count)

Parameters

values float*
indices int*
basePtr float*
count int