Table of Contents

Enum OptimizationType

Namespace
DotCompute.Backends.CPU
Assembly
DotCompute.Backends.CPU.dll

CPU kernel optimization categories.

public enum OptimizationType

Fields

Cache = 3

Cache hierarchy optimization.

Optimizes data structures and algorithms to maximize L1/L2/L3 cache utilization. Critical for workloads where data exceeds L1 cache size but fits in L3.

Typical Speedup: 3-10x when moving from DRAM to L1 cache

Techniques: Cache blocking/tiling, temporal locality exploitation

Cache Sizes:

  • L1: 32-64 KB per core (1-2 cycles latency)
  • L2: 256 KB - 1 MB per core (10-20 cycles)
  • L3: 8-128 MB shared (40-75 cycles)

Best For: Working sets 32KB-32MB, temporal data reuse

Memory = 2

Memory access pattern optimization.

Optimizes memory access patterns to improve cache hit rates and reduce memory bandwidth pressure. Includes loop tiling, data layout transformations, and prefetching strategies.

Typical Speedup: 2-4x for memory-bound operations

Techniques: Structure-of-Arrays (SoA) layout, loop blocking, prefetching

Metrics: Cache miss rate, memory bandwidth utilization

Best For: Strided access, matrix operations, pointer chasing

Parallelization = 1

Multi-threading and task parallelization.

Distributes workload across multiple CPU cores using thread pools or parallel loops. Scales with core count but has synchronization overhead.

Typical Speedup: Near-linear up to physical core count

Overhead: Thread creation, context switching, synchronization

Sweet Spot: Work items > 10,000, minimal inter-thread communication

Best For: Large datasets, independent computations, embarrassingly parallel problems

Threading = 4

Threading strategy and NUMA optimization.

Optimizes thread affinity, NUMA node allocation, and synchronization primitives. Critical for multi-socket systems and large core counts.

Typical Speedup: 1.5-3x on NUMA systems when properly tuned

Considerations:

  • Thread affinity to prevent thread migration
  • NUMA-aware memory allocation (local vs. remote)
  • False sharing prevention (64-byte cache line alignment)
  • Lock-free algorithms where possible

Best For: Multi-socket servers, workloads > core count, shared data structures

Vectorization = 0

SIMD vectorization optimization.

Utilizes CPU vector instructions (SSE, AVX2, AVX-512, NEON) to process multiple data elements simultaneously. Most effective for data-parallel operations like element-wise arithmetic, reductions, and transformations.

Typical Speedup: 2-8x depending on vector width and data types

Requirements: Aligned memory access, no data dependencies

Best For: Float/int arrays, matrix operations, image processing

Instruction Sets:

  • SSE/AVX2: 4-8 floats per instruction (x86-64)
  • AVX-512: 16 floats per instruction (Xeon, high-end Core)
  • NEON: 4 floats per instruction (ARM64)

Remarks

Defines optimization strategies for CPU kernel execution. Used by CPU kernel optimizer to analyze kernels and generate performance improvement recommendations.

Each optimization type targets specific performance bottlenecks and provides measurable speedup improvements when applied correctly.