Enum OptimizationType
- Namespace
- DotCompute.Backends.CPU
- Assembly
- DotCompute.Backends.CPU.dll
CPU kernel optimization categories.
public enum OptimizationType
Fields
Cache = 3Cache hierarchy optimization.
Optimizes data structures and algorithms to maximize L1/L2/L3 cache utilization. Critical for workloads where data exceeds L1 cache size but fits in L3.
Typical Speedup: 3-10x when moving from DRAM to L1 cache
Techniques: Cache blocking/tiling, temporal locality exploitation
Cache Sizes:
- L1: 32-64 KB per core (1-2 cycles latency)
- L2: 256 KB - 1 MB per core (10-20 cycles)
- L3: 8-128 MB shared (40-75 cycles)
Best For: Working sets 32KB-32MB, temporal data reuse
Memory = 2Memory access pattern optimization.
Optimizes memory access patterns to improve cache hit rates and reduce memory bandwidth pressure. Includes loop tiling, data layout transformations, and prefetching strategies.
Typical Speedup: 2-4x for memory-bound operations
Techniques: Structure-of-Arrays (SoA) layout, loop blocking, prefetching
Metrics: Cache miss rate, memory bandwidth utilization
Best For: Strided access, matrix operations, pointer chasing
Parallelization = 1Multi-threading and task parallelization.
Distributes workload across multiple CPU cores using thread pools or parallel loops. Scales with core count but has synchronization overhead.
Typical Speedup: Near-linear up to physical core count
Overhead: Thread creation, context switching, synchronization
Sweet Spot: Work items > 10,000, minimal inter-thread communication
Best For: Large datasets, independent computations, embarrassingly parallel problems
Threading = 4Threading strategy and NUMA optimization.
Optimizes thread affinity, NUMA node allocation, and synchronization primitives. Critical for multi-socket systems and large core counts.
Typical Speedup: 1.5-3x on NUMA systems when properly tuned
Considerations:
- Thread affinity to prevent thread migration
- NUMA-aware memory allocation (local vs. remote)
- False sharing prevention (64-byte cache line alignment)
- Lock-free algorithms where possible
Best For: Multi-socket servers, workloads > core count, shared data structures
Vectorization = 0SIMD vectorization optimization.
Utilizes CPU vector instructions (SSE, AVX2, AVX-512, NEON) to process multiple data elements simultaneously. Most effective for data-parallel operations like element-wise arithmetic, reductions, and transformations.
Typical Speedup: 2-8x depending on vector width and data types
Requirements: Aligned memory access, no data dependencies
Best For: Float/int arrays, matrix operations, image processing
Instruction Sets:
- SSE/AVX2: 4-8 floats per instruction (x86-64)
- AVX-512: 16 floats per instruction (Xeon, high-end Core)
- NEON: 4 floats per instruction (ARM64)
Remarks
Defines optimization strategies for CPU kernel execution. Used by CPU kernel optimizer to analyze kernels and generate performance improvement recommendations.
Each optimization type targets specific performance bottlenecks and provides measurable speedup improvements when applied correctly.