Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Statistical ML

Crate: rustkernel-ml Kernels: 17 Feature: ml (included in default features)

Machine learning kernels for clustering, anomaly detection, NLP, federated learning, and healthcare analytics.

Kernel Overview

Clustering (3)

KernelIDModesDescription
KMeansml/kmeansBatch, RingK-means++ clustering
DBSCANml/dbscanBatchDensity-based clustering
HierarchicalClusteringml/hierarchical-clusteringBatchAgglomerative clustering

Anomaly Detection (3)

KernelIDModesDescription
IsolationForestml/isolation-forestBatchTree-based anomaly detection
LocalOutlierFactorml/local-outlier-factorBatch, RingDensity-based outlier detection
EnsembleVotingml/ensemble-votingBatch, RingCombine multiple detectors

Regression (2)

KernelIDModesDescription
LinearRegressionml/linear-regressionBatch, RingOrdinary least squares
RidgeRegressionml/ridge-regressionBatch, RingL2-regularized regression

NLP / Embeddings (2)

KernelIDModesDescription
EmbeddingGenerationml/embedding-generationBatchGenerate text embeddings from documents
SemanticSimilarityml/semantic-similarityBatchCompute similarity between document embeddings

Federated Learning (1)

KernelIDModesDescription
SecureAggregationml/secure-aggregationBatchPrivacy-preserving distributed model training

Healthcare Analytics (2)

KernelIDModesDescription
DrugInteractionPredictionml/drug-interactionBatchPredict multi-drug interaction risks
ClinicalPathwayConformanceml/clinical-pathwayBatchCheck treatment guideline compliance

Streaming ML (2)

KernelIDModesDescription
StreamingIsolationForestml/streaming-iforestBatch, RingOnline anomaly detection
AdaptiveThresholdml/adaptive-thresholdBatch, RingSelf-adjusting anomaly thresholds

Explainability (2)

KernelIDModesDescription
SHAPValuesml/shap-valuesBatchGPU-accelerated SHAP explanations
FeatureImportanceml/feature-importanceBatchReal-time feature attribution

Kernel Details

KMeans

Partitions data into K clusters using the K-means++ initialization.

ID: ml/kmeans Modes: Batch, Ring Throughput: ~500,000 points/sec

Input

pub struct KMeansInput {
    /// Data points as flattened array
    pub points: Vec<f64>,
    /// Number of dimensions per point
    pub dimensions: u32,
    /// Number of clusters
    pub k: u32,
    /// Maximum iterations
    pub max_iterations: u32,
    /// Convergence tolerance
    pub tolerance: f64,
}

Output

pub struct KMeansOutput {
    /// Cluster assignment per point
    pub assignments: Vec<u32>,
    /// Centroids (k * dimensions)
    pub centroids: Vec<f64>,
    /// Iterations performed
    pub iterations: u32,
    /// Inertia (sum of squared distances)
    pub inertia: f64,
}

Example

use rustkernel::ml::clustering::{KMeans, KMeansInput};

let kernel = KMeans::new();

let input = KMeansInput {
    points: vec![
        1.0, 2.0,  // Point 0
        1.5, 1.8,  // Point 1
        5.0, 8.0,  // Point 2
        6.0, 9.0,  // Point 3
    ],
    dimensions: 2,
    k: 2,
    max_iterations: 100,
    tolerance: 1e-4,
};

let result = kernel.execute(input).await?;
println!("Clusters: {:?}", result.assignments);

IsolationForest

Detects anomalies by isolating observations using random forests.

ID: ml/isolation-forest Modes: Batch

Input

pub struct IsolationForestInput {
    pub points: Vec<f64>,
    pub dimensions: u32,
    /// Number of trees
    pub n_estimators: u32,
    /// Subsample size
    pub max_samples: u32,
    /// Contamination ratio (expected anomaly rate)
    pub contamination: f64,
}

Output

pub struct IsolationForestOutput {
    /// Anomaly scores (higher = more anomalous)
    pub scores: Vec<f64>,
    /// Binary labels (-1 = anomaly, 1 = normal)
    pub labels: Vec<i32>,
}

LocalOutlierFactor

Measures local density deviation to identify outliers.

ID: ml/local-outlier-factor Modes: Batch, Ring

Example

use rustkernel::ml::anomaly::{LocalOutlierFactor, LOFInput};

let kernel = LocalOutlierFactor::new();

let result = kernel.execute(LOFInput {
    points: data_points,
    dimensions: 3,
    k_neighbors: 20,
}).await?;

// Scores > 1.0 indicate outliers
let outliers: Vec<usize> = result.scores
    .iter()
    .enumerate()
    .filter(|(_, &s)| s > 1.5)
    .map(|(i, _)| i)
    .collect();

LinearRegression

Fits a linear model using ordinary least squares.

ID: ml/linear-regression Modes: Batch, Ring

Input

pub struct LinearRegressionInput {
    /// Feature matrix (n_samples * n_features)
    pub features: Vec<f64>,
    /// Target values (n_samples)
    pub targets: Vec<f64>,
    /// Number of features
    pub n_features: u32,
    /// Whether to fit intercept
    pub fit_intercept: bool,
}

Output

pub struct LinearRegressionOutput {
    /// Coefficients (n_features, or n_features + 1 with intercept)
    pub coefficients: Vec<f64>,
    /// Intercept (if fit_intercept = true)
    pub intercept: f64,
    /// R-squared score
    pub r_squared: f64,
}

EmbeddingGeneration

Generates text embeddings from documents using TF-IDF and n-gram features.

ID: ml/embedding-generation Modes: Batch

Example

use rustkernel::ml::nlp::{EmbeddingGeneration, EmbeddingConfig};

let kernel = EmbeddingGeneration::new();

let config = EmbeddingConfig {
    embedding_dim: 128,
    ngram_range: (1, 2),
    max_features: 10000,
    use_idf: true,
};

let documents = vec!["financial transaction", "bank transfer"];
let embeddings = kernel.generate(&documents, &config)?;

SemanticSimilarity

Computes cosine similarity between document embeddings.

ID: ml/semantic-similarity Modes: Batch

Example

use rustkernel::ml::nlp::{SemanticSimilarity, SimilarityConfig};

let kernel = SemanticSimilarity::new();

let similar = kernel.find_similar(
    &embeddings,
    query_index,
    &SimilarityConfig { top_k: 10, threshold: 0.5, include_self: false }
)?;

SecureAggregation

Privacy-preserving federated learning with differential privacy.

ID: ml/secure-aggregation Modes: Batch

Aggregates model updates from multiple clients while preserving privacy through noise injection and gradient clipping.

Example

use rustkernel::ml::federated::{SecureAggregation, AggregationConfig};

let kernel = SecureAggregation::new();

let config = AggregationConfig {
    num_clients: 10,
    clip_threshold: 1.0,
    noise_multiplier: 0.1,
    secure_mode: true,
};

let global_update = kernel.aggregate(&client_updates, &config)?;

DrugInteractionPrediction

Predicts multi-drug interaction risks using hypergraph neural networks.

ID: ml/drug-interaction Modes: Batch

Example

use rustkernel::ml::healthcare::{DrugInteractionPrediction, DrugProfile};

let kernel = DrugInteractionPrediction::new();

let drugs = vec![
    DrugProfile { id: "D001", features: moa_features.clone() },
    DrugProfile { id: "D002", features: target_features.clone() },
];

let result = kernel.predict(&drugs)?;
println!("Interaction risk: {:.2}", result.risk_score);

ClinicalPathwayConformance

Checks treatment event sequences against clinical guidelines.

ID: ml/clinical-pathway Modes: Batch

Example

use rustkernel::ml::healthcare::{ClinicalPathwayConformance, ClinicalPathway};

let kernel = ClinicalPathwayConformance::new();

let pathway = ClinicalPathway {
    name: "Sepsis Protocol".to_string(),
    required_steps: vec!["blood_culture", "antibiotics", "fluids"],
    max_time_hours: 3.0,
};

let result = kernel.check_conformance(&events, &pathway)?;
println!("Conformance: {:.1}%", result.conformance_score * 100.0);

Ring Mode for Streaming ML

Ring mode enables online learning scenarios:

use rustkernel::ml::clustering::KMeansRing;

let ring = KMeansRing::new(k: 5, dimensions: 3);

// Stream data points
for point in incoming_stream {
    // Assign to nearest cluster (sub-millisecond)
    let cluster = ring.assign_point(point).await?;

    // Periodically update centroids
    if should_update_centroids() {
        ring.update_centroids().await?;
    }
}

Performance Considerations

  1. Dimensionality: High dimensions slow down distance calculations
  2. Memory: KMeans stores all points; for very large datasets, consider mini-batch
  3. Initialization: K-means++ is more expensive but gives better results
  4. GPU utilization: Ensure batch sizes are large enough to saturate GPU