PyDotCompute¶

A GPU-native actor model with persistent kernels and message passing.

PyDotCompute is a Python port of DotCompute's Ring Kernel System, bringing the power of GPU-native actors to Python developers. It enables you to create persistent GPU kernels that communicate through high-performance message queues.

:material-rocket-launch:{ .lg .middle } Quick Start

Get up and running with PyDotCompute in minutes.

:octicons-arrow-right-24: Getting Started
:material-book-open-variant:{ .lg .middle } Concepts

Understand the core concepts behind ring kernels and GPU actors.

:octicons-arrow-right-24: Learn Concepts
:material-code-braces:{ .lg .middle } API Reference

Complete API documentation for all modules.

:octicons-arrow-right-24: API Docs
:material-tools:{ .lg .middle } Practitioner's Guide

Best practices for building production GPU actors.

:octicons-arrow-right-24: Guides

Why PyDotCompute?¶

Traditional GPU programming involves launching kernels, waiting for completion, and transferring results back to the host. This approach introduces latency and doesn't scale well for streaming workloads.

PyDotCompute changes this paradigm:

from pydotcompute import RingKernelRuntime, ring_kernel, message

@message
class DataPoint:
    value: float

@ring_kernel(kernel_id="processor")
async def stream_processor(ctx):
    """Persistent actor processing streaming data."""
    while not ctx.should_terminate:
        data = await ctx.receive()
        result = process(data)
        await ctx.send(result)

async with RingKernelRuntime() as runtime:
    await runtime.launch("processor")
    await runtime.activate("processor")

    # Stream data continuously
    for point in data_stream:
        await runtime.send("processor", point)
        result = await runtime.receive("processor")

Key Benefits¶

Feature	Traditional GPU	PyDotCompute
Kernel Lifetime	Per-invocation	Persistent
Communication	Memory copies	Message queues
Latency	High (launch overhead)	Low (always running)
Programming Model	Imperative	Actor-based
State Management	Manual	Automatic

Performance Highlights¶

Metric	Value
Message latency (p50)	21μs
Message latency (p99)	131μs
GPU graph processing	1.7M edges/sec
Actor throughput	76K msg/sec
Cython queue ops	0.33μs

Benchmarked with uvloop on Linux.

Features¶

Ring Kernel System: Persistent GPU kernels with infinite processing loops
High Performance: uvloop auto-installation for 21μs message latency
Performance Tiers: From uvloop (default) to Cython extensions
Message Passing: Type-safe, high-performance message serialization
Unified Memory: Transparent host-device memory with lazy synchronization
Lifecycle Management: Two-phase launch with graceful shutdown
GPU Telemetry: Real-time monitoring and performance metrics
Multi-Backend: CPU simulation, CUDA (NVIDIA), and Metal (Apple Silicon)

Architecture Overview¶

PyDotCompute Ring Kernel System
├── Ring Kernels          │ Performance Tiers      │ GPU Backends
│   • RingKernelRuntime   │ • uvloop (21μs)        │ CUDA:
│   • FastMessageQueue    │ • ThreadedRingKernel   │ • Numba JIT, CuPy arrays
│   • @ring_kernel        │ • CythonRingKernel     │ • Zero-copy DMA, PTX caching
│   • @message            │ • FastSPSCQueue        │ Metal (macOS):
│                         │                        │ • MLX, Unified memory
├─────────────────────────┴────────────────────────┴─────────────────
│ Memory: UnifiedBuffer (.host, .device, .metal), MemoryPool, Accelerator

Installation¶

Basic (CPU only)Fast (Recommended)With CUDA supportWith Metal support (macOS)Development

pip install pydotcompute

pip install pydotcompute[fast]

Includes uvloop for 21μs message latency.

pip install pydotcompute[cuda,fast]

pip install pydotcompute[metal,fast]

GPU acceleration on Apple Silicon using MLX.

git clone https://github.com/mivertowski/PyDotCompute.git
cd PyDotCompute
pip install -e ".[dev]"

Quick Example¶

import asyncio
from pydotcompute import RingKernelRuntime, ring_kernel, message

@message
class ComputeRequest:
    values: list[float]

@message
class ComputeResponse:
    result: float

@ring_kernel(kernel_id="summer", queue_size=1000)
async def sum_actor(ctx):
    while not ctx.should_terminate:
        request = await ctx.receive()
        total = sum(request.values)
        await ctx.send(ComputeResponse(result=total))

async def main():
    async with RingKernelRuntime() as runtime:
        await runtime.launch("summer")
        await runtime.activate("summer")

        await runtime.send("summer", ComputeRequest(values=[1, 2, 3, 4, 5]))
        response = await runtime.receive("summer")

        print(f"Sum: {response.result}")  # Sum: 15.0

asyncio.run(main())

Next Steps¶

Quick Start: Get running in 5 minutes
First Ring Kernel: Build your first GPU actor
Concepts: Deep dive into the architecture
API Reference: Complete API documentation