PageRank: Standard vs Distributed Actor (GPU Native) Approach
This document compares the traditional PageRank algorithm implementation with the distributed actor model using GPU-native Ring Kernels in DotCompute.
Overview
PageRank is an algorithm used to rank nodes in a graph by their importance, originally developed by Google to rank web pages. The algorithm iteratively computes rank scores based on the link structure between nodes.
Standard PageRank Implementation
Algorithm Description
The standard PageRank algorithm follows an iterative approach:
PR(u) = (1-d)/N + d * Σ PR(v)/L(v)
Where:
PR(u)= PageRank of page ud= damping factor (typically 0.85)N= total number of pagesPR(v)= PageRank of pages linking to uL(v)= number of outbound links from page v
Data Input Format
Standard approach uses dense matrix or CSR (Compressed Sparse Row) format:
// Dense adjacency matrix (small graphs)
float[,] adjacencyMatrix = new float[N, N];
// CSR format (large sparse graphs)
int[] rowPointers; // Start index of each row's edges
int[] columnIndices; // Destination node for each edge
float[] values; // Edge weights (usually 1/outDegree)
Execution Flow
┌─────────────────────────────────────────────────────────────┐
│ Standard PageRank │
├─────────────────────────────────────────────────────────────┤
│ 1. Load entire graph into memory │
│ 2. Initialize ranks: PR[i] = 1/N for all nodes │
│ 3. For each iteration until convergence: │
│ a. For each node u: │
│ - Sum contributions from all incoming edges │
│ - Apply damping factor │
│ - Store new rank │
│ b. Check convergence (||PR_new - PR_old|| < epsilon) │
│ 4. Return final ranks │
└─────────────────────────────────────────────────────────────┘
GPU Implementation (Traditional)
// Traditional GPU kernel - processes all nodes per iteration
__global__ void pagerank_iteration(
float* ranks_in,
float* ranks_out,
int* row_ptrs,
int* col_indices,
float* values,
int num_nodes,
float damping)
{
int node = blockIdx.x * blockDim.x + threadIdx.x;
if (node >= num_nodes) return;
float sum = 0.0f;
for (int e = row_ptrs[node]; e < row_ptrs[node + 1]; e++) {
int src = col_indices[e];
sum += values[e] * ranks_in[src];
}
ranks_out[node] = (1.0f - damping) / num_nodes + damping * sum;
}
Distributed Actor (GPU Native) Approach
Architecture Overview
The distributed actor model decomposes PageRank into independent, message-passing actors (Ring Kernels) that run persistently on the GPU:
┌───────────────────────────────────────────────────────────────────┐
│ Distributed Actor Ring Kernel System │
├───────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ Contribution │───▶│ Rank │───▶│ Convergence │ │
│ │ Sender │ │ Aggregator │ │ Checker │ │
│ │ (Producer) │ │ (Consumer/Prod) │ │ (Consumer) │ │
│ └─────────────────┘ └──────────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ K2K Messages │ K2K Messages │ │
│ └───────────────────────┴──────────────────────┘ │
│ │
│ Each kernel runs PERSISTENTLY on GPU, processing messages │
│ asynchronously without CPU intervention │
└───────────────────────────────────────────────────────────────────┘
Data Input Format
Actor model uses message-based edge representation:
// Edge information as discrete messages
[MemoryPackable]
public partial struct GraphEdgeInfo : IRingKernelMessage
{
public int SourceNodeId { get; set; }
public int TargetNodeId { get; set; }
public float ContributionWeight { get; set; } // 1/outDegree
public int Iteration { get; set; }
}
// Rank update messages between kernels
[MemoryPackable]
public partial struct RankContribution : IRingKernelMessage
{
public int TargetNodeId { get; set; }
public float ContributionValue { get; set; }
public int SourceIteration { get; set; }
}
Ring Kernel Definitions
// Kernel 1: Sends rank contributions along outgoing edges
[RingKernel("pagerank_contribution_sender")]
[InputMessage(typeof(GraphEdgeInfo))]
[OutputMessage(typeof(RankContribution))]
[PublishesTo("pagerank_rank_aggregator")]
public static class ContributionSenderKernel
{
public static void Process(
RingKernelContext ctx,
GraphEdgeInfo edge,
out RankContribution contribution)
{
float currentRank = ctx.GetNodeRank(edge.SourceNodeId);
contribution = new RankContribution
{
TargetNodeId = edge.TargetNodeId,
ContributionValue = currentRank * edge.ContributionWeight,
SourceIteration = edge.Iteration
};
}
}
// Kernel 2: Aggregates contributions and updates ranks
[RingKernel("pagerank_rank_aggregator")]
[InputMessage(typeof(RankContribution))]
[OutputMessage(typeof(ConvergenceCheck))]
[SubscribesTo("pagerank_contribution_sender")]
[PublishesTo("pagerank_convergence_checker")]
public static class RankAggregatorKernel
{
public static void Process(
RingKernelContext ctx,
RankContribution contribution,
out ConvergenceCheck check)
{
// Atomically aggregate contribution
float oldRank = ctx.GetNodeRank(contribution.TargetNodeId);
float newRank = ctx.AtomicAdd(
contribution.TargetNodeId,
contribution.ContributionValue);
check = new ConvergenceCheck
{
NodeId = contribution.TargetNodeId,
RankDelta = Math.Abs(newRank - oldRank),
Iteration = contribution.SourceIteration
};
}
}
Execution Flow
┌─────────────────────────────────────────────────────────────────┐
│ Distributed Actor Execution Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SETUP PHASE (Once): │
│ 1. Launch Ring Kernels (persistent GPU threads) │
│ 2. Initialize node ranks in shared GPU memory │
│ 3. Activate kernels for message processing │
│ │
│ EXECUTION PHASE (Continuous): │
│ 4. Stream edge messages to ContributionSender │
│ - CPU sends edges asynchronously │
│ - GPU processes immediately without sync │
│ │
│ 5. K2K Message Flow (GPU-only, no CPU involvement): │
│ ContributionSender ──▶ RankAggregator ──▶ ConvergenceChecker│
│ │
│ 6. Monitor convergence via telemetry (zero-copy polling) │
│ │
│ TEARDOWN PHASE: │
│ 7. Deactivate and terminate kernels │
│ 8. Read final ranks from GPU memory │
└─────────────────────────────────────────────────────────────────┘
Comparison: Pros and Cons
Standard PageRank
| Aspect | Pros | Cons |
|---|---|---|
| Simplicity | Simple to implement and understand | Limited to batch processing |
| Memory | Efficient CSR format for sparse graphs | Must load entire graph before processing |
| Debugging | Easy to trace execution step-by-step | N/A |
| Convergence | Global sync ensures consistent iterations | CPU-GPU sync overhead each iteration |
| Latency | N/A | High latency for streaming updates |
| Scaling | Works well for static graphs | Poor for dynamic graph updates |
Distributed Actor (GPU Native)
| Aspect | Pros | Cons |
|---|---|---|
| Streaming | Process edges as they arrive | More complex programming model |
| Latency | Sub-millisecond message processing | Initial kernel launch overhead |
| CPU Offload | GPU processes autonomously | Debugging more challenging |
| Dynamic Graphs | Natural support for edge insertions/deletions | Memory overhead for message queues |
| Scalability | Linear scaling with GPU count | Requires careful load balancing |
| Throughput | Higher sustained throughput | Convergence detection more complex |
Performance Characteristics
Standard Approach
Iteration Time = O(E) GPU work + O(1) CPU-GPU sync
Total Time = Iterations * (GPU_kernel + CPU_sync + Launch_overhead)
Typical per-iteration:
- Kernel launch: ~5-10μs
- GPU execution: ~100μs - 10ms (depends on graph size)
- Sync: ~10-50μs
- Total: ~115μs - 10ms per iteration
Distributed Actor Approach
Setup Time = Kernel_launch + Memory_allocation (one-time)
Processing Time = Message_rate * Processing_latency (continuous)
Typical characteristics:
- Initial launch: ~50-100ms (PTX compilation + kernel start)
- Message latency: ~1-5μs per message (GPU-native)
- Throughput: 10M+ messages/second sustained
- Zero CPU intervention during processing
Use Case Recommendations
Use Standard PageRank When:
- Static graphs - Graph structure doesn't change during computation
- Batch processing - All edges known upfront
- Simple deployment - Need straightforward implementation
- Debugging priority - Requires easy step-through debugging
- Small to medium graphs - < 10M nodes, < 100M edges
Use Distributed Actor When:
- Streaming data - Edges arrive continuously
- Dynamic graphs - Frequent edge insertions/deletions
- Real-time requirements - Sub-millisecond update latency needed
- High throughput - Sustained millions of updates per second
- Large-scale graphs - 100M+ nodes, billions of edges
- Multi-GPU deployment - Need to scale across GPU cluster
Code Examples
Standard PageRank (DotCompute)
// Traditional batch approach
public async Task<float[]> ComputePageRank(
SparseGraph graph,
int maxIterations = 100,
float epsilon = 1e-6f)
{
var accelerator = _orchestrator.GetAccelerator(BackendType.CUDA);
// Allocate buffers
using var ranks = accelerator.Allocate<float>(graph.NodeCount);
using var newRanks = accelerator.Allocate<float>(graph.NodeCount);
using var rowPtrs = accelerator.Allocate<int>(graph.RowPointers);
using var colIdx = accelerator.Allocate<int>(graph.ColumnIndices);
// Initialize ranks
ranks.Fill(1.0f / graph.NodeCount);
// Iterate until convergence
for (int iter = 0; iter < maxIterations; iter++)
{
// Launch kernel (blocking)
await accelerator.LaunchAsync(
"pagerank_iteration",
gridSize: (graph.NodeCount + 255) / 256,
blockSize: 256,
ranks, newRanks, rowPtrs, colIdx, graph.NodeCount, 0.85f);
// Check convergence (requires CPU-GPU sync)
float delta = await ComputeDelta(ranks, newRanks);
if (delta < epsilon) break;
// Swap buffers
(ranks, newRanks) = (newRanks, ranks);
}
return await ranks.CopyToHostAsync();
}
Distributed Actor PageRank (DotCompute Ring Kernels)
// Streaming actor approach
public async Task<IAsyncEnumerable<RankUpdate>> ComputePageRankStreaming(
IAsyncEnumerable<GraphEdge> edgeStream,
CancellationToken cancellationToken)
{
// Launch persistent kernels (one-time setup)
await _runtime.LaunchAsync("pagerank_contribution_sender", gridSize: 4, blockSize: 256);
await _runtime.LaunchAsync("pagerank_rank_aggregator", gridSize: 4, blockSize: 256);
await _runtime.LaunchAsync("pagerank_convergence_checker", gridSize: 1, blockSize: 256);
// Activate for processing
await _runtime.ActivateAsync("pagerank_contribution_sender");
await _runtime.ActivateAsync("pagerank_rank_aggregator");
await _runtime.ActivateAsync("pagerank_convergence_checker");
// Stream edges to GPU (non-blocking)
var sendTask = StreamEdgesToGpu(edgeStream, cancellationToken);
// Poll for rank updates (zero-copy)
await foreach (var update in PollRankUpdates(cancellationToken))
{
yield return update;
}
// Cleanup
await _runtime.DeactivateAsync("pagerank_contribution_sender");
await _runtime.TerminateAsync("pagerank_contribution_sender");
// ... terminate other kernels
}
private async Task StreamEdgesToGpu(
IAsyncEnumerable<GraphEdge> edges,
CancellationToken ct)
{
await foreach (var edge in edges.WithCancellation(ct))
{
// Send edge as message - GPU processes immediately
await _runtime.SendMessageAsync("pagerank_contribution_sender",
new GraphEdgeInfo
{
SourceNodeId = edge.Source,
TargetNodeId = edge.Target,
ContributionWeight = 1.0f / edge.SourceOutDegree
});
}
}
Memory Layout Comparison
Standard Approach
GPU Memory Layout:
┌──────────────────────────────────────────────────┐
│ ranks[] - N floats (node ranks) │
│ newRanks[] - N floats (temporary) │
│ rowPointers[] - (N+1) ints (CSR structure) │
│ columnIndices[]- E ints (edge destinations) │
│ values[] - E floats (edge weights) │
└──────────────────────────────────────────────────┘
Total: 2N*4 + (N+1)*4 + 2E*4 = 12N + 8E + 4 bytes
Distributed Actor Approach
GPU Memory Layout:
┌──────────────────────────────────────────────────┐
│ Control Blocks - 64 bytes per kernel │
│ Message Queues - Configurable (e.g., 1MB each)│
│ Shared Rank State - N floats (atomic access) │
│ Telemetry Buffers - 64 bytes per kernel │
│ K2K Channels - Variable (inter-kernel msgs) │
└──────────────────────────────────────────────────┘
Total: Per-kernel overhead + N*4 + queue capacity
Conclusion
The choice between standard and distributed actor approaches depends on your specific requirements:
Standard PageRank is ideal for batch processing of static graphs where simplicity and debuggability are priorities.
Distributed Actor (GPU Native) excels in streaming scenarios, dynamic graphs, and real-time applications where low latency and high throughput are critical.
DotCompute's Ring Kernel system enables the distributed actor pattern while maintaining the performance benefits of native GPU execution, making it possible to achieve both the programming model benefits of actors and the computational efficiency of GPU acceleration.