Skip to Content
Architecture

Architecture & Crate Topology

High-Level Architecture

┌───────────────────────────────────────────────────────────────────────────┐ │ PostgreSQL Server │ │ │ │ ┌─────────────────────────┐ ┌────────────────────────────────────┐ │ │ │ SQL Interface │ │ Background Worker │ │ │ │ (graph schema) │ │ │ │ │ │ │ │ WAL Listener (logical replication) │ │ │ │ graph.traverse() │ │ Periodic auto-vacuum │ │ │ │ graph.search() │ │ OOM monitoring │ │ │ │ graph.build() │ │ │ │ │ │ graph.auto_discover() │ │ │ │ │ └───────────┬─────────────┘ └──────────────────┬─────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Graph Engine (Rust) │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ │ │ ACL Pre-Flight Check │ │ │ │ │ │ pg_class_aclcheck(table_oid, user_oid, ACL_SELECT) │ │ │ │ │ │ O(1) check BEFORE dropping into BFS hot loop │ │ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────────┐ │ │ │ │ │ NodeStore │ │ EdgeStore │ │ TokenIndex │ │ │ │ │ │ (SoA arrays) │ │ (CSR) │ │ (Roaring) │ │ │ │ │ └────────────────┘ └────────────────┘ └───────────────────┘ │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────────┐ │ │ │ │ │ BloomFilter │ │ FilterIndex │ │ ResolutionIndex │ │ │ │ │ │ (u64 sigs) │ │ (Vec<u32>[]) │ │ (mmap'd sorted) │ │ │ │ │ └────────────────┘ └────────────────┘ └───────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ │ │ Safety Layer │ │ │ │ │ │ • ACL check before BFS (O(1), zero cost) │ │ │ │ │ │ • OOM guard (check RSS before allocation) │ │ │ │ │ │ • Circuit breakers (max_nodes, max_depth, frontier) │ │ │ │ │ │ • panic = catch_unwind → PG error (never crashes PG) │ │ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────┬─────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Persistence Layer │ │ │ │ .graph files on $PGDATA/graph/ (mmap'd read-only) │ │ │ │ Atomic rename crash recovery │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Configuration (graph schema tables) │ │ │ │ graph.registered_tables — node tables │ │ │ │ graph.registered_edges — edge relationships │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Your Tables (source of truth) │ │ │ │ customers, orders, products, invoices, ... │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────────┘

Crate Layout

graph/ ├── Cargo.toml # pgrx extension crate ├── graph.control # Postgres extension control file ├── sql/ │ └── graph--0.1.0.sql # Extension SQL bootstrap ├── src/ │ ├── lib.rs # pgrx entry point, _PG_init, SQL function registration │ ├── engine.rs # Graph engine orchestrator (owns all stores) │ ├── node_store.rs # SoA NodeStore — flat arrays │ ├── edge_store.rs # CSR EdgeStore — targets[] + type_ids[] │ ├── bloom.rs # 64-bit Bloom filter (xxhash3) │ ├── token_index.rs # TokenIndex with RoaringBitmap posting lists │ ├── resolution_index.rs # (table_oid, pk) → node_index — mmap'd sorted array, binary search │ ├── filter_index.rs # Parallel Vec<u32> arrays for numeric column filtering │ ├── bfs.rs # BFS hot loop (unweighted, zero allocations) │ ├── path_finder.rs # Bidirectional BFS shortest path + path reconstruction │ ├── builder.rs # Initial graph construction from SQL tables │ ├── discover.rs # Schema auto-discovery (information_schema introspection) │ ├── acl.rs # ACL pre-flight check (pg_class_aclcheck) │ ├── sync.rs # Trigger-based and WAL-based graph sync │ ├── bgworker.rs # Background worker for WAL + auto-vacuum │ ├── persistence.rs # .graph file format, mmap, atomic rename │ ├── config.rs # GUC parameters and engine config │ ├── catalog.rs # Extension catalog tables (graph schema) │ └── safety.rs # OOM guard, catch_unwind wrappers, error handling ├── benches/ │ └── bfs_bench.rs # Criterion.rs benchmarks └── tests/ ├── integration.rs # pgTAP-style integration tests ├── oom.rs # OOM graceful degradation tests ├── acl.rs # Permission / ACL tests └── concurrent.rs # Multi-backend concurrency tests

Test status: this layout is the target test structure. The public repository includes Rust unit tests plus a minimal pgrx/pg_regress harness. Additional SQL-path, role/ACL, SQLSTATE, sync, OOM, crash-recovery, and concurrency integration coverage is tracked in the public issue tracker.

Design Decisions

Single Crate

pgGraph is implemented as a single pgrx extension crate. Keeping SQL bindings and the core engine together keeps #[pg_extern] expansion straightforward and avoids premature crate boundaries around a compact engine.

Dedicated graph Schema

All SQL APIs are declared in the graph schema (#[pg_extern(schema = "graph")]), and extension bootstrap SQL creates/uses graph.* catalog tables. This:

  • Prevents collisions with user tables/functions in public
  • Makes it clear which engine is running: graph.traverse(...) is unambiguous
  • Follows the professional standard set by pg_stat_statements, pgvector, etc.
  • Keeps the user’s public schema pristine

REGCLASS for Table References

All functions accept table names as REGCLASS (e.g., 'users'::regclass). Postgres automatically resolves schema-qualified names, handles OIDs, and validates that the table exists at call time. This eliminates an entire class of “table not found” errors.


Data Flow

Auto-Discovery (graph.auto_discover())

graph.auto_discover() called Step 1: Query information_schema.tables → Find all user tables in the target schema(s) Step 2: Query information_schema.key_column_usage + table_constraints → Find all FK relationships Step 3: Register every table as a node source (graph.registered_tables) Register every FK as an edge (graph.registered_edges) Step 4: Automatically call graph.build() Result: Entire schema is graph-queryable in ~10 seconds

Initial Build (graph.build())

graph.build() called from SQL Step 1: Read catalog (graph.registered_tables, graph.registered_edges) Step 2: OOM pre-check Estimate RAM required from table row counts (SELECT count(*)) If estimated RAM > graph.memory_limit_mb → return ERROR with HINT Step 3: For each registered table: COPY (SELECT id, col1, col2, ... FROM table) TO STDOUT ├── Allocate node_index via ResolutionIndex ├── Compute Bloom signature from properties ├── Insert tokens into TokenIndex (properties ≤ max_token_length) ├── Load filter column values into FilterIndex (parallel Vec<u32>) └── Write to NodeStore SoA arrays Step 4: For each registered edge: COPY (SELECT from_col, to_col FROM table) TO STDOUT ├── Resolve both endpoints via ResolutionIndex └── Buffer (source, target, edge_type) tuple Step 5: Build CSR EdgeStore from buffered edges Sort + deduplicate per node → build edge_offsets[] + targets[] + type_ids[] Step 6: Persist to $PGDATA/graph/graph.graph Atomic rename for crash safety Step 7: mmap the new file for serving queries

Query Flow (graph.traverse())

SELECT * FROM graph.traverse('users', 'U-123', 4, ARRAY['transferred_to']); Step 1: ACL Pre-Flight Check pg_class_aclcheck(seed_table_oid, GetUserId(), ACL_SELECT) If ACLCHECK_OK ≠ result → ERROR "Permission denied for table users" Cost: O(1), ~0 microseconds Step 2: Resolve seed — ResolutionIndex: (users_oid, "U-123") → node_index If not found → ERROR: "Node not found: users.U-123" Step 3: Compute filters (if any) ├── Edge type filter: resolve ARRAY['transferred_to'] → u8 type IDs ├── Bloom mask from filter predicates (if provided) └── TokenIndex bitmap from exact-match predicates (if provided) Step 4: Execute BFS (pure Rust, catch_unwind wrapper) ├── VecDeque frontier (standard BFS queue) ├── RoaringBitmap visited (cycle detection) ├── For each node: check is_active, edge_type filter, Bloom sig ├── FilterIndex check: structured JSONB filter (e.g., graph.greater_than('amount', 10000)) ├── Expand neighbors from CSR (sequential memory access) ├── Track parent[] for path reconstruction └── Circuit breakers (max_depth, max_nodes, max_frontier) Step 5: Reconstruct paths from parent[] array Step 6: pgrx TableIterator yields rows: (root_table REGCLASS, root_id TEXT, node_table REGCLASS, node_id TEXT, depth INT, path JSONB, edge_path JSONB, node JSONB)

Shortest Path Flow (graph.shortest_path())

SELECT * FROM graph.shortest_path('users', 'U-123', 'companies', 'C-456'); Step 1: ACL check on both tables Step 2: Resolve both endpoints → node indices Step 3: Bidirectional BFS (unweighted, O(V + E)) ├── Expand forward from source ├── Expand backward from target ├── Meet-in-the-middle: stop when frontiers intersect └── Path reconstruction from both parent[] arrays Step 4: Return path as ordered rows: (step INT, node_table REGCLASS, node_id TEXT, edge_label TEXT) If no path exists → empty result set (not an error)

Note on bidirectional BFS and unidirectional edges: Bidirectional BFS expands the graph in reverse from the target node. This requires edges to exist in both directions. Since graph.add_edge() defaults to bidirectional := true, both forward and reverse edges are stored in the CSR and bidirectional BFS works correctly. However, if any edge along the path was registered with bidirectional := false, the backward expansion from the target may not traverse that edge — potentially missing the optimal path. When shortest_path() is called, graph checks a global flag (has_unidirectional_edges, set during build()) to determine the strategy: if the graph contains any unidirectional edges, shortest path falls back to single-direction BFS from the source to guarantee correctness.


Crash Safety

Principle: graph Must NEVER Crash Postgres

This is the #1 engineering constraint. A misbehaving extension that takes down the database is worse than a slow one.

Strategy 1: catch_unwind on All Entry Points

Every #[pg_extern] function wraps its Rust logic in std::panic::catch_unwind. If the Rust code panics (array bounds, integer overflow, logic bug), the panic is caught and converted to a Postgres ERROR — not a process abort.

#[pg_extern] fn graph_traverse(...) -> TableIterator<...> { let result = std::panic::catch_unwind(AssertUnwindSafe(|| { // ... ACL check, BFS logic ... })); match result { Ok(rows) => TableIterator::new(rows.into_iter()), Err(_) => { ereport!(ERROR, "graph: internal error during traversal. \ Graph may need rebuild. See server log for details."); } } }

Strategy 2: ACL Pre-Flight Check (Before BFS)

Before dropping into the Rust BFS hot loop, every function verifies the caller has SELECT permission on the seed table:

let user_oid = pgrx::pg_sys::GetUserId(); let acl_result = pgrx::pg_sys::pg_class_aclcheck( seed_table_oid, user_oid, pgrx::pg_sys::ACL_SELECT ); if acl_result != pgrx::pg_sys::ACLCHECK_OK { return Err(GraphError::AclDenied { table: table_name.to_string(), }); }

This is an O(1) check — essentially zero cost.

Strategy 3: Memory Budget Enforcement

Before any large allocation (graph build, vacuum), graph checks the current process RSS against graph.memory_limit_mb. If exceeded, it returns a clear SQL error with a HINT — never a crash.

Strategy 4: Graceful Read-Only Degradation

If the edge buffer fills up during trigger sync, graph enters read-only mode: existing graph continues serving queries, new mutations are dropped with a WARNING, and graph.status() reports sync_status = 'read_only'.


Memory Architecture

mmap-Based (Process-Local, Kernel-Shared)

Each backend process loads the graph via mmap from the .graph file. Since mmap shares physical pages across processes, the actual RAM cost is not multiplied per connection.

Backend 1: mmap("graph.graph") ─┐ Backend 2: mmap("graph.graph") ─┼── Same physical pages Backend 3: mmap("graph.graph") ─┘ (OS page cache)

This is the same mechanism Postgres itself uses. No shared memory complexity, no LWLock contention for reads.


Weighted Edges (Dijkstra)

The default traversal mode is unweighted BFS. All edges have implicit weight = 1. This keeps the hot loop at $O(V + E)$ with dense Vec<u32> arrays and full CPU cache utilisation.

When any edge is registered with a weight_column, graph allocates a parallel weights[] array in the CSR and enables weighted shortest path via Dijkstra:

// Unweighted: BFS hot loop (O(V + E)) let mut frontier: VecDeque<u32> = VecDeque::new(); // Weighted: Dijkstra hot loop (O(V + E log V)) let mut heap: BinaryHeap<Reverse<(u32, u32)>> = BinaryHeap::new(); // (cost, node)

The CSR layout with optional weights:

EdgeStore (base): edge_offsets[]: Vec<u32> — where each node's edges begin targets[]: Vec<u32> — neighbor node indices type_ids[]: Vec<u8> — edge type labels EdgeStore (with weights, optional): weights[]: Vec<u32> — edge cost (parallel to targets[])

This is an additive layout — unweighted BFS ignores the weights[] array entirely. Dijkstra is only invoked when the user explicitly calls weighted_shortest_path().


Key Design Decisions

1. Why pgrx (Not C)?

FactorC Extensionpgrx (Rust)
Memory safetyManual (segfaults, buffer overflows)Compile-time guaranteed
Crash safetyBugs crash Postgrescatch_unwind → PG ERROR
Build systemMakefiles + PGXScargo pgrx build
Testingpg_regress (limited)Rust unit tests + pgTAP
PerformanceNativeNative (same LLVM backend)

2. Why CSR (Not Adjacency List)?

PropertyAdjacency ListCSR
Cache efficiencyPoor (scattered heap allocs)Excellent (contiguous)
mmap compatibilityImpossible (heap pointers)Natural (flat arrays)
SerializationExpensive (recursive)Zero-copy (direct mmap)
Memory overhead24 bytes/Vec + padding4 bytes/node (offset)

3. Why Not Apache AGE?

Apache AGEgraph
Data modelStores graph in separate tablesReads your existing tables
Query languageopenCypher + SQL hybridSQL functions only
StorageOn-disk (heap tuples)RAM-resident (CSR + mmap)
Latency50ms–500ms0.07ms–15ms
SetupLoad data into AGE formatauto_discover() — 10 seconds
SchemaRequires graph-specific DDLUses your existing schema

graph is to AGE what Redis is to Postgres — a fast in-memory acceleration layer, not a replacement.

4. Configuration in Postgres Tables (Not YAML Files)

All graph configuration is stored in Postgres catalog tables within the graph schema:

  • graph.registered_tables — registered tables (oid, id_columns, columns, tenant_column)
  • graph.registered_edges — registered edges (from_oid, from_col, to_oid, to_col, label)
  • graph.registered_filter_columns — numeric columns registered for O(1) range filtering

This means configuration is backed up with pg_dump, versioned with schema migrations, queryable with SQL, and replicated with Postgres replication.

Last updated on