Architecture & Crate Topology
High-Level Architecture
┌───────────────────────────────────────────────────────────────────────────┐
│ PostgreSQL Server │
│ │
│ ┌─────────────────────────┐ ┌────────────────────────────────────┐ │
│ │ SQL Interface │ │ Background Worker │ │
│ │ (graph schema) │ │ │ │
│ │ │ │ WAL Listener (logical replication) │ │
│ │ graph.traverse() │ │ Periodic auto-vacuum │ │
│ │ graph.search() │ │ OOM monitoring │ │
│ │ graph.build() │ │ │ │
│ │ graph.auto_discover() │ │ │ │
│ └───────────┬─────────────┘ └──────────────────┬─────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Graph Engine (Rust) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ ACL Pre-Flight Check │ │ │
│ │ │ pg_class_aclcheck(table_oid, user_oid, ACL_SELECT) │ │ │
│ │ │ O(1) check BEFORE dropping into BFS hot loop │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────────┐ │ │
│ │ │ NodeStore │ │ EdgeStore │ │ TokenIndex │ │ │
│ │ │ (SoA arrays) │ │ (CSR) │ │ (Roaring) │ │ │
│ │ └────────────────┘ └────────────────┘ └───────────────────┘ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌───────────────────┐ │ │
│ │ │ BloomFilter │ │ FilterIndex │ │ ResolutionIndex │ │ │
│ │ │ (u64 sigs) │ │ (Vec<u32>[]) │ │ (mmap'd sorted) │ │ │
│ │ └────────────────┘ └────────────────┘ └───────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Safety Layer │ │ │
│ │ │ • ACL check before BFS (O(1), zero cost) │ │ │
│ │ │ • OOM guard (check RSS before allocation) │ │ │
│ │ │ • Circuit breakers (max_nodes, max_depth, frontier) │ │ │
│ │ │ • panic = catch_unwind → PG error (never crashes PG) │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Persistence Layer │ │
│ │ .graph files on $PGDATA/graph/ (mmap'd read-only) │ │
│ │ Atomic rename crash recovery │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Configuration (graph schema tables) │ │
│ │ graph.registered_tables — node tables │ │
│ │ graph.registered_edges — edge relationships │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Your Tables (source of truth) │ │
│ │ customers, orders, products, invoices, ... │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘Crate Layout
graph/
├── Cargo.toml # pgrx extension crate
├── graph.control # Postgres extension control file
├── sql/
│ └── graph--0.1.0.sql # Extension SQL bootstrap
├── src/
│ ├── lib.rs # pgrx entry point, _PG_init, SQL function registration
│ ├── engine.rs # Graph engine orchestrator (owns all stores)
│ ├── node_store.rs # SoA NodeStore — flat arrays
│ ├── edge_store.rs # CSR EdgeStore — targets[] + type_ids[]
│ ├── bloom.rs # 64-bit Bloom filter (xxhash3)
│ ├── token_index.rs # TokenIndex with RoaringBitmap posting lists
│ ├── resolution_index.rs # (table_oid, pk) → node_index — mmap'd sorted array, binary search
│ ├── filter_index.rs # Parallel Vec<u32> arrays for numeric column filtering
│ ├── bfs.rs # BFS hot loop (unweighted, zero allocations)
│ ├── path_finder.rs # Bidirectional BFS shortest path + path reconstruction
│ ├── builder.rs # Initial graph construction from SQL tables
│ ├── discover.rs # Schema auto-discovery (information_schema introspection)
│ ├── acl.rs # ACL pre-flight check (pg_class_aclcheck)
│ ├── sync.rs # Trigger-based and WAL-based graph sync
│ ├── bgworker.rs # Background worker for WAL + auto-vacuum
│ ├── persistence.rs # .graph file format, mmap, atomic rename
│ ├── config.rs # GUC parameters and engine config
│ ├── catalog.rs # Extension catalog tables (graph schema)
│ └── safety.rs # OOM guard, catch_unwind wrappers, error handling
├── benches/
│ └── bfs_bench.rs # Criterion.rs benchmarks
└── tests/
├── integration.rs # pgTAP-style integration tests
├── oom.rs # OOM graceful degradation tests
├── acl.rs # Permission / ACL tests
└── concurrent.rs # Multi-backend concurrency testsTest status: this layout is the target test structure. The public repository includes Rust unit tests plus a minimal pgrx/pg_regress harness. Additional SQL-path, role/ACL, SQLSTATE, sync, OOM, crash-recovery, and concurrency integration coverage is tracked in the public issue tracker.
Design Decisions
Single Crate
pgGraph is implemented as a single pgrx extension crate. Keeping SQL
bindings and the core engine together keeps #[pg_extern] expansion
straightforward and avoids premature crate boundaries around a compact engine.
Dedicated graph Schema
All SQL APIs are declared in the graph schema (#[pg_extern(schema = "graph")]), and extension bootstrap SQL creates/uses graph.* catalog tables. This:
- Prevents collisions with user tables/functions in
public - Makes it clear which engine is running:
graph.traverse(...)is unambiguous - Follows the professional standard set by
pg_stat_statements,pgvector, etc. - Keeps the user’s
publicschema pristine
REGCLASS for Table References
All functions accept table names as REGCLASS (e.g., 'users'::regclass). Postgres automatically resolves schema-qualified names, handles OIDs, and validates that the table exists at call time. This eliminates an entire class of “table not found” errors.
Data Flow
Auto-Discovery (graph.auto_discover())
graph.auto_discover() called
│
▼
Step 1: Query information_schema.tables
→ Find all user tables in the target schema(s)
│
▼
Step 2: Query information_schema.key_column_usage + table_constraints
→ Find all FK relationships
│
▼
Step 3: Register every table as a node source (graph.registered_tables)
Register every FK as an edge (graph.registered_edges)
│
▼
Step 4: Automatically call graph.build()
│
▼
Result: Entire schema is graph-queryable in ~10 secondsInitial Build (graph.build())
graph.build() called from SQL
│
▼
Step 1: Read catalog (graph.registered_tables, graph.registered_edges)
│
▼
Step 2: OOM pre-check
Estimate RAM required from table row counts (SELECT count(*))
If estimated RAM > graph.memory_limit_mb → return ERROR with HINT
│
▼
Step 3: For each registered table:
COPY (SELECT id, col1, col2, ... FROM table) TO STDOUT
│
├── Allocate node_index via ResolutionIndex
├── Compute Bloom signature from properties
├── Insert tokens into TokenIndex (properties ≤ max_token_length)
├── Load filter column values into FilterIndex (parallel Vec<u32>)
└── Write to NodeStore SoA arrays
│
▼
Step 4: For each registered edge:
COPY (SELECT from_col, to_col FROM table) TO STDOUT
│
├── Resolve both endpoints via ResolutionIndex
└── Buffer (source, target, edge_type) tuple
│
▼
Step 5: Build CSR EdgeStore from buffered edges
Sort + deduplicate per node → build edge_offsets[] + targets[] + type_ids[]
│
▼
Step 6: Persist to $PGDATA/graph/graph.graph
Atomic rename for crash safety
│
▼
Step 7: mmap the new file for serving queriesQuery Flow (graph.traverse())
SELECT * FROM graph.traverse('users', 'U-123', 4, ARRAY['transferred_to']);
│
▼
Step 1: ACL Pre-Flight Check
pg_class_aclcheck(seed_table_oid, GetUserId(), ACL_SELECT)
If ACLCHECK_OK ≠ result → ERROR "Permission denied for table users"
Cost: O(1), ~0 microseconds
│
▼
Step 2: Resolve seed — ResolutionIndex: (users_oid, "U-123") → node_index
If not found → ERROR: "Node not found: users.U-123"
│
▼
Step 3: Compute filters (if any)
├── Edge type filter: resolve ARRAY['transferred_to'] → u8 type IDs
├── Bloom mask from filter predicates (if provided)
└── TokenIndex bitmap from exact-match predicates (if provided)
│
▼
Step 4: Execute BFS (pure Rust, catch_unwind wrapper)
├── VecDeque frontier (standard BFS queue)
├── RoaringBitmap visited (cycle detection)
├── For each node: check is_active, edge_type filter, Bloom sig
├── FilterIndex check: structured JSONB filter (e.g., graph.greater_than('amount', 10000))
├── Expand neighbors from CSR (sequential memory access)
├── Track parent[] for path reconstruction
└── Circuit breakers (max_depth, max_nodes, max_frontier)
│
▼
Step 5: Reconstruct paths from parent[] array
│
▼
Step 6: pgrx TableIterator yields rows:
(root_table REGCLASS, root_id TEXT, node_table REGCLASS, node_id TEXT,
depth INT, path JSONB, edge_path JSONB, node JSONB)Shortest Path Flow (graph.shortest_path())
SELECT * FROM graph.shortest_path('users', 'U-123', 'companies', 'C-456');
│
▼
Step 1: ACL check on both tables
│
▼
Step 2: Resolve both endpoints → node indices
│
▼
Step 3: Bidirectional BFS (unweighted, O(V + E))
├── Expand forward from source
├── Expand backward from target
├── Meet-in-the-middle: stop when frontiers intersect
└── Path reconstruction from both parent[] arrays
│
▼
Step 4: Return path as ordered rows:
(step INT, node_table REGCLASS, node_id TEXT, edge_label TEXT)
│
If no path exists → empty result set (not an error)Note on bidirectional BFS and unidirectional edges: Bidirectional BFS expands the graph in reverse from the target node. This requires edges to exist in both directions. Since
graph.add_edge()defaults tobidirectional := true, both forward and reverse edges are stored in the CSR and bidirectional BFS works correctly. However, if any edge along the path was registered withbidirectional := false, the backward expansion from the target may not traverse that edge — potentially missing the optimal path. Whenshortest_path()is called, graph checks a global flag (has_unidirectional_edges, set duringbuild()) to determine the strategy: if the graph contains any unidirectional edges, shortest path falls back to single-direction BFS from the source to guarantee correctness.
Crash Safety
Principle: graph Must NEVER Crash Postgres
This is the #1 engineering constraint. A misbehaving extension that takes down the database is worse than a slow one.
Strategy 1: catch_unwind on All Entry Points
Every #[pg_extern] function wraps its Rust logic in std::panic::catch_unwind. If the Rust code panics (array bounds, integer overflow, logic bug), the panic is caught and converted to a Postgres ERROR — not a process abort.
#[pg_extern]
fn graph_traverse(...) -> TableIterator<...> {
let result = std::panic::catch_unwind(AssertUnwindSafe(|| {
// ... ACL check, BFS logic ...
}));
match result {
Ok(rows) => TableIterator::new(rows.into_iter()),
Err(_) => {
ereport!(ERROR, "graph: internal error during traversal. \
Graph may need rebuild. See server log for details.");
}
}
}Strategy 2: ACL Pre-Flight Check (Before BFS)
Before dropping into the Rust BFS hot loop, every function verifies the caller has SELECT permission on the seed table:
let user_oid = pgrx::pg_sys::GetUserId();
let acl_result = pgrx::pg_sys::pg_class_aclcheck(
seed_table_oid,
user_oid,
pgrx::pg_sys::ACL_SELECT
);
if acl_result != pgrx::pg_sys::ACLCHECK_OK {
return Err(GraphError::AclDenied {
table: table_name.to_string(),
});
}This is an O(1) check — essentially zero cost.
Strategy 3: Memory Budget Enforcement
Before any large allocation (graph build, vacuum), graph checks the current process RSS against graph.memory_limit_mb. If exceeded, it returns a clear SQL error with a HINT — never a crash.
Strategy 4: Graceful Read-Only Degradation
If the edge buffer fills up during trigger sync, graph enters read-only mode: existing graph continues serving queries, new mutations are dropped with a WARNING, and graph.status() reports sync_status = 'read_only'.
Memory Architecture
mmap-Based (Process-Local, Kernel-Shared)
Each backend process loads the graph via mmap from the .graph file. Since mmap shares physical pages across processes, the actual RAM cost is not multiplied per connection.
Backend 1: mmap("graph.graph") ─┐
Backend 2: mmap("graph.graph") ─┼── Same physical pages
Backend 3: mmap("graph.graph") ─┘ (OS page cache)This is the same mechanism Postgres itself uses. No shared memory complexity, no LWLock contention for reads.
Weighted Edges (Dijkstra)
The default traversal mode is unweighted BFS. All edges have implicit weight = 1. This keeps the hot loop at $O(V + E)$ with dense Vec<u32> arrays and full CPU cache utilisation.
When any edge is registered with a weight_column, graph allocates a parallel weights[] array in the CSR and enables weighted shortest path via Dijkstra:
// Unweighted: BFS hot loop (O(V + E))
let mut frontier: VecDeque<u32> = VecDeque::new();
// Weighted: Dijkstra hot loop (O(V + E log V))
let mut heap: BinaryHeap<Reverse<(u32, u32)>> = BinaryHeap::new(); // (cost, node)The CSR layout with optional weights:
EdgeStore (base):
edge_offsets[]: Vec<u32> — where each node's edges begin
targets[]: Vec<u32> — neighbor node indices
type_ids[]: Vec<u8> — edge type labels
EdgeStore (with weights, optional):
weights[]: Vec<u32> — edge cost (parallel to targets[])This is an additive layout — unweighted BFS ignores the weights[] array entirely. Dijkstra is only invoked when the user explicitly calls weighted_shortest_path().
Key Design Decisions
1. Why pgrx (Not C)?
| Factor | C Extension | pgrx (Rust) |
|---|---|---|
| Memory safety | Manual (segfaults, buffer overflows) | Compile-time guaranteed |
| Crash safety | Bugs crash Postgres | catch_unwind → PG ERROR |
| Build system | Makefiles + PGXS | cargo pgrx build |
| Testing | pg_regress (limited) | Rust unit tests + pgTAP |
| Performance | Native | Native (same LLVM backend) |
2. Why CSR (Not Adjacency List)?
| Property | Adjacency List | CSR |
|---|---|---|
| Cache efficiency | Poor (scattered heap allocs) | Excellent (contiguous) |
| mmap compatibility | Impossible (heap pointers) | Natural (flat arrays) |
| Serialization | Expensive (recursive) | Zero-copy (direct mmap) |
| Memory overhead | 24 bytes/Vec + padding | 4 bytes/node (offset) |
3. Why Not Apache AGE?
| Apache AGE | graph | |
|---|---|---|
| Data model | Stores graph in separate tables | Reads your existing tables |
| Query language | openCypher + SQL hybrid | SQL functions only |
| Storage | On-disk (heap tuples) | RAM-resident (CSR + mmap) |
| Latency | 50ms–500ms | 0.07ms–15ms |
| Setup | Load data into AGE format | auto_discover() — 10 seconds |
| Schema | Requires graph-specific DDL | Uses your existing schema |
graph is to AGE what Redis is to Postgres — a fast in-memory acceleration layer, not a replacement.
4. Configuration in Postgres Tables (Not YAML Files)
All graph configuration is stored in Postgres catalog tables within the graph schema:
graph.registered_tables— registered tables (oid, id_columns, columns, tenant_column)graph.registered_edges— registered edges (from_oid, from_col, to_oid, to_col, label)graph.registered_filter_columns— numeric columns registered for O(1) range filtering
This means configuration is backed up with pg_dump, versioned with schema migrations, queryable with SQL, and replicated with Postgres replication.