Architecture & Crate Topology

High-Level Architecture


┌───────────────────────────────────────────────────────────────────────────┐
│                             PostgreSQL Server                             │
│                                                                           │
│  ┌─────────────────────────┐      ┌────────────────────────────────────┐  │
│  │ SQL Interface           │      │ Background Worker                  │  │
│  │ (graph schema)          │      │                                    │  │
│  │                         │      │ WAL Listener (logical replication) │  │
│  │ graph.traverse()        │      │ Periodic auto-vacuum               │  │
│  │ graph.search()          │      │ OOM monitoring                     │  │
│  │ graph.build()           │      │                                    │  │
│  │ graph.auto_discover()   │      │                                    │  │
│  └───────────┬─────────────┘      └──────────────────┬─────────────────┘  │
│              │                                       │                    │
│              ▼                                       ▼                    │
│   ┌───────────────────────────────────────────────────────────────────┐   │
│   │ Graph Engine (Rust)                                               │   │
│   │                                                                   │   │
│   │  ┌─────────────────────────────────────────────────────────────┐  │   │
│   │  │ ACL Pre-Flight Check                                        │  │   │
│   │  │ pg_class_aclcheck(table_oid, user_oid, ACL_SELECT)          │  │   │
│   │  │ O(1) check BEFORE dropping into BFS hot loop                │  │   │
│   │  └─────────────────────────────────────────────────────────────┘  │   │
│   │                                                                   │   │
│   │   ┌────────────────┐  ┌────────────────┐  ┌───────────────────┐   │   │
│   │   │ NodeStore      │  │ EdgeStore      │  │ TokenIndex        │   │   │
│   │   │ (SoA arrays)   │  │ (CSR)          │  │ (Roaring)         │   │   │
│   │   └────────────────┘  └────────────────┘  └───────────────────┘   │   │
│   │   ┌────────────────┐  ┌────────────────┐  ┌───────────────────┐   │   │
│   │   │ BloomFilter    │  │ FilterIndex    │  │ ResolutionIndex   │   │   │
│   │   │ (u64 sigs)     │  │ (Vec<u32>[])   │  │ (mmap'd sorted)   │   │   │
│   │   └────────────────┘  └────────────────┘  └───────────────────┘   │   │
│   │                                                                   │   │
│   │  ┌─────────────────────────────────────────────────────────────┐  │   │
│   │  │ Safety Layer                                                │  │   │
│   │  │ • ACL check before BFS (O(1), zero cost)                    │  │   │
│   │  │ • OOM guard (check RSS before allocation)                   │  │   │
│   │  │ • Circuit breakers (max_nodes, max_depth, frontier)         │  │   │
│   │  │ • panic = catch_unwind → PG error (never crashes PG)        │  │   │
│   │  └─────────────────────────────────────────────────────────────┘  │   │
│   └─────────────────────────────────┬─────────────────────────────────┘   │
│                                     │                                     │
│                                     ▼                                     │
│   ┌───────────────────────────────────────────────────────────────────┐   │
│   │ Persistence Layer                                                 │   │
│   │ .graph files on $PGDATA/graph/ (mmap'd read-only)                 │   │
│   │ Atomic rename crash recovery                                      │   │
│   └───────────────────────────────────────────────────────────────────┘   │
│                                                                           │
│   ┌───────────────────────────────────────────────────────────────────┐   │
│   │ Configuration (graph schema tables)                               │   │
│   │ graph.registered_tables — node tables                             │   │
│   │ graph.registered_edges — edge relationships                       │   │
│   └───────────────────────────────────────────────────────────────────┘   │
│                                                                           │
│   ┌───────────────────────────────────────────────────────────────────┐   │
│   │ Your Tables (source of truth)                                     │   │
│   │ customers, orders, products, invoices, ...                        │   │
│   └───────────────────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────────────────┘

Crate Layout


graph/
├── Cargo.toml                     # pgrx extension crate
├── graph.control                # Postgres extension control file
├── sql/
│   └── graph--0.1.0.sql         # Extension SQL bootstrap
├── src/
│   ├── lib.rs                     # pgrx entry point, _PG_init, SQL function registration
│   ├── engine.rs                  # Graph engine orchestrator (owns all stores)
│   ├── node_store.rs              # SoA NodeStore — flat arrays
│   ├── edge_store.rs              # CSR EdgeStore — targets[] + type_ids[]
│   ├── bloom.rs                   # 64-bit Bloom filter (xxhash3)
│   ├── token_index.rs             # TokenIndex with RoaringBitmap posting lists
│   ├── resolution_index.rs        # (table_oid, pk) → node_index — mmap'd sorted array, binary search
│   ├── filter_index.rs             # Parallel Vec<u32> arrays for numeric column filtering
│   ├── bfs.rs                     # BFS hot loop (unweighted, zero allocations)
│   ├── path_finder.rs             # Bidirectional BFS shortest path + path reconstruction
│   ├── builder.rs                 # Initial graph construction from SQL tables
│   ├── discover.rs                # Schema auto-discovery (information_schema introspection)
│   ├── acl.rs                     # ACL pre-flight check (pg_class_aclcheck)
│   ├── sync.rs                    # Trigger-based and WAL-based graph sync
│   ├── bgworker.rs                # Background worker for WAL + auto-vacuum
│   ├── persistence.rs             # .graph file format, mmap, atomic rename
│   ├── config.rs                  # GUC parameters and engine config
│   ├── catalog.rs                 # Extension catalog tables (graph schema)
│   └── safety.rs                  # OOM guard, catch_unwind wrappers, error handling
├── benches/
│   └── bfs_bench.rs               # Criterion.rs benchmarks
└── tests/
    ├── integration.rs             # pgTAP-style integration tests
    ├── oom.rs                     # OOM graceful degradation tests
    ├── acl.rs                     # Permission / ACL tests
    └── concurrent.rs              # Multi-backend concurrency tests

Test status: this layout is the target test structure. The public repository includes Rust unit tests plus a minimal pgrx/pg_regress harness. Additional SQL-path, role/ACL, SQLSTATE, sync, OOM, crash-recovery, and concurrency integration coverage is tracked in the public issue tracker.

Design Decisions

Single Crate

pgGraph is implemented as a single pgrx extension crate. Keeping SQL bindings and the core engine together keeps #[pg_extern] expansion straightforward and avoids premature crate boundaries around a compact engine.

Dedicated `graph` Schema

All SQL APIs are declared in the graph schema (#[pg_extern(schema = "graph")]), and extension bootstrap SQL creates/uses graph.* catalog tables. This:

Prevents collisions with user tables/functions in public
Makes it clear which engine is running: graph.traverse(...) is unambiguous
Follows the professional standard set by pg_stat_statements, pgvector, etc.
Keeps the user’s public schema pristine

`REGCLASS` for Table References

All functions accept table names as REGCLASS (e.g., 'users'::regclass). Postgres automatically resolves schema-qualified names, handles OIDs, and validates that the table exists at call time. This eliminates an entire class of “table not found” errors.

Data Flow

Auto-Discovery (`graph.auto_discover()`)


graph.auto_discover() called
        │
        ▼
Step 1: Query information_schema.tables
        → Find all user tables in the target schema(s)
        │
        ▼
Step 2: Query information_schema.key_column_usage + table_constraints
        → Find all FK relationships
        │
        ▼
Step 3: Register every table as a node source (graph.registered_tables)
        Register every FK as an edge (graph.registered_edges)
        │
        ▼
Step 4: Automatically call graph.build()
        │
        ▼
Result: Entire schema is graph-queryable in ~10 seconds

Initial Build (`graph.build()`)


graph.build() called from SQL
        │
        ▼
Step 1: Read catalog (graph.registered_tables, graph.registered_edges)
        │
        ▼
Step 2: OOM pre-check
        Estimate RAM required from table row counts (SELECT count(*))
        If estimated RAM > graph.memory_limit_mb → return ERROR with HINT
        │
        ▼
Step 3: For each registered table:
        COPY (SELECT id, col1, col2, ... FROM table) TO STDOUT
        │
        ├── Allocate node_index via ResolutionIndex
        ├── Compute Bloom signature from properties
        ├── Insert tokens into TokenIndex (properties ≤ max_token_length)
        ├── Load filter column values into FilterIndex (parallel Vec<u32>)
        └── Write to NodeStore SoA arrays
        │
        ▼
Step 4: For each registered edge:
        COPY (SELECT from_col, to_col FROM table) TO STDOUT
        │
        ├── Resolve both endpoints via ResolutionIndex
        └── Buffer (source, target, edge_type) tuple
        │
        ▼
Step 5: Build CSR EdgeStore from buffered edges
        Sort + deduplicate per node → build edge_offsets[] + targets[] + type_ids[]
        │
        ▼
Step 6: Persist to $PGDATA/graph/graph.graph
        Atomic rename for crash safety
        │
        ▼
Step 7: mmap the new file for serving queries

Query Flow (`graph.traverse()`)


SELECT * FROM graph.traverse('users', 'U-123', 4, ARRAY['transferred_to']);
        │
        ▼
Step 1: ACL Pre-Flight Check
        pg_class_aclcheck(seed_table_oid, GetUserId(), ACL_SELECT)
        If ACLCHECK_OK ≠ result → ERROR "Permission denied for table users"
        Cost: O(1), ~0 microseconds
        │
        ▼
Step 2: Resolve seed — ResolutionIndex: (users_oid, "U-123") → node_index
        If not found → ERROR: "Node not found: users.U-123"
        │
        ▼
Step 3: Compute filters (if any)
        ├── Edge type filter: resolve ARRAY['transferred_to'] → u8 type IDs
        ├── Bloom mask from filter predicates (if provided)
        └── TokenIndex bitmap from exact-match predicates (if provided)
        │
        ▼
Step 4: Execute BFS (pure Rust, catch_unwind wrapper)
        ├── VecDeque frontier (standard BFS queue)
        ├── RoaringBitmap visited (cycle detection)
        ├── For each node: check is_active, edge_type filter, Bloom sig
        ├── FilterIndex check: structured JSONB filter (e.g., graph.greater_than('amount', 10000))
        ├── Expand neighbors from CSR (sequential memory access)
        ├── Track parent[] for path reconstruction
        └── Circuit breakers (max_depth, max_nodes, max_frontier)
        │
        ▼
Step 5: Reconstruct paths from parent[] array
        │
        ▼
Step 6: pgrx TableIterator yields rows:
        (root_table REGCLASS, root_id TEXT, node_table REGCLASS, node_id TEXT,
         depth INT, path JSONB, edge_path JSONB, node JSONB)

Shortest Path Flow (`graph.shortest_path()`)


SELECT * FROM graph.shortest_path('users', 'U-123', 'companies', 'C-456');
        │
        ▼
Step 1: ACL check on both tables
        │
        ▼
Step 2: Resolve both endpoints → node indices
        │
        ▼
Step 3: Bidirectional BFS (unweighted, O(V + E))
        ├── Expand forward from source
        ├── Expand backward from target
        ├── Meet-in-the-middle: stop when frontiers intersect
        └── Path reconstruction from both parent[] arrays
        │
        ▼
Step 4: Return path as ordered rows:
        (step INT, node_table REGCLASS, node_id TEXT, edge_label TEXT)
        │
        If no path exists → empty result set (not an error)

Note on bidirectional BFS and unidirectional edges: Bidirectional BFS expands the graph in reverse from the target node. This requires edges to exist in both directions. Since graph.add_edge() defaults to bidirectional := true, both forward and reverse edges are stored in the CSR and bidirectional BFS works correctly. However, if any edge along the path was registered with bidirectional := false, the backward expansion from the target may not traverse that edge — potentially missing the optimal path. When shortest_path() is called, graph checks a global flag (has_unidirectional_edges, set during build()) to determine the strategy: if the graph contains any unidirectional edges, shortest path falls back to single-direction BFS from the source to guarantee correctness.

Crash Safety

Principle: graph Must NEVER Crash Postgres

This is the #1 engineering constraint. A misbehaving extension that takes down the database is worse than a slow one.

Strategy 1: `catch_unwind` on All Entry Points

Every #[pg_extern] function wraps its Rust logic in std::panic::catch_unwind. If the Rust code panics (array bounds, integer overflow, logic bug), the panic is caught and converted to a Postgres ERROR — not a process abort.


#[pg_extern]
fn graph_traverse(...) -> TableIterator<...> {
    let result = std::panic::catch_unwind(AssertUnwindSafe(|| {
        // ... ACL check, BFS logic ...
    }));
    
    match result {
        Ok(rows) => TableIterator::new(rows.into_iter()),
        Err(_) => {
            ereport!(ERROR, "graph: internal error during traversal. \
                     Graph may need rebuild. See server log for details.");
        }
    }
}

Strategy 2: ACL Pre-Flight Check (Before BFS)

Before dropping into the Rust BFS hot loop, every function verifies the caller has SELECT permission on the seed table:


let user_oid = pgrx::pg_sys::GetUserId();
let acl_result = pgrx::pg_sys::pg_class_aclcheck(
    seed_table_oid, 
    user_oid, 
    pgrx::pg_sys::ACL_SELECT
);
if acl_result != pgrx::pg_sys::ACLCHECK_OK {
    return Err(GraphError::AclDenied {
        table: table_name.to_string(),
    });
}

This is an O(1) check — essentially zero cost.

Strategy 3: Memory Budget Enforcement

Before any large allocation (graph build, vacuum), graph checks the current process RSS against graph.memory_limit_mb. If exceeded, it returns a clear SQL error with a HINT — never a crash.

Strategy 4: Graceful Read-Only Degradation

If the edge buffer fills up during trigger sync, graph enters read-only mode: existing graph continues serving queries, new mutations are dropped with a WARNING, and graph.status() reports sync_status = 'read_only'.

Memory Architecture

mmap-Based (Process-Local, Kernel-Shared)

Each backend process loads the graph via mmap from the .graph file. Since mmap shares physical pages across processes, the actual RAM cost is not multiplied per connection.


Backend 1:  mmap("graph.graph") ─┐
Backend 2:  mmap("graph.graph") ─┼── Same physical pages
Backend 3:  mmap("graph.graph") ─┘   (OS page cache)

This is the same mechanism Postgres itself uses. No shared memory complexity, no LWLock contention for reads.

Weighted Edges (Dijkstra)

The default traversal mode is unweighted BFS. All edges have implicit weight = 1. This keeps the hot loop at $O(V + E)$ with dense Vec<u32> arrays and full CPU cache utilisation.

When any edge is registered with a weight_column, graph allocates a parallel weights[] array in the CSR and enables weighted shortest path via Dijkstra:


// Unweighted: BFS hot loop (O(V + E))
let mut frontier: VecDeque<u32> = VecDeque::new();
 
// Weighted: Dijkstra hot loop (O(V + E log V))
let mut heap: BinaryHeap<Reverse<(u32, u32)>> = BinaryHeap::new(); // (cost, node)

The CSR layout with optional weights:


EdgeStore (base):
  edge_offsets[]:  Vec<u32>    — where each node's edges begin
  targets[]:       Vec<u32>    — neighbor node indices
  type_ids[]:      Vec<u8>     — edge type labels

EdgeStore (with weights, optional):
  weights[]:       Vec<u32>    — edge cost (parallel to targets[])

This is an additive layout — unweighted BFS ignores the weights[] array entirely. Dijkstra is only invoked when the user explicitly calls weighted_shortest_path().

Key Design Decisions

1. Why pgrx (Not C)?

Factor	C Extension	pgrx (Rust)
Memory safety	Manual (segfaults, buffer overflows)	Compile-time guaranteed
Crash safety	Bugs crash Postgres	`catch_unwind` → PG ERROR
Build system	Makefiles + PGXS	`cargo pgrx build`
Testing	pg_regress (limited)	Rust unit tests + pgTAP
Performance	Native	Native (same LLVM backend)

2. Why CSR (Not Adjacency List)?

Property	Adjacency List	CSR
Cache efficiency	Poor (scattered heap allocs)	Excellent (contiguous)
mmap compatibility	Impossible (heap pointers)	Natural (flat arrays)
Serialization	Expensive (recursive)	Zero-copy (direct mmap)
Memory overhead	24 bytes/Vec + padding	4 bytes/node (offset)

3. Why Not Apache AGE?

	Apache AGE	graph
Data model	Stores graph in separate tables	Reads your existing tables
Query language	openCypher + SQL hybrid	SQL functions only
Storage	On-disk (heap tuples)	RAM-resident (CSR + mmap)
Latency	50ms–500ms	0.07ms–15ms
Setup	Load data into AGE format	`auto_discover()` — 10 seconds
Schema	Requires graph-specific DDL	Uses your existing schema

graph is to AGE what Redis is to Postgres — a fast in-memory acceleration layer, not a replacement.

4. Configuration in Postgres Tables (Not YAML Files)

All graph configuration is stored in Postgres catalog tables within the graph schema:

graph.registered_tables — registered tables (oid, id_columns, columns, tenant_column)
graph.registered_edges — registered edges (from_oid, from_col, to_oid, to_col, label)
graph.registered_filter_columns — numeric columns registered for O(1) range filtering

This means configuration is backed up with pg_dump, versioned with schema migrations, queryable with SQL, and replicated with Postgres replication.