Memory Model & Capacity Planning

Memory Architecture

graph uses a layered memory model designed for maximum BFS throughput while minimising impact on the Postgres shared buffer pool.

Key Principle: Graph Memory is Separate from Postgres Shared Buffers

The graph index does NOT use Postgres shared buffers (shared_buffers). It lives in the Postgres backend’s private address space via mmap. This means:

No contention with your table/index buffer cache.
No LWLock overhead for graph reads.
OS-managed page sharing — multiple backends reading the same .graph file share physical RAM pages via the kernel’s page cache.


┌─────────────────────────────────────────────────────┐
│  Postgres Process Memory                            │
│                                                     │
│  ┌───────────────────────────┐                      │
│  │  shared_buffers (128 MB+) │  ← Your table data   │
│  │  WAL buffers              │                      │
│  │  Lock tables              │                      │
│  └───────────────────────────┘                      │
│                                                     │
│  ┌───────────────────────────┐                      │
│  │  Backend private memory   │                      │
│  │  ┌─────────────────────┐  │                      │
│  │  │  graph mmap         │  │  ← SHARED via OS     │
│  │  │  (graph.graph)      │  │    page cache        │
│  │  │                     │  │                      │
│  │  │  NodeStore arrays   │  │  ← Zero-copy         │
│  │  │  EdgeStore CSR      │  │                      │
│  │  │  ResolutionIndex    │  │    Sorted flat array │
│  │  └─────────────────────┘  │                      │
│  │  ┌─────────────────────┐  │                      │
│  │  │  TokenIndex (heap)  │  │  ← Process-local     │
│  │  │  RoaringBitmap      │  │    (deserialized on  │
│  │  └─────────────────────┘  │     first load)      │
│  └───────────────────────────┘                      │
└─────────────────────────────────────────────────────┘

⚠️ Per-Backend Memory: TokenIndex Only. When loaded from the .graph file, NodeStore arrays, EdgeStore CSR, and ResolutionIndex are all served directly from mmap’d memory — zero per-process heap allocation, shared across all backends via the OS page cache. The only per-process duplication is the TokenIndex (RoaringBitmap posting lists, ~2–8 bytes/node). At 10M nodes with 50 connections, per-process duplication is ~4 GB (TokenIndex) — not the ~59 GB it would be if NodeStore and EdgeStore were also duplicated.

Per-Node Memory Cost

Hot Path (BFS traversal reads these arrays)

Array	Width	Purpose
`edge_offsets[]`	4 bytes (u32)	CSR offset into edge array
`property_signature[]`	8 bytes (u64)	Bloom filter signature
`is_active[]`	0.125 bytes (bit)	Tombstone flag

Hot path total: ~12.1 bytes/node

Warm Path — mmap’d (shared via OS page cache)

Structure	Width	Purpose
ResolutionIndex	16 bytes/node	`(table_oid, pk_hash, node_idx)` sorted flat array

The ResolutionIndex is a sorted array of fixed-size 16-byte entries stored directly in the .graph file and mmap’d. All backends share the same physical memory pages — no deserialization, no per-process heap allocation. Lookups use binary search: O(log n), ~115ns for 10M nodes.

Warm Path — heap (per-process, duplicated per backend)

Structure	Estimated Cost	Purpose
TokenIndex posting lists	~2–8 bytes/node (amortised)	Property search
FilterIndex columns	4 bytes/node/column (u32)	Numeric column value for range checks

Note: The Bloom/property signature is already counted in the hot path above — it is the same property_signature[] array.

Heap warm path (per-process): ~2–8 bytes/node (plus 4 bytes per registered filter column)

FilterIndex sparsity warning: FilterIndex arrays are indexed by global node_idx, not per-table. Registering invoices.amount allocates 4 bytes for every node in the graph — not just invoices. For a graph with 10M total nodes and 500K invoices, 9.5M entries are zeroed (wasting ~38 MB). Only register filter columns on dense/common tables, or accept the overhead.

Per-Edge Memory Cost

Array	Width	Purpose
`targets[]`	4 bytes (u32)	Neighbour node index
`type_ids[]`	1 byte (u8)	Edge type ID
`weights[]`	4 bytes (u32)	Edge weight (optional, Dijkstra only)

Unweighted edges: 5 bytes/edge (BFS traversals — no weights[] array loaded) Weighted edges: 9 bytes/edge (when weight_column is registered on any edge)

Capacity Planning Table

Scale	Nodes	Edges	Hot Path	Total RAM	Machine
Small	100K	2M	11 MB	~30 MB	Any
Medium	1M	20M	112 MB	~300 MB	4 GB+
Large	10M	200M	1.1 GB	~3 GB	16 GB+
Enterprise	50M	1B	5.6 GB	~15 GB	64 GB+
Max (32-bit)	500M	10B	56 GB	~150 GB	256 GB+

Rule of thumb: Allocate ~1.5× the hot-path estimate for total mmap’d extension memory (covers ResolutionIndex + OS page table overhead). Then add ~80 MB per backend for heap structures (TokenIndex). The mmap’d portion is shared across all backends.

⚠️ Peak memory during build() / vacuum(): During a rebuild, the old graph continues serving queries via mmap while the new CSR is constructed in memory, then written to .graph.tmp, and atomically renamed. This means peak RAM usage during build() is approximately 2× the hot-path footprint — the old mmap’d graph plus the new in-memory graph. Plan accordingly: a 5 GB graph requires ~10 GB available RAM during rebuilds.

Multi-connection formula: For production capacity planning: Total RAM = (mmap_total × 1) + (heap_warm × max_connections) + shared_buffers + OS. The mmap_total includes NodeStore + EdgeStore + ResolutionIndex (shared, paid once). The heap_warm is TokenIndex only (~2–8 bytes/node, paid per-backend). Example: 10M-node graph with 50 connections → (1.3 GB × 1) + (80 MB × 50) + shared_buffers ≈ 5.3 GB + shared_buffers. Compare to the old HashMap approach: ~17.2 GB.

.graph File Format

The graph is persisted to disk for crash recovery. The file format is designed for zero-copy mmap:


graph.graph layout:
┌──────────────────────────────────┐
│ Header (40 bytes)                │
│   magic: [u8; 4] = b"PGGH"       │
│   version: u8 = 0x01 (32-bit)    │
│   flags: u8                      │
│     bit 0 = has weighted edges   │
│     bit 1 = has unidirectional   │
│     bits 2–7 = reserved          │
│   reserved: [u8; 2]              │
│   node_count: u32                │
│   edge_count: u32                │
│   token_index_offset: u64        │
│   filter_index_offset: u64       │
│   checksum: u32 (CRC32)          │
│   padding: [u8; 4]               │
├──────────────────────────────────┤
│ Section: edge_offsets            │
│   (node_count + 1) × u32         │
├──────────────────────────────────┤
│ Section: property_signature      │
│   node_count × u64               │
├──────────────────────────────────┤
│ Section: is_active (packed bits) │
│   ceil(node_count / 8) bytes     │
├──────────────────────────────────┤
│ Section: edge_targets            │
│   edge_count × u32               │
├──────────────────────────────────┤
│ Section: edge_type_ids           │
│   edge_count × u8                │
├──────────────────────────────────┤
│ Section: edge_weights (optional) │
│   edge_count × u32 (if weighted) │
│   (absent if no weighted edges)  │
├──────────────────────────────────┤
│ Section: filter_index            │
│   filter_column_count: u16       │
│   for each filter column:        │
│     table_oid: u32               │
│     column_name_len: u16         │
│     column_name: [u8; len]       │
│     values: node_count × u32     │
├──────────────────────────────────┤
│ Section: resolution_index        │
│   entry_count: u32               │
│   entries[] (sorted by key):     │
│     table_oid: u32               │
│     pk_hash: u64                 │
│     node_idx: u32                │
│   (16 bytes/entry, binary search)│
├──────────────────────────────────┤
│ Section: token_index             │
│   Bincode-serialised TokenIndex  │
├──────────────────────────────────┤
│ Section: edge_type_registry      │
│   Bincode-serialized Vec<String> │
│   index 0 = untyped/null         │
│   user labels = indexes 1–254    │
└──────────────────────────────────┘

Integrity Rules

On load: verify magic == b"PGGH".
On load: verify version == 0x01.
On load: verify CRC32 checksum covers all sections.
On failure: log WARNING, skip loading. Graph is empty until graph.build() is called.
Never crash. File corruption → clear error message, not a segfault.

Crash Recovery


Normal operation:
  graph.graph exists, valid → mmap and serve

During rebuild/vacuum:
  1. Write to graph.graph.tmp
  2. fdatasync(graph.graph.tmp)
  3. rename("graph.graph.tmp", "graph.graph")  ← atomic on POSIX
  4. Re-mmap the new file

Crash during rebuild:
  graph.graph.tmp exists, graph.graph is intact
  → Delete .tmp, load existing .graph
  → Zero corruption, zero data loss

Crash with no .graph file:
  → Run graph.build() to reconstruct from tables
  → Tables are the source of truth

mmap Behaviour

Read Path (BFS)

The .graph file’s NodeStore and EdgeStore sections are mmap’d directly. The BFS hot loop reads from these mmap’d arrays with zero-copy:


// The BFS reads mmap'd memory directly — no deserialization
let sig = unsafe { *self.property_signatures.add(node_idx) };
if (sig & bloom_mask) != bloom_mask { continue; }

Page Faults

On first access after startup, the OS will page-fault each accessed page of the mmap’d file into RAM. This causes a brief warm-up period:

Graph Size	.graph File Size	Warm-up Time (NVMe)
1M nodes	~120 MB	~0.5s
10M nodes	~1.2 GB	~3s
50M nodes	~6 GB	~12s

Mitigation: Set graph.auto_load = on to madvise(MADV_WILLNEED) the entire file at extension load time, pre-populating the page cache before the first query arrives.

Memory Pressure

If the OS is under memory pressure, it may evict mmap’d pages. This causes page faults during BFS, which can spike latency from microseconds to milliseconds.

Mitigation for production:

Ensure the machine has enough RAM for the graph + Postgres shared buffers + OS.
Use mlock() on the hot-path sections (edge_offsets, property_signatures, edge_targets) to pin them in RAM. This requires the Postgres process to have CAP_IPC_LOCK or sufficient RLIMIT_MEMLOCK.

OOM Handling & Graceful Degradation

graph enforces a hard memory limit via graph.memory_limit_mb. If an operation would exceed this limit, graph returns a clear SQL error instead of crashing Postgres.

Behaviour by Operation

Operation	OOM Behaviour
`graph.build()`	Pre-checks estimated RAM. Returns `ERROR` before allocating.
`graph.traverse()`	Circuit breakers (max_nodes, max_frontier) prevent runaway expansion.
`graph.shortest_path()`	Bidirectional BFS bounded by max_nodes.
Trigger sync	If edge buffer full → `WARNING` + enters read-only mode.
Vacuum	If CSR rebuild would exceed limit → `ERROR`, old graph continues serving.

Read-Only Mode

When the edge buffer fills (graph.edge_buffer_size exceeded) or memory is critically low:

graph stops accepting new mutations via triggers.
Existing graph continues serving queries at full speed.
graph.status() reports sync_status = 'read_only'.
A WARNING is logged: graph: entering read-only mode (edge buffer full).
Recovery: call graph.vacuum() or graph.build() to return to normal.

Configuration for Different Workloads

Small Database (< 1M nodes, shared hosting)


graph.memory_limit_mb = 512
graph.auto_load = off     # Don't waste RAM if rarely queried
graph.sync_mode = 'trigger'
graph.vacuum_interval_secs = 300  # Vacuum every 5 minutes
graph.max_token_length = 128

Medium Database (1M–10M nodes, dedicated server)


graph.memory_limit_mb = 2048
graph.auto_load = on
graph.sync_mode = 'trigger'
graph.vacuum_interval_secs = 60
graph.max_token_length = 128

Large Database (10M–100M nodes, high-memory server)


graph.memory_limit_mb = 16384
graph.auto_load = on
graph.sync_mode = 'wal'          # Zero write overhead
graph.vacuum_interval_secs = 30
graph.max_token_length = 64      # Reduce to save RAM

AI Agent Workload (latency-sensitive, many concurrent queries)


graph.memory_limit_mb = 8192
graph.auto_load = on
graph.sync_mode = 'wal'
graph.vacuum_interval_secs = 10   # Near-real-time edge freshness
graph.default_max_depth = 10      # Deep traversals for AI context
graph.max_nodes = 50000   # Bound per-query cost
graph.oom_action = 'readonly'     # Don't error, just degrade