Skip to Content
Memory Model

Memory Model & Capacity Planning

Memory Architecture

graph uses a layered memory model designed for maximum BFS throughput while minimising impact on the Postgres shared buffer pool.

Key Principle: Graph Memory is Separate from Postgres Shared Buffers

The graph index does NOT use Postgres shared buffers (shared_buffers). It lives in the Postgres backend’s private address space via mmap. This means:

  1. No contention with your table/index buffer cache.
  2. No LWLock overhead for graph reads.
  3. OS-managed page sharing — multiple backends reading the same .graph file share physical RAM pages via the kernel’s page cache.
┌─────────────────────────────────────────────────────┐ │ Postgres Process Memory │ │ │ │ ┌───────────────────────────┐ │ │ │ shared_buffers (128 MB+) │ ← Your table data │ │ │ WAL buffers │ │ │ │ Lock tables │ │ │ └───────────────────────────┘ │ │ │ │ ┌───────────────────────────┐ │ │ │ Backend private memory │ │ │ │ ┌─────────────────────┐ │ │ │ │ │ graph mmap │ │ ← SHARED via OS │ │ │ │ (graph.graph) │ │ page cache │ │ │ │ │ │ │ │ │ │ NodeStore arrays │ │ ← Zero-copy │ │ │ │ EdgeStore CSR │ │ │ │ │ │ ResolutionIndex │ │ Sorted flat array │ │ │ └─────────────────────┘ │ │ │ │ ┌─────────────────────┐ │ │ │ │ │ TokenIndex (heap) │ │ ← Process-local │ │ │ │ RoaringBitmap │ │ (deserialized on │ │ │ └─────────────────────┘ │ first load) │ │ └───────────────────────────┘ │ └─────────────────────────────────────────────────────┘

⚠️ Per-Backend Memory: TokenIndex Only. When loaded from the .graph file, NodeStore arrays, EdgeStore CSR, and ResolutionIndex are all served directly from mmap’d memory — zero per-process heap allocation, shared across all backends via the OS page cache. The only per-process duplication is the TokenIndex (RoaringBitmap posting lists, ~2–8 bytes/node). At 10M nodes with 50 connections, per-process duplication is ~4 GB (TokenIndex) — not the ~59 GB it would be if NodeStore and EdgeStore were also duplicated.


Per-Node Memory Cost

Hot Path (BFS traversal reads these arrays)

ArrayWidthPurpose
edge_offsets[]4 bytes (u32)CSR offset into edge array
property_signature[]8 bytes (u64)Bloom filter signature
is_active[]0.125 bytes (bit)Tombstone flag

Hot path total: ~12.1 bytes/node

Warm Path — mmap’d (shared via OS page cache)

StructureWidthPurpose
ResolutionIndex16 bytes/node(table_oid, pk_hash, node_idx) sorted flat array

The ResolutionIndex is a sorted array of fixed-size 16-byte entries stored directly in the .graph file and mmap’d. All backends share the same physical memory pages — no deserialization, no per-process heap allocation. Lookups use binary search: O(log n), ~115ns for 10M nodes.

Warm Path — heap (per-process, duplicated per backend)

StructureEstimated CostPurpose
TokenIndex posting lists~2–8 bytes/node (amortised)Property search
FilterIndex columns4 bytes/node/column (u32)Numeric column value for range checks

Note: The Bloom/property signature is already counted in the hot path above — it is the same property_signature[] array.

Heap warm path (per-process): ~2–8 bytes/node (plus 4 bytes per registered filter column)

FilterIndex sparsity warning: FilterIndex arrays are indexed by global node_idx, not per-table. Registering invoices.amount allocates 4 bytes for every node in the graph — not just invoices. For a graph with 10M total nodes and 500K invoices, 9.5M entries are zeroed (wasting ~38 MB). Only register filter columns on dense/common tables, or accept the overhead.

Per-Edge Memory Cost

ArrayWidthPurpose
targets[]4 bytes (u32)Neighbour node index
type_ids[]1 byte (u8)Edge type ID
weights[]4 bytes (u32)Edge weight (optional, Dijkstra only)

Unweighted edges: 5 bytes/edge (BFS traversals — no weights[] array loaded) Weighted edges: 9 bytes/edge (when weight_column is registered on any edge)


Capacity Planning Table

ScaleNodesEdgesHot PathTotal RAMMachine
Small100K2M11 MB~30 MBAny
Medium1M20M112 MB~300 MB4 GB+
Large10M200M1.1 GB~3 GB16 GB+
Enterprise50M1B5.6 GB~15 GB64 GB+
Max (32-bit)500M10B56 GB~150 GB256 GB+

Rule of thumb: Allocate ~1.5× the hot-path estimate for total mmap’d extension memory (covers ResolutionIndex + OS page table overhead). Then add ~80 MB per backend for heap structures (TokenIndex). The mmap’d portion is shared across all backends.

⚠️ Peak memory during build() / vacuum(): During a rebuild, the old graph continues serving queries via mmap while the new CSR is constructed in memory, then written to .graph.tmp, and atomically renamed. This means peak RAM usage during build() is approximately 2× the hot-path footprint — the old mmap’d graph plus the new in-memory graph. Plan accordingly: a 5 GB graph requires ~10 GB available RAM during rebuilds.

Multi-connection formula: For production capacity planning: Total RAM = (mmap_total × 1) + (heap_warm × max_connections) + shared_buffers + OS. The mmap_total includes NodeStore + EdgeStore + ResolutionIndex (shared, paid once). The heap_warm is TokenIndex only (~2–8 bytes/node, paid per-backend). Example: 10M-node graph with 50 connections → (1.3 GB × 1) + (80 MB × 50) + shared_buffers ≈ 5.3 GB + shared_buffers. Compare to the old HashMap approach: ~17.2 GB.


.graph File Format

The graph is persisted to disk for crash recovery. The file format is designed for zero-copy mmap:

graph.graph layout: ┌──────────────────────────────────┐ │ Header (40 bytes) │ │ magic: [u8; 4] = b"PGGH" │ │ version: u8 = 0x01 (32-bit) │ │ flags: u8 │ │ bit 0 = has weighted edges │ │ bit 1 = has unidirectional │ │ bits 2–7 = reserved │ │ reserved: [u8; 2] │ │ node_count: u32 │ │ edge_count: u32 │ │ token_index_offset: u64 │ │ filter_index_offset: u64 │ │ checksum: u32 (CRC32) │ │ padding: [u8; 4] │ ├──────────────────────────────────┤ │ Section: edge_offsets │ │ (node_count + 1) × u32 │ ├──────────────────────────────────┤ │ Section: property_signature │ │ node_count × u64 │ ├──────────────────────────────────┤ │ Section: is_active (packed bits) │ │ ceil(node_count / 8) bytes │ ├──────────────────────────────────┤ │ Section: edge_targets │ │ edge_count × u32 │ ├──────────────────────────────────┤ │ Section: edge_type_ids │ │ edge_count × u8 │ ├──────────────────────────────────┤ │ Section: edge_weights (optional) │ │ edge_count × u32 (if weighted) │ │ (absent if no weighted edges) │ ├──────────────────────────────────┤ │ Section: filter_index │ │ filter_column_count: u16 │ │ for each filter column: │ │ table_oid: u32 │ │ column_name_len: u16 │ │ column_name: [u8; len] │ │ values: node_count × u32 │ ├──────────────────────────────────┤ │ Section: resolution_index │ │ entry_count: u32 │ │ entries[] (sorted by key): │ │ table_oid: u32 │ │ pk_hash: u64 │ │ node_idx: u32 │ │ (16 bytes/entry, binary search)│ ├──────────────────────────────────┤ │ Section: token_index │ │ Bincode-serialised TokenIndex │ ├──────────────────────────────────┤ │ Section: edge_type_registry │ │ Bincode-serialized Vec<String> │ │ index 0 = untyped/null │ │ user labels = indexes 1–254 │ └──────────────────────────────────┘

Integrity Rules

  1. On load: verify magic == b"PGGH".
  2. On load: verify version == 0x01.
  3. On load: verify CRC32 checksum covers all sections.
  4. On failure: log WARNING, skip loading. Graph is empty until graph.build() is called.
  5. Never crash. File corruption → clear error message, not a segfault.

Crash Recovery

Normal operation: graph.graph exists, valid → mmap and serve During rebuild/vacuum: 1. Write to graph.graph.tmp 2. fdatasync(graph.graph.tmp) 3. rename("graph.graph.tmp", "graph.graph") ← atomic on POSIX 4. Re-mmap the new file Crash during rebuild: graph.graph.tmp exists, graph.graph is intact → Delete .tmp, load existing .graph → Zero corruption, zero data loss Crash with no .graph file: → Run graph.build() to reconstruct from tables → Tables are the source of truth

mmap Behaviour

Read Path (BFS)

The .graph file’s NodeStore and EdgeStore sections are mmap’d directly. The BFS hot loop reads from these mmap’d arrays with zero-copy:

// The BFS reads mmap'd memory directly — no deserialization let sig = unsafe { *self.property_signatures.add(node_idx) }; if (sig & bloom_mask) != bloom_mask { continue; }

Page Faults

On first access after startup, the OS will page-fault each accessed page of the mmap’d file into RAM. This causes a brief warm-up period:

Graph Size.graph File SizeWarm-up Time (NVMe)
1M nodes~120 MB~0.5s
10M nodes~1.2 GB~3s
50M nodes~6 GB~12s

Mitigation: Set graph.auto_load = on to madvise(MADV_WILLNEED) the entire file at extension load time, pre-populating the page cache before the first query arrives.

Memory Pressure

If the OS is under memory pressure, it may evict mmap’d pages. This causes page faults during BFS, which can spike latency from microseconds to milliseconds.

Mitigation for production:

  1. Ensure the machine has enough RAM for the graph + Postgres shared buffers + OS.
  2. Use mlock() on the hot-path sections (edge_offsets, property_signatures, edge_targets) to pin them in RAM. This requires the Postgres process to have CAP_IPC_LOCK or sufficient RLIMIT_MEMLOCK.

OOM Handling & Graceful Degradation

graph enforces a hard memory limit via graph.memory_limit_mb. If an operation would exceed this limit, graph returns a clear SQL error instead of crashing Postgres.

Behaviour by Operation

OperationOOM Behaviour
graph.build()Pre-checks estimated RAM. Returns ERROR before allocating.
graph.traverse()Circuit breakers (max_nodes, max_frontier) prevent runaway expansion.
graph.shortest_path()Bidirectional BFS bounded by max_nodes.
Trigger syncIf edge buffer full → WARNING + enters read-only mode.
VacuumIf CSR rebuild would exceed limit → ERROR, old graph continues serving.

Read-Only Mode

When the edge buffer fills (graph.edge_buffer_size exceeded) or memory is critically low:

  1. graph stops accepting new mutations via triggers.
  2. Existing graph continues serving queries at full speed.
  3. graph.status() reports sync_status = 'read_only'.
  4. A WARNING is logged: graph: entering read-only mode (edge buffer full).
  5. Recovery: call graph.vacuum() or graph.build() to return to normal.

Configuration for Different Workloads

Small Database (< 1M nodes, shared hosting)

graph.memory_limit_mb = 512 graph.auto_load = off # Don't waste RAM if rarely queried graph.sync_mode = 'trigger' graph.vacuum_interval_secs = 300 # Vacuum every 5 minutes graph.max_token_length = 128

Medium Database (1M–10M nodes, dedicated server)

graph.memory_limit_mb = 2048 graph.auto_load = on graph.sync_mode = 'trigger' graph.vacuum_interval_secs = 60 graph.max_token_length = 128

Large Database (10M–100M nodes, high-memory server)

graph.memory_limit_mb = 16384 graph.auto_load = on graph.sync_mode = 'wal' # Zero write overhead graph.vacuum_interval_secs = 30 graph.max_token_length = 64 # Reduce to save RAM

AI Agent Workload (latency-sensitive, many concurrent queries)

graph.memory_limit_mb = 8192 graph.auto_load = on graph.sync_mode = 'wal' graph.vacuum_interval_secs = 10 # Near-real-time edge freshness graph.default_max_depth = 10 # Deep traversals for AI context graph.max_nodes = 50000 # Bound per-query cost graph.oom_action = 'readonly' # Don't error, just degrade
Last updated on