Memory Model & Capacity Planning
Memory Architecture
graph uses a layered memory model designed for maximum BFS throughput while minimising impact on the Postgres shared buffer pool.
Key Principle: Graph Memory is Separate from Postgres Shared Buffers
The graph index does NOT use Postgres shared buffers (shared_buffers). It lives in the Postgres backend’s private address space via mmap. This means:
- No contention with your table/index buffer cache.
- No LWLock overhead for graph reads.
- OS-managed page sharing — multiple backends reading the same
.graphfile share physical RAM pages via the kernel’s page cache.
┌─────────────────────────────────────────────────────┐
│ Postgres Process Memory │
│ │
│ ┌───────────────────────────┐ │
│ │ shared_buffers (128 MB+) │ ← Your table data │
│ │ WAL buffers │ │
│ │ Lock tables │ │
│ └───────────────────────────┘ │
│ │
│ ┌───────────────────────────┐ │
│ │ Backend private memory │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ graph mmap │ │ ← SHARED via OS │
│ │ │ (graph.graph) │ │ page cache │
│ │ │ │ │ │
│ │ │ NodeStore arrays │ │ ← Zero-copy │
│ │ │ EdgeStore CSR │ │ │
│ │ │ ResolutionIndex │ │ Sorted flat array │
│ │ └─────────────────────┘ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ TokenIndex (heap) │ │ ← Process-local │
│ │ │ RoaringBitmap │ │ (deserialized on │
│ │ └─────────────────────┘ │ first load) │
│ └───────────────────────────┘ │
└─────────────────────────────────────────────────────┘⚠️ Per-Backend Memory: TokenIndex Only. When loaded from the
.graphfile, NodeStore arrays, EdgeStore CSR, and ResolutionIndex are all served directly from mmap’d memory — zero per-process heap allocation, shared across all backends via the OS page cache. The only per-process duplication is the TokenIndex (RoaringBitmap posting lists, ~2–8 bytes/node). At 10M nodes with 50 connections, per-process duplication is ~4 GB (TokenIndex) — not the ~59 GB it would be if NodeStore and EdgeStore were also duplicated.
Per-Node Memory Cost
Hot Path (BFS traversal reads these arrays)
| Array | Width | Purpose |
|---|---|---|
edge_offsets[] | 4 bytes (u32) | CSR offset into edge array |
property_signature[] | 8 bytes (u64) | Bloom filter signature |
is_active[] | 0.125 bytes (bit) | Tombstone flag |
Hot path total: ~12.1 bytes/node
Warm Path — mmap’d (shared via OS page cache)
| Structure | Width | Purpose |
|---|---|---|
| ResolutionIndex | 16 bytes/node | (table_oid, pk_hash, node_idx) sorted flat array |
The ResolutionIndex is a sorted array of fixed-size 16-byte entries stored directly in the
.graphfile and mmap’d. All backends share the same physical memory pages — no deserialization, no per-process heap allocation. Lookups use binary search: O(log n), ~115ns for 10M nodes.
Warm Path — heap (per-process, duplicated per backend)
| Structure | Estimated Cost | Purpose |
|---|---|---|
| TokenIndex posting lists | ~2–8 bytes/node (amortised) | Property search |
| FilterIndex columns | 4 bytes/node/column (u32) | Numeric column value for range checks |
Note: The Bloom/property signature is already counted in the hot path above — it is the same
property_signature[]array.
Heap warm path (per-process): ~2–8 bytes/node (plus 4 bytes per registered filter column)
FilterIndex sparsity warning: FilterIndex arrays are indexed by global
node_idx, not per-table. Registeringinvoices.amountallocates 4 bytes for every node in the graph — not just invoices. For a graph with 10M total nodes and 500K invoices, 9.5M entries are zeroed (wasting ~38 MB). Only register filter columns on dense/common tables, or accept the overhead.
Per-Edge Memory Cost
| Array | Width | Purpose |
|---|---|---|
targets[] | 4 bytes (u32) | Neighbour node index |
type_ids[] | 1 byte (u8) | Edge type ID |
weights[] | 4 bytes (u32) | Edge weight (optional, Dijkstra only) |
Unweighted edges: 5 bytes/edge (BFS traversals — no weights[] array loaded)
Weighted edges: 9 bytes/edge (when weight_column is registered on any edge)
Capacity Planning Table
| Scale | Nodes | Edges | Hot Path | Total RAM | Machine |
|---|---|---|---|---|---|
| Small | 100K | 2M | 11 MB | ~30 MB | Any |
| Medium | 1M | 20M | 112 MB | ~300 MB | 4 GB+ |
| Large | 10M | 200M | 1.1 GB | ~3 GB | 16 GB+ |
| Enterprise | 50M | 1B | 5.6 GB | ~15 GB | 64 GB+ |
| Max (32-bit) | 500M | 10B | 56 GB | ~150 GB | 256 GB+ |
Rule of thumb: Allocate ~1.5× the hot-path estimate for total mmap’d extension memory (covers ResolutionIndex + OS page table overhead). Then add ~80 MB per backend for heap structures (TokenIndex). The mmap’d portion is shared across all backends.
⚠️ Peak memory during
build()/vacuum(): During a rebuild, the old graph continues serving queries via mmap while the new CSR is constructed in memory, then written to.graph.tmp, and atomically renamed. This means peak RAM usage duringbuild()is approximately 2× the hot-path footprint — the old mmap’d graph plus the new in-memory graph. Plan accordingly: a 5 GB graph requires ~10 GB available RAM during rebuilds.
Multi-connection formula: For production capacity planning:
Total RAM = (mmap_total × 1) + (heap_warm × max_connections) + shared_buffers + OS. The mmap_total includes NodeStore + EdgeStore + ResolutionIndex (shared, paid once). The heap_warm is TokenIndex only (~2–8 bytes/node, paid per-backend). Example: 10M-node graph with 50 connections →(1.3 GB × 1) + (80 MB × 50) + shared_buffers ≈ 5.3 GB + shared_buffers. Compare to the old HashMap approach: ~17.2 GB.
.graph File Format
The graph is persisted to disk for crash recovery. The file format is designed for zero-copy mmap:
graph.graph layout:
┌──────────────────────────────────┐
│ Header (40 bytes) │
│ magic: [u8; 4] = b"PGGH" │
│ version: u8 = 0x01 (32-bit) │
│ flags: u8 │
│ bit 0 = has weighted edges │
│ bit 1 = has unidirectional │
│ bits 2–7 = reserved │
│ reserved: [u8; 2] │
│ node_count: u32 │
│ edge_count: u32 │
│ token_index_offset: u64 │
│ filter_index_offset: u64 │
│ checksum: u32 (CRC32) │
│ padding: [u8; 4] │
├──────────────────────────────────┤
│ Section: edge_offsets │
│ (node_count + 1) × u32 │
├──────────────────────────────────┤
│ Section: property_signature │
│ node_count × u64 │
├──────────────────────────────────┤
│ Section: is_active (packed bits) │
│ ceil(node_count / 8) bytes │
├──────────────────────────────────┤
│ Section: edge_targets │
│ edge_count × u32 │
├──────────────────────────────────┤
│ Section: edge_type_ids │
│ edge_count × u8 │
├──────────────────────────────────┤
│ Section: edge_weights (optional) │
│ edge_count × u32 (if weighted) │
│ (absent if no weighted edges) │
├──────────────────────────────────┤
│ Section: filter_index │
│ filter_column_count: u16 │
│ for each filter column: │
│ table_oid: u32 │
│ column_name_len: u16 │
│ column_name: [u8; len] │
│ values: node_count × u32 │
├──────────────────────────────────┤
│ Section: resolution_index │
│ entry_count: u32 │
│ entries[] (sorted by key): │
│ table_oid: u32 │
│ pk_hash: u64 │
│ node_idx: u32 │
│ (16 bytes/entry, binary search)│
├──────────────────────────────────┤
│ Section: token_index │
│ Bincode-serialised TokenIndex │
├──────────────────────────────────┤
│ Section: edge_type_registry │
│ Bincode-serialized Vec<String> │
│ index 0 = untyped/null │
│ user labels = indexes 1–254 │
└──────────────────────────────────┘Integrity Rules
- On load: verify
magic == b"PGGH". - On load: verify
version == 0x01. - On load: verify CRC32 checksum covers all sections.
- On failure: log
WARNING, skip loading. Graph is empty untilgraph.build()is called. - Never crash. File corruption → clear error message, not a segfault.
Crash Recovery
Normal operation:
graph.graph exists, valid → mmap and serve
During rebuild/vacuum:
1. Write to graph.graph.tmp
2. fdatasync(graph.graph.tmp)
3. rename("graph.graph.tmp", "graph.graph") ← atomic on POSIX
4. Re-mmap the new file
Crash during rebuild:
graph.graph.tmp exists, graph.graph is intact
→ Delete .tmp, load existing .graph
→ Zero corruption, zero data loss
Crash with no .graph file:
→ Run graph.build() to reconstruct from tables
→ Tables are the source of truthmmap Behaviour
Read Path (BFS)
The .graph file’s NodeStore and EdgeStore sections are mmap’d directly. The BFS hot loop reads from these mmap’d arrays with zero-copy:
// The BFS reads mmap'd memory directly — no deserialization
let sig = unsafe { *self.property_signatures.add(node_idx) };
if (sig & bloom_mask) != bloom_mask { continue; }Page Faults
On first access after startup, the OS will page-fault each accessed page of the mmap’d file into RAM. This causes a brief warm-up period:
| Graph Size | .graph File Size | Warm-up Time (NVMe) |
|---|---|---|
| 1M nodes | ~120 MB | ~0.5s |
| 10M nodes | ~1.2 GB | ~3s |
| 50M nodes | ~6 GB | ~12s |
Mitigation: Set graph.auto_load = on to madvise(MADV_WILLNEED) the entire file at extension load time, pre-populating the page cache before the first query arrives.
Memory Pressure
If the OS is under memory pressure, it may evict mmap’d pages. This causes page faults during BFS, which can spike latency from microseconds to milliseconds.
Mitigation for production:
- Ensure the machine has enough RAM for the graph + Postgres shared buffers + OS.
- Use
mlock()on the hot-path sections (edge_offsets, property_signatures, edge_targets) to pin them in RAM. This requires the Postgres process to haveCAP_IPC_LOCKor sufficientRLIMIT_MEMLOCK.
OOM Handling & Graceful Degradation
graph enforces a hard memory limit via graph.memory_limit_mb. If an operation would exceed this limit, graph returns a clear SQL error instead of crashing Postgres.
Behaviour by Operation
| Operation | OOM Behaviour |
|---|---|
graph.build() | Pre-checks estimated RAM. Returns ERROR before allocating. |
graph.traverse() | Circuit breakers (max_nodes, max_frontier) prevent runaway expansion. |
graph.shortest_path() | Bidirectional BFS bounded by max_nodes. |
| Trigger sync | If edge buffer full → WARNING + enters read-only mode. |
| Vacuum | If CSR rebuild would exceed limit → ERROR, old graph continues serving. |
Read-Only Mode
When the edge buffer fills (graph.edge_buffer_size exceeded) or memory is critically low:
- graph stops accepting new mutations via triggers.
- Existing graph continues serving queries at full speed.
graph.status()reportssync_status = 'read_only'.- A
WARNINGis logged:graph: entering read-only mode (edge buffer full). - Recovery: call
graph.vacuum()orgraph.build()to return to normal.
Configuration for Different Workloads
Small Database (< 1M nodes, shared hosting)
graph.memory_limit_mb = 512
graph.auto_load = off # Don't waste RAM if rarely queried
graph.sync_mode = 'trigger'
graph.vacuum_interval_secs = 300 # Vacuum every 5 minutes
graph.max_token_length = 128Medium Database (1M–10M nodes, dedicated server)
graph.memory_limit_mb = 2048
graph.auto_load = on
graph.sync_mode = 'trigger'
graph.vacuum_interval_secs = 60
graph.max_token_length = 128Large Database (10M–100M nodes, high-memory server)
graph.memory_limit_mb = 16384
graph.auto_load = on
graph.sync_mode = 'wal' # Zero write overhead
graph.vacuum_interval_secs = 30
graph.max_token_length = 64 # Reduce to save RAMAI Agent Workload (latency-sensitive, many concurrent queries)
graph.memory_limit_mb = 8192
graph.auto_load = on
graph.sync_mode = 'wal'
graph.vacuum_interval_secs = 10 # Near-real-time edge freshness
graph.default_max_depth = 10 # Deep traversals for AI context
graph.max_nodes = 50000 # Bound per-query cost
graph.oom_action = 'readonly' # Don't error, just degrade