Thread

Stack-based tracking of per-node WAL/buffer usage

Lukas Fittl <lukas@fittl.com> — 2025-08-31T23:57:01Z
Hi,

Please find attached a patch series that introduces a new paradigm for how
per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
distinct buffer hits [0].

Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in
InstrStopNode we call the rather
expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity
happened within a given node cycle.

This proposal instead uses a stack, where each time we enter a node
(InstrStartNode) we point a new global (pgInstrStack) to the current stack
entry. Whilst we're in that node we increment buffer/WAL usage statistics
to the stack entry. On exit (InstrStopNode) we restore the previous entry.

This change provides about a 10% performance benefit for EXPLAIN ANALYZE on
paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):

CREATE TABLE test(id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);

master (124ms, best out of 3):

postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16925.01..16925.02 rows=1 width=8) (actual
time=124.910..124.910 rows=1.00 loops=1)
   Buffers: shared hit=752 read=3673
   ->  Seq Scan on test  (cost=0.00..14425.01 rows=1000001 width=0) (actual
time=0.201..62.228 rows=1000001.00 loops=1)
         Buffers: shared hit=752 read=3673
 Planning Time: 0.116 ms
 Execution Time: 124.961 ms

patched (109ms, best out of 3):

postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16925.01..16925.02 rows=1 width=8) (actual
time=109.788..109.788 rows=1.00 loops=1)
   Buffers: shared hit=940 read=3485
   ->  Seq Scan on test  (cost=0.00..14425.01 rows=1000001 width=0) (actual
time=0.153..69.368 rows=1000001.00 loops=1)
         Buffers: shared hit=940 read=3485
 Planning Time: 0.134 ms
 Execution Time: 109.837 ms
(6 rows)

I have also prototyped a more ambitious approach that completely removes
pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g.
pg_stat_statements), but for now this patch set does not include that
change, but instead keeps adding to these legacy globals as well.

Patches attached:

0001: Separate node instrumentation from other use of Instrumentation struct

    Previously different places (e.g. query "total time") were repurposing
the per-node Instrumentation struct. Instead, simplify the Instrumentation
struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
the NodeInstrumentation struct to carry forward the per-node
instrumentation information.

0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros

0003: Introduce stack for tracking per-node WAL/buffer usage

Feedback/thoughts welcome!

CCing Andres since he had expressed interest in this off-list.

[0]: See lightning talk slides from PGConf.Dev discussing an HLL-based
EXPLAIN (BUFFERS DISTINCT):
https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf

Thanks,
Lukas

-- 
Lukas Fittl