Thread

  1. Stack-based tracking of per-node WAL/buffer usage

    Lukas Fittl <lukas@fittl.com> — 2025-08-31T23:57:01Z

    Hi,
    
    Please find attached a patch series that introduces a new paradigm for how
    per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
    overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
    distinct buffer hits [0].
    
    Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in
    InstrStopNode we call the rather
    expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity
    happened within a given node cycle.
    
    This proposal instead uses a stack, where each time we enter a node
    (InstrStartNode) we point a new global (pgInstrStack) to the current stack
    entry. Whilst we're in that node we increment buffer/WAL usage statistics
    to the stack entry. On exit (InstrStopNode) we restore the previous entry.
    
    This change provides about a 10% performance benefit for EXPLAIN ANALYZE on
    paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):
    
    CREATE TABLE test(id int);
    INSERT INTO test SELECT * FROM generate_series(0, 1000000);
    
    master (124ms, best out of 3):
    
    postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
                                                           QUERY PLAN
    
    ------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=16925.01..16925.02 rows=1 width=8) (actual
    time=124.910..124.910 rows=1.00 loops=1)
       Buffers: shared hit=752 read=3673
       ->  Seq Scan on test  (cost=0.00..14425.01 rows=1000001 width=0) (actual
    time=0.201..62.228 rows=1000001.00 loops=1)
             Buffers: shared hit=752 read=3673
     Planning Time: 0.116 ms
     Execution Time: 124.961 ms
    
    patched (109ms, best out of 3):
    
    postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
                                                           QUERY PLAN
    
    ------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=16925.01..16925.02 rows=1 width=8) (actual
    time=109.788..109.788 rows=1.00 loops=1)
       Buffers: shared hit=940 read=3485
       ->  Seq Scan on test  (cost=0.00..14425.01 rows=1000001 width=0) (actual
    time=0.153..69.368 rows=1000001.00 loops=1)
             Buffers: shared hit=940 read=3485
     Planning Time: 0.134 ms
     Execution Time: 109.837 ms
    (6 rows)
    
    I have also prototyped a more ambitious approach that completely removes
    pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g.
    pg_stat_statements), but for now this patch set does not include that
    change, but instead keeps adding to these legacy globals as well.
    
    Patches attached:
    
    0001: Separate node instrumentation from other use of Instrumentation struct
    
        Previously different places (e.g. query "total time") were repurposing
    the per-node Instrumentation struct. Instead, simplify the Instrumentation
    struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
    drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
    the NodeInstrumentation struct to carry forward the per-node
    instrumentation information.
    
    0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros
    
    0003: Introduce stack for tracking per-node WAL/buffer usage
    
    Feedback/thoughts welcome!
    
    CCing Andres since he had expressed interest in this off-list.
    
    [0]: See lightning talk slides from PGConf.Dev discussing an HLL-based
    EXPLAIN (BUFFERS DISTINCT):
    https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf
    
    Thanks,
    Lukas
    
    -- 
    Lukas Fittl