Thread

  1. Re: Adding basic NUMA awareness

    Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-12-02T12:26:33Z

    On Wed, Nov 26, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:
    
    > Rebased patch series attached.
    
    Thanks. BTW still with the old patchset series, One additional thing
    that I've found out related to interleave is that in
    CreateAnonymousSegment() with the default check_debug='', we still
    issue numa_interleave_memory(ptr..). It should be optional (this also
    affects earlier calls too). Tiny patch attached.
    
    > I think the MAP_POPULATE should be optional, enabled by GUC.
    
    OK, but you mean it's a new option to debug_numa, right? (not some
    separate) so debug_numa='prefault' then?
    
    > > I would consider everything +/- 3% as noise (technically each branch
    > > was a different compilation/ELF binary, as changing this #define
    > > required to do so to get 4 vs 16; please see attached script). I miss
    > > the explanation why without HP it deteriorates so much with for c=1024
    > > with the patches.
    >
    > I wouldn't expect a big difference for "pgbench -S". That workload has
    > so much other fairly expensive stuff (e.g. initializing index scans
    > etc.), the cost of buffer replacement is going to be fairly limited.
    
    Right. OK, so I've got the seqconcurrentscans comparison done right,
    that is when prewarmed and not naturally filled:
    
    @master, 29GB/s mem bandwidth
    latency average = 1255.572 ms
    latency stddev = 417.162 ms
    tps = 50.451925 (without initial connection time)
    
    @v20251121 patchset, 41GB/s (~10GB/s per socket)
    latency average = 719.931 ms
    latency stddev = 14.874 ms
    tps = 88.362091 (without initial connection time)
    
    The main PMC difference seems to be much lower "backend cycles idle"
    (51% master and vs 31% for the NUMA debug_numa="buffers,procs", so
    less is waiting on memory, thus it gets that speedup and better IPC).
    
    Anyway, the biggest gripe right now (at least to me) is reliable
    benchmarking. Below runs are all apples and oranges comparisons (they
    measure different stuff although looks the same initially)
    - restart and just select pg_shmem_allocations_numa or prewarm puts
    everything into 1 NUMA node with check_numa='', because of prefaulting
    happening during select-view case
    - restart and pgbench -i -s XX (same issue as above) then pgbench -
    you get the same, everything on potential one NUMA node (because
    pgbench prefaults just on one)
    - restart and pgbench -c 64.. with debug_numa='' (off) MIGHT get
    random NUMA layout, how's that is supposed to be deterministic? at
    least with debug_numa='buffers' you get determinism..
    - the shared_buffers size vs size of dataset read, the moment you
    start doing something CPU intensive (or like calling syscalls just for
    VFS cache), the benefit seems to disappear at least on my hardware
    
    Anyway, depending on the scenario I could get varied results like
    34tps .. 88tps here. The debug_numa='buffers,..' gives just assurance
    of the proper layout of shared memory is there (one could even argue
    that such performance deviations across runs are bug ;)).
    
    > The regressions for numa/pgproc patches with 1024 clients are annoying,
    > but how realistic is such scenario? With 32/64 CPUs, having 1024 active
    > connections is a substantial overload. If we can fix this, great. But I
    > think such regression may be OK if we get benefits for reasonable setups
    > (with fewer clients).
    >
    > I don't know why it's happening, though. I haven't been testing cases
    > with so many clients (compared to the number of CPUs).
    
    The only thing in my mind about deterioration of high-connection count
    (AKA -c 1024 scenario) with pgprocs, would be related to the question
    you raised in 0007 "Note: The scheduler may migrate the process to a
    different CPU/node later. Maybe we should consider pinning the process
    to the node?"
    
    I think the answer is yes, so to fetch MyProc based on sched_getcpu()
    and then maybe with additional numa_flags & new PROCS_PIN_NODE simply
    numa_run_on_node(node)? I've tried this:
    
    pgbench -c 1024 -j 64 -P 1 -T 30 -S -M prepared got:
    
    @numa-empty-debug_numa         ~434k TPS, ~12k CPU migrations/second
    @numa+buffers+pgproc           ~412k TPS, 7-8k CPU migrations/second
    @numa+buffers+pgproc+pinnode   ~434k TPS, still with 7-8k CPU
    migrations/second (so same)
    but I've verified for the last one, with bpftrace on that
    tracepoint:sched:sched_migrate_task did not performed node-to-node
    process bounces anymore (it did for pgbench but not for postgres
    itself with this numa_run_on_node())
    
    > > scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
    > > pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
    [..]
    >
    > Hmmm. I'd have expected better results for this workload. So I tried
    > re-running my seqscan benchmark on the 176-core instance, and I got this:
    
    [..]
    Thanks!
    
    > I did the benchmark for individual parts of the patch series. There's a
    > clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The
    > 0002/0003 regress it quite a bit. And with 128 clients there's no
    > improvement at all.
    [..]
    > Those are clearly much better results, so I guess the default number of
    > partitions may be too low.
    >
    > What bothers me is that this seems like a very narrow benchmark. I mean,
    > few systems are doing concurrent seqscans putting this much pressure on
    > buffer replacement. And once the plans start to do other stuff, the
    > contention on clock sweep seems to go down substantially (as shown by
    > the read-only pgbench). So the question is - is this really worth it?
    
    Are you thinking here about whole NUMA patchset or just clocksweep? I
    think multiple clocksweep are just not shining because other
    bottlenecks hammer the efficiency here. Andres talk about it exactly
    here https://youtu.be/V75KpACdl6E?t=1990 (He mentions out of order
    execution, I see btrees in reports as top#1). So maybe it's just too
    early to see the results of this optimization?
    
    As for classic readonly pgbench -S I still see roughly 1:8 local to
    remote (!) DRAM access (1 <-> 3 sockets) even with those patches, so
    potentially something could be improved in far future for sure (that
    would require some memaddr monitoring for most remote DRAM misses <->
    pg inter-shm ptr mapping; think of pg_shmem_allocations_numa with
    local/remote counters or maybe just fallback to perf-c2c).
    
    To sum up, IMHO I understand this $thread's NUMA implementation as:
    - it's strictly a guard mechanism to get determinism (for most cases)
    -- it fixes "imbalance"
    - no performance boost for OLTP as such
    - for analytics it could be win (in-memory workloads; well PG is not
    fully built for this, but it could be one day/or already is with 3rd
    party TAMs and extensions), and:
    -- we can provide performance jump for seqconcurrentjobs or memory
    fitting workloads (patchset does this already). Note: I think PG will
    eventually get into such classes in the longer run, we are just ahead
    with NUMA, but PG is without proper vectorized executor stuff.
    -- we could further enhance PQ here: the leader and PQ workers would
    stick to the same NUMA node with some affinity (the earlier thread
    measurements for this  [1] -- we could have session GUC to enable this
    for planned big PQ whole-NUMA SELECTs; this would be probably done
    close to dsm_impl_posix())
    - new idea: we could allow exposing tables(spaces) into NUMA nodes or
    make it per-user toggle too while we are at it (imagine HTAP-like
    workloads: NUMA node #0 for OLTP, node #1 for analytics). Sounds cool
    and rather easy and has valid use, but dunno if that would be really
    useful?
    
    Way out of scope:
    - superlocking btress that Andres mentioned on his presentation
    
    -J.
    
    [1] - https://www.postgresql.org/message-id/attachment/178120/NUMA_pq_cpu_pinning_results.txt