Thread

Re: Adding basic NUMA awareness

Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-12-02T12:26:33Z
On Wed, Nov 26, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:

> Rebased patch series attached.

Thanks. BTW still with the old patchset series, One additional thing
that I've found out related to interleave is that in
CreateAnonymousSegment() with the default check_debug='', we still
issue numa_interleave_memory(ptr..). It should be optional (this also
affects earlier calls too). Tiny patch attached.

> I think the MAP_POPULATE should be optional, enabled by GUC.

OK, but you mean it's a new option to debug_numa, right? (not some
separate) so debug_numa='prefault' then?

> > I would consider everything +/- 3% as noise (technically each branch
> > was a different compilation/ELF binary, as changing this #define
> > required to do so to get 4 vs 16; please see attached script). I miss
> > the explanation why without HP it deteriorates so much with for c=1024
> > with the patches.
>
> I wouldn't expect a big difference for "pgbench -S". That workload has
> so much other fairly expensive stuff (e.g. initializing index scans
> etc.), the cost of buffer replacement is going to be fairly limited.

Right. OK, so I've got the seqconcurrentscans comparison done right,
that is when prewarmed and not naturally filled:

@master, 29GB/s mem bandwidth
latency average = 1255.572 ms
latency stddev = 417.162 ms
tps = 50.451925 (without initial connection time)

@v20251121 patchset, 41GB/s (~10GB/s per socket)
latency average = 719.931 ms
latency stddev = 14.874 ms
tps = 88.362091 (without initial connection time)

The main PMC difference seems to be much lower "backend cycles idle"
(51% master and vs 31% for the NUMA debug_numa="buffers,procs", so
less is waiting on memory, thus it gets that speedup and better IPC).

Anyway, the biggest gripe right now (at least to me) is reliable
benchmarking. Below runs are all apples and oranges comparisons (they
measure different stuff although looks the same initially)
- restart and just select pg_shmem_allocations_numa or prewarm puts
everything into 1 NUMA node with check_numa='', because of prefaulting
happening during select-view case
- restart and pgbench -i -s XX (same issue as above) then pgbench -
you get the same, everything on potential one NUMA node (because
pgbench prefaults just on one)
- restart and pgbench -c 64.. with debug_numa='' (off) MIGHT get
random NUMA layout, how's that is supposed to be deterministic? at
least with debug_numa='buffers' you get determinism..
- the shared_buffers size vs size of dataset read, the moment you
start doing something CPU intensive (or like calling syscalls just for
VFS cache), the benefit seems to disappear at least on my hardware

Anyway, depending on the scenario I could get varied results like
34tps .. 88tps here. The debug_numa='buffers,..' gives just assurance
of the proper layout of shared memory is there (one could even argue
that such performance deviations across runs are bug ;)).

> The regressions for numa/pgproc patches with 1024 clients are annoying,
> but how realistic is such scenario? With 32/64 CPUs, having 1024 active
> connections is a substantial overload. If we can fix this, great. But I
> think such regression may be OK if we get benefits for reasonable setups
> (with fewer clients).
>
> I don't know why it's happening, though. I haven't been testing cases
> with so many clients (compared to the number of CPUs).

The only thing in my mind about deterioration of high-connection count
(AKA -c 1024 scenario) with pgprocs, would be related to the question
you raised in 0007 "Note: The scheduler may migrate the process to a
different CPU/node later. Maybe we should consider pinning the process
to the node?"

I think the answer is yes, so to fetch MyProc based on sched_getcpu()
and then maybe with additional numa_flags & new PROCS_PIN_NODE simply
numa_run_on_node(node)? I've tried this:

pgbench -c 1024 -j 64 -P 1 -T 30 -S -M prepared got:

@numa-empty-debug_numa         ~434k TPS, ~12k CPU migrations/second
@numa+buffers+pgproc           ~412k TPS, 7-8k CPU migrations/second
@numa+buffers+pgproc+pinnode   ~434k TPS, still with 7-8k CPU
migrations/second (so same)
but I've verified for the last one, with bpftrace on that
tracepoint:sched:sched_migrate_task did not performed node-to-node
process bounces anymore (it did for pgbench but not for postgres
itself with this numa_run_on_node())

> > scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
> > pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
[..]
>
> Hmmm. I'd have expected better results for this workload. So I tried
> re-running my seqscan benchmark on the 176-core instance, and I got this:

[..]
Thanks!

> I did the benchmark for individual parts of the patch series. There's a
> clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The
> 0002/0003 regress it quite a bit. And with 128 clients there's no
> improvement at all.
[..]
> Those are clearly much better results, so I guess the default number of
> partitions may be too low.
>
> What bothers me is that this seems like a very narrow benchmark. I mean,
> few systems are doing concurrent seqscans putting this much pressure on
> buffer replacement. And once the plans start to do other stuff, the
> contention on clock sweep seems to go down substantially (as shown by
> the read-only pgbench). So the question is - is this really worth it?

Are you thinking here about whole NUMA patchset or just clocksweep? I
think multiple clocksweep are just not shining because other
bottlenecks hammer the efficiency here. Andres talk about it exactly
here https://youtu.be/V75KpACdl6E?t=1990 (He mentions out of order
execution, I see btrees in reports as top#1). So maybe it's just too
early to see the results of this optimization?

As for classic readonly pgbench -S I still see roughly 1:8 local to
remote (!) DRAM access (1 <-> 3 sockets) even with those patches, so
potentially something could be improved in far future for sure (that
would require some memaddr monitoring for most remote DRAM misses <->
pg inter-shm ptr mapping; think of pg_shmem_allocations_numa with
local/remote counters or maybe just fallback to perf-c2c).

To sum up, IMHO I understand this $thread's NUMA implementation as:
- it's strictly a guard mechanism to get determinism (for most cases)
-- it fixes "imbalance"
- no performance boost for OLTP as such
- for analytics it could be win (in-memory workloads; well PG is not
fully built for this, but it could be one day/or already is with 3rd
party TAMs and extensions), and:
-- we can provide performance jump for seqconcurrentjobs or memory
fitting workloads (patchset does this already). Note: I think PG will
eventually get into such classes in the longer run, we are just ahead
with NUMA, but PG is without proper vectorized executor stuff.
-- we could further enhance PQ here: the leader and PQ workers would
stick to the same NUMA node with some affinity (the earlier thread
measurements for this  [1] -- we could have session GUC to enable this
for planned big PQ whole-NUMA SELECTs; this would be probably done
close to dsm_impl_posix())
- new idea: we could allow exposing tables(spaces) into NUMA nodes or
make it per-user toggle too while we are at it (imagine HTAP-like
workloads: NUMA node #0 for OLTP, node #1 for analytics). Sounds cool
and rather easy and has valid use, but dunno if that would be really
useful?

Way out of scope:
- superlocking btress that Andres mentioned on his presentation

-J.

[1] - https://www.postgresql.org/message-id/attachment/178120/NUMA_pq_cpu_pinning_results.txt