Thread

Re: Adding basic NUMA awareness

Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-11-17T09:23:50Z
On Tue, Nov 11, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi,
>
> here's a rebased patch series, fixing most of the smaller issues from
> v20251101, and making cfbot happy (hopefully).

Hi Tomas,

> >>> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
> >>> called pg_shm_pgproc?
> >>>
> >>
> >> Right. It does not belong to pg_buffercache at all, I just added it
> >> there because I've been messing with that code already.
> >
> > Please keep them in for at least for some time (perhaps standalone
> > patch marked as not intended to be commited would work?). I find the
> > view extermely useful as it will allow us pinpointing local-vs-remote
> > NUMA fetches (we need to know the addres).
> >
>
> Are you referring to the _pgproc view specifically, or also to the view
> with buffer partitions? I don't intend to remove the view for shared
> buffers, that's indeed useful.

Both, even the _pgproc.


> Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC
> partitioning, I don't think that's likely to go into 19.

Oh ok.

> >>> 0006d: I've got one SIGBUS during a call to select
> >>> pg_buffercache_numa_pages(); and it looks like that memory accessed is
> >>> simply not mapped? (bug)
[..]
> I didn't have time to look into all this info about mappings, io_uring
> yet, so no response from me.
>

Ok, so the proper HP + SIGBUS explanation:

Appologies, earlier I wrote that disabling THP does workaround this,
but I've probably made an error there and used wrong binary back there
(with MAP_POPULATE in PG_MMAP_FLAGS), so please ignore that.

1. Before starting PG, with shared_buffers=32GB, huge_pages=on (2MB
ones), vm.nr_hugepages=17715, 4 NUMA nodes, kernel 6.14.x,
max_connections=10k, wal_buffers=1GB:

node0/hugepages/hugepages-2048kB/free_hugepages:4429
node1/hugepages/hugepages-2048kB/free_hugepages:4429
node2/hugepages/hugepages-2048kB/free_hugepages:4429
node3/hugepages/hugepages-2048kB/free_hugepages:4428

2. Just startup the PG with the older NUMA patchset 20251101. There
will be deficit across NUMA nodes right after startup, mostly one node
NUMA will allocate much more:

node0/hugepages/hugepages-2048kB/free_hugepages:4397
node1/hugepages/hugepages-2048kB/free_hugepages:3453
node2/hugepages/hugepages-2048kB/free_hugepages:4397
node3/hugepages/hugepages-2048kB/free_hugepages:4396

3. Check layout of NUMA maps for postmaster PID

7fc9cb200000 default file=/anon_hugepage\040(deleted) huge dirty=517
mapmax=8 N1=517 kernelpagesize_kB=2048 [!!!]
7fca0d600000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N0=32 kernelpagesize_kB=2048
7fca11600000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N1=32 kernelpagesize_kB=2048
7fca15600000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N2=32 kernelpagesize_kB=2048
7fca19600000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N3=32 kernelpagesize_kB=2048
7fca1d600000 default file=/anon_hugepage\040(deleted) huge
7fca1d800000 bind:0 file=/anon_hugepage\040(deleted) huge
7fcc1d800000 bind:1 file=/anon_hugepage\040(deleted) huge
7fce1d800000 bind:2 file=/anon_hugepage\040(deleted) huge
7fd01d800000 bind:3 file=/anon_hugepage\040(deleted) huge
7fd21d800000 default file=/anon_hugepage\040(deleted) huge dirty=425
mapmax=8 N1=425 kernelpagesize_kB=2048 [!!!]

So your patch doesn't do anything special for anything other than
Buffer Blocks and PGPROC in the above picture, so the the default
mmap() just keeps on with "default" NUMA policy which takes per above
(517+425) * 2MB = ~1884 MB of really used memory as per N1 entires. PG
does touch those regions on startup, but it doesnt really touch Buffer
Blocks. Anyway, this causes the missing amount of free huge pages on
the N1 (generates pressure on this Node 1).

So as it stands, the patchset is missing some form balancing to use
equal memory across nodes:
- each node to be forced to get certain amount of BufferBlocks/NUMA nodes blocks
- yet we do nothing and leave at the "defaults" the others regions
(e..g $SegHDR (start of shm) .. first Buffers Block), as those are
placed on the current node (due default policy), which in causes turns
this memory overallocation imbalance (so in the example N1 will get
Buffer Blocks + everything else, but that only happens on real access
not during mmap() due to lazy/first touch policy)

Currently, any launch of anything that touches imbalanced NUMA node
memory with deficit (N1 above) - use of pg_shm_allocations,
pg_buffercache - it will cause stress there and end up in SIGBUS.
This looks by design on Linux kernel side: exc:page_fault() ->
do_user_addr_fault() -> do_sigbus() AKA force_sig_fault(). But, if I
hack pg to hack do interleave (or just numactl --interleave=all ... )
to effectivley interleave those 3 "default" regions instead, so I'll
get "interleave" like that:

7fb2dd000000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
dirty=517 mapmax=8 N0=129 N1=132 N2=128 N3=128 kernelpagesize_kB=2048
7fb31f400000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N0=32 kernelpagesize_kB=2048
7fb323400000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N1=32 kernelpagesize_kB=2048
7fb327400000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N2=32 kernelpagesize_kB=2048
7fb32b400000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
mapmax=2 N3=32 kernelpagesize_kB=2048
7fb32f400000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
7fb32f600000 bind:0 file=/anon_hugepage\040(deleted) huge
7fb52f600000 bind:1 file=/anon_hugepage\040(deleted) huge
7fb72f600000 bind:2 file=/anon_hugepage\040(deleted) huge
7fb92f600000 bind:3 file=/anon_hugepage\040(deleted) huge
7fbb2f600000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
dirty=425 N0=106 N1=106 N2=105 N3=108 kernelpagesize_kB=2048

then even after fully touching everything (via select to
pg_shm_allocations), it'll run, I'll get much better balance, and wont
have SIGBUS issues:

node0/hugepages/hugepages-2048kB/free_hugepages:23
node1/hugepages/hugepages-2048kB/free_hugepages:23
node2/hugepages/hugepages-2048kB/free_hugepages:23
node3/hugepages/hugepages-2048kB/free_hugepages:22

This somehow demonstrates that enough free memory is out there, it's
just imbalance that causes SIGBUS. I hope this somehow hopefully
answers one of Your's main questions as per in the very first messages
what we should do with remaining shared_buffer members. I would like
to hear your thoughts on this, before I start benchmarking this for
real as I didnt want to bench it yet, as such interleaving could alter
the the test results.

Other things I've noticed:
- smaps Size: && Shared_Hugetlb: reporting are a lie and are showing
really touched memory, not assigned memory
- same goes for procfs's numa_maps, ignore the N[0-3] sizes, it's only
"really used", not assigned
- the best is just to manually calculate size from pointers/address range itself

-J.