Thread

  1. Re: Adding basic NUMA awareness

    Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-11-17T09:23:50Z

    On Tue, Nov 11, 2025 at 12:52 PM Tomas Vondra <tomas@vondra.me> wrote:
    >
    > Hi,
    >
    > here's a rebased patch series, fixing most of the smaller issues from
    > v20251101, and making cfbot happy (hopefully).
    
    Hi Tomas,
    
    > >>> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
    > >>> called pg_shm_pgproc?
    > >>>
    > >>
    > >> Right. It does not belong to pg_buffercache at all, I just added it
    > >> there because I've been messing with that code already.
    > >
    > > Please keep them in for at least for some time (perhaps standalone
    > > patch marked as not intended to be commited would work?). I find the
    > > view extermely useful as it will allow us pinpointing local-vs-remote
    > > NUMA fetches (we need to know the addres).
    > >
    >
    > Are you referring to the _pgproc view specifically, or also to the view
    > with buffer partitions? I don't intend to remove the view for shared
    > buffers, that's indeed useful.
    
    Both, even the _pgproc.
    
    
    > Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC
    > partitioning, I don't think that's likely to go into 19.
    
    Oh ok.
    
    > >>> 0006d: I've got one SIGBUS during a call to select
    > >>> pg_buffercache_numa_pages(); and it looks like that memory accessed is
    > >>> simply not mapped? (bug)
    [..]
    > I didn't have time to look into all this info about mappings, io_uring
    > yet, so no response from me.
    >
    
    Ok, so the proper HP + SIGBUS explanation:
    
    Appologies, earlier I wrote that disabling THP does workaround this,
    but I've probably made an error there and used wrong binary back there
    (with MAP_POPULATE in PG_MMAP_FLAGS), so please ignore that.
    
    1. Before starting PG, with shared_buffers=32GB, huge_pages=on (2MB
    ones), vm.nr_hugepages=17715, 4 NUMA nodes, kernel 6.14.x,
    max_connections=10k, wal_buffers=1GB:
    
    node0/hugepages/hugepages-2048kB/free_hugepages:4429
    node1/hugepages/hugepages-2048kB/free_hugepages:4429
    node2/hugepages/hugepages-2048kB/free_hugepages:4429
    node3/hugepages/hugepages-2048kB/free_hugepages:4428
    
    2. Just startup the PG with the older NUMA patchset 20251101. There
    will be deficit across NUMA nodes right after startup, mostly one node
    NUMA will allocate much more:
    
    node0/hugepages/hugepages-2048kB/free_hugepages:4397
    node1/hugepages/hugepages-2048kB/free_hugepages:3453
    node2/hugepages/hugepages-2048kB/free_hugepages:4397
    node3/hugepages/hugepages-2048kB/free_hugepages:4396
    
    3. Check layout of NUMA maps for postmaster PID
    
    7fc9cb200000 default file=/anon_hugepage\040(deleted) huge dirty=517
    mapmax=8 N1=517 kernelpagesize_kB=2048 [!!!]
    7fca0d600000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N0=32 kernelpagesize_kB=2048
    7fca11600000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N1=32 kernelpagesize_kB=2048
    7fca15600000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N2=32 kernelpagesize_kB=2048
    7fca19600000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N3=32 kernelpagesize_kB=2048
    7fca1d600000 default file=/anon_hugepage\040(deleted) huge
    7fca1d800000 bind:0 file=/anon_hugepage\040(deleted) huge
    7fcc1d800000 bind:1 file=/anon_hugepage\040(deleted) huge
    7fce1d800000 bind:2 file=/anon_hugepage\040(deleted) huge
    7fd01d800000 bind:3 file=/anon_hugepage\040(deleted) huge
    7fd21d800000 default file=/anon_hugepage\040(deleted) huge dirty=425
    mapmax=8 N1=425 kernelpagesize_kB=2048 [!!!]
    
    So your patch doesn't do anything special for anything other than
    Buffer Blocks and PGPROC in the above picture, so the the default
    mmap() just keeps on with "default" NUMA policy which takes per above
    (517+425) * 2MB = ~1884 MB of really used memory as per N1 entires. PG
    does touch those regions on startup, but it doesnt really touch Buffer
    Blocks. Anyway, this causes the missing amount of free huge pages on
    the N1 (generates pressure on this Node 1).
    
    So as it stands, the patchset is missing some form balancing to use
    equal memory across nodes:
    - each node to be forced to get certain amount of BufferBlocks/NUMA nodes blocks
    - yet we do nothing and leave at the "defaults" the others regions
    (e..g $SegHDR (start of shm) .. first Buffers Block), as those are
    placed on the current node (due default policy), which in causes turns
    this memory overallocation imbalance (so in the example N1 will get
    Buffer Blocks + everything else, but that only happens on real access
    not during mmap() due to lazy/first touch policy)
    
    Currently, any launch of anything that touches imbalanced NUMA node
    memory with deficit (N1 above) - use of pg_shm_allocations,
    pg_buffercache - it will cause stress there and end up in SIGBUS.
    This looks by design on Linux kernel side: exc:page_fault() ->
    do_user_addr_fault() -> do_sigbus() AKA force_sig_fault(). But, if I
    hack pg to hack do interleave (or just numactl --interleave=all ... )
    to effectivley interleave those 3 "default" regions instead, so I'll
    get "interleave" like that:
    
    7fb2dd000000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
    dirty=517 mapmax=8 N0=129 N1=132 N2=128 N3=128 kernelpagesize_kB=2048
    7fb31f400000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N0=32 kernelpagesize_kB=2048
    7fb323400000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N1=32 kernelpagesize_kB=2048
    7fb327400000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N2=32 kernelpagesize_kB=2048
    7fb32b400000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=32
    mapmax=2 N3=32 kernelpagesize_kB=2048
    7fb32f400000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
    7fb32f600000 bind:0 file=/anon_hugepage\040(deleted) huge
    7fb52f600000 bind:1 file=/anon_hugepage\040(deleted) huge
    7fb72f600000 bind:2 file=/anon_hugepage\040(deleted) huge
    7fb92f600000 bind:3 file=/anon_hugepage\040(deleted) huge
    7fbb2f600000 interleave:0-3 file=/anon_hugepage\040(deleted) huge
    dirty=425 N0=106 N1=106 N2=105 N3=108 kernelpagesize_kB=2048
    
    then even after fully touching everything (via select to
    pg_shm_allocations), it'll run, I'll get much better balance, and wont
    have SIGBUS issues:
    
    node0/hugepages/hugepages-2048kB/free_hugepages:23
    node1/hugepages/hugepages-2048kB/free_hugepages:23
    node2/hugepages/hugepages-2048kB/free_hugepages:23
    node3/hugepages/hugepages-2048kB/free_hugepages:22
    
    This somehow demonstrates that enough free memory is out there, it's
    just imbalance that causes SIGBUS. I hope this somehow hopefully
    answers one of Your's main questions as per in the very first messages
    what we should do with remaining shared_buffer members. I would like
    to hear your thoughts on this, before I start benchmarking this for
    real as I didnt want to bench it yet, as such interleaving could alter
    the the test results.
    
    Other things I've noticed:
    - smaps Size: && Shared_Hugetlb: reporting are a lie and are showing
    really touched memory, not assigned memory
    - same goes for procfs's numa_maps, ignore the N[0-3] sizes, it's only
    "really used", not assigned
    - the best is just to manually calculate size from pointers/address range itself
    
    -J.