Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Remove PG_MMAP_FLAGS from mem.h

  2. Improve runtime and output of tests for replication slots checkpointing.

  3. Revert support for improved tracking of nested queries

  4. Use exported symbols list on macOS for loadable modules as well

  5. Add support for basic NUMA awareness

  6. Avoid unnecessary copying of a string in pg_restore.c

  7. aio: Infrastructure for io_method=worker

  8. Improve InitShmemAccess() prototype

  1. Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-10-18T19:21:19Z

    TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    changing shared memory mapping layout. Any feedback is appreciated.
    
    Hi,
    
    Being able to change PostgreSQL configuration on the fly is an important
    property for performance tuning, since it reduces the feedback time and
    invasiveness of the process. In certain cases it even becomes highly desired,
    e.g. when doing automatic tuning. But there are couple of important
    configuration options that could not be modified without a restart, the most
    notorious example is shared_buffers.
    
    I've been working recently on an idea how to change that, allowing to modify
    shared_buffers without a restart. To demonstrate the approach, I've prepared a
    PoC that ignores lots of stuff, but works in a limited set of use cases I was
    testing. I would like to discuss the idea and get some feedback.
    
    Patches 1-3 prepare the infrastructure and shared memory layout. They could be
    useful even with multithreaded PostgreSQL, when there will be no need for
    shared memory. I assume, in the multithreaded world there still will be need
    for a contiguous chunk of memory to share between threads, and its layout would
    be similar to the one with shared memory mappings.
    
    Patch 4 actually does resizing. It's shared memory specific of course, and
    utilized Linux specific mremap, meaning open portability questions.
    
    Patch 5 is somewhat independent, but quite convenient to have. It also utilizes
    Linux specific call memfd_create.
    
    The patch set still doesn't address lots of things, e.g. shared memory segment
    detach/reattach, portability questions, it doesn't touch EXEC_BACKEND code and
    huge pages.
    
    So far I was doing some rudimentary testing: spinning up PostgreSQL, then
    increasing shared_buffers and running pgbench with the scale factor large
    enough to extend the data set into newly allocated buffers:
    
        -- shared_buffers 128 MB
        =# SELECT * FROM pg_buffercache_summary();
         buffers_used | buffers_unused | buffers_dirty | buffers_pinned
        --------------+----------------+---------------+----------------
                  134 |          16250 |             1 |              0
    
        -- change shared_buffers to 512 MB
        =# select pg_reload_conf();
        =# SELECT * FROM pg_buffercache_summary();
         buffers_used | buffers_unused | buffers_dirty | buffers_pinned
        --------------+----------------+---------------+---------------
                  221 |          65315 |             1 |              0
    
        -- round of pgbench read-only load
        =# SELECT * FROM pg_buffercache_summary();
         buffers_used | buffers_unused | buffers_dirty | buffers_pinned
        --------------+----------------+---------------+---------------
                41757 |          23779 |           216 |              0
    
    Here is the breakdown:
    
    v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
    
    Preparation, introduces the possibility to work with many shmem mappings. To
    make it less invasive, I've duplicated the shmem API to extend it with the
    shmem_slot argument, while redirecting the original API to it. There are
    probably better ways of doing that, I'm open for suggestions.
    
    v1-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch
    
    Implements a new layout of shared memory mappings to include room for resizing.
    I've done a couple of tests to verify that such space in between doesn't affect
    how the kernel calculates actual used memory, to make sure that e.g. cgroup
    will not trigger OOM. The only change seems to be in VmPeak, which is total
    mapped pages.
    
    v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch
    
    Splits shared_buffers into multiple slots, moving out structures that depend on
    NBuffers into separate mappings. There are two large gaps here:
    
    * Shmem size calculation for those mappings is not correct yet, it includes too
      many other things (no particular issues here, just haven't had time).
    * It makes hardcoded assumptions about what is the upper limit for resizing,
      which is currently low purely for experiments. Ideally there should be a new
      configuration option to specify the total available memory, which would be a
      base for subsequent calculations.
    
    v1-0004-Allow-to-resize-shared-memory-without-restart.patch
    
    Do shared_buffers change without a restart. Current approach is clumsy, it adds
    an assign hook for shared_buffers and goes from there using mremap to resize
    mappings. But I haven't immediately found any better approach. Currently it
    supports only an increase of shared_buffers.
    
    v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
    
    Allows an anonyous file to back a shared mapping. This makes certain things
    easier, e.g. mappings visual representation, and gives an fd for possible
    future customizations.
    
    In this thread I'm hoping to answer following questions:
    
    * Are there any concerns about this approach?
    * What would be a better mechanism to handle resizing than an assign hook?
    * Assuming I'll be able to address already known missing bits, what are the
      chances the patch series could be accepted?
    
  2. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-01T15:27:50Z

    > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    >
    > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > changing shared memory mapping layout. Any feedback is appreciated.
    
    It was pointed out to me, that earlier this year there was a useful
    discussion about similar matters "PGC_SIGHUP shared_buffers?" [1]. From
    what I see the patch series falls into the "re-map" category in that
    thread.
    
    [1]: https://www.postgresql.org/message-id/flat/CA%2BTgmoaGCFPhMjz7veJOeef30%3DKdpOxgywcLwNbr-Gny-mXwcg%40mail.gmail.com
    
    
    
    
  3. Re: Changing shared_buffers without restart

    Vladlen Popolitov <v.popolitov@postgrespro.ru> — 2024-11-06T19:10:06Z

    Hi
    
    I tried to apply patches, but failed. I suppose the problem with CRLF in the end of lines in the patch files. At least, after manual change of v1-0001 and v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - v1.0005 - they have also other mistakes during patch process. Could you check patch files and place them in correct format?
    
    The new status of this patch is: Waiting on Author
    
  4. Re: Changing shared_buffers without restart

    Thomas Munro <thomas.munro@gmail.com> — 2024-11-07T01:05:52Z

    On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > Currently it
    > supports only an increase of shared_buffers.
    
    Just BTW in case it is interesting, Palak and I experimented with how
    to shrink the buffer pool while PostgreSQL is running, while we were
    talking about 13453ee (which it shares infrastructure with).  This
    version fails if something is pinned and in the way of the shrink
    operation, but you could imagine other policies (wait, cancel it,
    ...):
    
    https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247
    
    
    
    
  5. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-08T16:40:03Z

    > On Thu, Nov 07, 2024 at 02:05:52PM GMT, Thomas Munro wrote:
    > On Sat, Oct 19, 2024 at 8:21 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > Currently it
    > > supports only an increase of shared_buffers.
    >
    > Just BTW in case it is interesting, Palak and I experimented with how
    > to shrink the buffer pool while PostgreSQL is running, while we were
    > talking about 13453ee (which it shares infrastructure with).  This
    > version fails if something is pinned and in the way of the shrink
    > operation, but you could imagine other policies (wait, cancel it,
    > ...):
    >
    > https://github.com/macdice/postgres/commit/db26fe0c98476cdbbd1bcf553f3b7864cb142247
    
    Thanks, looks interesting. I'll try to experiment with that in the next
    version of the patch.
    
    
    
    
  6. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-08T16:43:06Z

    > On Wed, Nov 06, 2024 at 07:10:06PM GMT, Vladlen Popolitov wrote:
    > Hi
    >
    > I tried to apply patches, but failed. I suppose the problem with CRLF in the end of lines in the patch files. At least, after manual change of v1-0001 and v1-0002 from CRLF to LF patches applied, but it was not helped for v1-0003 - v1.0005 - they have also other mistakes during patch process. Could you check patch files and place them in correct format?
    >
    > The new status of this patch is: Waiting on Author
    
    Well, I'm going to rebase the patch if that's what you mean. But just
    FYI -- it could be applied without any issues to the base commit
    mentioned in the series.
    
    
    
    
  7. Re: Changing shared_buffers without restart

    Peter Eisentraut <peter@eisentraut.org> — 2024-11-19T12:57:00Z

    On 18.10.24 21:21, Dmitry Dolgov wrote:
    > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
    > 
    > Preparation, introduces the possibility to work with many shmem mappings. To
    > make it less invasive, I've duplicated the shmem API to extend it with the
    > shmem_slot argument, while redirecting the original API to it. There are
    > probably better ways of doing that, I'm open for suggestions.
    
    After studying this a bit, I tend to think you should just change the 
    existing APIs in place.  So for example,
    
    void *ShmemAlloc(Size size);
    
    becomes
    
    void *ShmemAlloc(int shmem_slot, Size size);
    
    There aren't that many callers, and all these duplicated interfaces 
    almost add more new code than they save.
    
    It might be worth making exceptions for interfaces that are likely to be 
    used by extensions.  For example, I see pg_stat_statements using 
    ShmemInitStruct() and ShmemInitHash().  But that seems to be it.  Are 
    there any other examples out there?  Maybe there are many more that I 
    don't see right now.  But at least for the initialization functions, it 
    doesn't seem worth it to preserve the existing interfaces exactly.
    
    In any case, I think the slot number should be the first argument.  This 
    matches how MemoryContextAlloc() or also talloc() work.
    
    (Now here is an idea:  Could these just be memory contexts?  Instead of 
    making six shared memory slots, could you make six memory contexts with 
    a special shared memory type.  And ShmemAlloc becomes the allocation 
    function, etc.?)
    
    I noticed the existing code made inconsistent use of PGShmemHeader * vs. 
    void *, which also bled into your patch.  I made the attached little 
    patch to clean that up a bit.
    
    I suggest splitting the struct ShmemSegment into one struct for the 
    three memory addresses and a separate array just for the slock_t's.  The 
    former struct can then stay private in storage/ipc/shmem.c, only the 
    locks need to be exported.
    
    Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
    
    Also, maybe some of this should be declared in storage/shmem.h rather 
    than in storage/pg_shmem.h.  We have the existing ShmemLock in there, so 
    it would be a bit confusing to have the per-segment locks elsewhere.
    
    
    > v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch
    > 
    > Splits shared_buffers into multiple slots, moving out structures that depend on
    > NBuffers into separate mappings. There are two large gaps here:
    > 
    > * Shmem size calculation for those mappings is not correct yet, it includes too
    >    many other things (no particular issues here, just haven't had time).
    > * It makes hardcoded assumptions about what is the upper limit for resizing,
    >    which is currently low purely for experiments. Ideally there should be a new
    >    configuration option to specify the total available memory, which would be a
    >    base for subsequent calculations.
    
    Yes, I imagine a shared_buffers_hard_limit setting.  We could maybe 
    default that to the total available memory, but it would also be good to 
    be able to specify it directly, for testing.
    
    
    > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
    > 
    > Allows an anonyous file to back a shared mapping. This makes certain things
    > easier, e.g. mappings visual representation, and gives an fd for possible
    > future customizations.
    
    I think this could be a useful patch just by itself, without the rest of 
    the series, because of
    
     > * By default, Linux will not add file-backed shared mappings into a
     > core dump, making it more convenient to work with them in PostgreSQL:
     > no more huge dumps to process.
    
    This could be significant operational benefit.
    
    When you say "by default", is this adjustable?  Does someone actually 
    want the whole shared memory in their core file?  (If it's adjustable, 
    is it also adjustable for anonymous mappings?)
    
    I'm wondering about this change:
    
    -#define PG_MMAP_FLAGS 
    (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
    +#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)
    
    It looks like this would affect all mmap() calls, not only the one 
    you're changing.  But that's the only one that uses this macro!  I don't 
    understand why we need this; I don't see anything in the commit log 
    about this ever being used for any portability.  I think we should just 
    get rid of it and have mmap() use the right flags directly.
    
    I see that FreeBSD has a memfd_create() function.  Might be worth a try. 
      Obviously, this whole thing needs a configure test for memfd_create() 
    anyway.
    
    I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear 
    how that interacts with the MAP_HUGETLB flag for mmap().  Do you need to 
    specify both of them if you want huge pages?
    
  8. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-19T13:29:12Z

    > On Tue, Nov 19, 2024 at 01:57:00PM GMT, Peter Eisentraut wrote:
    > On 18.10.24 21:21, Dmitry Dolgov wrote:
    > > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
    > >
    > > Preparation, introduces the possibility to work with many shmem mappings. To
    > > make it less invasive, I've duplicated the shmem API to extend it with the
    > > shmem_slot argument, while redirecting the original API to it. There are
    > > probably better ways of doing that, I'm open for suggestions.
    >
    > After studying this a bit, I tend to think you should just change the
    > existing APIs in place.  So for example,
    >
    > void *ShmemAlloc(Size size);
    >
    > becomes
    >
    > void *ShmemAlloc(int shmem_slot, Size size);
    >
    > There aren't that many callers, and all these duplicated interfaces almost
    > add more new code than they save.
    >
    > It might be worth making exceptions for interfaces that are likely to be
    > used by extensions.  For example, I see pg_stat_statements using
    > ShmemInitStruct() and ShmemInitHash().  But that seems to be it.  Are there
    > any other examples out there?  Maybe there are many more that I don't see
    > right now.  But at least for the initialization functions, it doesn't seem
    > worth it to preserve the existing interfaces exactly.
    >
    > In any case, I think the slot number should be the first argument.  This
    > matches how MemoryContextAlloc() or also talloc() work.
    
    Yeah, agree. I'll reshape this part, thanks.
    
    > (Now here is an idea:  Could these just be memory contexts?  Instead of
    > making six shared memory slots, could you make six memory contexts with a
    > special shared memory type.  And ShmemAlloc becomes the allocation function,
    > etc.?)
    
    Sound interesting. I don't know how good the memory context interface
    would fit here, but I'll do some investigation.
    
    > I noticed the existing code made inconsistent use of PGShmemHeader * vs.
    > void *, which also bled into your patch.  I made the attached little patch
    > to clean that up a bit.
    
    Right, it was bothering me the whole time, but not strong enough to make
    me fix this in the PoC just yet.
    
    > I suggest splitting the struct ShmemSegment into one struct for the three
    > memory addresses and a separate array just for the slock_t's.  The former
    > struct can then stay private in storage/ipc/shmem.c, only the locks need to
    > be exported.
    >
    > Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
    >
    > Also, maybe some of this should be declared in storage/shmem.h rather than
    > in storage/pg_shmem.h.  We have the existing ShmemLock in there, so it would
    > be a bit confusing to have the per-segment locks elsewhere.
    >
    > [...]
    >
    > I'm wondering about this change:
    >
    > -#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
    > +#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)
    >
    > It looks like this would affect all mmap() calls, not only the one you're
    > changing.  But that's the only one that uses this macro!  I don't understand
    > why we need this; I don't see anything in the commit log about this ever
    > being used for any portability.  I think we should just get rid of it and
    > have mmap() use the right flags directly.
    >
    > I see that FreeBSD has a memfd_create() function.  Might be worth a try.
    > Obviously, this whole thing needs a configure test for memfd_create()
    > anyway.
    
    Yep, those points make sense to me.
    
    > > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
    > >
    > > Allows an anonyous file to back a shared mapping. This makes certain things
    > > easier, e.g. mappings visual representation, and gives an fd for possible
    > > future customizations.
    >
    > I think this could be a useful patch just by itself, without the rest of the
    > series, because of
    >
    > > * By default, Linux will not add file-backed shared mappings into a
    > > core dump, making it more convenient to work with them in PostgreSQL:
    > > no more huge dumps to process.
    >
    > This could be significant operational benefit.
    >
    > When you say "by default", is this adjustable?  Does someone actually want
    > the whole shared memory in their core file?  (If it's adjustable, is it also
    > adjustable for anonymous mappings?)
    
    Yes, there is /proc/<pid>/coredump_filter [1], that allows to specify
    what to include. One can ask to exclude anon, file-backed and hugetlb
    shared memory, with the only caveat that it's per process. I guess
    normally no one wants to have a full shared memory in the coredump, but
    there could be exceptions.
    
    > I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
    > that interacts with the MAP_HUGETLB flag for mmap().  Do you need to specify
    > both of them if you want huge pages?
    
    Correct, both (one flag in memfd_create and one for mmap) are needed to
    use huge pages.
    
    [1]: https://www.kernel.org/doc/html/latest/filesystems/proc.html#proc-pid-coredump-filter-core-dump-filtering-settings
    
    
    
    
  9. Re: Changing shared_buffers without restart

    Peter Eisentraut <peter@eisentraut.org> — 2024-11-21T07:55:40Z

    On 19.11.24 14:29, Dmitry Dolgov wrote:
    >> I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
    >> that interacts with the MAP_HUGETLB flag for mmap().  Do you need to specify
    >> both of them if you want huge pages?
    > Correct, both (one flag in memfd_create and one for mmap) are needed to
    > use huge pages.
    
    I was worried because the FreeBSD man page says
    
    MFD_HUGETLB	  This flag is currently unsupported.
    
    It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.
    
    But you should make sure in your patch that the right set of flags for 
    huge pages is passed.
    
    
    
    
    
  10. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-11-25T19:33:48Z

    On Fri, Oct 18, 2024 at 3:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > changing shared memory mapping layout. Any feedback is appreciated.
    
    A lot of people would like to have this feature, so I hope this
    proposal works out. Thanks for working on it.
    
    I think the idea of having multiple shared memory segments is
    interesting and makes sense, but I would prefer to see them called
    "segments" rather than "slots" just as do we do for DSMs. The name
    "slot" is somewhat overused, and invites confusion with replication
    slots, inter alia. I think it's possible that having multiple fixed
    shared memory segments will spell trouble on Windows, where we already
    need to use a retry loop to try to get the main shared memory segment
    mapped at the correct address. If there are multiple segments and we
    need whatever ASLR stuff happens on Windows to not place anything else
    overlapping with any of them, that means there's more chances for
    stuff to fail than if we just need one address range to be free.
    Granted, the individual ranges are smaller, so maybe it's fine? But I
    don't know.
    
    The big thing that worries me is synchronization, and while I've only
    looked at the patch set briefly, it doesn't look to me as though
    there's enough machinery here to make that work correctly. Suppose
    that shared_buffers=8GB (a million buffers) and I change it to
    shared_buffers=16GB (2 million buffers). As soon as any one backend
    has seen that changed and expanded shared_buffers, there's a
    possibility that some other backend which has not yet seen the change
    might see a buffer number greater than a million. If it tries to use
    that buffer number before it absorbs the change, something bad will
    happen. The most obvious way for it to see such a buffer number - and
    possibly the only one - is to do a lookup in the buffer mapping table
    and find a buffer ID there that was inserted by some other backend
    that has already seen the change.
    
    Fixing this seems tricky. My understanding is that BufferGetBlock() is
    extremely performance-critical, so having to do a bounds check there
    to make sure that a given buffer number is in range would probably be
    bad for performance. Also, even if the overhead weren't prohibitive, I
    don't think we can safely stick code that unmaps and remaps shared
    memory segments into a function that currently just does math, because
    we've probably got places where we assume this operation can't fail --
    as well as places where we assume that if we call BufferGetBlock(i)
    and then BufferGetBlock(j), the second call won't change the answer to
    the first.
    
    It seems to me that it's probably only safe to swap out a backend's
    notion of where shared_buffers is located when the backend holds on
    buffer pins, and maybe not even all such places, because it would be a
    problem if a backend looks up the address of a buffer before actually
    pinning it, on the assumption that the answer can't change. I don't
    know if that ever happens, but it would be a legal coding pattern
    today. Doing it between statements seems safe as long as there are no
    cursors holding pins. Doing it in the middle of a statement is
    probably possible if we can verify that we're at a "safe" point in the
    code, but I'm not sure exactly which points are safe. If we have no
    code anywhere that assumes the address of an unpinned buffer can't
    change before we pin it, then I guess the check for pins is the only
    thing we need, but I don't know that to be the case.
    
    I guess I would have imagined that a change like this would have to be
    done in phases. In phase 1, we'd tell all of the backends that
    shared_buffers had expanded to some new, larger value; but the new
    buffers wouldn't be usable for anything yet. Then, once we confirmed
    that everyone had the memo, we'd tell all the backends that those
    buffers are now available for use. If shared_buffers were contracted,
    phase 1 would tell all of the backends that shared_buffers had
    contracted to some new, smaller value. Once a particular backend
    learns about that, they will refuse to put any new pages into those
    high-numbered buffers, but the existing contents would still be valid.
    Once everyone has been told about this, we can go through and evict
    all of those buffers, and then let everyone know that's done. Then
    they shrink their mappings.
    
    It looks to me like the patch doesn't expand the buffer mapping table,
    which seems essential. But maybe I missed that.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  11. Re: Changing shared_buffers without restart

    Peter Eisentraut <peter@eisentraut.org> — 2024-11-26T07:53:39Z

    On 19.11.24 14:29, Dmitry Dolgov wrote:
    >> I noticed the existing code made inconsistent use of PGShmemHeader * vs.
    >> void *, which also bled into your patch.  I made the attached little patch
    >> to clean that up a bit.
    > Right, it was bothering me the whole time, but not strong enough to make
    > me fix this in the PoC just yet.
    
    I committed a bit of this, so check that when you're rebasing your patch 
    set.
    
    
    
    
    
  12. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-26T19:17:58Z

    > On Mon, Nov 25, 2024 at 02:33:48PM GMT, Robert Haas wrote:
    >
    > I think the idea of having multiple shared memory segments is
    > interesting and makes sense, but I would prefer to see them called
    > "segments" rather than "slots" just as do we do for DSMs. The name
    > "slot" is somewhat overused, and invites confusion with replication
    > slots, inter alia. I think it's possible that having multiple fixed
    > shared memory segments will spell trouble on Windows, where we already
    > need to use a retry loop to try to get the main shared memory segment
    > mapped at the correct address. If there are multiple segments and we
    > need whatever ASLR stuff happens on Windows to not place anything else
    > overlapping with any of them, that means there's more chances for
    > stuff to fail than if we just need one address range to be free.
    > Granted, the individual ranges are smaller, so maybe it's fine? But I
    > don't know.
    
    I haven't had a chance to experiment with that on Windows, but I'm
    hoping that in the worst case fallback to a single mapping via proposed
    infrastructure (and the consequent limitations) would be acceptable.
    
    > The big thing that worries me is synchronization, and while I've only
    > looked at the patch set briefly, it doesn't look to me as though
    > there's enough machinery here to make that work correctly. Suppose
    > that shared_buffers=8GB (a million buffers) and I change it to
    > shared_buffers=16GB (2 million buffers). As soon as any one backend
    > has seen that changed and expanded shared_buffers, there's a
    > possibility that some other backend which has not yet seen the change
    > might see a buffer number greater than a million. If it tries to use
    > that buffer number before it absorbs the change, something bad will
    > happen. The most obvious way for it to see such a buffer number - and
    > possibly the only one - is to do a lookup in the buffer mapping table
    > and find a buffer ID there that was inserted by some other backend
    > that has already seen the change.
    
    Right, I haven't put much efforts into synchronization yet. It's in my
    bucket list for the next iteration of the patch.
    
    > code, but I'm not sure exactly which points are safe. If we have no
    > code anywhere that assumes the address of an unpinned buffer can't
    > change before we pin it, then I guess the check for pins is the only
    > thing we need, but I don't know that to be the case.
    
    Probably I'm missing something here. What scenario do you have in mind,
    when the address of a buffer is changing?
    
    > I guess I would have imagined that a change like this would have to be
    > done in phases. In phase 1, we'd tell all of the backends that
    > shared_buffers had expanded to some new, larger value; but the new
    > buffers wouldn't be usable for anything yet. Then, once we confirmed
    > that everyone had the memo, we'd tell all the backends that those
    > buffers are now available for use. If shared_buffers were contracted,
    > phase 1 would tell all of the backends that shared_buffers had
    > contracted to some new, smaller value. Once a particular backend
    > learns about that, they will refuse to put any new pages into those
    > high-numbered buffers, but the existing contents would still be valid.
    > Once everyone has been told about this, we can go through and evict
    > all of those buffers, and then let everyone know that's done. Then
    > they shrink their mappings.
    
    Yep, sounds good. I was pondering about more crude approach, but doing
    this in phases seems to be a way to go.
    
    > It looks to me like the patch doesn't expand the buffer mapping table,
    > which seems essential. But maybe I missed that.
    
    Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
    under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
    at the code, I see a few issues around that -- so I would have to
    improve it anyway, thanks for pointing that out.
    
    
    
    
  13. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-11-27T15:20:27Z

    On Tue, Nov 26, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > I haven't had a chance to experiment with that on Windows, but I'm
    > hoping that in the worst case fallback to a single mapping via proposed
    > infrastructure (and the consequent limitations) would be acceptable.
    
    Yeah, if you can still fall back to a single mapping, I think that's
    OK. It would be nicer if it could work on every platform in the same
    way, but half a loaf is better than none.
    
    > > code, but I'm not sure exactly which points are safe. If we have no
    > > code anywhere that assumes the address of an unpinned buffer can't
    > > change before we pin it, then I guess the check for pins is the only
    > > thing we need, but I don't know that to be the case.
    >
    > Probably I'm missing something here. What scenario do you have in mind,
    > when the address of a buffer is changing?
    
    I was assuming that if you expand the mapping for shared_buffers, you
    can't count on the new mapping being at the same address as the old
    mapping. If you can, that makes things simpler, but what if the OS has
    mapped something else just afterward, in the address space that you're
    hoping to use when you expand the mapping?
    
    > > It looks to me like the patch doesn't expand the buffer mapping table,
    > > which seems essential. But maybe I missed that.
    >
    > Do you mean the "Shared Buffer Lookup Table"? It does expand it, but
    > under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look
    > at the code, I see a few issues around that -- so I would have to
    > improve it anyway, thanks for pointing that out.
    
    Yeah, we -- or at least I -- usually call that the buffer mapping
    table. There are identifiers like BufMappingPartitionLock, for
    example. I'm slightly surprised that the ShmemInitHash() call uses
    something else as the identifier, but I guess that's how it is.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  14. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-27T20:48:14Z

    > On Wed, Nov 27, 2024 at 10:20:27AM GMT, Robert Haas wrote:
    > > >
    > > > code, but I'm not sure exactly which points are safe. If we have no
    > > > code anywhere that assumes the address of an unpinned buffer can't
    > > > change before we pin it, then I guess the check for pins is the only
    > > > thing we need, but I don't know that to be the case.
    > >
    > > Probably I'm missing something here. What scenario do you have in mind,
    > > when the address of a buffer is changing?
    >
    > I was assuming that if you expand the mapping for shared_buffers, you
    > can't count on the new mapping being at the same address as the old
    > mapping. If you can, that makes things simpler, but what if the OS has
    > mapped something else just afterward, in the address space that you're
    > hoping to use when you expand the mapping?
    
    Yes, that's the whole point of the exercise with remap -- to keep
    addresses unchanged, making buffer management simpler and allowing
    resize mappings quicker. The trade off is that we would need to take
    care of shared mapping placing.
    
    My understanding is that clashing of mappings (either at creation time
    or when resizing) could happen only withing the process address space,
    and the assumption is that by the time we prepare the mapping layout all
    the rest of mappings for the process are already done. But I agree, it's
    an interesting question -- I'm going to investigate if those assumptions
    could be wrong under certain conditions. Currently if something else is
    mapped at the same address where we want to expand the mapping, we will
    get an error and can decide how to proceed (e.g. if it happens at
    creation time, proceed with a single mapping, otherwise ignore mapping
    resize).
    
    
    
    
  15. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-11-27T21:05:47Z

    On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > My understanding is that clashing of mappings (either at creation time
    > or when resizing) could happen only withing the process address space,
    > and the assumption is that by the time we prepare the mapping layout all
    > the rest of mappings for the process are already done.
    
    I don't think that's correct at all. First, the user could type LOAD
    'whatever' at any time. But second, even if they don't or you prohibit
    them from doing so, the process could allocate memory for any of a
    million different things, and that could require mapping a new region
    of memory, and the OS could choose to place that just after an
    existing mapping, or at least close enough that we can't expand the
    object size as much as desired.
    
    If we had an upper bound on the size of shared_buffers and could
    reserve that amount of address space at startup time but only actually
    map a portion of it, then we could later remap and expand into the
    reserved space. Without that, I think there's absolutely no guarantee
    that the amount of address space that we need is available when we
    want to extend a mapping.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  16. Re: Changing shared_buffers without restart

    Jelte Fennema-Nio <postgres@jeltef.nl> — 2024-11-27T21:28:06Z

    On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:
    > If we had an upper bound on the size of shared_buffers
    
    I think a fairly reliable upper bound is the amount of physical memory
    on the system at time of postmaster start. We could make it a GUC to
    set the upper bound for the rare cases where people do stuff like
    adding swap space later or doing online VM growth. We could even have
    the default be something like 4x the physical memory to accommodate
    those people by default.
    
    > reserve that amount of address space at startup time but only actually
    > map a portion of it
    
    Or is this the difficult part?
    
    
    
    
  17. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2024-11-27T21:41:45Z

    Hi,
    
    On 2024-11-27 16:05:47 -0500, Robert Haas wrote:
    > On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > My understanding is that clashing of mappings (either at creation time
    > > or when resizing) could happen only withing the process address space,
    > > and the assumption is that by the time we prepare the mapping layout all
    > > the rest of mappings for the process are already done.
    > 
    > I don't think that's correct at all. First, the user could type LOAD
    > 'whatever' at any time. But second, even if they don't or you prohibit
    > them from doing so, the process could allocate memory for any of a
    > million different things, and that could require mapping a new region
    > of memory, and the OS could choose to place that just after an
    > existing mapping, or at least close enough that we can't expand the
    > object size as much as desired.
    > 
    > If we had an upper bound on the size of shared_buffers and could
    > reserve that amount of address space at startup time but only actually
    > map a portion of it, then we could later remap and expand into the
    > reserved space. Without that, I think there's absolutely no guarantee
    > that the amount of address space that we need is available when we
    > want to extend a mapping.
    
    Strictly speaking we don't actually need to map shared buffers to the same
    location in each process... We do need that for most other uses of shared
    memory, including the buffer mapping table, but not for the buffer data
    itself.
    
    Whether it's worth the complexity of dealing with differing locations is
    another matter.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  18. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-11-28T01:26:39Z

    On Wed, Nov 27, 2024 at 4:28 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
    > On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote:
    > > If we had an upper bound on the size of shared_buffers
    >
    > I think a fairly reliable upper bound is the amount of physical memory
    > on the system at time of postmaster start. We could make it a GUC to
    > set the upper bound for the rare cases where people do stuff like
    > adding swap space later or doing online VM growth. We could even have
    > the default be something like 4x the physical memory to accommodate
    > those people by default.
    
    Yes, Peter mentioned similar ideas on this thread last week.
    
    > > reserve that amount of address space at startup time but only actually
    > > map a portion of it
    >
    > Or is this the difficult part?
    
    I'm not sure how difficult this is, although I'm pretty sure that it's
    more difficult than adding a GUC. My point wasn't so much whether this
    is easy or hard but rather that it's essential if you want to avoid
    having addresses change when the resizing happens.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  19. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-11-28T01:28:59Z

    On Wed, Nov 27, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:
    > Strictly speaking we don't actually need to map shared buffers to the same
    > location in each process... We do need that for most other uses of shared
    > memory, including the buffer mapping table, but not for the buffer data
    > itself.
    
    Well, if it can move, then you have to make sure it doesn't move while
    someone's holding onto a pointer into it. I'm not exactly sure how
    hard it is to guarantee that, but we certainly do construct pointers
    into shared_buffers and use them at least for short periods of time,
    so it's not a purely academic concern.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  20. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-28T16:30:32Z

    > On Wed, Nov 27, 2024 at 04:05:47PM GMT, Robert Haas wrote:
    > On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > My understanding is that clashing of mappings (either at creation time
    > > or when resizing) could happen only withing the process address space,
    > > and the assumption is that by the time we prepare the mapping layout all
    > > the rest of mappings for the process are already done.
    >
    > I don't think that's correct at all. First, the user could type LOAD
    > 'whatever' at any time. But second, even if they don't or you prohibit
    > them from doing so, the process could allocate memory for any of a
    > million different things, and that could require mapping a new region
    > of memory, and the OS could choose to place that just after an
    > existing mapping, or at least close enough that we can't expand the
    > object size as much as desired.
    >
    > If we had an upper bound on the size of shared_buffers and could
    > reserve that amount of address space at startup time but only actually
    > map a portion of it, then we could later remap and expand into the
    > reserved space. Without that, I think there's absolutely no guarantee
    > that the amount of address space that we need is available when we
    > want to extend a mapping.
    
    Just done a couple of experiments, and I think this could be addressed by
    careful placing of mappings as well, based on two assumptions: for a new
    mapping the kernel always picks up a lowest address that allows enough space,
    and the maximum amount of allocable memory for other mappings could be derived
    from total available memory. With that in mind the shared mapping layout will
    have to have a large gap at the start, between the lowest address and the
    shared mappings used for buffers and rest -- the gap where all the other
    mapping (allocations, libraries, madvise, etc) will land. It's similar to
    address space reserving you mentioned above, will reduce possibility of
    clashing significantly, and looks something like this:
    
    	01339000-0139e000 [heap]
    	0139e000-014aa000 [heap]
    	7f2dd72f6000-7f2dfbc9c000 /memfd:strategy (deleted)
    	7f2e0209c000-7f2e269b0000 /memfd:checkpoint (deleted)
    	7f2e2cdb0000-7f2e516b4000 /memfd:iocv (deleted)
    	7f2e57ab4000-7f2e7c478000 /memfd:descriptors (deleted)
    	7f2ebc478000-7f2ee8d3c000 /memfd:buffers (deleted)
    	^ note the distance between two mappings,
    	  which is intended for resize
    	7f3168d3c000-7f318d600000 /memfd:main (deleted)
    	^ here is where the gap starts
    	7f4194c00000-7f4194e7d000
    	^ this one is an anonymous maping created due to large
    	  memory allocation after shared mappings were created
    	7f4195000000-7f419527d000
    	7f41952dc000-7f4195416000
    	7f4195416000-7f4195600000 /dev/shm/PostgreSQL.2529797530
    	7f4195600000-7f41a311d000 /usr/lib/locale/locale-archive
    	7f41a317f000-7f41a3200000
    	7f41a3200000-7f41a3201000 /usr/lib64/libicudata.so.74.2
    
    The assumption about picking up a lowest address is just how it works right now
    on Linux, this fact is already used in the patch. The idea that we could put
    upper boundary on the size of other mappings based on total available memory
    comes from the fact that anonymous mappings, that are much larger than memory,
    will fail without overcommit. With overcommit it becomes different, but if
    allocations are hitting that limit I can imagine there are bigger problems than
    shared buffer resize.
    
    This approach follows the same ideas already used in the patch, and have the
    same trade offs: no address changes, but questions about portability.
    
    
    
    
  21. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-11-28T17:18:54Z

    On Thu, Nov 28, 2024 at 11:30 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > on Linux, this fact is already used in the patch. The idea that we could put
    > upper boundary on the size of other mappings based on total available memory
    > comes from the fact that anonymous mappings, that are much larger than memory,
    > will fail without overcommit. With overcommit it becomes different, but if
    > allocations are hitting that limit I can imagine there are bigger problems than
    > shared buffer resize.
    >
    > This approach follows the same ideas already used in the patch, and have the
    > same trade offs: no address changes, but questions about portability.
    
    I definitely welcome the fact that you have some platform-specific
    knowledge of the Linux behavior, because that's expertise that is
    obviously quite useful here and which I lack. I'm personally not
    overly concerned about whether it works on every other platform -- I
    would prefer an implementation that works everywhere, but I'd rather
    have one that works on Linux than have nothing. It's unclear to me why
    operating systems don't offer better primitives for this sort of thing
    -- in theory there could be a system call that sets aside a pool of
    address space and then other system calls that let you allocate
    shared/unshared memory within that space or even at specific
    addresses, but actually such things don't exist.
    
    All that having been said, what does concern me a bit is our ability
    to predict what Linux will do well enough to keep what we're doing
    safe; and also whether the Linux behavior might abruptly change in the
    future. Users would be sad if we released this feature and then a
    future kernel upgrade causes PostgreSQL to completely stop working. I
    don't know how the Linux kernel developers actually feel about this
    sort of thing, but if I imagine myself as a kernel developer, I can
    totally see myself saying "well, we never promised that this would
    work in any particular way, so we're free to change it whenever we
    like." We've certainly used that argument here countless times.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  22. Re: Changing shared_buffers without restart

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2024-11-28T18:13:22Z

    On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > [...] It's unclear to me why
    > operating systems don't offer better primitives for this sort of thing
    > -- in theory there could be a system call that sets aside a pool of
    > address space and then other system calls that let you allocate
    > shared/unshared memory within that space or even at specific
    > addresses, but actually such things don't exist.
    
    Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
    allows you to request memory from the OS at arbitrary addresses - it's
    just that stdlib's malloc doens't expose the 'alloc at this address'
    part of that API.
    
    Windows seems to have an equivalent API in VirtualAlloc*. Both the
    Windows API and Linux's mmap have an optional address argument, which
    (when not NULL) is where the allocation will be placed (some
    conditions apply, based on flags and specific API used), so, assuming
    we have some control on where to allocate memory, we should be able to
    reserve enough memory by using these APIs.
    
    Kind regards,
    
    Matthias van de Meent
    Neon (https://neon.tech)
    
    
    
    
  23. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-28T18:45:49Z

    > On Thu, Nov 28, 2024 at 12:18:54PM GMT, Robert Haas wrote:
    >
    > All that having been said, what does concern me a bit is our ability
    > to predict what Linux will do well enough to keep what we're doing
    > safe; and also whether the Linux behavior might abruptly change in the
    > future. Users would be sad if we released this feature and then a
    > future kernel upgrade causes PostgreSQL to completely stop working. I
    > don't know how the Linux kernel developers actually feel about this
    > sort of thing, but if I imagine myself as a kernel developer, I can
    > totally see myself saying "well, we never promised that this would
    > work in any particular way, so we're free to change it whenever we
    > like." We've certainly used that argument here countless times.
    
    Agree, at the moment I can't say for sure how reliable this behavior is
    in long term. I'll try to see if there are ways to get more confidence
    about that.
    
    
    
    
  24. Re: Changing shared_buffers without restart

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-11-28T18:57:15Z

    Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
    > On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:
    >> [...] It's unclear to me why
    >> operating systems don't offer better primitives for this sort of thing
    >> -- in theory there could be a system call that sets aside a pool of
    >> address space and then other system calls that let you allocate
    >> shared/unshared memory within that space or even at specific
    >> addresses, but actually such things don't exist.
    
    > Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
    > allows you to request memory from the OS at arbitrary addresses - it's
    > just that stdlib's malloc doens't expose the 'alloc at this address'
    > part of that API.
    
    I think what Robert is concerned about is that there is exactly 0
    guarantee that that will succeed, because you have no control over
    system-driven allocations of address space (for example, loading
    of extensions or JIT code).  In fact, given things like ASLR, there
    is pressure on the kernel crew to make that *less* predictable not
    more so.  So even if we devise a method that seems to work reliably
    today, we could have little faith that it would work with next year's
    kernels.
    
    			regards, tom lane
    
    
    
    
  25. Re: Changing shared_buffers without restart

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2024-11-29T00:56:30Z

    On Thu, 28 Nov 2024 at 19:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >
    > Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
    > > On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote:
    > >> [...] It's unclear to me why
    > >> operating systems don't offer better primitives for this sort of thing
    > >> -- in theory there could be a system call that sets aside a pool of
    > >> address space and then other system calls that let you allocate
    > >> shared/unshared memory within that space or even at specific
    > >> addresses, but actually such things don't exist.
    >
    > > Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall
    > > allows you to request memory from the OS at arbitrary addresses - it's
    > > just that stdlib's malloc doens't expose the 'alloc at this address'
    > > part of that API.
    >
    > I think what Robert is concerned about is that there is exactly 0
    > guarantee that that will succeed, because you have no control over
    > system-driven allocations of address space (for example, loading
    > of extensions or JIT code).  In fact, given things like ASLR, there
    > is pressure on the kernel crew to make that *less* predictable not
    > more so.
    
    I see what you mean, but I think that shouldn't be much of an issue.
    I'm not a kernel hacker, but I've never heard about anyone arguing to
    remove mmap's mapping-overwriting behavior for user-controlled
    mappings - it seems too useful as a way to guarantee relative memory
    addresses (agreed, there is now mseal(2), but that is the user asking
    for security on their own mapping, this isn't applied to arbitrary
    mappings).
    
    I mean, we can do the following to get a nice contiguous empty address
    space no other mmap(NULL)s will get put into:
    
        /* reserve size bytes of memory */
        base = mmap(NULL, size, PROT_NONE, ...flags, ...);
        /* use the first small_size bytes of that reservation */
        allocated_in_reserved = mmap(base, small_size, PROT_READ |
    PROT_WRITE, MAP_FIXED, ...);
    
    With the PROT_NONE protection option the OS doesn't actually allocate
    any backing memory, but guarantees no other mmap(NULL, ...) will get
    placed in that area such that it overlaps with that allocation until
    the area is munmap-ed, thus allowing us to reserve a chunk of address
    space without actually using (much) memory. Deallocations have to go
    through mmap(... PROT_NONE, ...) instead of munmap if we'd want to
    keep the full area reserved, but I think that's not that much of an
    issue.
    
    I also highly doubt Linux will remove or otherwise limit the PROT_NONE
    option to such a degree that we won't be able to "balloon" the memory
    address space for (e.g.) dynamic shared buffer resizing.
    
    See also: FreeBSD's MAP_GUARD mmap flag, Window's MEM_RESERVE and
    MEM_RESERVE_PLACEHOLDER flags for VirtualAlloc[2][Ex].
    See also [0] where PROT_NONE is explicitly called out as a tool for
    reserving memory address space.
    
    > So even if we devise a method that seems to work reliably
    > today, we could have little faith that it would work with next year's
    > kernels.
    
    I really don't think that userspace memory address space reservations
    through e.g. PROT_NONE or MEM_RESERVE[_PLACEHOLDER] will be retired
    anytime soon, at least not without the relevant kernels also providing
    effective alternatives.
    
    
    Kind regards,
    
    Matthias van de Meent
    Neon (https://neon.tech)
    
    [0] https://www.gnu.org/software/libc/manual/html_node/Memory-Protection.html
    
    
    
    
  26. Re: Changing shared_buffers without restart

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-11-29T01:42:47Z

    Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
    > I mean, we can do the following to get a nice contiguous empty address
    > space no other mmap(NULL)s will get put into:
    
    >     /* reserve size bytes of memory */
    >     base = mmap(NULL, size, PROT_NONE, ...flags, ...);
    >     /* use the first small_size bytes of that reservation */
    >     allocated_in_reserved = mmap(base, small_size, PROT_READ |
    > PROT_WRITE, MAP_FIXED, ...);
    
    > With the PROT_NONE protection option the OS doesn't actually allocate
    > any backing memory, but guarantees no other mmap(NULL, ...) will get
    > placed in that area such that it overlaps with that allocation until
    > the area is munmap-ed, thus allowing us to reserve a chunk of address
    > space without actually using (much) memory.
    
    Well, that's all great if it works portably.  But I don't see one word
    in either POSIX or the Linux mmap(2) man page that promises those
    semantics for PROT_NONE.  I also wonder how well a giant chunk of
    "unbacked" address space will interoperate with the OOM killer,
    top(1)'s display of used memory, and other things that have caused us
    headaches with large shared-memory arenas.
    
    Maybe those issues are all in the past and this'll work great.
    I'm not holding my breath though.
    
    			regards, tom lane
    
    
    
    
  27. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-11-29T16:47:27Z

    > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:
    >
    > I mean, we can do the following to get a nice contiguous empty address
    > space no other mmap(NULL)s will get put into:
    >
    >     /* reserve size bytes of memory */
    >     base = mmap(NULL, size, PROT_NONE, ...flags, ...);
    >     /* use the first small_size bytes of that reservation */
    >     allocated_in_reserved = mmap(base, small_size, PROT_READ |
    > PROT_WRITE, MAP_FIXED, ...);
    >
    > With the PROT_NONE protection option the OS doesn't actually allocate
    > any backing memory, but guarantees no other mmap(NULL, ...) will get
    > placed in that area such that it overlaps with that allocation until
    > the area is munmap-ed, thus allowing us to reserve a chunk of address
    > space without actually using (much) memory.
    
    From what I understand it's not much different from the scenario when we
    just map as much as we want in advance. The actual memory will not be
    allocated in both cases due to CoW, oom_score seems to be the same. I
    agree it sounds attractive, but after some experimenting it looks like
    it won't work with huge pages insige a cgroup v2 (=container).
    
    The reason is Linux has recently learned to apply memory reservation
    limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
    this feature is often configured out of the box in various container
    orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
    reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
    match, reserve some address space via non-hugetlb mapping, and allocate
    a hugetlb out of it, but it doesn't work either (the smaller mmap
    complains about MAP_HUGETLB with EINVAL).
    
    
    
    
  28. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2024-11-29T18:17:29Z

    Hi,
    
    On 2024-11-28 17:30:32 +0100, Dmitry Dolgov wrote:
    > The assumption about picking up a lowest address is just how it works right now
    > on Linux, this fact is already used in the patch. The idea that we could put
    > upper boundary on the size of other mappings based on total available memory
    > comes from the fact that anonymous mappings, that are much larger than memory,
    > will fail without overcommit.
    
    The overcommit issue shouldn't be a big hurdle - by mmap()ing with
    MAP_NORESERVE the space isn't reserved. Then madvise with MADV_POPULATE_WRITE
    can be used to actually populate the used range of the mapping and MADV_REMOVE
    can be used to shrink the mapping again.
    
    
    > With overcommit it becomes different, but if allocations are hitting that
    > limit I can imagine there are bigger problems than shared buffer resize.
    
    I'm fairly sure it'll not work to just disregard issues around overcommit. A
    overly large memory allocation, without MAP_NORESERVE, will actually reduce
    the amount of memory that can be used for other allocations. That's obviously
    problematic, because you'll now have a smaller shared buffers, but can't use
    the memory for work_mem type allocations...
    
    Greetings,
    
    Andres Freund
    
    
    
    
  29. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2024-12-02T19:17:59Z

    > On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote:
    > > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:
    > >
    > > I mean, we can do the following to get a nice contiguous empty address
    > > space no other mmap(NULL)s will get put into:
    > >
    > >     /* reserve size bytes of memory */
    > >     base = mmap(NULL, size, PROT_NONE, ...flags, ...);
    > >     /* use the first small_size bytes of that reservation */
    > >     allocated_in_reserved = mmap(base, small_size, PROT_READ |
    > > PROT_WRITE, MAP_FIXED, ...);
    > >
    > > With the PROT_NONE protection option the OS doesn't actually allocate
    > > any backing memory, but guarantees no other mmap(NULL, ...) will get
    > > placed in that area such that it overlaps with that allocation until
    > > the area is munmap-ed, thus allowing us to reserve a chunk of address
    > > space without actually using (much) memory.
    >
    > From what I understand it's not much different from the scenario when we
    > just map as much as we want in advance. The actual memory will not be
    > allocated in both cases due to CoW, oom_score seems to be the same. I
    > agree it sounds attractive, but after some experimenting it looks like
    > it won't work with huge pages insige a cgroup v2 (=container).
    >
    > The reason is Linux has recently learned to apply memory reservation
    > limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
    > this feature is often configured out of the box in various container
    > orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
    > reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
    > match, reserve some address space via non-hugetlb mapping, and allocate
    > a hugetlb out of it, but it doesn't work either (the smaller mmap
    > complains about MAP_HUGETLB with EINVAL).
    
    I've asked about that in linux-mm [1]. To my surprise, the
    recommendations were to stick to creating a large mapping in advance,
    and slice smaller mappings out of that, which could be resized later.
    The OOM score should not be affected, and hugetlb could be avoided using
    MAP_NORESERVE flag for the initial mapping (I've experimented with that,
    seems to be working just fine, even if the slices are not using
    MAP_NORESERVE).
    
    I guess that would mean I'll try to experiment with this approach as
    well. But what others think? How much research do we need to do, to gain
    some confidence about large shared mappings and make it realistically
    acceptable?
    
    [1]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/
    
    
    
    
  30. Re: Changing shared_buffers without restart

    Robert Haas <robertmhaas@gmail.com> — 2024-12-03T14:31:19Z

    On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > I've asked about that in linux-mm [1]. To my surprise, the
    > recommendations were to stick to creating a large mapping in advance,
    > and slice smaller mappings out of that, which could be resized later.
    > The OOM score should not be affected, and hugetlb could be avoided using
    > MAP_NORESERVE flag for the initial mapping (I've experimented with that,
    > seems to be working just fine, even if the slices are not using
    > MAP_NORESERVE).
    >
    > I guess that would mean I'll try to experiment with this approach as
    > well. But what others think? How much research do we need to do, to gain
    > some confidence about large shared mappings and make it realistically
    > acceptable?
    
    Personally, I like this approach. It seems to me that this opens up
    the possibility of a system where the virtual addresses of data
    structures in shared memory never change, which I think will avoid an
    absolutely massive amount of implementation complexity. It's obviously
    not ideal that we have to specify in advance an upper limit on the
    potential size of shared_buffers, but we can live with it. It's better
    than what we have today; and certainly cloud providers will have no
    issue with pre-setting that to a reasonable value. I don't know if we
    can port it to other operating systems, but it seems at least possible
    that they offer similar primitives, or will in the future; if not, we
    can disable the feature on those platforms.
    
    I still think the synchronization is going to be tricky. For example
    when you go to shrink a mapping, you need to make sure that it's free
    of buffers that anyone might touch; and when you grow a mapping, you
    need to make sure that nobody tries to touch that address space before
    they grow the mapping, which goes back to my earlier point about
    someone doing a lookup into the buffer mapping table and finding a
    buffer number that is beyond the end of what they've already mapped.
    But I think it may be doable with sufficient cleverness.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  31. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2024-12-17T14:10:11Z

    On Tue, Dec 3, 2024 at 8:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > I've asked about that in linux-mm [1]. To my surprise, the
    > > recommendations were to stick to creating a large mapping in advance,
    > > and slice smaller mappings out of that, which could be resized later.
    > > The OOM score should not be affected, and hugetlb could be avoided using
    > > MAP_NORESERVE flag for the initial mapping (I've experimented with that,
    > > seems to be working just fine, even if the slices are not using
    > > MAP_NORESERVE).
    > >
    > > I guess that would mean I'll try to experiment with this approach as
    > > well. But what others think? How much research do we need to do, to gain
    > > some confidence about large shared mappings and make it realistically
    > > acceptable?
    >
    > Personally, I like this approach. It seems to me that this opens up
    > the possibility of a system where the virtual addresses of data
    > structures in shared memory never change, which I think will avoid an
    > absolutely massive amount of implementation complexity. It's obviously
    > not ideal that we have to specify in advance an upper limit on the
    > potential size of shared_buffers, but we can live with it. It's better
    > than what we have today; and certainly cloud providers will have no
    > issue with pre-setting that to a reasonable value. I don't know if we
    > can port it to other operating systems, but it seems at least possible
    > that they offer similar primitives, or will in the future; if not, we
    > can disable the feature on those platforms.
    >
    > I still think the synchronization is going to be tricky. For example
    > when you go to shrink a mapping, you need to make sure that it's free
    > of buffers that anyone might touch; and when you grow a mapping, you
    > need to make sure that nobody tries to touch that address space before
    > they grow the mapping, which goes back to my earlier point about
    > someone doing a lookup into the buffer mapping table and finding a
    > buffer number that is beyond the end of what they've already mapped.
    > But I think it may be doable with sufficient cleverness.
    >
    
    From the discussion so far, the protocol for each shared memory slot
    (or segment as suggested by Robert) seems to be the following.
    1. At the start create a memory mapping using mmap with maximum
    allocation (maxsize) with PROT_READ/PROT_WRITE and MAP_NORESERVE to
    reserve address space. Assume this is created at virtual address
    maddr.
    2. Resize it to the required size (size) using mremap() - this will be
    used to create shared memory objects
    3. Map a segment with PROT_NONE and MAP_NORESERVE at maddr + size.
    This segment would not allow any other mapping to be added in the
    required space. PROT_NONE will protect from unintentional writes/reads
    from this space.
    4. When resizing the segment remove the mapping created in step 3 and
    execute step 2 and 3 again. Synchronization, mentioned by Robert,
    should be carried out somewhere in this step.
    Note that the addresses need to be aligned as per mmap and mremap requirements.
    
    Please correct me if I am wrong.
    
    I wrote the attached simple program simulating this protocol. It seems
    to work as expected. However, mmap'ing with MAP_FIXED would still be
    able to dislodge the reserved memory. But that's true with any mapped
    segment; not just with reserved memory.
    
    A bit about the program: It reserves a 3MB memory segment and resizes
    it to 1MB, 2MB and back to 3MB, thus exercising both shrinking and
    enlarging the memory.  It forks a child process after resizing the the
    memory segment first time. At every step it makes sure that the parent
    and child programs can write and read at the boundaries of the resized
    memory segment. The program waits for getchar() at these steps. So in
    case the program seems to be stuck, try pressing Enter once or twice.
    
    I could verify the memory mappings, their sizes etc. by looking at
    /proc/PID/maps and /proc/PID/status but I did not find a way to verify
    the amount of memory actually allocated and verify that it's actually
    shrinking and expanding. Please let me know how to verify that.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  32. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-01-13T08:11:06Z

    Hi Dmitry,
    
    On Tue, Dec 17, 2024 at 7:40 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > I could verify the memory mappings, their sizes etc. by looking at
    > /proc/PID/maps and /proc/PID/status but I did not find a way to verify
    > the amount of memory actually allocated and verify that it's actually
    > shrinking and expanding. Please let me know how to verify that.
    
    As somewhere mentioned upthread, the mmap or mremap by themselves do
    not allocate any memory. Writing to the mapped region causes memory to
    be allocated and shows up in VmRSS and RssShmem. But it does get
    resized if mremap() shrinks the mapped region.
    
    Attached are patches rebased on top of commit
    2a7b2d97171dd39dca7cefb91008a3c84ec003ba. I have also fixed
    compilation errors. Otherwise I haven't changed anything in the
    patches. The last patches adds some TODOs and questions, which I think
    we need to address while completing this work, just add for as a
    reminder later. The TODO in postgres.c is related to your observation
    
    > Another rough edge is that a
    > backend, executing pg_reload_conf interactively, will not resize
    > mappings immediately, for some reason it will require another command.
    I don't have a solution right now, but at least the comment documents
    the reason and points to its origin.
    
    I am next looking at the problem of synchronizing the change across
    the backends.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  33. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-02-25T09:52:05Z

    > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > changing shared memory mapping layout. Any feedback is appreciated.
    
    Hi,
    
    Here is a new version of the patch, which contains a proposal about how to
    coordinate shared memory resizing between backends. The rest is more or less
    the same, a feedback about coordination is appreciated. It's a lot to read, but
    the main difference is about:
    
    1. Allowing to decouple a GUC value change from actually applying it, sort of a
    "pending" change. The idea is to let a custom logic be triggered on an assign
    hook, and then take responsibility for what happens later and how it's going to
    be applied. This allows to use regular GUC infrastructure in cases where value
    change requires some complicated processing. I was trying to make the change
    not so invasive, plus it's missing GUC reporting yet.
    
    2. Shared memory resizing patch became more complicated thanks to some
    coordination between backends. The current implementation was chosen from few
    more or less equal alternatives, which are evolving along following lines:
    
    * There should be one "coordinator" process overseeing the change. Having
    postmaster to fulfill this role like in this patch seems like a natural idea,
    but it poses certain challenges since it doesn't have locking infrastructure.
    Another option would be to elect a single backend to be a coordinator, which
    will handle the postmaster as a special case. If there will ever be a
    "coordinator" worker in Postgres, that would be useful here.
    
    * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
    and trigger the resize process. Backends join a Barrier to synchronize and wait
    untill everyone is finished.
    
    * There is some resizing state stored in shared memory, which is there to
    handle backends that were for some reason late or didn't receive the signal.
    What to store there is open for discussion.
    
    * Since we want to make sure all processes share the same understanding of what
    NBuffers value is, any failure is mostly a hard stop, since to rollback the
    change coordination is needed as well and sounds a bit too complicated for now.
    
    We've tested this change manually for now, although it might be useful to try
    out injection points. The testing strategy, which has caught plenty of bugs,
    was simply to run pgbench workload against a running instance and change
    shared_buffers on the fly. Some more subtle cases were verified by manually
    injecting delays to trigger expected scenarios.
    
    To reiterate, here is patches breakdown:
    
    Patches 1-3 prepare the infrastructure and shared memory layout. They could be
    useful even with multithreaded PostgreSQL, when there will be no need for
    shared memory. I assume, in the multithreaded world there still will be need
    for a contiguous chunk of memory to share between threads, and its layout would
    be similar to the one with shared memory mappings. Note that patch nr 2 is
    going away as soon as I'll get to implement shared memory address reservation,
    but for now it's needed.
    
    Patch 4 is a new addition to handle "pending" GUC changes.
    
    Patch 5 actually does resizing. It's shared memory specific of course, and
    utilized Linux specific mremap, meaning open portability questions.
    
    Patch 6 is somewhat independent, but quite convenient to have. It also utilizes
    Linux specific call memfd_create.
    
    I would like to get some feedback on the synchronization part. While waiting
    I'll proceed implementing shared memory address space reservation and Ashutosh
    will continue with buffer eviction to support shared memory reduction.
    
  34. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-02-27T08:28:22Z

    > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:
    > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > > changing shared memory mapping layout. Any feedback is appreciated.
    >
    > Hi,
    >
    > Here is a new version of the patch, which contains a proposal about how to
    > coordinate shared memory resizing between backends. The rest is more or less
    > the same, a feedback about coordination is appreciated. It's a lot to read, but
    > the main difference is about:
    
    Just one note, there are still couple of compilation warnings in the
    code, which I haven't addressed yet. Those will go away in the next
    version.
    
    
    
    
  35. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-02-28T11:52:29Z

    On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote:
    > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    > > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > > > changing shared memory mapping layout. Any feedback is appreciated.
    > >
    > > Hi,
    > >
    > > Here is a new version of the patch, which contains a proposal about how to
    > > coordinate shared memory resizing between backends. The rest is more or less
    > > the same, a feedback about coordination is appreciated. It's a lot to read, but
    > > the main difference is about:
    >
    > Just one note, there are still couple of compilation warnings in the
    > code, which I haven't addressed yet. Those will go away in the next
    > version.
    
    PFA the patchset which implements shrinking shared buffers.
    0001-0006 are same as the previous patchset
    0007 fixes compilation warnings from previous patches - I think those
    should be absorbed into their respective patches
    0008 adds TODOs that need some code changes or at least need some
    consideration. Some of them might point to the causes of Assertion
    failures seen with this patch set.
    0009 adds WIP support for shrinking shared buffers - I think this
    should be absorbed into 0005
    0010 WIP fix for Assertion failures seen from BgBufferSync() - I am
    still investigating those.
    
    I am using the attached script to shake the patch well.  It runs
    pgbench and concurrently resizes the shared_buffers. I am seeing
    Assertion failures when running the script in both cases, expanding
    and shrinking the buffers. I am investigating "failed
    Assert("strategy_delta >= 0")," next.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  36. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-02-28T12:01:41Z

    On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > > changing shared memory mapping layout. Any feedback is appreciated.
    >
    > Hi,
    >
    > Here is a new version of the patch, which contains a proposal about how to
    > coordinate shared memory resizing between backends. The rest is more or less
    > the same, a feedback about coordination is appreciated. It's a lot to read, but
    > the main difference is about:
    
    Thanks Dmitry for the summary.
    
    >
    > 1. Allowing to decouple a GUC value change from actually applying it, sort of a
    > "pending" change. The idea is to let a custom logic be triggered on an assign
    > hook, and then take responsibility for what happens later and how it's going to
    > be applied. This allows to use regular GUC infrastructure in cases where value
    > change requires some complicated processing. I was trying to make the change
    > not so invasive, plus it's missing GUC reporting yet.
    >
    > 2. Shared memory resizing patch became more complicated thanks to some
    > coordination between backends. The current implementation was chosen from few
    > more or less equal alternatives, which are evolving along following lines:
    >
    > * There should be one "coordinator" process overseeing the change. Having
    > postmaster to fulfill this role like in this patch seems like a natural idea,
    > but it poses certain challenges since it doesn't have locking infrastructure.
    > Another option would be to elect a single backend to be a coordinator, which
    > will handle the postmaster as a special case. If there will ever be a
    > "coordinator" worker in Postgres, that would be useful here.
    >
    > * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
    > and trigger the resize process. Backends join a Barrier to synchronize and wait
    > untill everyone is finished.
    >
    > * There is some resizing state stored in shared memory, which is there to
    > handle backends that were for some reason late or didn't receive the signal.
    > What to store there is open for discussion.
    >
    > * Since we want to make sure all processes share the same understanding of what
    > NBuffers value is, any failure is mostly a hard stop, since to rollback the
    > change coordination is needed as well and sounds a bit too complicated for now.
    >
    
    I think we should add a way to monitor the progress of resizing; at
    least whether resizing is complete and whether the new GUC value is in
    effect.
    
    > We've tested this change manually for now, although it might be useful to try
    > out injection points. The testing strategy, which has caught plenty of bugs,
    > was simply to run pgbench workload against a running instance and change
    > shared_buffers on the fly. Some more subtle cases were verified by manually
    > injecting delays to trigger expected scenarios.
    
    I have shared a script with my changes but it's far from being full
    testing. We will need to use injection points to test specific
    scenarios.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  37. Re: Changing shared_buffers without restart

    Ni Ku <jakkuniku@gmail.com> — 2025-03-20T08:55:47Z

    Dmitry / Ashutosh,
    Thanks for the patch set. I've been doing some testing with it and in
    particular want to see if this solution would work with hugepage bufferpool.
    
    I ran some simple tests (outside of PG) on linux kernel v6.1, which has
    this commit that added some hugepage support to mremap (
    https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
    ).
    
    From reading the kernel code and testing, for a hugepage-backed mapping it
    seems mremap supports only shrinking but not growing. Further, for
    shrinking, what I observed is that after mremap is called the hugepage
    memory
    is not released back to the OS, rather it's released when the fd is closed
    (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
    I'm not sure if this behavior is expected, but being able to release memory
    back to the OS immediately after mremap would be important for use cases
    such as supporting "serverless" PG instances on the cloud.
    
    I'm no expert in the linux kernel so I could be missing something. It'd be
    great if you or somebody can comment on these observations and whether this
    mremap-based solution would work with hugepage bufferpool.
    
    I also attached the test program in case someone can spot I did something
    wrong.
    
    Regards,
    
    Jack Ng
    
    On Tue, Mar 18, 2025 at 11:02 AM Ashutosh Bapat <
    ashutosh.bapat.oss@gmail.com> wrote:
    
    > On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com>
    > wrote:
    > >
    > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    > > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    > > > changing shared memory mapping layout. Any feedback is appreciated.
    > >
    > > Hi,
    > >
    > > Here is a new version of the patch, which contains a proposal about how
    > to
    > > coordinate shared memory resizing between backends. The rest is more or
    > less
    > > the same, a feedback about coordination is appreciated. It's a lot to
    > read, but
    > > the main difference is about:
    >
    > Thanks Dmitry for the summary.
    >
    > >
    > > 1. Allowing to decouple a GUC value change from actually applying it,
    > sort of a
    > > "pending" change. The idea is to let a custom logic be triggered on an
    > assign
    > > hook, and then take responsibility for what happens later and how it's
    > going to
    > > be applied. This allows to use regular GUC infrastructure in cases where
    > value
    > > change requires some complicated processing. I was trying to make the
    > change
    > > not so invasive, plus it's missing GUC reporting yet.
    > >
    > > 2. Shared memory resizing patch became more complicated thanks to some
    > > coordination between backends. The current implementation was chosen
    > from few
    > > more or less equal alternatives, which are evolving along following
    > lines:
    > >
    > > * There should be one "coordinator" process overseeing the change. Having
    > > postmaster to fulfill this role like in this patch seems like a natural
    > idea,
    > > but it poses certain challenges since it doesn't have locking
    > infrastructure.
    > > Another option would be to elect a single backend to be a coordinator,
    > which
    > > will handle the postmaster as a special case. If there will ever be a
    > > "coordinator" worker in Postgres, that would be useful here.
    > >
    > > * The coordinator uses EmitProcSignalBarrier to reach out to all other
    > backends
    > > and trigger the resize process. Backends join a Barrier to synchronize
    > and wait
    > > untill everyone is finished.
    > >
    > > * There is some resizing state stored in shared memory, which is there to
    > > handle backends that were for some reason late or didn't receive the
    > signal.
    > > What to store there is open for discussion.
    > >
    > > * Since we want to make sure all processes share the same understanding
    > of what
    > > NBuffers value is, any failure is mostly a hard stop, since to rollback
    > the
    > > change coordination is needed as well and sounds a bit too complicated
    > for now.
    > >
    >
    > I think we should add a way to monitor the progress of resizing; at
    > least whether resizing is complete and whether the new GUC value is in
    > effect.
    >
    > > We've tested this change manually for now, although it might be useful
    > to try
    > > out injection points. The testing strategy, which has caught plenty of
    > bugs,
    > > was simply to run pgbench workload against a running instance and change
    > > shared_buffers on the fly. Some more subtle cases were verified by
    > manually
    > > injecting delays to trigger expected scenarios.
    >
    > I have shared a script with my changes but it's far from being full
    > testing. We will need to use injection points to test specific
    > scenarios.
    >
    > --
    > Best Wishes,
    > Ashutosh Bapat
    >
    >
    >
    >
    >
    
  38. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-03-20T10:21:18Z

    > On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
    >
    > I ran some simple tests (outside of PG) on linux kernel v6.1, which has
    > this commit that added some hugepage support to mremap (
    > https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
    > ).
    >
    > From reading the kernel code and testing, for a hugepage-backed mapping it
    > seems mremap supports only shrinking but not growing. Further, for
    > shrinking, what I observed is that after mremap is called the hugepage
    > memory
    > is not released back to the OS, rather it's released when the fd is closed
    > (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
    > I'm not sure if this behavior is expected, but being able to release memory
    > back to the OS immediately after mremap would be important for use cases
    > such as supporting "serverless" PG instances on the cloud.
    >
    > I'm no expert in the linux kernel so I could be missing something. It'd be
    > great if you or somebody can comment on these observations and whether this
    > mremap-based solution would work with hugepage bufferpool.
    
    Hm, I think you're right. I didn't realize there is such limitation, but
    just verified on the latest kernel build and hit the same condition on
    increasing hugetlb mapping you've mentioned above. That's annoying of
    course, but I've got another approach I was originally experimenting
    with -- instead of mremap do munmap and mmap with the new size and rely
    on the anonymous fd to keep the memory content in between. I'm currently
    reworking mmap'ing part of the patch, let me check if this new approach
    is something we could universally rely on.
    
    
    
    
  39. Re: Changing shared_buffers without restart

    Ni Ku <jakkuniku@gmail.com> — 2025-03-21T08:48:30Z

    Thanks for your insights and confirmation, Dmitry.
    Right, I think the anonymous fd approach would work to keep the memory
    contents intact in between munmap and mmap with the new size, so bufferpool
    expansion would work.
    But it seems shrinking would still be problematic, since that approach
    requires the anonymous fd to remain open (for memory content protection),
    and so munmap would not release the memory back to the OS right away (gets
    released when the fd is closed). From testing this is true for hugepage
    memory at least.
    Is there a way around this? Or maybe I misunderstood what you have in mind
    ;)
    
    Regards,
    
    Jack Ng
    
    On Thu, Mar 20, 2025 at 6:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    
    > > On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
    > >
    > > I ran some simple tests (outside of PG) on linux kernel v6.1, which has
    > > this commit that added some hugepage support to mremap (
    > >
    > https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
    > > ).
    > >
    > > From reading the kernel code and testing, for a hugepage-backed mapping
    > it
    > > seems mremap supports only shrinking but not growing. Further, for
    > > shrinking, what I observed is that after mremap is called the hugepage
    > > memory
    > > is not released back to the OS, rather it's released when the fd is
    > closed
    > > (or when the memory is unmapped for a mapping created with
    > MAP_ANONYMOUS).
    > > I'm not sure if this behavior is expected, but being able to release
    > memory
    > > back to the OS immediately after mremap would be important for use cases
    > > such as supporting "serverless" PG instances on the cloud.
    > >
    > > I'm no expert in the linux kernel so I could be missing something. It'd
    > be
    > > great if you or somebody can comment on these observations and whether
    > this
    > > mremap-based solution would work with hugepage bufferpool.
    >
    > Hm, I think you're right. I didn't realize there is such limitation, but
    > just verified on the latest kernel build and hit the same condition on
    > increasing hugetlb mapping you've mentioned above. That's annoying of
    > course, but I've got another approach I was originally experimenting
    > with -- instead of mremap do munmap and mmap with the new size and rely
    > on the anonymous fd to keep the memory content in between. I'm currently
    > reworking mmap'ing part of the patch, let me check if this new approach
    > is something we could universally rely on.
    >
    
  40. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-03-21T09:31:04Z

    > On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
    > Thanks for your insights and confirmation, Dmitry.
    > Right, I think the anonymous fd approach would work to keep the memory
    > contents intact in between munmap and mmap with the new size, so bufferpool
    > expansion would work.
    > But it seems shrinking would still be problematic, since that approach
    > requires the anonymous fd to remain open (for memory content protection),
    > and so munmap would not release the memory back to the OS right away (gets
    > released when the fd is closed). From testing this is true for hugepage
    > memory at least.
    > Is there a way around this? Or maybe I misunderstood what you have in mind
    > ;)
    
    The anonymous file will be truncated to it's new shrinked size before
    mapping it second time (I think this part is missing in your test
    example), to my understanding after a quick look at do_vmi_align_munmap,
    this should be enough to make the memory reclaimable.
    
    
    
    
  41. Re: Changing shared_buffers without restart

    Ni Ku <jakkuniku@gmail.com> — 2025-03-21T10:30:47Z

    You're right Dmitry, truncating the anonymous file before mapping it again
    does the trick! I see 'HugePages_Free' increases to the expected size right
    after the ftruncate call for shrinking.
    This alternative approach looks very promising. Thanks.
    
    Regards,
    
    Jack Ng
    
    On Fri, Mar 21, 2025 at 5:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    
    > > On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
    > > Thanks for your insights and confirmation, Dmitry.
    > > Right, I think the anonymous fd approach would work to keep the memory
    > > contents intact in between munmap and mmap with the new size, so
    > bufferpool
    > > expansion would work.
    > > But it seems shrinking would still be problematic, since that approach
    > > requires the anonymous fd to remain open (for memory content protection),
    > > and so munmap would not release the memory back to the OS right away
    > (gets
    > > released when the fd is closed). From testing this is true for hugepage
    > > memory at least.
    > > Is there a way around this? Or maybe I misunderstood what you have in
    > mind
    > > ;)
    >
    > The anonymous file will be truncated to it's new shrinked size before
    > mapping it second time (I think this part is missing in your test
    > example), to my understanding after a quick look at do_vmi_align_munmap,
    > this should be enough to make the memory reclaimable.
    >
    
  42. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-07T06:20:46Z

    On Fri, Feb 28, 2025 at 5:31 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > I think we should add a way to monitor the progress of resizing; at
    > least whether resizing is complete and whether the new GUC value is in
    > effect.
    >
    
    I further tested this approach by tracing the barrier synchronization
    using the attached patch with adds a bunch of elogs().
    I ran pgbench load and simultaneously
    executed following commands on a psql connection
    
    #alter system set shared_buffers to '200MB';
    ALTER SYSTEM
    #select pg_reload_conf();
     pg_reload_conf
    ----------------
     t
    (1 row)
    
    #show shared_buffers;
     shared_buffers
    ----------------
     200MB
    (1 row)
    
    #select count(*) from pg_stat_activity;
     count
    -------
         6
    (1 row)
    
    #select pg_backend_pid(); - the backend where all these commands were executed
     pg_backend_pid
    ----------------
             878405
    (1 row)
    
    I see the following in the postgresql error logs.
    
    2025-03-12 11:04:53.812 IST [878167] LOG: received SIGHUP, reloading
    configuration files
    2025-03-12 11:04:53.813 IST [878405] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    
    -- not all backends have reloaded configuration.
    
    2025-03-12 11:04:53.813 IST [878173] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878173] LOG: attached when barrier was at phase 0
    2025-03-12 11:04:53.813 IST [878173] LOG: reached barrier phase 1
    2025-03-12 11:04:53.813 IST [878171] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878172] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878171] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.813 IST [878172] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.813 IST [878340] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE
    pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
    2025-03-12 11:04:53.813 IST [878340] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE
    pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
    2025-03-12 11:04:53.813 IST [878338] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
    2025-03-12 11:04:53.813 IST [878339] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
    2025-03-12 11:04:53.813 IST [878338] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
    2025-03-12 11:04:53.813 IST [878339] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
    2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.813 IST [878341] STATEMENT: BEGIN;
    2025-03-12 11:04:53.814 IST [878341] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
    2025-03-12 11:04:53.814 IST [878337] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
    SET tbalance = tbalance + -1996 WHERE tid = 392;
    2025-03-12 11:04:53.814 IST [878337] LOG: attached when barrier was at phase 1
    2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
    SET tbalance = tbalance + -1996 WHERE tid = 392;
    2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE
    pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
    2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
    2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
    2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
    SET tbalance = tbalance + -1996 WHERE tid = 392;
    2025-03-12 11:04:53.814 IST [878173] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
    2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE
    pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8;
    2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN;
    2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726;
    2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE
    pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662;
    2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 3
    2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
    SET tbalance = tbalance + -1996 WHERE tid = 392;
    2025-03-12 11:04:53.814 IST [878337] LOG: buffer resizing operation
    finished at phase 4
    2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers
    SET tbalance = tbalance + -1996 WHERE tid = 392;
    2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.814 IST [878168] LOG: attached when barrier was at phase 0
    2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 1
    2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 2
    2025-03-12 11:04:53.814 IST [878168] LOG: buffer resizing operation
    finished at phase 3
    2025-03-12 11:04:53.815 IST [878169] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:53.815 IST [878169] LOG: attached when barrier was at phase 0
    2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 1
    2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 2
    2025-03-12 11:04:53.815 IST [878169] LOG: buffer resizing operation
    finished at phase 3
    2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
    resizing from 16384 to -1, 0
    2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem
    resizing from 16384 to 25600, 1
    2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
    2025-03-12 11:04:55.965 IST [878405] LOG: attached when barrier was at phase 0
    2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
    2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 1
    2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
    2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 2
    2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
    2025-03-12 11:04:55.965 IST [878405] LOG: buffer resizing operation
    finished at phase 3
    2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers;
    
    To tell the story in short. pid 173 (for the sake of brevity I am just
    mentioning the last three digits of PID) attached to the barrier first
    and immediately reached phase 1. 171, 172, 340, 338, 339, 341, 337 -
    all attached barrier in phase 1. All of these backends completed the
    phases in synchronous fashion. But 168, 169 and 405 were yet to attach
    to the barrier since they hadn't loaded their configurations yet. Each
    of these backends then finished all phases independent of others.
    
    For your reference
    #select pid, application_name, backend_type from pg_stat_activity
    where pid in (878169, 878168);
      pid   | application_name |   backend_type
    --------+------------------+-------------------
     878168 |                  | checkpointer
     878169 |                  | background writer
    (2 rows)
    
    This is because the BarrierArriveAndWait() only waits for all the
    attached backends. It doesn't wait for backends which are yet to
    attach. I think what we want is *all* the backends should execute all
    the phases synchronously and wait for others to finish. If we don't do
    that, there's a possibility that some of them would see inconsistent
    buffer states or even worse may not have necessary memory mapped and
    resized - thus causing segfaults. Am I correct?
    
    I think what needs to be done is that every backend should wait for other
    backends to attach themselves to the barrier before moving to the
    first phase. One way I can think of is we use two signal barriers -
    one to ensure that all the backends have attached themselves and
    second for the actual resizing. But then the postmaster needs to wait for
    all the processes to process the first signal barrier. A postmaster can
    not wait on anything. Maybe there's a way to poll, but I didn't find
    it. Does that mean that we have to make some other backend a coordinator?
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  43. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-07T08:43:22Z

    > On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote:
    > This is because the BarrierArriveAndWait() only waits for all the
    > attached backends. It doesn't wait for backends which are yet to
    > attach. I think what we want is *all* the backends should execute all
    > the phases synchronously and wait for others to finish. If we don't do
    > that, there's a possibility that some of them would see inconsistent
    > buffer states or even worse may not have necessary memory mapped and
    > resized - thus causing segfaults. Am I correct?
    >
    > I think what needs to be done is that every backend should wait for other
    > backends to attach themselves to the barrier before moving to the
    > first phase. One way I can think of is we use two signal barriers -
    > one to ensure that all the backends have attached themselves and
    > second for the actual resizing. But then the postmaster needs to wait for
    > all the processes to process the first signal barrier. A postmaster can
    > not wait on anything. Maybe there's a way to poll, but I didn't find
    > it. Does that mean that we have to make some other backend a coordinator?
    
    Yes, you're right, plain dynamic Barrier does not ensure all available
    processes will be synchronized. I was aware about the scenario you
    describe, it's mentioned in commentaries for the resize function. I was
    under the impression this should be enough, but after some more thinking
    I'm not so sure anymore. Let me try to structure it as a list of
    possible corner cases that we need to worry about:
    
    * New backend spawned while we're busy resizing shared memory. Those
      should wait until the resizing is complete and get the new size as well.
    
    * Old backend receives a resize message, but exits before attempting to
      resize. Those should be excluded from coordination.
    
    * A backend is blocked and not responding before or after the
      ProcSignalBarrier message was sent. I'm thinking about a failure
      situation, when one rogue backend is doing something without checking
      for interrupts. We need to wait for those to become responsive, and
      potentially abort shared memory resize after some timeout.
    
    * Backends join the barrier in disjoint groups with some time in
      between, which is longer than what it takes to resize shared memory.
      That means that relying only on the shared dynamic barrier is not
      enough -- it will only synchronize resize procedure withing those
      groups.
    
    Out of those I think the third poses some problems, e.g. if we shrinking
    the shared memory, but one backend is accessing buffer pool without
    checking for interrupts. In the v3 implementation this won't be handled
    correctly, other backends will ignore such rogue process. Independently
    from that we could reason about the logic much easier if it's guaranteed
    that all the process to resize shared memory will wait for each other to
    start simultaneously.
    
    Looks like to achieve that we need a slightly different combination of a
    global Barrier and ProcSignalBarrier mechanism. We can't use
    ProcSignalBarrier as it is, because processes need to wait for each
    other, and at the same time finish processing to bump the generation. We
    also can't use a simple dynamic Barrier due to possibility of disjoint
    groups of processes. A static Barrier is also not easier, because we
    would need somehow to know exact number of processes, which might change
    over time.
    
    I think a relatively elegant solution is to extend ProcSignalBarrier
    mechanism to track not only pss_barrierGeneration, as a sign that
    everything was processed, but also something like
    pss_barrierReceivedGeneration, indicating that the message was received
    everywhere but not processed yet. That would be enough to allow
    processes to wait until the resize message was received everywhere, then
    use a global Barrier to wait until all processes are finished.  It's
    somehow similar to your proposal to use two signals, but has less
    implementation overhead.
    
    This would also allow different solutions regarding error handling. E.g.
    we could do an unbounded waiting for all processes we expect to resize,
    assuming that the user will be able to intervene and fix an issue if
    there is any. Or we can do a timed waiting, and abort the resize after
    some timeout of not all processes are ready yet. In the new v4 version
    of the patch the first option is implemented.
    
    On top of that there are following changes:
    
    * Shared memory address space is now reserved for future usage, making
      shared memory segments clash (e.g. due to memory allocation)
      impossible.  There is a new GUC to control how much space to reserve,
      which is called max_available_memory -- on the assumption that most of
      the time it would make sense to set its value to the total amount of
      memory on the machine. I'm open for suggestions regarding the name.
    
    * There is one more patch to address hugepages remap. As mentioned in
      this thread above, Linux kernel has certain limitations when it comes
      to mremap for segments allocated with huge pages. To work around it's
      possible to replace mremap with a sequence of unmap and map again,
      relying on the anon file behind the segment to keep the memory
      content. I haven't found any downsides of this approach so far, but it
      makes the anonymous file patch 0007 mandatory.
    
  44. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-09T05:42:18Z

    On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > In the new v4 version
    > of the patch the first option is implemented.
    >
    
    The patches don't apply cleanly using git am but patch -p1 applies
    them cleanly. However I see following compilation errors
    RuntimeError: command "ninja" failed with error [1/1954] Generating
    src/include/utils/errcodes with a custom command
    [2/1954] Generating src/include/storage/lwlocknames_h with a custom command
    [3/1954] Generating src/include/utils/wait_event_names with a custom command
    [4/1954] Compiling C object src/port/libpgport.a.p/pg_popcount_aarch64.c.o
    [5/1954] Compiling C object src/port/libpgport.a.p/pg_numa.c.o
    FAILED: src/port/libpgport.a.p/pg_numa.c.o
    cc -Isrc/port/libpgport.a.p -Isrc/include
    -I../../coderoot/pg/src/include -fdiagnostics-color=always
    -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
    -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
    -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
    -Wmissing-format-attribute -Wimplicit-fallthrough=3
    -Wcast-function-type -Wshadow=compatible-local -Wformat-security
    -Wdeclaration-after-statement -Wno-format-truncation
    -Wno-stringop-truncation -fPIC -DFRONTEND -MD -MQ
    src/port/libpgport.a.p/pg_numa.c.o -MF
    src/port/libpgport.a.p/pg_numa.c.o.d -o
    src/port/libpgport.a.p/pg_numa.c.o -c
    ../../coderoot/pg/src/port/pg_numa.c
    In file included from ../../coderoot/pg/src/include/storage/spin.h:54,
                     from
    ../../coderoot/pg/src/include/storage/condition_variable.h:26,
                     from ../../coderoot/pg/src/include/storage/barrier.h:22,
                     from ../../coderoot/pg/src/include/storage/pg_shmem.h:27,
                     from ../../coderoot/pg/src/port/pg_numa.c:26:
    ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
    "s_lock.h may not be included from frontend code"
       93 | #error "s_lock.h may not be included from frontend code"
          |  ^~~~~
    In file included from ../../coderoot/pg/src/port/pg_numa.c:26:
    ../../coderoot/pg/src/include/storage/pg_shmem.h:66:9: error: unknown
    type name ‘pg_atomic_uint32’
       66 |         pg_atomic_uint32        NSharedBuffers;
          |         ^~~~~~~~~~~~~~~~
    ../../coderoot/pg/src/include/storage/pg_shmem.h:68:9: error: unknown
    type name ‘pg_atomic_uint64’
       68 |         pg_atomic_uint64        Generation;
          |         ^~~~~~~~~~~~~~~~
    ../../coderoot/pg/src/port/pg_numa.c: In function ‘pg_numa_get_pagesize’:
    ../../coderoot/pg/src/port/pg_numa.c:117:17: error: too few arguments
    to function ‘GetHugePageSize’
      117 |                 GetHugePageSize(&os_page_size, NULL);
          |                 ^~~~~~~~~~~~~~~
    In file included from ../../coderoot/pg/src/port/pg_numa.c:26:
    ../../coderoot/pg/src/include/storage/pg_shmem.h:127:13: note: declared here
      127 | extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
          |             ^~~~~~~~~~~~~~~
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  45. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-09T07:45:32Z

    > On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote:
    > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > >
    > > In the new v4 version
    > > of the patch the first option is implemented.
    > >
    >
    > The patches don't apply cleanly using git am but patch -p1 applies
    > them cleanly. However I see following compilation errors
    > RuntimeError: command "ninja" failed with error
    
    Becase it's relatively meaningless to apply a patch to the tip of the
    master around the release freeze time :) Commit 65c298f61fc has
    introduced new usage of GetHugePageSize, which was modified in my patch.
    I'm going to address it with the next rebased version, in the meantime
    you can always use the specified base commit to apply the changeset:
    
        base-commit: 5e1915439085014140314979c4dd5e23bd677cac
    
    
    
    
  46. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-09T07:50:16Z

    On Wed, Apr 9, 2025 at 1:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote:
    > > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > >
    > > > In the new v4 version
    > > > of the patch the first option is implemented.
    > > >
    > >
    > > The patches don't apply cleanly using git am but patch -p1 applies
    > > them cleanly. However I see following compilation errors
    > > RuntimeError: command "ninja" failed with error
    >
    > Becase it's relatively meaningless to apply a patch to the tip of the
    > master around the release freeze time :) Commit 65c298f61fc has
    > introduced new usage of GetHugePageSize, which was modified in my patch.
    > I'm going to address it with the next rebased version, in the meantime
    > you can always use the specified base commit to apply the changeset:
    >
    >     base-commit: 5e1915439085014140314979c4dd5e23bd677cac
    
    There is a higher chance that people will try these patches now than
    it was two days before and more chance if they find the patches
    applicable easily.
    
    ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
    "s_lock.h may not be included from frontend code"
    
    How about this? Why is that happening?
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  47. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-09T08:19:10Z

    > On Wed, Apr 09, 2025 at 01:20:16PM GMT, Ashutosh Bapat wrote:
    > ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error
    > "s_lock.h may not be included from frontend code"
    >
    > How about this? Why is that happening?
    
    The same -- as you can see it comes from compiling pg_numa.c, which as
    it seems used in frontend and imports pg_shmem.h . I wanted to reshuffle
    includes in the patch anyway, that would be a good excuse to finally do
    this.
    
    
    
    
  48. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-11T14:34:39Z

    On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > Yes, you're right, plain dynamic Barrier does not ensure all available
    > processes will be synchronized. I was aware about the scenario you
    > describe, it's mentioned in commentaries for the resize function. I was
    > under the impression this should be enough, but after some more thinking
    > I'm not so sure anymore. Let me try to structure it as a list of
    > possible corner cases that we need to worry about:
    >
    > * New backend spawned while we're busy resizing shared memory. Those
    >   should wait until the resizing is complete and get the new size as well.
    >
    > * Old backend receives a resize message, but exits before attempting to
    >   resize. Those should be excluded from coordination.
    
    Should we detach barrier in on_exit()?
    
    >
    > * A backend is blocked and not responding before or after the
    >   ProcSignalBarrier message was sent. I'm thinking about a failure
    >   situation, when one rogue backend is doing something without checking
    >   for interrupts. We need to wait for those to become responsive, and
    >   potentially abort shared memory resize after some timeout.
    
    Right.
    
    >
    > I think a relatively elegant solution is to extend ProcSignalBarrier
    > mechanism to track not only pss_barrierGeneration, as a sign that
    > everything was processed, but also something like
    > pss_barrierReceivedGeneration, indicating that the message was received
    > everywhere but not processed yet. That would be enough to allow
    > processes to wait until the resize message was received everywhere, then
    > use a global Barrier to wait until all processes are finished.  It's
    > somehow similar to your proposal to use two signals, but has less
    > implementation overhead.
    
    The way it's implemented in v4 still has the disjoint group problem.
    Assume backends p1, p2, p3. All three of them are executing
    ProcessProcSignalBarrier(). All three of them updated
    pss_barrierReceivedGeneration
    
    /* The message is observed, record that */
    pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
    shared_gen);
    
    p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
    if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
      WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));
    
    Since all the processes have received the barrier message, p1, p2 move
    ahead and go through all the next phases and finish resizing even
    before p3 gets a chance to call ProcessBarrierShmemResize() and attach
    itself to Barrier. This could happen because it processed some other
    ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
    not attached itself to the barrier. Once p1, p2 finish, p3 will attach
    itself to the barrier and resize buffers again - reinitializing the
    shared memory, which might has been already modified by p1 or p2. Boom
    - there's memory corruption.
    
    Either every process has to make sure that all the other extant
    backends have attached themselves to the barrier OR somebody has to
    ensure that and signal all the backends to proceed. The implementation
    doesn't do either.
    
    >
    > * Shared memory address space is now reserved for future usage, making
    >   shared memory segments clash (e.g. due to memory allocation)
    >   impossible.  There is a new GUC to control how much space to reserve,
    >   which is called max_available_memory -- on the assumption that most of
    >   the time it would make sense to set its value to the total amount of
    >   memory on the machine. I'm open for suggestions regarding the name.
    
    With 0006 applied
    + /* Clean up some reserved space to resize into */
    + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
    ze, m->shmem)));
    ... snip ...
    + ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
    
    We unmap the portion of reserved address space where the existing
    segment would expand into. As long as we are just expanding this will
    work. I am wondering how would this work for shrinking buffers? What
    scheme do you have in mind?
    
    >
    > * There is one more patch to address hugepages remap. As mentioned in
    >   this thread above, Linux kernel has certain limitations when it comes
    >   to mremap for segments allocated with huge pages. To work around it's
    >   possible to replace mremap with a sequence of unmap and map again,
    >   relying on the anon file behind the segment to keep the memory
    >   content. I haven't found any downsides of this approach so far, but it
    >   makes the anonymous file patch 0007 mandatory.
    
    In 0008
    if (munmap(m->shmem, m->shmem_size) < 0)
    ... snip ...
    /* Resize the backing anon file. */
    if(ftruncate(m->segment_fd, new_size) == -1)
    ...
    /* Reclaim the space */
    ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
    mmap_flags | MAP_FIXED, m->segment_fd, 0);
    
    How are we preventing something get mapped into the space after
    m->shmem + newsize? We will need to add an unallocated but reserved
    addressed space map after m->shmem+newsize right?
    
    --
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  49. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-11T15:01:31Z

    > On Fri, Apr 11, 2025 at 08:04:39PM GMT, Ashutosh Bapat wrote:
    > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > >
    > > Yes, you're right, plain dynamic Barrier does not ensure all available
    > > processes will be synchronized. I was aware about the scenario you
    > > describe, it's mentioned in commentaries for the resize function. I was
    > > under the impression this should be enough, but after some more thinking
    > > I'm not so sure anymore. Let me try to structure it as a list of
    > > possible corner cases that we need to worry about:
    > >
    > > * New backend spawned while we're busy resizing shared memory. Those
    > >   should wait until the resizing is complete and get the new size as well.
    > >
    > > * Old backend receives a resize message, but exits before attempting to
    > >   resize. Those should be excluded from coordination.
    >
    > Should we detach barrier in on_exit()?
    
    Yeah, good point.
    
    > > I think a relatively elegant solution is to extend ProcSignalBarrier
    > > mechanism to track not only pss_barrierGeneration, as a sign that
    > > everything was processed, but also something like
    > > pss_barrierReceivedGeneration, indicating that the message was received
    > > everywhere but not processed yet. That would be enough to allow
    > > processes to wait until the resize message was received everywhere, then
    > > use a global Barrier to wait until all processes are finished.  It's
    > > somehow similar to your proposal to use two signals, but has less
    > > implementation overhead.
    >
    > The way it's implemented in v4 still has the disjoint group problem.
    > Assume backends p1, p2, p3. All three of them are executing
    > ProcessProcSignalBarrier(). All three of them updated
    > pss_barrierReceivedGeneration
    >
    > /* The message is observed, record that */
    > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
    > shared_gen);
    >
    > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
    > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
    >   WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));
    >
    > Since all the processes have received the barrier message, p1, p2 move
    > ahead and go through all the next phases and finish resizing even
    > before p3 gets a chance to call ProcessBarrierShmemResize() and attach
    > itself to Barrier. This could happen because it processed some other
    > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
    > not attached itself to the barrier. Once p1, p2 finish, p3 will attach
    > itself to the barrier and resize buffers again - reinitializing the
    > shared memory, which might has been already modified by p1 or p2. Boom
    > - there's memory corruption.
    
    It won't reinitialize anything, since this logic is controlled by the
    ShmemCtrl->NSharedBuffers, if it's already updated nothing will be
    changed.
    
    About the race condition you mention, there is indeed a window between
    receiving the ProcSignalBarrier and attaching to the global Barrier in
    resize, but I don't think any process will be able to touch buffer pool
    while inside this window. Even if it happens that the remapping itself
    was blazing fast that this window was enough to make one process late
    (e.g. if it was busy handling some other signal as you mention), as I've
    showed above it shouldn't be a problem.
    
    I can experiment with this case though, maybe there is a way to
    completely close this window to not thing about even potential
    scenarios.
    
    > > * Shared memory address space is now reserved for future usage, making
    > >   shared memory segments clash (e.g. due to memory allocation)
    > >   impossible.  There is a new GUC to control how much space to reserve,
    > >   which is called max_available_memory -- on the assumption that most of
    > >   the time it would make sense to set its value to the total amount of
    > >   memory on the machine. I'm open for suggestions regarding the name.
    >
    > With 0006 applied
    > + /* Clean up some reserved space to resize into */
    > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
    > ze, m->shmem)));
    > ... snip ...
    > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
    >
    > We unmap the portion of reserved address space where the existing
    > segment would expand into. As long as we are just expanding this will
    > work. I am wondering how would this work for shrinking buffers? What
    > scheme do you have in mind?
    
    I didn't like this part originally, and after changes to support hugetlb
    I think it's worth it just to replace mremap with munmap/mmap. That way
    there will be no such question, e.g. if a segment is getting shrinked
    the unmapped area will again become a part of the reserved space.
    
    > > * There is one more patch to address hugepages remap. As mentioned in
    > >   this thread above, Linux kernel has certain limitations when it comes
    > >   to mremap for segments allocated with huge pages. To work around it's
    > >   possible to replace mremap with a sequence of unmap and map again,
    > >   relying on the anon file behind the segment to keep the memory
    > >   content. I haven't found any downsides of this approach so far, but it
    > >   makes the anonymous file patch 0007 mandatory.
    >
    > In 0008
    > if (munmap(m->shmem, m->shmem_size) < 0)
    > ... snip ...
    > /* Resize the backing anon file. */
    > if(ftruncate(m->segment_fd, new_size) == -1)
    > ...
    > /* Reclaim the space */
    > ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
    > mmap_flags | MAP_FIXED, m->segment_fd, 0);
    >
    > How are we preventing something get mapped into the space after
    > m->shmem + newsize? We will need to add an unallocated but reserved
    > addressed space map after m->shmem+newsize right?
    
    Nope, the segment is allocated from the reserved space already, with
    some chunk of it left after the segment's end for resizing purposes. We
    only take some part of the designated space, the rest is still reserved.
    
    
    
    
  50. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-14T05:10:28Z

    On Fri, Apr 11, 2025 at 8:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > > I think a relatively elegant solution is to extend ProcSignalBarrier
    > > > mechanism to track not only pss_barrierGeneration, as a sign that
    > > > everything was processed, but also something like
    > > > pss_barrierReceivedGeneration, indicating that the message was received
    > > > everywhere but not processed yet. That would be enough to allow
    > > > processes to wait until the resize message was received everywhere, then
    > > > use a global Barrier to wait until all processes are finished.  It's
    > > > somehow similar to your proposal to use two signals, but has less
    > > > implementation overhead.
    > >
    > > The way it's implemented in v4 still has the disjoint group problem.
    > > Assume backends p1, p2, p3. All three of them are executing
    > > ProcessProcSignalBarrier(). All three of them updated
    > > pss_barrierReceivedGeneration
    > >
    > > /* The message is observed, record that */
    > > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
    > > shared_gen);
    > >
    > > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
    > > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
    > >   WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));
    > >
    > > Since all the processes have received the barrier message, p1, p2 move
    > > ahead and go through all the next phases and finish resizing even
    > > before p3 gets a chance to call ProcessBarrierShmemResize() and attach
    > > itself to Barrier. This could happen because it processed some other
    > > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
    > > not attached itself to the barrier. Once p1, p2 finish, p3 will attach
    > > itself to the barrier and resize buffers again - reinitializing the
    > > shared memory, which might has been already modified by p1 or p2. Boom
    > > - there's memory corruption.
    >
    > It won't reinitialize anything, since this logic is controlled by the
    > ShmemCtrl->NSharedBuffers, if it's already updated nothing will be
    > changed.
    
    Ah, I see it now
    if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers)
    {
    
    Thanks for the clarification.
    
    However, when we put back the patches to shrink buffers, we will evict
    the extra buffers, and shrink - if all the processes haven't
    participated in the barrier by then, some of them may try to access
    those buffers - re-installing them and then bad things can happen.
    
    >
    > About the race condition you mention, there is indeed a window between
    > receiving the ProcSignalBarrier and attaching to the global Barrier in
    > resize, but I don't think any process will be able to touch buffer pool
    > while inside this window. Even if it happens that the remapping itself
    > was blazing fast that this window was enough to make one process late
    > (e.g. if it was busy handling some other signal as you mention), as I've
    > showed above it shouldn't be a problem.
    >
    > I can experiment with this case though, maybe there is a way to
    > completely close this window to not thing about even potential
    > scenarios.
    
    The window may be small today but we have to make this future proof.
    Multiple ProcSignalBarrier messages may be processed in a single call
    to ProcessProcSignalBarrier() and if each of those takes as long as
    buffer resizing, the window will get bigger and bigger. So we have to
    close this window.
    
    >
    > > > * Shared memory address space is now reserved for future usage, making
    > > >   shared memory segments clash (e.g. due to memory allocation)
    > > >   impossible.  There is a new GUC to control how much space to reserve,
    > > >   which is called max_available_memory -- on the assumption that most of
    > > >   the time it would make sense to set its value to the total amount of
    > > >   memory on the machine. I'm open for suggestions regarding the name.
    > >
    > > With 0006 applied
    > > + /* Clean up some reserved space to resize into */
    > > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
    > > ze, m->shmem)));
    > > ... snip ...
    > > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0);
    > >
    > > We unmap the portion of reserved address space where the existing
    > > segment would expand into. As long as we are just expanding this will
    > > work. I am wondering how would this work for shrinking buffers? What
    > > scheme do you have in mind?
    >
    > I didn't like this part originally, and after changes to support hugetlb
    > I think it's worth it just to replace mremap with munmap/mmap. That way
    > there will be no such question, e.g. if a segment is getting shrinked
    > the unmapped area will again become a part of the reserved space.
    >
    
    I might have not noticed it, but are we putting two mappings one
    reserved and one allocated in the same address space, so that when the
    allocated mapping shrinks or expands, the reserved mapping continues
    to prohibit any other mapping from appearing there? I looked at some
    of the previous emails, but didn't find anything that describes how
    the reserved mapped space is managed.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  51. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-14T07:20:44Z

    > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:
    >
    > However, when we put back the patches to shrink buffers, we will evict
    > the extra buffers, and shrink - if all the processes haven't
    > participated in the barrier by then, some of them may try to access
    > those buffers - re-installing them and then bad things can happen.
    
    As I've mentioned above, I don't see how a process could try to access a
    buffer, if it's on the path between receiving the ProcSignalBarrier and
    attaching to the global shmem Barrier, even if we shrink buffers.
    AFAICT interrupt handles should not touch buffers, and otherwise the
    process doesn't have any point withing this window where it might do
    this. Do you have some particular scenario in mind?
    
    > I might have not noticed it, but are we putting two mappings one
    > reserved and one allocated in the same address space, so that when the
    > allocated mapping shrinks or expands, the reserved mapping continues
    > to prohibit any other mapping from appearing there? I looked at some
    > of the previous emails, but didn't find anything that describes how
    > the reserved mapped space is managed.
    
    I though so, but this turns out to be incorrect. Just have done a small
    experiment -- looks like when reserving some space, mapping and
    unmapping a small segment from it leaves a non-mapped gap. That would
    mean for shrinking the new available space has to be reserved again.
    
    
    
    
  52. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-14T08:58:59Z

    On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:
    > >
    > > However, when we put back the patches to shrink buffers, we will evict
    > > the extra buffers, and shrink - if all the processes haven't
    > > participated in the barrier by then, some of them may try to access
    > > those buffers - re-installing them and then bad things can happen.
    >
    > As I've mentioned above, I don't see how a process could try to access a
    > buffer, if it's on the path between receiving the ProcSignalBarrier and
    > attaching to the global shmem Barrier, even if we shrink buffers.
    > AFAICT interrupt handles should not touch buffers, and otherwise the
    > process doesn't have any point withing this window where it might do
    > this. Do you have some particular scenario in mind?
    
    ProcessProcSignalBarrier() is not within an interrupt handler but it
    responds to a flag set by an interrupt handler. After calling
    pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
    shared_gen); it will enter the loop
    
    while (flags != 0)
    where it may process many barriers before processing
    PROCSIGNAL_BARRIER_SHMEM_RESIZE. Nothing stops the other barrier
    processing code from touching buffers. Right now it's just smgrrelease
    that gets called in the other barrier. But that's not guaranteed in
    future.
    
    >
    > > I might have not noticed it, but are we putting two mappings one
    > > reserved and one allocated in the same address space, so that when the
    > > allocated mapping shrinks or expands, the reserved mapping continues
    > > to prohibit any other mapping from appearing there? I looked at some
    > > of the previous emails, but didn't find anything that describes how
    > > the reserved mapped space is managed.
    >
    > I though so, but this turns out to be incorrect. Just have done a small
    > experiment -- looks like when reserving some space, mapping and
    > unmapping a small segment from it leaves a non-mapped gap. That would
    > mean for shrinking the new available space has to be reserved again.
    
    Right. That's what I thought. But I didn't see the corresponding code.
    So we have to keep track of two mappings for every segment - 1 for
    allocation and one for reserving space and resize those two while
    shrinking and expanding buffers. Am I correct?
    
    
    --
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  53. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-17T09:52:28Z

    Hi Dmitry,
    
    On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote:
    > >
    > > However, when we put back the patches to shrink buffers, we will evict
    > > the extra buffers, and shrink - if all the processes haven't
    > > participated in the barrier by then, some of them may try to access
    > > those buffers - re-installing them and then bad things can happen.
    >
    > As I've mentioned above, I don't see how a process could try to access a
    > buffer, if it's on the path between receiving the ProcSignalBarrier and
    > attaching to the global shmem Barrier, even if we shrink buffers.
    > AFAICT interrupt handles should not touch buffers, and otherwise the
    > process doesn't have any point withing this window where it might do
    > this. Do you have some particular scenario in mind?
    >
    > > I might have not noticed it, but are we putting two mappings one
    > > reserved and one allocated in the same address space, so that when the
    > > allocated mapping shrinks or expands, the reserved mapping continues
    > > to prohibit any other mapping from appearing there? I looked at some
    > > of the previous emails, but didn't find anything that describes how
    > > the reserved mapped space is managed.
    >
    > I though so, but this turns out to be incorrect. Just have done a small
    > experiment -- looks like when reserving some space, mapping and
    > unmapping a small segment from it leaves a non-mapped gap. That would
    > mean for shrinking the new available space has to be reserved again.
    
    In an offlist chat Thomas Munro mentioned that just ftruncate() would
    be enough to resize the shared memory without touching address maps
    using mmap and munmap().
    
    ftruncate man page seems to concur with him
    
           If the effect of ftruncate() is to decrease the size of a memory
           mapped file or a shared memory object and whole pages beyond the
           new end were previously mapped, then the whole pages beyond the
           new end shall be discarded.
    
           References to discarded pages shall result in the generation of a
           SIGBUS signal.
    
           If the effect of ftruncate() is to increase the size of a memory
           object, it is unspecified whether the contents of any mapped pages
           between the old end-of-file and the new are flushed to the
           underlying object.
    
    ftruncate() when shrinking memory will release the extra pages and
    also would cause segmentation fault when memory outside the size of
    file is accessed even if the actual address map is larger than the
    mapped file. The expanded memory is allocated as it is written to, and
    those pages also become visible in the underlying object.
    
    I played with the attached small program under debugger observing pmap
    and /proc/<pid>/status after every memory operation. The address map
    always shows that it's as long as 300K memory.
    00007fffd2200000 307200K rw-s- memfd:mmap_fd_exp (deleted)
    
    Immediately after mmap()
    RssShmem:              0 kB
    
    after first memset
    RssShmem:         307200 kB
    
    after ftruncate to 100MB (we don't need to wait for memset() to see
    the effect on RssShmem)
    RssShmem:         102400 kB
    
    after ftruncate to 200MB (requires memset to see effect on RssShmem)
    RssShmem:         102400 kB
    
    after memsetting upto 200MB
    RssShmem:         204800 kB
    
    All the observations concur with the man page.
    
    [1] https://man7.org/linux/man-pages/man3/ftruncate.3p.html#:~:text=If%20the%20effect%20of%20ftruncate,generation%20of%20a%20SIGBUS%20signal.
    
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  54. Re: Changing shared_buffers without restart

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-04-17T11:21:07Z

    On 25/02/2025 11:52 am, Dmitry Dolgov wrote:
    >> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
    >> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
    >> changing shared memory mapping layout. Any feedback is appreciated.
    
    Hi Dmitry,
    
    I am sorry that I have not participated in the discussion in this thread 
    from the very beginning, although I am also very interested in dynamic 
    shared buffer resizing and evn proposed my own implementation of it: 
    https://github.com/knizhnik/postgres/pull/2 based on memory ballooning 
    and using `madvise`. And it really works (returns unused memory to the 
    system).
    This PoC allows me to understand the main drawbacks of this approach:
    
    1. Performance of Postgres CLOCK page eviction algorithm depends on 
    number of shared buffers. My first native attempt just to mark unused 
    buffers as invalid cause significant degrade of performance
    
    pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S
    
    (here shared_buffers - is maximal shared buffers size and 
    `available_buffers` - is used part:
    
    | shared_buffers | available_buffers | TPS | | ------------------| 
    ---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 | 
    324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 
    2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 
    2Gb | 348k |
    
    My first thought is to replace clock with LRU based in double-linked 
    list. As far as there is no lockless double-list implementation,
    it need some global lock. This lock can become bottleneck. The standard 
    solution is partitioning: use N  LRU lists instead of 1.
    Just as partitioned has table used by buffer manager to lockup buffers. 
    Actually we can use the same partitions locks to protect LRU list.
    But it not clear what to do with ring buffers (strategies).So I decided 
    not to perform such revolution in bufmgr, but optimize clock to more 
    efficiently split reserved buffers.
    Just add|skip_count|field to buffer descriptor. And it helps! Now the 
    worst case shared_buffer/available_buffers = 32Gb/128Mb
    shows the same performance 280k as  shared_buffers=128Mb without ballooning.
    
    2. There are several data structures i Postgres which size depends on 
    number of buffers.
    In my patch I used in some cases dynamic shared buffer size, but if this 
    structure has to be allocated in shared memory then still maximal size 
    has to be used. We have the buffers themselves (8 kB per buffer), then 
    the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), 
    checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache 
    (24B+8B/entry).
    128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be 
    quote noticeable with size differences larger than 2 orders of magnitude:
    E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer 
    we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers.
    
    3. `madvise` is not portable.
    
    Certainly you have moved much further in your proposal comparing with my 
    PoC (including huge pages support).
    But it is still not quite clear to me how you are going to solve the 
    problems with large memory overhead in case of ~100x times variation of 
    shared buffers size.
    
    
    
    
    I
    
    
    
    
    
  55. Re: Changing shared_buffers without restart

    Thomas Munro <thomas.munro@gmail.com> — 2025-04-17T15:54:31Z

    On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut <peter@eisentraut.org> wrote:
    > On 19.11.24 14:29, Dmitry Dolgov wrote:
    > >> I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear how
    > >> that interacts with the MAP_HUGETLB flag for mmap().  Do you need to specify
    > >> both of them if you want huge pages?
    > > Correct, both (one flag in memfd_create and one for mmap) are needed to
    > > use huge pages.
    >
    > I was worried because the FreeBSD man page says
    >
    > MFD_HUGETLB       This flag is currently unsupported.
    >
    > It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.
    >
    > But you should make sure in your patch that the right set of flags for
    > huge pages is passed.
    
    MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't
    admit it (guessing an oversight, not sure, will see).  And you don't
    need the corresponding (non-existent) mmap flag.  You also have to
    specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but
    other than that quirk I see it definitely working with eg procstat -v.
    That might be because FreeBSD doesn't have a default huge page size
    concept?  On Linux that's a boot time setting, I guess rarely changed.
    I contemplated that once before, when I wrote a quick demo patch[1] to
    implement huge_pages=on for FreeBSD (ie explicit rather than
    transparent).  I used a different function, not the Linuxoid one but
    it's the same under the covers, and I wrote:
    
    + /*
    + * Find the matching page size index, or if huge_page_size wasn't set,
    + * then skip the smallest size and take the next one after that.
    + */
    
    Swapping that topic back in, I was left wondering: (1) how to choose
    between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause
    ftruncate() to try to defragment physical memory to fulfil your
    request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT,
    and (2) if it's the second thing, well Linux is like that in respect
    of failing fast, but for it to succeed you have to configure
    nr_hugepages in the OS as a separate administrative step and *that's*
    when it does any defragmentation required, and that's another concept
    FreeBSD doesn't have.  It's a bit of a weird concept too, I mean those
    pages are not reserved for you in any way and anyone could nab them,
    which is undeniably practical but it lacks a few qualities one might
    hope for in a kernel facility...  IDK.  Anyway, the Linux-like
    memfd_create() always does it the _DEFAULT way.  EIther way, we can't
    have identical "try" semantics: it'll actually put some effort into
    trying, perhaps burning many seconds of CPU.
    
    I took a peek at what we're doing for Windows and the man pages tell
    me that it's like that too.  I don't recall hearing any complaints
    about that, but it's gated on a Windows permission that I assume very
    few enabled, so "try" probably isn't trying for most systems.
    Quoting:
    
    "Large-page memory regions may be difficult to obtain after the system
    has been running for a long time because the physical space for each
    large page must be contiguous, but the memory may have become
    fragmented. Allocating large pages under these conditions can
    significantly affect system performance. Therefore, applications
    should avoid making repeated large-page allocations and instead
    allocate all large pages one time, at startup."
    
    For Windows we also interpret "on" with GetLargePageMinimum(), which
    sounds like my "second known page size" idea.
    
    To make Windows do the thing that this thread wants, I found a thread
    saying that calling VirtualAlloc(..., MEM_RESET) and then convincing
    every process to call VirtualUnlock(...) might work:
    
    https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ
    
    I'm not sure what to do about the other Unixen.  One option is
    nothing, no feature, patches welcome.  Another is to use
    shm_open(<made up name>), like DSM segments, except we never need to
    reopen these ones so we could immediately call shm_unlink() to leave
    only a very short window to crash and leak a name.  It'd be low risk
    name pollution in a name space that POSIX forgot to provide any way to
    list.  The other idea is  non-standard madvise tricks but they seem
    far too squishy to be part of a "portable" fallback if they even work
    at all, so it might be better not to have the feature than that I
    think.
    
    
    
    
  56. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-17T21:16:23Z

    > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:
    >
    > In an offlist chat Thomas Munro mentioned that just ftruncate() would
    > be enough to resize the shared memory without touching address maps
    > using mmap and munmap().
    >
    > ftruncate man page seems to concur with him
    >
    >        If the effect of ftruncate() is to decrease the size of a memory
    >        mapped file or a shared memory object and whole pages beyond the
    >        new end were previously mapped, then the whole pages beyond the
    >        new end shall be discarded.
    >
    >        References to discarded pages shall result in the generation of a
    >        SIGBUS signal.
    >
    >        If the effect of ftruncate() is to increase the size of a memory
    >        object, it is unspecified whether the contents of any mapped pages
    >        between the old end-of-file and the new are flushed to the
    >        underlying object.
    >
    > ftruncate() when shrinking memory will release the extra pages and
    > also would cause segmentation fault when memory outside the size of
    > file is accessed even if the actual address map is larger than the
    > mapped file. The expanded memory is allocated as it is written to, and
    > those pages also become visible in the underlying object.
    
    Thanks for sharing. I need to do more thorough tests, but after a quick
    look I'm not sure about that. ftruncate will take care about the memory,
    but AFAICT the memory mapping will stay the same, is that what you mean?
    In that case if the segment got increased, the memory still can't be
    used because it's beyond the mapping end (at least in my test that's
    what happened). If the segment got shrinked, the memory couldn't be
    reclaimed, because, well, there is already a mapping. Or do I miss
    something?
    
    > > > I might have not noticed it, but are we putting two mappings one
    > > > reserved and one allocated in the same address space, so that when the
    > > > allocated mapping shrinks or expands, the reserved mapping continues
    > > > to prohibit any other mapping from appearing there? I looked at some
    > > > of the previous emails, but didn't find anything that describes how
    > > > the reserved mapped space is managed.
    > >
    > > I though so, but this turns out to be incorrect. Just have done a small
    > > experiment -- looks like when reserving some space, mapping and
    > > unmapping a small segment from it leaves a non-mapped gap. That would
    > > mean for shrinking the new available space has to be reserved again.
    >
    > Right. That's what I thought. But I didn't see the corresponding code.
    > So we have to keep track of two mappings for every segment - 1 for
    > allocation and one for reserving space and resize those two while
    > shrinking and expanding buffers. Am I correct?
    
    Not necessarily, depending on what we want. Again, I'll do a bit more testing,
    but after a quick check it seems that it's possible to "plug" the gap with a
    new reservation mapping, then reallocate it to another mapping or unmap both
    reservations (main and the "gap" one) at once. That would mean that for the
    current functionality we don't need to track reservation in any way more than
    just start and the end of the "main" reserved space. The only consequence I can
    imagine is possible fragmentation of the reserved space in case of frequent
    increase/decrease of a segment with even decreasing size. But since it's only
    reserved space, which will not really be used, it's probably not going to be a
    problem.
    
    
    
    
  57. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-17T21:26:18Z

    > On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:
    >
    > 1. Performance of Postgres CLOCK page eviction algorithm depends on number
    > of shared buffers. My first native attempt just to mark unused buffers as
    > invalid cause significant degrade of performance
    
    Thanks for sharing!
    
    Right, but it concerns the case when the number of shared buffers is
    high, independently from whether it was changed online or with a
    restart, correct? In that case it's out of scope for this patch.
    
    > 2. There are several data structures i Postgres which size depends on number
    > of buffers.
    > In my patch I used in some cases dynamic shared buffer size, but if this
    > structure has to be allocated in shared memory then still maximal size has
    > to be used. We have the buffers themselves (8 kB per buffer), then the main
    > BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
    > CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
    > 128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be
    > quote noticeable with size differences larger than 2 orders of magnitude:
    > E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
    > have ~2GiB of static overhead on only 0.5GiB of actual buffers.
    
    Not sure what do you mean by using a maximal size, can you elaborate.
    
    In the current patch those structures are allocated as before, except
    each goes into a separate segment -- without any extra memory overhead
    as far as I see.
    
    > 3. `madvise` is not portable.
    
    The current implementation doesn't rely on madvise so far (it might for
    shared memory shrinking), but yeah there are plenty of other not very
    portable things (MAP_FIXED, memfd_create). All of that is mentioned in
    the corresponding patches as a limitation.
    
    
    
    
  58. Re: Changing shared_buffers without restart

    Ni Ku <jakkuniku@gmail.com> — 2025-04-17T23:05:36Z

    Hi Ashutosh / Dmitry,
    
    Thanks for the information and discussions, it's been very helpful.
    
    I also have a related question about how ftruncate() is used in the patch.
    In my testing I also see that when using ftruncate to shrink a shared
    segment, the memory is freed immediately after the call, even if other
    processes still have that memory mapped, and they will hit SIGBUS if they
    try to access that memory again as the manpage says.
    
    So am I correct to think that, to support the bufferpool shrinking case, it
    would not be safe to call ftruncate in AnonymousShmemResize as-is, since at
    that point other processes may still be using pages that belong to the
    truncated memory?
    It appears that for shrinking we should only call ftruncate when we're sure
    no process will access those pages again (eg, all processes have handled
    the resize interrupt signal barrier). I suppose this can be done by the
    resize coordinator after synchronizing with all the other processes.
    But in that case it seems we cannot use the postmaster as the coordinator
    then? b/c I see some code comments saying the postmaster does not have
    waiting infrastructure... (maybe even if the postmaster has waiting infra
    we don't want to use it anyway since it can be blocked for a long time and
    won't be able to serve other requests).
    
    Regards,
    
    Jack Ng
    
  59. Re: Changing shared_buffers without restart

    Thomas Munro <thomas.munro@gmail.com> — 2025-04-18T01:27:35Z

    On Fri, Apr 18, 2025 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > I contemplated that once before, when I wrote a quick demo patch[1] to
    > implement huge_pages=on for FreeBSD (ie explicit rather than
    > transparent).  I used a different function, not the Linuxoid one but
    
    Oops, I forgot to supply that link[1].  And by the way all that
    technical mumbo jumbo about FreeBSD was just me writing up why I
    didn't pull the trigger and add explicit huge_pages support for it.
    The short version is: you shouldn't try to use that flag at all on
    FreeBSD yet, as it's a separate research project to add that feature.
    I care about PostgreSQL/FreeBSD personally and may consider that again
    as I learn more about virtual memory topics, but actually its
    transparent super pages seem to do a pretty decent job already and
    people don't seem to want to turn them off.
    
    For an actionable plan that should be portable everywhere, how about
    this: use shm_open(<tempname>, O_CREAT | O_EXCL, S_IRUSR | S_IWUSR)
    followed by shm_unlink(<tempname>) to make this work on every Unix
    (FreeBSD could use its slightly better SHM_ANON as the name and skip
    the unlink), and redirect to memfd inside #ifdef __linux__.  One thing
    to consider is that shm_open() descriptors are implicitly set to
    FD_CLOEXEC per POSIX, so I think you need to clear that flag with
    fcntl() in EXEC_BACKEND builds, and then also set it again in children
    so that they don't pass the descriptor to subprograms they run with
    system() etc.  memfd_create() needs the same consideration, except its
    default is the other way: I think you need to supply the MFD_CLOEXEC
    flag explicitly, unless it's an EXEC_BACKEND build, and use the same
    fnctl() to clear it in children if it is.  To restate that the other
    way around, in non-EXEC_BACKEND builds shm_open() already does the
    right thing and memfd_create() needs MFD_CLOEXEC, with no extra steps
    after that.
    
    The only systems I'm aware of that *don't* have shm_open() are (1)
    Android, but it's Linux so I assume it has memfd_create() (just for
    fun: you can run PostgreSQL on a phone with termux[2], and you can see
    that their package supplies a fake shm_open() that redirects to plain
    open(); I guess didn't realise they could have supplied an ENOSYS
    dummy and just set dynamic_shared_memory_type=mmap instead, and we'd
    have done that for them!), and (2) the capability-based research OS
    projects like Capsicum (and probably the others like it) that rip out
    all the global namespace Unix APIs for approximately the same reason
    as Android (PostgreSQL can't run under those yet, but just for fun: I
    had PostgreSQL mostly working under Capsicum once, and noticed that
    the problems to be solved had significant overlap with the
    multithreading project: the global namespace stuff like signals/PIDs
    and onymous IPC go away, and the only other major thing is absolute
    paths, many of which are easily made relative to a pgdata fd and
    handled with openat() in fd.c, but I digress...).
    
    [1] https://www.postgresql.org/message-id/CA%2BhUKGLmBWHF6gusP55R7jVS1%3D6T%3DGphbZpUXiOgMMHDUkVCgw%40mail.gmail.com
    [2] https://github.com/termux/termux-packages/tree/master/packages/postgresql
    
    
    
    
  60. Re: Changing shared_buffers without restart

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-04-18T07:06:23Z

    On 18/04/2025 12:26 am, Dmitry Dolgov wrote:
    >> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote:
    >>
    >> 1. Performance of Postgres CLOCK page eviction algorithm depends on number
    >> of shared buffers. My first native attempt just to mark unused buffers as
    >> invalid cause significant degrade of performance
    > Thanks for sharing!
    >
    > Right, but it concerns the case when the number of shared buffers is
    > high, independently from whether it was changed online or with a
    > restart, correct? In that case it's out of scope for this patch.
    >
    >> 2. There are several data structures i Postgres which size depends on number
    >> of buffers.
    >> In my patch I used in some cases dynamic shared buffer size, but if this
    >> structure has to be allocated in shared memory then still maximal size has
    >> to be used. We have the buffers themselves (8 kB per buffer), then the main
    >> BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's
    >> CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
    >> 128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be
    >> quote noticeable with size differences larger than 2 orders of magnitude:
    >> E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd
    >> have ~2GiB of static overhead on only 0.5GiB of actual buffers.
    > Not sure what do you mean by using a maximal size, can you elaborate.
    >
    > In the current patch those structures are allocated as before, except
    > each goes into a separate segment -- without any extra memory overhead
    > as far as I see.
    
    Thank you for explanation. I am sorry that I have not precisely 
    investigated your patch before writing: it seems to be that you are are 
    placing in separate segment only content of shared buffers.
    Now I see that I was wrong and it is actually the main difference with 
    memory ballooning approach I have used. As far as you are are allocating 
    buffers descriptors and hash table in the same segment,
    there is no extra memory overhead.
    The only drawback is that we are loosing content of shared buffers in 
    case of resize. It may be sadly, but not looks like there is no better 
    alternative.
    
    But there are still some dependencies on shared buffers size which are 
    not addressed in this PR.
    I am not sure how critical they are and is it possible to do something 
    here, but at least I want to enumerate them:
    
    1. Checkpointer: maximal number of checkpointer requests depends on 
    NBuffers. So if we start with small shared buffers and then upscale, it 
    may cause the too frequent checkpoints:
    
    Size
    CheckpointerShmemSize(void)
    ...
             size = add_size(size, mul_size(NBuffers, 
    sizeof(CheckpointerRequest)));
    
    CheckpointerShmemInit(void)
             CheckpointerShmem->max_requests = NBuffers;
    
    2. XLOG: number of xlog buffers is calculated depending on number of 
    shared buffers:
    
    XLOGChooseNumBuffers(void)
    {
    ...
          xbuffers = NBuffers / 32;
    
    Should not cause some errors, but may be not so efficient if once again 
    we start we tiny shared buffers.
    
    3. AIO: AIO max concurrency is also calculated based on number of shared 
    buffers:
    
    AioChooseMaxConcurrency(void)
    {
    ...
    
         max_proportional_pins = NBuffers / max_backends;
    
    For small shared buffers (i.e. 1Mb,  there will be no concurrency at all).
    
    So none of this issues can cause some error, just some inefficient behavior.
    But if we want to start with very small shared buffers and then increase 
    them on demand,
    then it can be a problem.
    
    In all this three cases NBuffers is used not just to calculate some 
    threshold value, but also determine size of the structure in shared memory.
    The straightforward solution is to place them in the same segment as 
    shared buffers. But I am not sure how difficult it will be to implement.
    
    
    
    
    
    
    
    
  61. Re: Changing shared_buffers without restart

    Thomas Munro <thomas.munro@gmail.com> — 2025-04-18T09:17:21Z

    On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote:
    > >
    > > In an offlist chat Thomas Munro mentioned that just ftruncate() would
    > > be enough to resize the shared memory without touching address maps
    > > using mmap and munmap().
    > >
    > > ftruncate man page seems to concur with him
    > >
    > >        If the effect of ftruncate() is to decrease the size of a memory
    > >        mapped file or a shared memory object and whole pages beyond the
    > >        new end were previously mapped, then the whole pages beyond the
    > >        new end shall be discarded.
    > >
    > >        References to discarded pages shall result in the generation of a
    > >        SIGBUS signal.
    > >
    > >        If the effect of ftruncate() is to increase the size of a memory
    > >        object, it is unspecified whether the contents of any mapped pages
    > >        between the old end-of-file and the new are flushed to the
    > >        underlying object.
    > >
    > > ftruncate() when shrinking memory will release the extra pages and
    > > also would cause segmentation fault when memory outside the size of
    > > file is accessed even if the actual address map is larger than the
    > > mapped file. The expanded memory is allocated as it is written to, and
    > > those pages also become visible in the underlying object.
    >
    > Thanks for sharing. I need to do more thorough tests, but after a quick
    > look I'm not sure about that. ftruncate will take care about the memory,
    > but AFAICT the memory mapping will stay the same, is that what you mean?
    > In that case if the segment got increased, the memory still can't be
    > used because it's beyond the mapping end (at least in my test that's
    > what happened). If the segment got shrinked, the memory couldn't be
    > reclaimed, because, well, there is already a mapping. Or do I miss
    > something?
    
    I was imagining that you might map some maximum possible size at the
    beginning to reserve the address space permanently, and then adjust
    the virtual memory object's size with ftruncate as required to provide
    backing.  Doesn't that achieve the goal with fewer steps, using only
    portable* POSIX stuff, and keeping all pointers stable?  I understand
    that pointer stability may not be required (I can see roughly how that
    argument is constructed), but isn't it still better to avoid having to
    prove that and deal with various other problems completely?  Is there
    a downside/cost to having a large mapping that is only partially
    backed?  I suppose choosing that number might offend you but at least
    there is an obvious upper bound: physical memory size.
    
    *You might also want to use fallocate after ftruncate on Linux to
    avoid SIGBUS on allocation failure on first touch page fault, which
    raises portability questions since it's unspecified whether you can do
    that with shm fds and fails on some systems, but it let's call that an
    independent topic as it's not affected by this choice.
    
    
    
    
  62. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-04-18T11:02:13Z

    Hi, 
    
    On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro <thomas.munro@gmail.com> wrote:
    > Doesn't that achieve the goal with fewer steps, using only
    >portable* POSIX stuff, and keeping all pointers stable?  I understand
    >that pointer stability may not be required (I can see roughly how that
    >argument is constructed), but isn't it still better to avoid having to
    >prove that and deal with various other problems completely?  
    
    I think we should flat out reject any approach that does not maintain pointer stability.  It would restrict future optimizations a lot if we can't rely on that (e.g. not materializing tuples when transporting them from worker to leader; pointering datastructures in shared buffers).
    
    Greetings, 
    
    Andres
    -- 
    Sent from my Android device with K-9 Mail. Please excuse my brevity.
    
    
    
    
  63. Re: Changing shared_buffers without restart

    Thomas Munro <thomas.munro@gmail.com> — 2025-04-18T11:05:15Z

    On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > Thanks for sharing. I need to do more thorough tests, but after a quick
    > > look I'm not sure about that. ftruncate will take care about the memory,
    > > but AFAICT the memory mapping will stay the same, is that what you mean?
    > > In that case if the segment got increased, the memory still can't be
    > > used because it's beyond the mapping end (at least in my test that's
    > > what happened). If the segment got shrinked, the memory couldn't be
    > > reclaimed, because, well, there is already a mapping. Or do I miss
    > > something?
    >
    > I was imagining that you might map some maximum possible size at the
    > beginning to reserve the address space permanently, and then adjust
    > the virtual memory object's size with ftruncate as required to provide
    > backing.  Doesn't that achieve the goal with fewer steps, using only
    > portable* POSIX stuff, and keeping all pointers stable?  I understand
    > that pointer stability may not be required (I can see roughly how that
    > argument is constructed), but isn't it still better to avoid having to
    > prove that and deal with various other problems completely?  Is there
    > a downside/cost to having a large mapping that is only partially
    > backed?  I suppose choosing that number might offend you but at least
    > there is an obvious upper bound: physical memory size.
    
    TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
    effect on Linux, as if you had called ftruncate on it (fully allocated
    huge pages I expected up to the object's size, just not magical size
    changes beyond that when I merely asked to map it).  That doesn't
    happen for regular page size, or for any page size on my local OS's
    shm objects and doesn't seem to fit mmap's job description given an
    fd*, but maybe I'm just confused.  Anyway, a  workaround seems to be
    to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
    | PROT_WRITE) new regions after extending with ftruncate(), at least
    in simple tests...
    
    (*Hmm, wiild uninformed speculation: perhap the size-setting behaviour
    needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is
    being exposed also when a hugetlbfs fd is given explicitly to mmap,
    generating this bizarro side effect?)
    
    
    
    
  64. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-21T09:29:59Z

    > On Fri, Apr 18, 2025 at 09:17:21PM GMT, Thomas Munro wrote:
    > I was imagining that you might map some maximum possible size at the
    > beginning to reserve the address space permanently, and then adjust
    > the virtual memory object's size with ftruncate as required to provide
    > backing.  Doesn't that achieve the goal with fewer steps, using only
    > portable* POSIX stuff, and keeping all pointers stable?
    
    Ah, I see what you folks mean. So in the latest patch there is a single large
    shared memory area reserved with PROT_NONE + MAP_NORESERVE. This area is
    logically divided between shmem segments, and each segment is mmap'd out of it
    and could be resized withing these logical boundaries. Now the suggestion is to
    have one reserved area for each segment, and instead of really mmap'ing
    something out of it, manage memory via ftruncate.
    
    Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
    questionable from portability point of view. This leaves memfd_create, and I'm
    still not completely clear on it's portability -- it seems to be specific to
    Linux, but others provide compatible implementation as well.
    
    Let me experiment with this idea a bit, I would like to make sure there are no
    other limitations we might face.
    
    > I understand that pointer stability may not be required
    
    Just to clarify, the current patch maintains this property (stable pointers),
    which I also see as mandatory for any possible implementation.
    
    > *You might also want to use fallocate after ftruncate on Linux to
    > avoid SIGBUS on allocation failure on first touch page fault, which
    > raises portability questions since it's unspecified whether you can do
    > that with shm fds and fails on some systems, but it let's call that an
    > independent topic as it's not affected by this choice.
    
    I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
    back where we were before reservation accounting for huge pages in Linux (lot's
    of people were facing unexpected SIGBUS when dealing with cgroups).
    
    > TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side
    > effect on Linux, as if you had called ftruncate on it (fully allocated
    > huge pages I expected up to the object's size, just not magical size
    > changes beyond that when I merely asked to map it).  That doesn't
    > happen for regular page size, or for any page size on my local OS's
    > shm objects and doesn't seem to fit mmap's job description given an
    > fd*, but maybe I'm just confused.  Anyway, a  workaround seems to be
    > to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ
    > | PROT_WRITE) new regions after extending with ftruncate(), at least
    > in simple tests...
    
    Right, it's similar to the currently implemented space reservation, which also
    goes with PROT_NONE and MAP_NORESERVE. I assume it boils down to the way how
    memory reservation accounting in Linux works.
    
    
    
    
  65. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-21T09:33:02Z

    > On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote:
    > I also have a related question about how ftruncate() is used in the patch.
    > In my testing I also see that when using ftruncate to shrink a shared
    > segment, the memory is freed immediately after the call, even if other
    > processes still have that memory mapped, and they will hit SIGBUS if they
    > try to access that memory again as the manpage says.
    >
    > So am I correct to think that, to support the bufferpool shrinking case, it
    > would not be safe to call ftruncate in AnonymousShmemResize as-is, since at
    > that point other processes may still be using pages that belong to the
    > truncated memory?
    > It appears that for shrinking we should only call ftruncate when we're sure
    > no process will access those pages again (eg, all processes have handled
    > the resize interrupt signal barrier). I suppose this can be done by the
    > resize coordinator after synchronizing with all the other processes.
    > But in that case it seems we cannot use the postmaster as the coordinator
    > then? b/c I see some code comments saying the postmaster does not have
    > waiting infrastructure... (maybe even if the postmaster has waiting infra
    > we don't want to use it anyway since it can be blocked for a long time and
    > won't be able to serve other requests).
    
    There is already a coordination infrastructure, implemented in the patch
    0006, which will take care of this and prevent access to the shared
    memory until everything is resized.
    
    
    
    
  66. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-04-21T09:38:25Z

    > On Fri, Apr 18, 2025 at 10:06:23AM GMT, Konstantin Knizhnik wrote:
    > The only drawback is that we are loosing content of shared buffers in case
    > of resize. It may be sadly, but not looks like there is no better
    > alternative.
    
    No, why would we loose the content? If we do mremap, it will leave the
    content as it is. If we do munmap/mmap with an anonymous backing file,
    it will also keep the content in memory. The same with another proposal
    about using ftruncate/fallocate only, both will leave the content
    untouch unless told to do otherwise.
    
    > But there are still some dependencies on shared buffers size which are not
    > addressed in this PR.
    > I am not sure how critical they are and is it possible to do something here,
    > but at least I want to enumerate them:
    
    Righ, I'm aware about those (except the AIO one, which was added after
    the first version of the patch), and didn't address them yet due to the
    same reason you've mentioned -- they're not hard errors, rather
    inefficiencies. But thanks for the reminder, I keep those in the back of
    my mind, and when the rest of the design will be settled down, I'll try
    to address them as well.
    
    
    
    
  67. Re: Changing shared_buffers without restart

    Thomas Munro <thomas.munro@gmail.com> — 2025-04-21T14:16:31Z

    On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
    > questionable from portability point of view. This leaves memfd_create, and I'm
    > still not completely clear on it's portability -- it seems to be specific to
    > Linux, but others provide compatible implementation as well.
    
    Something like this should work, roughly based on DSM code except here
    we don't really need the name so we unlink it immediately, at the
    slight risk of leaking it if the postmaster is killed between those
    lines (maybe someone should go and tell POSIX to support the special
    name SHM_ANON or some other way to avoid that; I can't see any
    portable workaround).  Not tested/compiled, just a sketch:
    
    #ifdef HAVE_MEMFD_CREATE
      /* Anonymous shared memory region. */
      fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
    #else
      /* Standard POSIX insists on a name, which we unlink immediately. */
      do
      {
          char tmp[80];
          snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
    pg_prng_uint32(&pg_global_prng_state));
          fd.= shm_open(tmp, O_CREAT | O_EXCL);
          if (fd >= 0)
            shm_unlink(tmp);
      } while (fd < 0 && errno == EXIST);
    #endif
    
    > Let me experiment with this idea a bit, I would like to make sure there are no
    > other limitations we might face.
    
    One thing I'm still wondering about is whether you really need all
    this multi-phase barrier stuff, or even need to stop other backends
    from running at all while doing the resize.  I guess that's related to
    your remapping scheme, but supposing you find the simple
    ftruncate()-only approach to be good, my next question is:  why isn't
    it enough to wait for all backends to agree to stop allocating new
    buffers in the range to be truncated, and then left them continue to
    run as normal?  As far as they would be concerned, the in-progress
    downsize has already happened, though it could be reverted later if
    the eviction phase fails.  Then the coordinator could start evicting
    buffers and truncating the shared memory object, which are
    phases/steps, sure, but it's not clear to me why they need other
    backends' help.
    
    It sounds like Windows might need a second ProcSignalBarrier poke in
    order to call VirtualUnlock() in every backend.  That's based on that
    Usenet discussion I lobbed in here the other day; I haven't tried it
    myself or fully grokked why it works, and there could well be other
    ways, IDK.  Assuming it's the right approach, between the first poke
    to make all backends accept the new lower size and the second poke to
    unlock the memory, I don't see why they need to wait.  I suppose it
    would be the same ProcSignalBarrier, but behave differently based on a
    control variables.  I suppose there could also be a third poke, if you
    want to consider the operation to be fully complete only once they
    have all actually done that unlock step, but it may also be OK not to
    worry about that, IDK.
    
    On the other hand, maybe it just feels less risky if you stop the
    whole world, or maybe you envisage parallelising the eviction work, or
    there is some correctness concern I haven't grokked yet, but what?
    
    > > *You might also want to use fallocate after ftruncate on Linux to
    > > avoid SIGBUS on allocation failure on first touch page fault, which
    > > raises portability questions since it's unspecified whether you can do
    > > that with shm fds and fails on some systems, but it let's call that an
    > > independent topic as it's not affected by this choice.
    >
    > I'm afraid it would be strictly neccessary to do fallocate, otherwise we're
    > back where we were before reservation accounting for huge pages in Linux (lot's
    > of people were facing unexpected SIGBUS when dealing with cgroups).
    
    Yeah.  FWIW here is where we decided to gate that on __linux__ while
    fixing that for DSM:
    
    https://www.postgresql.org/message-id/flat/CAEepm%3D0euOKPaYWz0-gFv9xfG%2B8ptAjhFjiQEX0CCJaYN--sDQ%40mail.gmail.com#c81b941d300f04d382472e6414cec1f4
    
    
    
    
  68. RE: Changing shared_buffers without restart

    Jack Ng <jack.ng@huawei.com> — 2025-05-06T04:23:07Z

    Thanks Dmitry. Right, the coordination mechanism in v4-0006 works as expected in various tests (sorry, I misunderstood some details initially).
    
    I also want to report a couple of minor issues found during testing (which you may be aware of already):
    
    1. For memory segments other the first one ('main'), the start address passed to mmap may not be aligned to 4KB or huge page size (since reserved_offset may not be aligned) and cause mmap to fail.
    
    2. Since the ratio for main/desc/iocv/checkpt/strategy in SHMEM_RESIZE_RATIO  are relatively small, I think we need to guard against the case where 'max_available_memory' is too small for the required sizes of these segments (from CalculateShmemSize).
    Like when max_available_memory=default and shared_numbers=128kB, 'main' still needs ~109MB, but since only 10% of max_available_memory is reserved for it (~102MB) and start address of the next segment is calculated based on reserved_offset, this would cause the mappings to overlap and memory problems later (I hit this after fixing 1.)
    I suppose we can change the minimum value of max_available_memory to be large enough, and may also adjust the ratios in SHMEM_RESIZE_RATIO to ensure the reserved space of those segments are sufficient.
    
    Regards,
    
    Jack Ng
    
    -----Original Message-----
    From: Dmitry Dolgov <9erthalion6@gmail.com> 
    Sent: Monday, April 21, 2025 5:33 AM
    To: Ni Ku <jakkuniku@gmail.com>
    Cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>; pgsql-hackers@postgresql.org; Robert Haas <robertmhaas@gmail.com>
    Subject: Re: Changing shared_buffers without restart
    
    > On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote:
    > I also have a related question about how ftruncate() is used in the patch.
    > In my testing I also see that when using ftruncate to shrink a shared 
    > segment, the memory is freed immediately after the call, even if other 
    > processes still have that memory mapped, and they will hit SIGBUS if 
    > they try to access that memory again as the manpage says.
    >
    > So am I correct to think that, to support the bufferpool shrinking 
    > case, it would not be safe to call ftruncate in AnonymousShmemResize 
    > as-is, since at that point other processes may still be using pages 
    > that belong to the truncated memory?
    > It appears that for shrinking we should only call ftruncate when we're 
    > sure no process will access those pages again (eg, all processes have 
    > handled the resize interrupt signal barrier). I suppose this can be 
    > done by the resize coordinator after synchronizing with all the other processes.
    > But in that case it seems we cannot use the postmaster as the 
    > coordinator then? b/c I see some code comments saying the postmaster 
    > does not have waiting infrastructure... (maybe even if the postmaster 
    > has waiting infra we don't want to use it anyway since it can be 
    > blocked for a long time and won't be able to serve other requests).
    
    There is already a coordination infrastructure, implemented in the patch 0006, which will take care of this and prevent access to the shared memory until everything is resized.
    
    
    
    
    
    
    
    
  69. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-05-06T08:05:21Z

    > On Tue, May 06, 2025 at 04:23:07AM GMT, Jack Ng wrote:
    > Thanks Dmitry. Right, the coordination mechanism in v4-0006 works as expected in various tests (sorry, I misunderstood some details initially).
    
    Great, thanks for checking.
    
    > I also want to report a couple of minor issues found during testing (which you may be aware of already):
    >
    > 1. For memory segments other the first one ('main'), the start address passed to mmap may not be aligned to 4KB or huge page size (since reserved_offset may not be aligned) and cause mmap to fail.
    >
    > 2. Since the ratio for main/desc/iocv/checkpt/strategy in SHMEM_RESIZE_RATIO  are relatively small, I think we need to guard against the case where 'max_available_memory' is too small for the required sizes of these segments (from CalculateShmemSize).
    > Like when max_available_memory=default and shared_numbers=128kB, 'main' still needs ~109MB, but since only 10% of max_available_memory is reserved for it (~102MB) and start address of the next segment is calculated based on reserved_offset, this would cause the mappings to overlap and memory problems later (I hit this after fixing 1.)
    > I suppose we can change the minimum value of max_available_memory to be large enough, and may also adjust the ratios in SHMEM_RESIZE_RATIO to ensure the reserved space of those segments are sufficient.
    
    Yeah, good points. I've introduced max_available_memory expecting some
    heated discussions about it, and thus didn't put lots of efforts into
    covering all the possible scenarios. But now I'm reworking it along the
    lines suggested by Thomas, and will address those as well. Thanks!
    
    
    
    
  70. RE: Changing shared_buffers without restart

    Jack Ng <jack.ng@huawei.com> — 2025-05-07T05:34:37Z

    > all the possible scenarios. But now I'm reworking it along the lines suggested
    > by Thomas, and will address those as well. Thanks!
    
    Thanks for the info, Dmitry.
    Just want to confirm my understanding of Thomas' suggestion and your discussions... I think the simpler and more portable solution goes something like the following? 
    
    * For each BP resource segment (main, desc, buffers, etc):
        1. create an anonymous file as backing
        2. mmap a large reserved shared memory area with PROTO_READ/WRITE + MAP_NORESERVE using the anon fd
        3. use ftruncate to back the in-use region (and maybe posix_fallocate too to avoid SIGBUS on alloc failure during first-touch), but no need to create a memory mapping for it
        4. also no need to create a separate mapping for the reserved region (already covered by the mapping created in 2.)
    
    |-- Memory mapping (MAP_NORESERVE) for BUFFER --|
    |-- In-use region --|----- Reserved region -----|
    
    * During resize, simply calculate the new size and call ftruncate on each segment to adjust memory accordingly, no need to mmap/munmap or modify any memory mapping.
    
    I tried this approach with a test program (with huge pages), and both expand and shrink seem to work as expected --for shrink, the memory is freed right after the resize ftruncate.
    
    Regards,
    
    Jack Ng
    
    
    
    
  71. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-05-09T14:43:02Z

    On Wed, May 7, 2025 at 11:04 AM Jack Ng <Jack.Ng@huawei.com> wrote:
    
    > > all the possible scenarios. But now I'm reworking it along the lines
    > suggested
    > > by Thomas, and will address those as well. Thanks!
    >
    > Thanks for the info, Dmitry.
    > Just want to confirm my understanding of Thomas' suggestion and your
    > discussions... I think the simpler and more portable solution goes
    > something like the following?
    >
    > * For each BP resource segment (main, desc, buffers, etc):
    >     1. create an anonymous file as backing
    >     2. mmap a large reserved shared memory area with PROTO_READ/WRITE +
    > MAP_NORESERVE using the anon fd
    >     3. use ftruncate to back the in-use region (and maybe posix_fallocate
    > too to avoid SIGBUS on alloc failure during first-touch), but no need to
    > create a memory mapping for it
    >     4. also no need to create a separate mapping for the reserved region
    > (already covered by the mapping created in 2.)
    >
    > |-- Memory mapping (MAP_NORESERVE) for BUFFER --|
    > |-- In-use region --|----- Reserved region -----|
    >
    > * During resize, simply calculate the new size and call ftruncate on each
    > segment to adjust memory accordingly, no need to mmap/munmap or modify any
    > memory mapping.
    >
    >
    That's same as my understanding.
    
    
    > I tried this approach with a test program (with huge pages), and both
    > expand and shrink seem to work as expected --for shrink, the memory is
    > freed right after the resize ftruncate.
    >
    > I thought I had shared a test program upthread, but I don't find it now.
    Attached here. Can you please share your test program?
    
    There are concerns around portability of this approach, though.
    
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  72. RE: Changing shared_buffers without restart

    Jack Ng <jack.ng@huawei.com> — 2025-05-13T05:03:03Z

    Hi Ashutosh,
    
    > > * During resize, simply calculate the new size and call ftruncate on each
    > > segment to adjust memory accordingly, no need to mmap/munmap or modify any
    > > memory mapping.
    > >
    > >
    > That's same as my understanding.
    Great, thanks for confirming!
    
    > I thought I had shared a test program upthread, but I don't find it now. Attached here. Can you please share your test program?
    Sure, mine is attached here (it’s based on another test program you shared before :-)
    
    Regards,
    
    Jack Ng
    
  73. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-06-10T11:09:58Z

    On Mon, Apr 21, 2025 at 7:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    
    > On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com>
    > wrote:
    > > Yeah, that would work and will allow to avoid MAP_FIXED and mremap,
    > which are
    > > questionable from portability point of view. This leaves memfd_create,
    > and I'm
    > > still not completely clear on it's portability -- it seems to be
    > specific to
    > > Linux, but others provide compatible implementation as well.
    >
    > Something like this should work, roughly based on DSM code except here
    > we don't really need the name so we unlink it immediately, at the
    > slight risk of leaking it if the postmaster is killed between those
    > lines (maybe someone should go and tell POSIX to support the special
    > name SHM_ANON or some other way to avoid that; I can't see any
    > portable workaround).  Not tested/compiled, just a sketch:
    >
    > #ifdef HAVE_MEMFD_CREATE
    >   /* Anonymous shared memory region. */
    >   fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
    > #else
    >   /* Standard POSIX insists on a name, which we unlink immediately. */
    >   do
    >   {
    >       char tmp[80];
    >       snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
    > pg_prng_uint32(&pg_global_prng_state));
    >       fd.= shm_open(tmp, O_CREAT | O_EXCL);
    >       if (fd >= 0)
    >         shm_unlink(tmp);
    >   } while (fd < 0 && errno == EXIST);
    > #endif
    >
    > > Let me experiment with this idea a bit, I would like to make sure there
    > are no
    > > other limitations we might face.
    >
    > One thing I'm still wondering about is whether you really need all
    > this multi-phase barrier stuff, or even need to stop other backends
    > from running at all while doing the resize.  I guess that's related to
    > your remapping scheme, but supposing you find the simple
    > ftruncate()-only approach to be good, my next question is:  why isn't
    > it enough to wait for all backends to agree to stop allocating new
    > buffers in the range to be truncated, and then left them continue to
    > run as normal?  As far as they would be concerned, the in-progress
    > downsize has already happened, though it could be reverted later if
    > the eviction phase fails.  Then the coordinator could start evicting
    > buffers and truncating the shared memory object, which are
    > phases/steps, sure, but it's not clear to me why they need other
    > backends' help.
    >
    
    AFAIU, we required the phased approach since mremap needed to happen in
    every backend after buffer eviction but before making modifications to the
    shared memory. If we don't need to call mremap in every backend and just
    ftruncate + initializing memory (when expanding buffers) is enough, I think
    phased approach isn't needed. But I haven't tried it myself.
    
    Here's patchset rebased on 3feff3916ee106c084eca848527dc2d2c3ef4e89.
    0001 - 0008 are same as the previous patchset
    
    0009 adds support to shrink shared buffers. It has two changes: a. evict
    the buffers outside the new buffer size b. remove buffers with buffer id
    outside the new buffer size from the free list. If a buffer being evicted
    is pinned, the operation is aborted and a FATAL error is raised. I think we
    need to change this behaviour to be less severe like rolling back the
    operation or waiting for the pinned buffer to be unpinned etc. Better even
    if we could let users control the behaviour. But we need better
    infrastructure to do such things. That's one TODO left in the patch.
    
    0010 is about reinitializing the Strategy reinitialization. Once we expand
    the buffers, the new buffers need to be added to the free list. Some
    StrategyControl area members (not all) need to be adjusted. That's what
    this patch does. But a deeper adjustment in BgBufferSync() and
    ClockSweepTick() is required. Further we need to do something about the
    buffer lookup table. More on that later in the email.
    
    0011-0012 fix compilation issues in these patches but those fixes are not
    correct. The patches are there so that binaries can be built without any
    compilation issues and someone can experiment with buffer resizing. Good
    thing is the compilation fixes are in SQL callable functions
    pg_get_shmem_pagesize() and pg_get_shmem_numa(). So there's no ill-effect
    because of these patches as long as those two functions are not called.
    
    Buffer lookup table resizing
    ------------------------------------
    The size of the buffer lookup table depends upon (number of shared
    buffers + number of partitions in the shared buffer lookup table). If we
    shrink the buffer pool, the buffer lookup table will become sparse but
    still useful. If we expand the buffers we need to expand the buffer lookup
    table too. That's not implemented in the current patchset. There are two
    solutions here:
    
    1. We map a lot of extra address space (not memory) initially to
    accomodate for future expansion of shared buffer pool. Let's say that the
    total address space is sufficient to accomodate Nx buffers. Simple solution
    is to allocate a buffer lookup table with Nx initial entries so that we
    don't have to resize the buffer lookup table ever. It will waste memory but
    we might be ok with that as version 1 solution. According to my offline
    discussion with David Rowley, buffer lookups in sparse hash tables are
    inefficient because or more cacheline faults. Whether that translates to
    any noticeable performance degradation in TPS needs to be measured.
    
    2. Alternate solution is to resize the buffer mapping table as well. This
    means that we rehash all the entries again which may take a longer time and
    the partitions will remain locked for that amount of time. Not to mention
    this will require non-trivial change to dynahash implementation.
    
    Next I will look at BgBufferSync() and ClockSweepTick() adjustments and
    then buffer lookup table fix with approach 1.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  74. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-06-16T12:39:17Z

    On Tue, Jun 10, 2025 at 4:39 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    
    Here's patchset rebased on f85f6ab051b7cf6950247e5fa6072c4130613555
    with some more fixes as described below.
    
    > 0001 - 0008 are same as the previous patchset
    >
    > 0009 adds support to shrink shared buffers. It has two changes: a. evict the buffers outside the new buffer size b. remove buffers with buffer id outside the new buffer size from the free list. If a buffer being evicted is pinned, the operation is aborted and a FATAL error is raised. I think we need to change this behaviour to be less severe like rolling back the operation or waiting for the pinned buffer to be unpinned etc. Better even if we could let users control the behaviour. But we need better infrastructure to do such things. That's one TODO left in the patch.
    >
    
    Patches upto 0009 are same as the previous patch set.
    
    > 0010 is about reinitializing the Strategy reinitialization. Once we expand the buffers, the new buffers need to be added to the free list. Some StrategyControl area members (not all) need to be adjusted. That's what this patch does. But a deeper adjustment in BgBufferSync() and ClockSweepTick() is required. Further we need to do something about the buffer lookup table. More on that later in the email.
    
    0010 is improved with fixes for background writer and clocksweeptick.
    Now we just reset the information saved between calls to BgBufferSync
    since it doesn't make sense after NBuffers has changed. Also the
    members in StrategyControl related to ClockSweepTick are reset for the
    same reason. More details in the commit message.
    
    0011: GetBufferFromRing() invalidates the buffers beyond NBuffers
    since those may have been added before resizing and are not valid
    anymore. Details in commit message.
    
    >
    > 0011-0012 fix compilation issues in these patches but those fixes are not correct. The patches are there so that binaries can be built without any compilation issues and someone can experiment with buffer resizing. Good thing is the compilation fixes are in SQL callable functions pg_get_shmem_pagesize() and pg_get_shmem_numa(). So there's no ill-effect because of these patches as long as those two functions are not called.
    
    These patches are now 0012 and 0013 respectively.
    
    >
    > Buffer lookup table resizing
    > ------------------------------------
    > The size of the buffer lookup table depends upon (number of shared buffers + number of partitions in the shared buffer lookup table). If we shrink the buffer pool, the buffer lookup table will become sparse but still useful. If we expand the buffers we need to expand the buffer lookup table too. That's not implemented in the current patchset. There are two solutions here:
    >
    > 1. We map a lot of extra address space (not memory) initially to accomodate for future expansion of shared buffer pool. Let's say that the total address space is sufficient to accomodate Nx buffers. Simple solution is to allocate a buffer lookup table with Nx initial entries so that we don't have to resize the buffer lookup table ever. It will waste memory but we might be ok with that as version 1 solution. According to my offline discussion with David Rowley, buffer lookups in sparse hash tables are inefficient because or more cacheline faults. Whether that translates to any noticeable performance degradation in TPS needs to be measured.
    >
    > 2. Alternate solution is to resize the buffer mapping table as well. This means that we rehash all the entries again which may take a longer time and the partitions will remain locked for that amount of time. Not to mention this will require non-trivial change to dynahash implementation.
    
    I haven't spent time on this yet.
    
    
    --
    Best Wishes,
    Ashutosh Bapat
    
  75. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-06-20T10:19:31Z

    Hi,
    
    > On Mon, Apr 21, 2025 at 7:47 PM Thomas Munro <thomas.munro@gmail.com>
    > wrote:
    >
    > One thing I'm still wondering about is whether you really need all
    > this multi-phase barrier stuff, or even need to stop other backends
    > from running at all while doing the resize.  I guess that's related to
    > your remapping scheme, but supposing you find the simple
    > ftruncate()-only approach to be good, my next question is:  why isn't
    > it enough to wait for all backends to agree to stop allocating new
    > buffers in the range to be truncated, and then left them continue to
    > run as normal?  As far as they would be concerned, the in-progress
    > downsize has already happened, though it could be reverted later if
    > the eviction phase fails.  Then the coordinator could start evicting
    > buffers and truncating the shared memory object, which are
    > phases/steps, sure, but it's not clear to me why they need other
    > backends' help.
    
    My intention behind keeping all backends waiting was to have a simple way of
    not only preventing them from allocating new buffers from the truncated range,
    but also eliminating any chance of them accessing those to-be-truncated
    buffers. In the end it's just easier (at least for me) to reason about
    correctness of the implementation this way.
    
    > On Tue, Jun 10, 2025 at 04:39:58PM +0530, Ashutosh Bapat wrote:
    >
    > Here's patchset rebased on f85f6ab051b7cf6950247e5fa6072c4130613555
    
    Thanks! I've reworked the series to implement approach suggested by
    Thomas, and applied your patches to support buffers shrinking on top. I
    had to restructure the patch set, here is how it looks like right now:
    
    1. Preparation patches
    
    Changes, that are needed to support resizing functionality, but not
    strictly related to it.
    
    * Process config reload in AIO workers. Corrects omission discussed on [1].
    
    * Introduce pending flag for GUC assign hooks. Allowing to decouple a GUC value
    change from actually applying it, sort of "pending" change. The idea is to let
    a custom logic be triggered on an assign hook, and then take responsibility for
    what happens later and how it's going to be applied. Doesn't do GUC reporting
    yet.
    
    * Introduce pss_barrierReceivedGeneration. Allows to distinguish situations
    when a signal was processed everywhere, and when a signal was received
    everywhere.
    
    2. Resizing implementation
    
    * Allow to use multiple shared memory mappings. A preparation patch, extending
    the existing interface to support multiple shared memory segments.
    
    * Address space reservation for shared memory. Implement the new way of
    handling shared memory segments, now each segment can visually be represented
    as following:
    
        /              Address space                 \
        +---------------<+>--------------------------+
        | Actual content | Address space reservation |
        | (memfd)        | (mmap, PROT_NONE)         |
        +---------------<+>--------------------------+
    
    The actual segment size is managed via ftruncate and mprotect. One interesting
    side effect I haven't fully understood yet, is that Linux doesn't seem to
    extend the existing mapping when doing mprotect on huge pages, it creates
    another mapping instead. E.g. when using normal page size and resizing shared
    memory we get:
    
    	7f4808600000-7f4817e00000 rw-s /memfd:buffers (deleted)
    	7f4817e00000-7f48a2000000 ---s /memfd:buffers (deleted)
    
    Doing the same with huge pages ends up looking like this:
    
    	7f4808600000-7f4817e00000 rw-s /memfd:buffers (deleted)
    	7f4817e00000-7f4830000000 rw-s /memfd:buffers (deleted)
    	7f4830000000-7f48a2000000 ---s /memfd:buffers (deleted)
    
    I'm still investigating whether it's a mistake on my side or a genuine Linux
    behavior. At the same time I don't see it as a large issue, the same situation
    could happen with the previous implementation as well.
    
    * Introduce multiple shmem segments for shared buffers. Modifies necessary bits
    to use new functionality.
    
    * Allow to resize shared memory without restart. Utilizes infrastructure
    introduced so far to implement stop-the-world resizing approach, where all the
    active backend (and potentially new one spawning) are waiting until everyone
    gets the same shared memory size.
    
    When testing I've noticed that there seems to be concurrency issues with
    interrupts, where aio workers and checkpointer sometimes do not receive
    the resize signal correctly. I assume it has something to do with the
    significant behavior change -- config reload processing can now fire
    signals on its own.  Letting those backends to always process config
    reload first seems to be resolved (or at least hide) the issue, but I
    still need to understand what's going on there.
    
    3. Shared memory shrinking
    
    So far only shared memory increase was implemented. These patches from Ashutosh
    support shrinking as well, which is tricky due to the need for buffer eviction.
    
    * Support shrinking shared buffers
    * Reinitialize StrategyControl after resizing buffers
    * Additional validation for buffer in the ring
    
    > 0009 adds support to shrink shared buffers. It has two changes: a.
    > evict the buffers outside the new buffer size b. remove buffers with
    > buffer id outside the new buffer size from the free list. If a buffer
    > being evicted is pinned, the operation is aborted and a FATAL error is
    > raised. I think we need to change this behaviour to be less severe
    > like rolling back the operation or waiting for the pinned buffer to be
    > unpinned etc. Better even if we could let users control the behaviour.
    > But we need better infrastructure to do such things. That's one TODO
    > left in the patch.
    
    I haven't reviewed those, just tested a bit to finally include into the series.
    Note that I had to tweak two things:
    
    * The way it was originally implemented was sending resize signal to postmaster
    before doing eviction, which could result in sigbus when accessing LSN of a
    dirty buffer to be evicted. I've reshaped it a bit to make sure eviction always
    happens first.
    
    * It seems the CurrentResource owner could be missing sometimes, so I've added
    a band-aid checking its presence.
    
    One side note, during my testing I've noticed assert failures on
    pgstat_tracks_io_op inside a wal writer a few times. I couldn't reproduce it
    after the fixes above, but still it may indicate that something is off. E.g.
    it's somehow not expected that the wal writer will do buffer eviction IO (from
    what I understand, the current shrinking implementation allows that).
    
    > Buffer lookup table resizing
    > ------------------------------------
    > The size of the buffer lookup table depends upon (number of shared
    > buffers + number of partitions in the shared buffer lookup table). If we
    > shrink the buffer pool, the buffer lookup table will become sparse but
    > still useful. If we expand the buffers we need to expand the buffer lookup
    > table too. That's not implemented in the current patchset.
    
    Just FYI, buffer lookup table has its own STRATEGY_SHMEM_SEGMENT shared memory
    segment and is resized in the same way as others. There could be lots of
    details missing, but at least the corresponding resizable segment is already
    there.
    
    [1]: https://www.postgresql.org/message-id/flat/sh5uqe4a4aqo5zkkpfy5fobe2rg2zzouctdjz7kou4t74c66ql%40yzpkxb7pgoxf
    
  76. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-06-20T10:22:42Z

    > On Fri, Jun 20, 2025 at 12:19:31PM +0200, Dmitry Dolgov wrote:
    > Thanks! I've reworked the series to implement approach suggested by
    > Thomas, and applied your patches to support buffers shrinking on top. I
    > had to restructure the patch set, here is how it looks like right now:
    
    The base-commit was left in the cover letter which I didn't post, so for
    posterity:
    
        base-commit: 4464fddf7b50abe3dbb462f76fd925e10eedad1c
    
    
    
    
  77. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-07-02T12:35:37Z

    Hi Dmitry,
    Thanks for sharing the patches.
    
    On Fri, Jun 20, 2025 at 3:49 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    
    > 3. Shared memory shrinking
    >
    > So far only shared memory increase was implemented. These patches from Ashutosh
    > support shrinking as well, which is tricky due to the need for buffer eviction.
    >
    > * Support shrinking shared buffers
    > * Reinitialize StrategyControl after resizing buffers
    
    This applies to both shrinking and expansion of shared buffers. When
    expanding we need to add the new buffers to the freelist by changing
    next pointer of last buffer in the free list to point to the first new
    buffer.
    
    > * Additional validation for buffer in the ring
    >
    > > 0009 adds support to shrink shared buffers. It has two changes: a.
    > > evict the buffers outside the new buffer size b. remove buffers with
    > > buffer id outside the new buffer size from the free list. If a buffer
    > > being evicted is pinned, the operation is aborted and a FATAL error is
    > > raised. I think we need to change this behaviour to be less severe
    > > like rolling back the operation or waiting for the pinned buffer to be
    > > unpinned etc. Better even if we could let users control the behaviour.
    > > But we need better infrastructure to do such things. That's one TODO
    > > left in the patch.
    >
    > I haven't reviewed those, just tested a bit to finally include into the series.
    > Note that I had to tweak two things:
    >
    > * The way it was originally implemented was sending resize signal to postmaster
    > before doing eviction, which could result in sigbus when accessing LSN of a
    > dirty buffer to be evicted. I've reshaped it a bit to make sure eviction always
    > happens first.
    
    Will take a look at this.
    
    >
    > * It seems the CurrentResource owner could be missing sometimes, so I've added
    > a band-aid checking its presence.
    >
    > One side note, during my testing I've noticed assert failures on
    > pgstat_tracks_io_op inside a wal writer a few times. I couldn't reproduce it
    > after the fixes above, but still it may indicate that something is off. E.g.
    > it's somehow not expected that the wal writer will do buffer eviction IO (from
    > what I understand, the current shrinking implementation allows that).
    
    Yes. I think, we have to find a better way to choose a backend which
    does the actual work. Eviction can be done in that backend itself.
    
    Compiler gives warning about an uninitialized variable, which seems to
    be a real bug. Fix attached.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  78. Re: Changing shared_buffers without restart

    Tomas Vondra <tomas@vondra.me> — 2025-07-04T00:06:16Z

    Hi Ashutosh dn Dmitry,
    
    I took a look at this patch, because it's somewhat related to the NUMA
    patch series I posted a couple days ago, and I've been wondering if
    it makes some of the NUMA stuff harder or simpler.
    
    I don't think it makes a bit difference (for the NUMA stuff). My main
    question was when would we adjust the "NUMA location" of parts of memory
    to keep stuff balanced, but this patch series already needs to update
    some of these structs (like the freelists), so those places would be
    updated to be NUMA-aware. Some of the changes could be made lazily,
    to minimize the amount of time when activity is stopped (like shifting
    the buffers to different NUMA nodes). It'd be harder if we wanted to
    resize e.g. PGPROC, but that's not the case. So I think this is fine.
    
    I agree it'd be useful to be able to resize shared buffers, without
    having to restart the instance (which is obviously very disruptive). So
    if we can make this work reliably, with reasonable trade offs (both on
    the backends, and also the risks/complexity introduced by the feature).
    
    I'm far from an expert on mmap() and similar low-level stuff, but the
    current appproach (reserving a big chunk of shared memory and slicing
    it by mmap() into smaller segments) seems reasonable.
    
    But I'm getting a bit lost in how exactly this interacts with things
    like overcommit, system memory accounting / OOM killer and this sort of
    stuff. I went through the thread and it seems to me the reserve+map
    approach works OK in this regard (and the messages on linux-mm seem to
    confirm this). But this information is scattered over many messages and
    it's hard to say for sure, because some of this might be relevant for
    an earlier approach, or a subtly different variant of it.
    
    A similar question is portability. The comments and commit messages
    seem to suggest most of this is linux-specific, and other platforms just
    don't have these capabilities. But there's a bunch of messages (mostly
    by Thomas Munro) that hint FreeBSD might be capable of this too, even if
    to some limited extent. And possibly even Windows/EXEC_BACKEND, although
    that seems much trickier.
    
    FWIW I think it's perfectly fine to only support resizing on selected
    platforms, especially considering Linux is the most widely used system
    for running Postgres. We still need to be able to build/run on other
    systems, of course. And maybe it'd be good to be able to disable this
    even on Linux, if that eliminates some overhead and/or risks for people
    who don't need the feature. Just a thought.
    
    Anyway, my main point is that this information is important, but very
    scattered over the thread. It's a bit foolish to expect everyone who
    wants to do a review to read the whole thread (which will inevitably
    grow longer over time), and assemble all these pieces again an again,
    following all the changes in the design etc. Few people will get over
    that hurdle, IMHO.
    
    So I think it'd be very helpful to write a README, explaining the
    currnent design/approach, and summarizing all these aspects in a single
    place. Including things like portability, interaction with the OS
    accounting, OOM killer, this kind of stuff. Some of this stuff may be
    already mentioned in code comments, but you it's hard to find those.
    
    Especially worth documenting are the states the processes need to go
    through (using the barriers), and the transacitons between them (i.e.
    what is allowed in each phase, what blocks can be visible, etc.).
    
    
    I'll go over some higher-level items first, and then over some comments
    for individual patches.
    
    
    1) no user docs
    
    There are no user .sgml docs, and maybe it's time to write some,
    explaining how to use this thing - how to configure it, how to trigger
    the resizing, etc. It took me a while to realize I need to do ALTER
    SYSTEM + pg_reload_conf() to kick this off.
    
    It should also document the user-visible limitations, e.g. what activity
    is blocked during the resizing, etc.
    
    
    2) pending GUC changes
    
    I'm somewhat skeptical about the GUC approach. I don't think it was
    designed with this kind of use case in mind, and so I think it's quite
    likely it won't be able to handle it well.
    
    For example, there's almost no validation of the values, so how do you
    ensure the new value makes sense? Because if it doesn't, it can easily
    crash the system (I've seen such crashes repeatedly, I'll get to that).
    Sure, you may do ALTER SYSTEM to set shared_buffers to nonsense and it
    won't start after restart/reboot, but crashing an instance is maybe a
    little bit more annoying.
    
    Let's say we did the ALTER SYSTEM + pg_reload_conf(), and it gets stuck
    waiting on something (can't evict a buffer or something). How do you
    cancel it, when the change is already written to the .auto.conf file?
    Can you simply do ALTER SYSTEM + pg_reload_conf() again?
    
    It also seems a bit strange that the "switch" gets to be be driven by a
    randomly selected backend (unless I'm misunderstanding this bit). It
    seems to be true for the buffer eviction during shrinking, at least.
    
    Perhaps this should be a separate utility command, or maybe even just
    a new ALTER SYSTEM variant? Or even just a function, similar to what
    the "online checksums" patch did, possibly combined with a bgworder
    (but probably not needed, there are no db-specific tasks to do).
    
    
    3) max_available_memory
    
    Speaking of GUCs, I dislike how max_available_memory works. It seems a
    bit backwards to me. I mean, we're specifying shared_buffers (and some
    other parameters), and the system calculates the amount of shared memory
    needed. But the limit determines the total limit?
    
    I think the GUC should specify the maximum shared_buffers we want to
    allow, and then we'd work out the total to pre-allocate? Considering
    we're only allowing to resize shared_buffers, that should be pretty
    trivial. Yes, it might happen that the "total limit" happens to exceed
    the available memory or something, but we already have the problem
    with shared_buffers. Seems fine if we explain this in the docs, and
    perhaps print the calculated memory limit on start.
    
    In any case, we should not allow setting a value that ends up
    overflowing the internal reserved space. It's true we don't have a good
    way to do checks for GUcs, but it's a bit silly to crash because of
    hitting some non-obvious internal limit that we necessarily know about.
    
    Maybe this is a reason why GUC hooks are not a good way to set this.
    
    
    4) SHMEM_RESIZE_RATIO
    
    The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
    these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
    descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
    0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
    the rest is only "reserved".
    
    However, that just makes the max_available_memory a bit misleading,
    because you can't ever use it. You can use the 60% for shared buffers
    (which is not mentioned anywhere, and good luck not overflowing that,
    as it's never checked), but those smaller regions are guaranteed to be
    mostly unused. Unfortunate.
    
    And it's not just a matter of fixing those ratios, because then someone
    rebuilds with 32kB blocks and you're in the same situation.
    
    Moreover, all of the above is for mappings sized based on NBuffers. But
    if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
    moment someone increases of max_connection, max_locks_per_transaction
    and possibly some other stuff?
    
    
    5) no tests
    
    I mentioned no "user docs", but the patch has 0 tests too. Which seems
    a bit strange for a patch of this age.
    
    A really serious part of the patch series seems to be the coordination
    of processes when going through the phases, enforced by the barriers.
    This seems like a perfect match for testing using injection points, and
    I know we did something like this in the online checksums patch, which
    needs to coordinate processes in a similar way.
    
    But even just a simple TAP test that does a bunch of (random?) resizes
    while running a pgbench seem better than no tests. (That's what I did
    manually, and it crashed right away.)
    
    There's a lot more stuff to test here, I think. Idle sessions with
    buffers pinned by open cursors, multiple backends doing ALTER SYSTEM
    + pg_reload_conf concurrently, other kinds of failures.
    
    
    6) SIGBUS failures
    
    As mentioned, I did some simple tests with shrink/resize with a pgbench
    in the background, and it almost immediately crashed for me :-( With a
    SIGBUS, which I think is fairly rare on x86 (definitely much less common
    than e.g. SIGSEGV).
    
    An example backtrace attached.
    
    
    7) EXEC_BACKEND, FreeBSD
    
    We clearly need to keep this working on systems without the necessary
    bits (so likely EXEC_BACKEND, FreeBSD etc.). But the builds currently
    fail in both cases, it seems.
    
    I think it's fine to not support resizing on every platform, then we'd
    never get it, but it still needs to build. It would be good to not have
    two very different code versions, one for resizing and one without it,
    though. I wonder if we can just have the "no-resize" use the same struct
    (with the segments/mapping, ...) and all that, but skipping the space
    reservation.
    
    
    8) monitoring
    
    So, let's say I start a resize of shared buffers. How will I know what
    it's currently doing, how much longer it might take, what it's waiting
    for, etc.? I think it'd be good to have progress monitoring, through
    the regular system view (e.g. pg_stat_shmem_resize_progress?).
    
    
    10) what to do about stuck resize?
    
    AFAICS the resize can get stuck for various reasons, e.g. because it
    can't evict pinned buffers, possibly indefinitely. Not great, it's not
    clear to me if there's a way out (canceling the resize) after a timeout,
    or something like that? Not great to start an "online resize" only to
    get stuck with all activity blocked for indefinite amount of time, and
    get to restart anyway.
    
    Seems related to Thomas' message [2], but AFAICS the patch does not do
    anything about this yet, right? What's the plan here?
    
    
    11) preparatory actions?
    
    Even if it doesn't get stuck, some of the actions can take a while, like
    evicting dirty buffers before shrinking, etc. This is similar to what
    happens on restart, when the shutdown checkpoint can take a while, while
    the system is (partly) unavailable.
    
    The common mitigation is to do an explicit checkpoint right before the
    restart, to make the shutdown checkpoint cheap. Could we do something
    similar for the shrinking, e.g. flush buffers from the part to be
    removed before actually starting the resize?
    
    
    12) does this affect e.g. fork() costs?
    
    I wonder if this affects the cost of fork() in some undesirable way?
    Could it make fork() more measurably more expensive?
    
    
    13) resize "state" is all over the place
    
    For me, a big hurdle when reasoning about the resizing correctness is
    that there's quite a lot of distinct pieces tracking what the current
    "state" is. I mean, there's:
    
     - ShmemCtrl->NSharedBuffers
     - NBuffers
     - NBuffersOld
     - NBuffersPending
     - ... (I'm sure I missed something)
    
    There's no cohesive description how this fits together, it seems a bit
    "ad hoc". Could be correct, but I find it hard to reason about.
    
    
    14) interesting messages from the thread
    
    While reading through the thread, I noticed a couple messages that I
    think are still relevant:
    
    - I see Peter E posted some review in 2024/11 [3], but it seems his
      comments were mostly ignored. I agree with most of them.
    
    - Robert mentioned a couple interesting failure scenarios in [4], not
      sure if all of this was handled. He howerver assumes pointers would
      not be stable (and that's something we should not allow, and the
      current approach works OK in this regard, I think). He also outlines
      how it'd happen in phases - this would be useful for the design README
      I think. It also reminds me the "phases" in the checksums patch.
    
    - Robert asked [5] if Linux might abruptly break this, but I find that
      unlikely. We'd point out we rely on this, and they'd likely rethink.
      This would be made safer if this was specified by POSIX - taking that
      away once implemented seems way harder than for custom extensions.
      It's likely they'd not take away the feature without an alternative
      way to achieve the same effect, I think (yes, harder to maintain).
      Tom suggests [7] this is not in POSIX.
    
    - Matthias mentioned [6] similar flags on other operating systems. Could
      some of those be used to implement the same resizing?
    
    - Andres had an interesting comment about how overcommit interacts with
      MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
      accounting. There's also some comments about from linux-mm people [9].
    
    - There seem to be some issues with releasing memory backing a mapping
      with hugetlb [10]. With the fd (and truncating the file), this seems
      to release the memory, but it's linux-specific? But most of this stuff
      is specific to linux, it seems. So is this a problem? With this it
      should be working even for hugetlb ...
    
    - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
      make the hugetlb stuff work just like on Linux? Unclear. Also, I
      thought the mfd stuff is linux-specific ... or am I confused?
    
    - Andres objected to any approach without pointer stability, and I agree
      with that. If we can figure out such solution, of course.
    
    - Thomas asked [13] why we need to stop all the backends, instead of
      just waiting for them to acknowledge the new (smaller) NBuffers value
      and then let them continue. I also don't quite see why this should
      not work, and it'd limit the disruption when we have to wait for
      eviction of buffers pinned by paused cursors, etc.
    
    
    
    Now, some comments about the individual patches (some of this may be a
    bit redundant with the earlier points):
    
    
    v5-0001-Process-config-reload-in-AIO-workers.patch
    
    1) Hmmm, so which other workers may need such explicit handling? Do all
       other processes participate in procsignal stuff, or does anything
       need an explicit handling?
    
    
    v5-0002-Introduce-pending-flag-for-GUC-assign-hooks.patch
    
    No additional comments, see the points about resizing through a GUC
    callback with pending flag vs. a separate utility command, monitoring
    and so on.
    
    
    v5-0003-Introduce-pss_barrierReceivedGeneration.patch
    
    1) Do we actually need this? Isn't it enough to just have two barriers?
       Or a barrier + condition variable, or something like that.
    
    2) The comment talks about "coordinated way" when processing messages,
       but it's not very clear to me. It should explain what is needed and
       not possible with the current barrier code.
    
    3) This very much reminds me what the online checksums patch needed to
       do, and we managed to do it using plain barriers. So why does this
       need this new thing? (No opinion on whether it's correct.)
    
    
    v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch
    
    1) "int shmem_segment" - wouldn't it be better to have a separate enum
       for this? I mean, we'll have a predefined list of segments, right?
    
    2) typedef struct AnonymousMapping would deserve some comment
    
    3) ANON_MAPPINGS - Probably should be MAX_ANON_MAPPINGS? But we'll know
       how many we have, so why not to allocate exactly the right number?
       Or even just an array of structs, like in similar cases?
    
    4) static int next_free_segment = 0;
    
       We exactly know what segments we'll create and in which order, no? So
       why do we even bother with this next_free_segment thing? Can't we
       simply declare an array of AnonymousMapping elements, with all the
       elements, and then just walk it and calculate the sizes/pointers?
    
    5) I'm a bit confused about the segment/mapping difference. The patch
       seems to randomly mix those, or maybe I'm just confused. I mean,
       we are creating just shmem segment, and the pieces are mappings,
       right? So why do we index them by "shmem_segment"?
    
       Also, consider
    
          CreateAnonymousSegment(AnonymousMapping *mapping)
    
       so is that creating a segment or mapping? Or what's the difference?
    
       Or are we creating multiple segments, and I missed that? Or are there
       different "segment" concepts, or what?
    
    6) There should probably be some sort of API wrapping the mappings, so
       that the various places don't need to mess with next_free_segments
       directly, etc. Perhaps PGSharedMemoryCreate() shouldn't do this, and
       should just pass size to CreateAnonymousSegment(), and that finding
       empty slot in Mappings, etc.? Not sure that'll work, but it's a bit
       error-prone if a struct is modified from multiple places like this.
    
    7) We should remember which segments got to use huge pages and which
       did not. And we should make it optional for each segment. Although,
       maybe I'm just confused about the "segment" definition - if we only
       have one, that's where huge pages are applied.
    
       If we could have multiple segments for different segments (whatever
       that means), not sure what we'll report for cases when some segments
       get to use huge pages and others don't. Either because we don't want
       to use that for some segments, or because we happen to run out of
       the available huge pages.
    
    8) It seems PGSharedMemoryDetach got some significant changes, but the
       comment was not modified at all. I'd guess that means the comment is
       perhaps stale, or maybe there's something we should mention.
    
    9) I doubt the Assert on GetConfigOption needs to be repeated for all
       segments (in CreateSharedMemoryAndSemaphores).
    
    10) Why do we have the Mapping and Segments indexed in different ways?
        I mean, Mappings seem to be filled in FIFO (just grab the next free
        slot), while Segments are indexed by segment ID.
    
    11) Actually, what's the difference between the contents of Mappings
        and Segments? Isn't that the same thing, indexed in the same way?
        Or could it be unified? Or are they conceptually different thing?
    
    12) I believe we'll have a predefined list of segments, with fixed IDs,
        so why not just have a MAX of those IDs as the capacity?
    
    13) Would it be good to have some checks on shmem_segment values? That
        it's valid with respect to defined segments, etc. An assert, maybe?
        What about some asserts on the Mapping/Segment elements? To check
        that the element is sensible, and that the arrays "match" (if we
        need both).
    
    14) Some of the lines got pretty long, e.g. in pg_get_shmem_allocations.
        I suggest we define some macros to make this shorter, or something
        like that.
    
    15) I'd maybe rename ShmemSegment to PGShmemSegment, for consistency
        with PGShmemHeader?
    
    16) Is MAIN_SHMEM_SEGMENT something we want to expose in a public header
        file? Seems very much like an internal thing, people should access
        it only through APIs ...
    
    
    v5-0005-Address-space-reservation-for-shared-memory.patch
    
    1) Shouldn't reserved_offset and huge_pages_on really be in the segment
       info? Or maybe even in mapping info? (again, maybe I'm confused
       about what these structs store)
    
    2) CreateSharedMemoryAndSemaphores comment is rather light on what it
       does, considering it now reserves space and then carves is into
       segments.
    
    3) So ReserveAnonymousMemory is what makes decisions about huge pages,
       for the whole reserved space / all segments in it. That's a bit
       unfortunate with respect to the desirability of some segments
       benefiting from huge pages and others not. Maybe we should have two
       "reserved" areas, one with huge pages, one without?
    
       I guess we don't want too many segments, because that might make
       fork() more expensive, etc. Just guessing, though. Also, how would
       this work with threading?
    
    4) Any particular reason to define max_available_memory as
       GUC_UNIT_BLOCKS and not GUC_UNIT_MB? Of course, if we change this
       to have "max shared buffers limit" then it'd make sense to use
       blocks, but "total limit" is not in blocks.
    
    5) The general approach seems sound to me, but I'm not expert on this.
       I wonder how portable this behavior is. I mean, will it work on other
       Unix systems / Windows? Is it POSIX or Linux extension?
    
    6) It might be a good idea to have Assert procedures to chech mappings
       and segments (that it doesn't overflow reserved space, etc.). It
       took me ages to realize I can change shared_buffers to >60% of the
       limit, it'll happily oblige and then just crash with OOM when
       calling mprotect().
    
    
    v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch
    
    1) I suspect the SHMEM_RESIZE_RATIO is the wrong direction, because it
       entirely ignores relationships between the parts. See the earlier
       comment about this.
    
    2) In fact, what happens if the user tries to resize to a value that is
       too large for one of the segments? How would the system know before
       starting the resize (and failing)?
    
    3) It seems wrong to modify the BufferManagerShmemSize like this. It's
       probably better to have a "...SegmentSize" function for individual
       segments, and let BufferManagerShmemSize() to still return a sum of
       all segments.
    
    4) I think MaxAvailableMemory is the wrong abstraction, because that's
       not what people specify. See earlier comment.
    
    5) Let's say we change the shared memory size (ALTER SYSTEM), trigger
       the config reload (pg_reload_conf). But then we find that we can't
       actually shrink the buffers, for some unpredictable reason (e.g.
       there's pinned buffers). How do we "undo" the change? We can't
       really undo the ALTER SYSTEM, that's already written in the .conf
       and we don't know the old value, IIRC. Is it reasonable to start
       killing backends from the assign_hook or something? Seems weird.
    
    
    v5-0007-Allow-to-resize-shared-memory-without-restart.patch
    
    1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
       somewhere in the resizing?
    
    2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
       Why is it needed, exactly? If it's to flush stuff for processes
       consuming EmitProcSignalBarrier, it's that too late? What if a
       process consumes the barrier between the emit and memory barrier?
    
    3) WaitOnShmemBarrier seem a bit under-documented.
    
    4) Is this actually adding buffers to the freelist? I see buf_init only
       links the new buffers by seeting freeNext, but where are the new
       buffers added to the existing freelist?
    
    5) The issue with a new backend seeing an old NBuffers value reminds me
       of the "support enabling checksums online" thread, where we ran into
       similar race conditions. See message [1], the part about race #2
       (the other race might be relevant too, not sure). It's been a while,
       but I think our conclusion ini that thread was that the "best" fix
       would be to change the order of steps in InitPostgres(), i.e. setup
       the ProcSignal stuff first, and only then "copy" the NBuffers value.
       And handle the possibility that we receive a "duplicate" barriers.
    
    6) In fact, the online checksums thread seems like a possible source of
       inspiration for some of the issues, because it needs to do similar
       stuff (e.g. make sure all backends follow steps in a synchronized
       way, etc.). And it didn't need new types of Barrier to do that.
    
    7) Also, this seems like a perfect match for testing using injection
       points. In fact, there's not a single test in the whole patch series.
       Or a single line of .sgml docs, for that matter. It took me a while
       to realize I'm supposed to change the size by ALTER SYSTEM + reload
       the config.
    
    
    v5-0008-Support-shrinking-shared-buffers.patch
    
    1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
       there a place starting it and waiting for it to complete? Why
       shouldn't it do EvictExtraBuffers itself?
    
    2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
       last buffer is still at the end of the freelist? Seems unlikely.
    
    3) Seems a bit strange to do it from a random backend. Shouldn't it
       be the responsibility of a process like checkpointer/bgwriter, or
       maybe a dedicated dynamic bgworker? Can we even rely on a backend
       to be available?
    
    4) Unsolved issues with buffers pinned for a long time. Could be an
       issue if the buffer is pinned indefinitely (e.g. cursor in idle
       connection), and the resizing blocks some activity (new connections
       or stuff like that).
    
    5) Funny that "AI suggests" something, but doesn't the block fail to
       reset nextVictimBuffer of the clocksweep? It may point to a buffer
       we're removing, and it'll be invalid, no?
    
    6) It's not clear to me in what situations this triggers (in the call
       to BufferManagerShmemInit)
    
       if (FirstBufferToInit < NBuffers) ...
    
    
    v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
    
    1) IMHO this should be included in the earlier resize/shrink patches,
       I don't see a reason to keep it separate (assuming this is the
       correct way, and the "init" is not).
    
    2) Doesn't StrategyPurgeFreeList already do some of this for the case
       of shrinking memory?
    
    3) Not great adding a bunch of static variables to bufmgr.c. Why do we
       need to make "everything" static global? Isn't it enough to make
       only the "valid" flag global? The rest can stay local, no?
    
       If everything needs to be global for some reason, could we at least
       make it a struct, to group the fields, not just separate random
       variables? And maybe at the top, not half-way throught the file?
    
    4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
       anything, it's just invalidating the info about past runs.
    
    5) I don't quite understand why BufferSync needs to do the dance with
       delay_shmem_resize.  I mean, we certainly should not run BufferSync
       from the code that resizes buffers, right? Certainly not after the
       eviction, from the part that actually rebuilds shmem structs etc.
       So perhaps something could trigger resize while we're running the
       BufferSync()? Isn't that a bit strange? If this flag is needed, it
       seems more like a band-aid for some issue in the architecture.
    
    6) Also, why should it be fine to get into situation that some of the
       buffers might not be valid, during shrinking? I mean, why should
       this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers).
       It seems better to ensure we never get into "sync" in a way that
       might lead some of the buffers invalid. Seems way too lowlevel to
       care about whether resize is happening.
    
    7) I don't understand the new condition for "Execute the LRU scan".
       Won't this stop LRU scan even in cases when we want it to happen?
       Don't we want to scan the buffers in the remaining part (after
       shrinking), for example? Also, we already checked this shmem flag at
       the beginning of the function - sure, it could change (if some other
       process modifies it), but does that make sense? Wouldn't it cause
       problems if it can change at an arbitrary point while running the
       BufferSync? IMHO just another sign it may not make sense to allow
       this, i.e. buffer sync should not run during the "actual" resize.
    
    
    v5-0010-Additional-validation-for-buffer-in-the-ring.patch
    
    1) So the problem is we might create a ring before shrinking shared
       buffers, and then GetBufferFromRing will see bogus buffers? OK, but
       we should be more careful with these checks, otherwise we'll miss
       real issues when we incorrectly get an invalid buffer. Can't the
       backends do this only when they for sure know we did shrink the
       shared buffers? Or maybe even handle that during the barrier?
    
    2) IMHO a sign there's the "transitions" between different NBuffers
       values may not be clear enough, and we're allowing stuff to happen
       in the "blurry" area. I think that's likely to cause bugs (it did
       cause issues for the online checksums patch, I think).
    
    
    [1]
    https://www.postgresql.org/message-id/3372a09c-d1f6-4974-ad60-eec15ee0c734%40vondra.me
    
    [2]
    https://www.postgresql.org/message-id/CA%2BhUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g%40mail.gmail.com
    
    [3]
    https://www.postgresql.org/message-id/12add41a-7625-4639-a394-a5563e349322%40eisentraut.org
    
    [4]
    https://www.postgresql.org/message-id/CA%2BTgmoZFfn0E%2BEkUAjnv_QM_00eUJPkgCJKzm3n1G4itJKMSsA%40mail.gmail.com
    
    [5]
    https://www.postgresql.org/message-id/flat/cnthxg2eekacrejyeonuhiaezc7vd7o2uowlsbenxqfkjwgvwj%40qgzu6eoqrglb
    
    [6]
    https://www.postgresql.org/message-id/CAEze2WiMkmXUWg10y%2B_oGhJzXirZbYHB5bw0%3DVWte%2BYHwSBa%3DA%40mail.gmail.com
    
    [7] https://www.postgresql.org/message-id/397218.1732844567%40sss.pgh.pa.us
    
    [8]
    https://www.postgresql.org/message-id/gzhuqq3eszx7w46j5de5jehycygipsy7zmfrtdkhfbj5utl6zh%40sxyejudixdfe
    
    [9]
    https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/
    
    [10]
    https://www.postgresql.org/message-id/3qzw5fhhb3eqwl3huqabyxechbz7frxs2vk3hx3tb3h7euyvul%40pc2rmhehuglc
    
    [11]
    https://www.postgresql.org/message-id/CA%2BhUKGJ-RfwSe3%3DZS2HRV9rvgrZTJJButfE8Kh5C6Ta2Eb%2BmPQ%40mail.gmail.com
    
    [12]
    https://www.postgresql.org/message-id/94B56B9C-025A-463F-BC57-DF5B15B8E808%40anarazel.de
    
    [13]
    https://www.postgresql.org/message-id/CA%2BhUKGLQhsZ1dEf5Zo6JuPbs6n-qX%3DcTGy49feKf1iFA_TBP1g%40mail.gmail.com
    
    -- 
    Tomas Vondra
    
  79. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-04T14:41:51Z

    > On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
    > I took a look at this patch, because it's somewhat related to the NUMA
    > patch series I posted a couple days ago, and I've been wondering if
    > it makes some of the NUMA stuff harder or simpler.
    
    Thanks a lot for the review! It's a plenty of feedback, and I'll
    probably take time to answer all of it, but I still want to address
    couple of most important topics quickly.
    
    > But I'm getting a bit lost in how exactly this interacts with things
    > like overcommit, system memory accounting / OOM killer and this sort of
    > stuff. I went through the thread and it seems to me the reserve+map
    > approach works OK in this regard (and the messages on linux-mm seem to
    > confirm this). But this information is scattered over many messages and
    > it's hard to say for sure, because some of this might be relevant for
    > an earlier approach, or a subtly different variant of it.
    >
    > A similar question is portability. The comments and commit messages
    > seem to suggest most of this is linux-specific, and other platforms just
    > don't have these capabilities. But there's a bunch of messages (mostly
    > by Thomas Munro) that hint FreeBSD might be capable of this too, even if
    > to some limited extent. And possibly even Windows/EXEC_BACKEND, although
    > that seems much trickier.
    >
    > [...]
    >
    > So I think it'd be very helpful to write a README, explaining the
    > currnent design/approach, and summarizing all these aspects in a single
    > place. Including things like portability, interaction with the OS
    > accounting, OOM killer, this kind of stuff. Some of this stuff may be
    > already mentioned in code comments, but you it's hard to find those.
    >
    > Especially worth documenting are the states the processes need to go
    > through (using the barriers), and the transacitons between them (i.e.
    > what is allowed in each phase, what blocks can be visible, etc.).
    
    Agree, I'll add some comprehensive readme in the next version. Note,
    that on the topic of portability the latest version implements a new
    approach suggested by Thomas Munro, which reduces problematic parts to
    memfd_create only, which is mentioned as Linux specific in the
    documentation, but AFAICT has FreeBSD counterparts.
    
    > 1) no user docs
    >
    > There are no user .sgml docs, and maybe it's time to write some,
    > explaining how to use this thing - how to configure it, how to trigger
    > the resizing, etc. It took me a while to realize I need to do ALTER
    > SYSTEM + pg_reload_conf() to kick this off.
    >
    > It should also document the user-visible limitations, e.g. what activity
    > is blocked during the resizing, etc.
    
    While the user interface is still under discussion, I agree, it makes
    sense to capture this information in sgml docs.
    
    > 2) pending GUC changes
    >
    > [...]
    >
    > It also seems a bit strange that the "switch" gets to be be driven by a
    > randomly selected backend (unless I'm misunderstanding this bit). It
    > seems to be true for the buffer eviction during shrinking, at least.
    
    The resize itself is coordinated by the postmaster alone, not by a
    randomly selected backend. But looks like buffer eviction indeed can
    happen anywhere, which is what we were discussing in the previous
    messages.
    
    > Perhaps this should be a separate utility command, or maybe even just
    > a new ALTER SYSTEM variant? Or even just a function, similar to what
    > the "online checksums" patch did, possibly combined with a bgworder
    > (but probably not needed, there are no db-specific tasks to do).
    
    This is one topic we still actively discuss, but haven't had much
    feedback otherwise. The pros and cons seem to be clear:
    
    * Utilizing the existing GUC mechanism would allow treating
      shared_buffers as any other configuration, meaning that potential
      users of this feature don't have to do anything new to use it -- they
      still can use whatever method they prefer to apply new configuration
      (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).
    
      I'm also wondering if it's only shared_buffers, or some other options
      could use similar approach.
    
    * Having a separate utility command is a mighty simplification, which
      helps avoiding problems you've described above.
    
    So far we've got two against one in favour of simple utility command, so
    we can as well go with that.
    
    > 3) max_available_memory
    >
    > Speaking of GUCs, I dislike how max_available_memory works. It seems a
    > bit backwards to me. I mean, we're specifying shared_buffers (and some
    > other parameters), and the system calculates the amount of shared memory
    > needed. But the limit determines the total limit?
    
    The reason it's so backwards is that it's coming from the need to
    specify how much memory we would like to reserve, and what would be the
    upper boundary for increasing shared_buffers. My intention is eventually
    to get rid of this GUC and figure its value at runtime as a function of
    the total available memory.
    
    > I think the GUC should specify the maximum shared_buffers we want to
    > allow, and then we'd work out the total to pre-allocate? Considering
    > we're only allowing to resize shared_buffers, that should be pretty
    > trivial. Yes, it might happen that the "total limit" happens to exceed
    > the available memory or something, but we already have the problem
    > with shared_buffers. Seems fine if we explain this in the docs, and
    > perhaps print the calculated memory limit on start.
    
    Somehow I'm not following what you suggest here. You mean having the
    maximum shared_buffers specified, but not as a separate GUC?
    
    > 4) SHMEM_RESIZE_RATIO
    >
    > The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
    > these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
    > descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
    > 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
    > the rest is only "reserved".
    
    SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent
    solution, nothing more. I probably have to mention that in the
    commentaries.
    
    > Moreover, all of the above is for mappings sized based on NBuffers. But
    > if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
    > moment someone increases of max_connection, max_locks_per_transaction
    > and possibly some other stuff?
    
    Can you elaborate, what do you mean by that? Increasing max_connection,
    etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
    but the ratio is for memory reservation only.
    
    > 5) no tests
    >
    > I mentioned no "user docs", but the patch has 0 tests too. Which seems
    > a bit strange for a patch of this age.
    >
    > A really serious part of the patch series seems to be the coordination
    > of processes when going through the phases, enforced by the barriers.
    > This seems like a perfect match for testing using injection points, and
    > I know we did something like this in the online checksums patch, which
    > needs to coordinate processes in a similar way.
    
    Exactly what we're talking about recently, figuring out how to use
    injections points for testing. Keep in mind, that the scope of this work
    turned out to be huge, and with just two people on board we're
    addressing one thing at the time.
    
    > But even just a simple TAP test that does a bunch of (random?) resizes
    > while running a pgbench seem better than no tests. (That's what I did
    > manually, and it crashed right away.)
    
    This is the type of testing I was doing before posting the series. I
    assume you've crashed it on buffers shrinking, singe you've got SIGBUS
    which would indicate that the memory is not available anymore. Before we
    go into debugging, just to be on the safe side I would like to make sure
    you were testing the latest patch version (there are some signs that
    it's not the case, about that later)?
    
    > 10) what to do about stuck resize?
    >
    > AFAICS the resize can get stuck for various reasons, e.g. because it
    > can't evict pinned buffers, possibly indefinitely. Not great, it's not
    > clear to me if there's a way out (canceling the resize) after a timeout,
    > or something like that? Not great to start an "online resize" only to
    > get stuck with all activity blocked for indefinite amount of time, and
    > get to restart anyway.
    >
    > Seems related to Thomas' message [2], but AFAICS the patch does not do
    > anything about this yet, right? What's the plan here?
    
    It's another open discussion right now, with an idea to eventually allow
    canceling after a timeout. I think canceling when stuck on buffer
    eviction should be pretty straightforward (the evition must take place
    before actual shared memory resize, so we know nothing has changed yet),
    but in some other failure scenarios it would be harder (e.g. if one
    backend is stuck resizing, while other have succeeded -- this would
    require another round of synchronization and some way to figure out what
    is the current status).
    
    > 11) preparatory actions?
    >
    > Even if it doesn't get stuck, some of the actions can take a while, like
    > evicting dirty buffers before shrinking, etc. This is similar to what
    > happens on restart, when the shutdown checkpoint can take a while, while
    > the system is (partly) unavailable.
    >
    > The common mitigation is to do an explicit checkpoint right before the
    > restart, to make the shutdown checkpoint cheap. Could we do something
    > similar for the shrinking, e.g. flush buffers from the part to be
    > removed before actually starting the resize?
    
    Yeah, that's a good idea, we will try to explore it.
    
    > 12) does this affect e.g. fork() costs?
    >
    > I wonder if this affects the cost of fork() in some undesirable way?
    > Could it make fork() more measurably more expensive?
    
    The number of new mappings is quite limited, so I would not expect that.
    But I can measure the impact.
    
    > 14) interesting messages from the thread
    >
    > While reading through the thread, I noticed a couple messages that I
    > think are still relevant:
    
    Right, I'm aware there is a lot of not yet addressed feedback, even more
    than you've mentioned below. None of this feedback was ignored, we're
    just solving large problems step by step. So far the focus was on how to
    do memory reservation and to coordinate resize, and everybody is more
    than welcome to join. But thanks for collecting the list, I probably
    need to start tracking what was addressed and what was not.
    
    > - Robert asked [5] if Linux might abruptly break this, but I find that
    >   unlikely. We'd point out we rely on this, and they'd likely rethink.
    >   This would be made safer if this was specified by POSIX - taking that
    >   away once implemented seems way harder than for custom extensions.
    >   It's likely they'd not take away the feature without an alternative
    >   way to achieve the same effect, I think (yes, harder to maintain).
    >   Tom suggests [7] this is not in POSIX.
    
    This conversation was related to the original implementation, which was
    based on mremap and slicing of mappings. As I've mentioned, the new
    approach doesn't have most of those controversial points, it uses
    memfd_create and regular compatible mmap -- I don't see any of those
    changing their behavior any time soon.
    
    > - Andres had an interesting comment about how overcommit interacts with
    >   MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
    >   accounting. There's also some comments about from linux-mm people [9].
    
    The new implementation uses MAP_NORESERVE for the mapping.
    
    > - There seem to be some issues with releasing memory backing a mapping
    >   with hugetlb [10]. With the fd (and truncating the file), this seems
    >   to release the memory, but it's linux-specific? But most of this stuff
    >   is specific to linux, it seems. So is this a problem? With this it
    >   should be working even for hugetlb ...
    
    Again, the new implementation got rid of problematic bits here, and I
    haven't found any weak points related to hugetlb in testing so far.
    
    > - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
    >   make the hugetlb stuff work just like on Linux? Unclear. Also, I
    >   thought the mfd stuff is linux-specific ... or am I confused?
    
    Yep, probably.
    
    > - Thomas asked [13] why we need to stop all the backends, instead of
    >   just waiting for them to acknowledge the new (smaller) NBuffers value
    >   and then let them continue. I also don't quite see why this should
    >   not work, and it'd limit the disruption when we have to wait for
    >   eviction of buffers pinned by paused cursors, etc.
    
    I think I've replied to that one, the idea so far was to eliminate any
    chance of accessing to-be-truncated buffers and make it easier to reason
    about correctness of the implementation this way. I don't see any other
    way how to prevent backends from accessing buffers that may disappear
    without adding overhead on the read path, but if you folks have some
    ideas -- please share!
    
    > v5-0001-Process-config-reload-in-AIO-workers.patch
    >
    > 1) Hmmm, so which other workers may need such explicit handling? Do all
    >    other processes participate in procsignal stuff, or does anything
    >    need an explicit handling?
    
    So far I've noticed the issue only with io_workers and the checkpointer.
    
    > v5-0003-Introduce-pss_barrierReceivedGeneration.patch
    >
    > 1) Do we actually need this? Isn't it enough to just have two barriers?
    >    Or a barrier + condition variable, or something like that.
    
    The issue with two barriers is that they do not prevent disjoint groups,
    i.e. one backend joins the barrier, finishes the work and detaches from
    the barrier, then another backends joins. I'm not familiar with how this
    was solved for online checkums patch though, will take a look. Having a
    barrier and a condition variable would be possible, but it's hard to
    figure out for how many backends to wait. All in all, a small extention
    to the ProcSignalBarrier feels to me much more elegant.
    
    > 2) The comment talks about "coordinated way" when processing messages,
    >    but it's not very clear to me. It should explain what is needed and
    >    not possible with the current barrier code.
    
    Yeah, I need to work on the commentaries across the patch. Here in
    particular it means any coordinated way, whatever that could be. I can
    add an example to clarify that part.
    
    > v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch
    
    Most of the commentaries here and in the following patches are obviously
    reasonable and I'll incorporate them into the next version.
    
    > 5) I'm a bit confused about the segment/mapping difference. The patch
    >    seems to randomly mix those, or maybe I'm just confused. I mean,
    >    we are creating just shmem segment, and the pieces are mappings,
    >    right? So why do we index them by "shmem_segment"?
    
    Indeed, the patch uses "segment" and "mapping" interchangeably, I need
    to tighten it up. The relation is still one to one, thus are multiple
    segments as well as mappings.
    
    > 7) We should remember which segments got to use huge pages and which
    >    did not. And we should make it optional for each segment. Although,
    >    maybe I'm just confused about the "segment" definition - if we only
    >    have one, that's where huge pages are applied.
    >
    >    If we could have multiple segments for different segments (whatever
    >    that means), not sure what we'll report for cases when some segments
    >    get to use huge pages and others don't.
    
    Exactly to avoid solving this, I've consciously decided to postpone
    implementing possibility to mix huge and regular pages so far. Any
    opinions, should a single reported value be removed and this information
    is instead represented as part of an informational view about shared
    memory (the one you were suggesting in this thread)?
    
    > 11) Actually, what's the difference between the contents of Mappings
    >     and Segments? Isn't that the same thing, indexed in the same way?
    >     Or could it be unified? Or are they conceptually different thing?
    
    Unless I'm mixing something badly, the content is the same. The relation
    is a segment as a structure "contains" a mapping.
    
    > v5-0005-Address-space-reservation-for-shared-memory.patch
    >
    > 1) Shouldn't reserved_offset and huge_pages_on really be in the segment
    >    info? Or maybe even in mapping info? (again, maybe I'm confused
    >    about what these structs store)
    
    I don't think there is reserved_offset variable in the latest version
    anymore, can you please confirm you use it instead of ther one I've
    posted in April?
    
    > 3) So ReserveAnonymousMemory is what makes decisions about huge pages,
    >    for the whole reserved space / all segments in it. That's a bit
    >    unfortunate with respect to the desirability of some segments
    >    benefiting from huge pages and others not. Maybe we should have two
    >    "reserved" areas, one with huge pages, one without?
    
    Again, there is no ReserveAnonymousMemory anymore, the new approach is
    to reserve the memory via separate mappings.
    
    >    I guess we don't want too many segments, because that might make
    >    fork() more expensive, etc. Just guessing, though. Also, how would
    >    this work with threading?
    
    I assume multithreading will render it unnecessary to use shared memory
    favoring some other types of memory usage, but the mechanism around it
    could still be the same.
    
    > 5) The general approach seems sound to me, but I'm not expert on this.
    >    I wonder how portable this behavior is. I mean, will it work on other
    >    Unix systems / Windows? Is it POSIX or Linux extension?
    
    Don't know yet, it's a topic for investigation.
    
    > v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch
    >
    > 2) In fact, what happens if the user tries to resize to a value that is
    >    too large for one of the segments? How would the system know before
    >    starting the resize (and failing)?
    
    This type of situation is handled (doing hard stop) in the latest
    version, because all the necessary information is present in the mapping
    structure.
    
    > v5-0007-Allow-to-resize-shared-memory-without-restart.patch
    >
    > 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
    >    somewhere in the resizing?
    
    When coordination with barriers kicks in, there is a cut off line after
    which any newly spawned backend will not be able to take part in it
    (e.g. it was too slow to init ProcSignal infrastructure).
    AdjustShmemSize is used to handle this cases.
    
    > 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
    >    Why is it needed, exactly? If it's to flush stuff for processes
    >    consuming EmitProcSignalBarrier, it's that too late? What if a
    >    process consumes the barrier between the emit and memory barrier?
    
    I think it's not needed, a leftover after code modifications.
    
    > v5-0008-Support-shrinking-shared-buffers.patch
    > v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
    > v5-0010-Additional-validation-for-buffer-in-the-ring.patch
    
    This reminds me I still need to review those, so Ashutosh probably can
    answer those questions better than I.
    
    
    
    
  80. Re: Changing shared_buffers without restart

    Tomas Vondra <tomas@vondra.me> — 2025-07-04T15:23:29Z

    On 7/4/25 16:41, Dmitry Dolgov wrote:
    >> On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
    >> I took a look at this patch, because it's somewhat related to the NUMA
    >> patch series I posted a couple days ago, and I've been wondering if
    >> it makes some of the NUMA stuff harder or simpler.
    > 
    > Thanks a lot for the review! It's a plenty of feedback, and I'll
    > probably take time to answer all of it, but I still want to address
    > couple of most important topics quickly.
    > 
    >> But I'm getting a bit lost in how exactly this interacts with things
    >> like overcommit, system memory accounting / OOM killer and this sort of
    >> stuff. I went through the thread and it seems to me the reserve+map
    >> approach works OK in this regard (and the messages on linux-mm seem to
    >> confirm this). But this information is scattered over many messages and
    >> it's hard to say for sure, because some of this might be relevant for
    >> an earlier approach, or a subtly different variant of it.
    >>
    >> A similar question is portability. The comments and commit messages
    >> seem to suggest most of this is linux-specific, and other platforms just
    >> don't have these capabilities. But there's a bunch of messages (mostly
    >> by Thomas Munro) that hint FreeBSD might be capable of this too, even if
    >> to some limited extent. And possibly even Windows/EXEC_BACKEND, although
    >> that seems much trickier.
    >>
    >> [...]
    >>
    >> So I think it'd be very helpful to write a README, explaining the
    >> currnent design/approach, and summarizing all these aspects in a single
    >> place. Including things like portability, interaction with the OS
    >> accounting, OOM killer, this kind of stuff. Some of this stuff may be
    >> already mentioned in code comments, but you it's hard to find those.
    >>
    >> Especially worth documenting are the states the processes need to go
    >> through (using the barriers), and the transacitons between them (i.e.
    >> what is allowed in each phase, what blocks can be visible, etc.).
    > 
    > Agree, I'll add some comprehensive readme in the next version. Note,
    > that on the topic of portability the latest version implements a new
    > approach suggested by Thomas Munro, which reduces problematic parts to
    > memfd_create only, which is mentioned as Linux specific in the
    > documentation, but AFAICT has FreeBSD counterparts.
    > 
    
    OK. It's not entirely clear to me if this README should be temporary, or
    if it should eventually get committed. I'd probably vote to have a
    proper README explaining the basic design / resizing processes etc. It
    probably should not discuss portability in too much detail, that can get
    stale pretty quick.
    
    >> 1) no user docs
    >>
    >> There are no user .sgml docs, and maybe it's time to write some,
    >> explaining how to use this thing - how to configure it, how to trigger
    >> the resizing, etc. It took me a while to realize I need to do ALTER
    >> SYSTEM + pg_reload_conf() to kick this off.
    >>
    >> It should also document the user-visible limitations, e.g. what activity
    >> is blocked during the resizing, etc.
    > 
    > While the user interface is still under discussion, I agree, it makes
    > sense to capture this information in sgml docs.
    > 
    
    Yeah. Spelling out the "official" way to use something is helpful.
    
    >> 2) pending GUC changes
    >>
    >> [...]
    >>
    >> It also seems a bit strange that the "switch" gets to be be driven by a
    >> randomly selected backend (unless I'm misunderstanding this bit). It
    >> seems to be true for the buffer eviction during shrinking, at least.
    > 
    > The resize itself is coordinated by the postmaster alone, not by a
    > randomly selected backend. But looks like buffer eviction indeed can
    > happen anywhere, which is what we were discussing in the previous
    > messages.
    > 
    >> Perhaps this should be a separate utility command, or maybe even just
    >> a new ALTER SYSTEM variant? Or even just a function, similar to what
    >> the "online checksums" patch did, possibly combined with a bgworder
    >> (but probably not needed, there are no db-specific tasks to do).
    > 
    > This is one topic we still actively discuss, but haven't had much
    > feedback otherwise. The pros and cons seem to be clear:
    > 
    > * Utilizing the existing GUC mechanism would allow treating
    >   shared_buffers as any other configuration, meaning that potential
    >   users of this feature don't have to do anything new to use it -- they
    >   still can use whatever method they prefer to apply new configuration
    >   (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).
    > 
    >   I'm also wondering if it's only shared_buffers, or some other options
    >   could use similar approach.
    > 
    
    I don't know. What are the "potential users" of this feature? I don't
    recall any, but there may be some. How do we know the new pending flag
    will work for them too?
    
    > * Having a separate utility command is a mighty simplification, which
    >   helps avoiding problems you've described above.
    > 
    > So far we've got two against one in favour of simple utility command, so
    > we can as well go with that.
    > 
    
    Not sure voting is a good way to make design decisions ...
    
    >> 3) max_available_memory
    >>
    >> Speaking of GUCs, I dislike how max_available_memory works. It seems a
    >> bit backwards to me. I mean, we're specifying shared_buffers (and some
    >> other parameters), and the system calculates the amount of shared memory
    >> needed. But the limit determines the total limit?
    > 
    > The reason it's so backwards is that it's coming from the need to
    > specify how much memory we would like to reserve, and what would be the
    > upper boundary for increasing shared_buffers. My intention is eventually
    > to get rid of this GUC and figure its value at runtime as a function of
    > the total available memory.
    > 
    
    I understand why it's like this. It's simple, and people do want to
    limit the memory the instance will allocate. That's understandable. The
    trouble is it makes it very unclear what's the implied limit on shared
    buffers size. Maybe if there was a sensible way to expose that, we could
    keep the max_available_memory.
    
    But I don't think you can get rid of the GUC, at least not entirely. You
    need to leave some memory aside for queries, people may start multiple
    instances at once, ...
    
    >> I think the GUC should specify the maximum shared_buffers we want to
    >> allow, and then we'd work out the total to pre-allocate? Considering
    >> we're only allowing to resize shared_buffers, that should be pretty
    >> trivial. Yes, it might happen that the "total limit" happens to exceed
    >> the available memory or something, but we already have the problem
    >> with shared_buffers. Seems fine if we explain this in the docs, and
    >> perhaps print the calculated memory limit on start.
    > 
    > Somehow I'm not following what you suggest here. You mean having the
    > maximum shared_buffers specified, but not as a separate GUC?
    > 
    
    My suggestion was to have a guc max_shared_buffers. Based on that you
    can easily calculate the size of all other segments dependent on
    NBuffers, and reserve memory for that.
    
    >> 4) SHMEM_RESIZE_RATIO
    >>
    >> The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
    >> these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
    >> descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
    >> 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
    >> the rest is only "reserved".
    > 
    > SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent
    > solution, nothing more. I probably have to mention that in the
    > commentaries.
    > 
    
    OK
    
    >> Moreover, all of the above is for mappings sized based on NBuffers. But
    >> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
    >> moment someone increases of max_connection, max_locks_per_transaction
    >> and possibly some other stuff?
    > 
    > Can you elaborate, what do you mean by that? Increasing max_connection,
    > etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
    > but the ratio is for memory reservation only.
    > 
    
    Stuff like PGPROC, fast-path locks etc. are allocated as part of
    MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
    space for that. If I significantly increase GUCs like max_connections or
    max_locks_per_transaction, how do you know it didn't exceed the 10%?
    
    >> 5) no tests
    >>
    >> I mentioned no "user docs", but the patch has 0 tests too. Which seems
    >> a bit strange for a patch of this age.
    >>
    >> A really serious part of the patch series seems to be the coordination
    >> of processes when going through the phases, enforced by the barriers.
    >> This seems like a perfect match for testing using injection points, and
    >> I know we did something like this in the online checksums patch, which
    >> needs to coordinate processes in a similar way.
    > 
    > Exactly what we're talking about recently, figuring out how to use
    > injections points for testing. Keep in mind, that the scope of this work
    > turned out to be huge, and with just two people on board we're
    > addressing one thing at the time.
    > 
    
    Sure.
    
    >> But even just a simple TAP test that does a bunch of (random?) resizes
    >> while running a pgbench seem better than no tests. (That's what I did
    >> manually, and it crashed right away.)
    > 
    > This is the type of testing I was doing before posting the series. I
    > assume you've crashed it on buffers shrinking, singe you've got SIGBUS
    > which would indicate that the memory is not available anymore. Before we
    > go into debugging, just to be on the safe side I would like to make sure
    > you were testing the latest patch version (there are some signs that
    > it's not the case, about that later)?
    > 
    
    Maybe, I don't remember. But I also see crashes while expanding the
    buffers, with assert failure here:
    
    #4  0x0000556f159c43d1 in ExceptionalCondition
    (conditionName=0x556f15c00e00 "node->prev != INVALID_PROC_NUMBER ||
    list->head == procno", fileName=0x556f15c00ce0
    "../../../../src/include/storage/proclist.h", lineNumber=163) at assert.c:66
    #5  0x0000556f157a9831 in proclist_contains_offset (list=0x7f296333ce24,
    procno=140, node_offset=100) at
    ../../../../src/include/storage/proclist.h:163
    #6  0x0000556f157a9add in ConditionVariableTimedSleep
    (cv=0x7f296333ce20, timeout=-1, wait_event_info=134217782) at
    condition_variable.c:184
    #7  0x0000556f157a99c9 in ConditionVariableSleep (cv=0x7f296333ce20,
    wait_event_info=134217782) at condition_variable.c:98
    #8  0x0000556f157902df in BarrierArriveAndWait (barrier=0x7f296333ce08,
    wait_event_info=134217782) at barrier.c:191
    #9  0x0000556f156d1226 in ProcessBarrierShmemResize
    (barrier=0x7f296333ce08) at pg_shmem.c:1201
    
    
    >> 10) what to do about stuck resize?
    >>
    >> AFAICS the resize can get stuck for various reasons, e.g. because it
    >> can't evict pinned buffers, possibly indefinitely. Not great, it's not
    >> clear to me if there's a way out (canceling the resize) after a timeout,
    >> or something like that? Not great to start an "online resize" only to
    >> get stuck with all activity blocked for indefinite amount of time, and
    >> get to restart anyway.
    >>
    >> Seems related to Thomas' message [2], but AFAICS the patch does not do
    >> anything about this yet, right? What's the plan here?
    > 
    > It's another open discussion right now, with an idea to eventually allow
    > canceling after a timeout. I think canceling when stuck on buffer
    > eviction should be pretty straightforward (the evition must take place
    > before actual shared memory resize, so we know nothing has changed yet),
    > but in some other failure scenarios it would be harder (e.g. if one
    > backend is stuck resizing, while other have succeeded -- this would
    > require another round of synchronization and some way to figure out what
    > is the current status).
    > 
    
    I think it'll be crucial to structure it so that it can't get stuck
    while resizing.
    
    >> 11) preparatory actions?
    >>
    >> Even if it doesn't get stuck, some of the actions can take a while, like
    >> evicting dirty buffers before shrinking, etc. This is similar to what
    >> happens on restart, when the shutdown checkpoint can take a while, while
    >> the system is (partly) unavailable.
    >>
    >> The common mitigation is to do an explicit checkpoint right before the
    >> restart, to make the shutdown checkpoint cheap. Could we do something
    >> similar for the shrinking, e.g. flush buffers from the part to be
    >> removed before actually starting the resize?
    > 
    > Yeah, that's a good idea, we will try to explore it.
    > 
    >> 12) does this affect e.g. fork() costs?
    >>
    >> I wonder if this affects the cost of fork() in some undesirable way?
    >> Could it make fork() more measurably more expensive?
    > 
    > The number of new mappings is quite limited, so I would not expect that.
    > But I can measure the impact.
    > 
    >> 14) interesting messages from the thread
    >>
    >> While reading through the thread, I noticed a couple messages that I
    >> think are still relevant:
    > 
    > Right, I'm aware there is a lot of not yet addressed feedback, even more
    > than you've mentioned below. None of this feedback was ignored, we're
    > just solving large problems step by step. So far the focus was on how to
    > do memory reservation and to coordinate resize, and everybody is more
    > than welcome to join. But thanks for collecting the list, I probably
    > need to start tracking what was addressed and what was not.
    > 
    >> - Robert asked [5] if Linux might abruptly break this, but I find that
    >>   unlikely. We'd point out we rely on this, and they'd likely rethink.
    >>   This would be made safer if this was specified by POSIX - taking that
    >>   away once implemented seems way harder than for custom extensions.
    >>   It's likely they'd not take away the feature without an alternative
    >>   way to achieve the same effect, I think (yes, harder to maintain).
    >>   Tom suggests [7] this is not in POSIX.
    > 
    > This conversation was related to the original implementation, which was
    > based on mremap and slicing of mappings. As I've mentioned, the new
    > approach doesn't have most of those controversial points, it uses
    > memfd_create and regular compatible mmap -- I don't see any of those
    > changing their behavior any time soon.
    > 
    >> - Andres had an interesting comment about how overcommit interacts with
    >>   MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
    >>   accounting. There's also some comments about from linux-mm people [9].
    > 
    > The new implementation uses MAP_NORESERVE for the mapping.
    > 
    >> - There seem to be some issues with releasing memory backing a mapping
    >>   with hugetlb [10]. With the fd (and truncating the file), this seems
    >>   to release the memory, but it's linux-specific? But most of this stuff
    >>   is specific to linux, it seems. So is this a problem? With this it
    >>   should be working even for hugetlb ...
    > 
    > Again, the new implementation got rid of problematic bits here, and I
    > haven't found any weak points related to hugetlb in testing so far.
    > 
    >> - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
    >>   make the hugetlb stuff work just like on Linux? Unclear. Also, I
    >>   thought the mfd stuff is linux-specific ... or am I confused?
    > 
    > Yep, probably.
    > 
    >> - Thomas asked [13] why we need to stop all the backends, instead of
    >>   just waiting for them to acknowledge the new (smaller) NBuffers value
    >>   and then let them continue. I also don't quite see why this should
    >>   not work, and it'd limit the disruption when we have to wait for
    >>   eviction of buffers pinned by paused cursors, etc.
    > 
    > I think I've replied to that one, the idea so far was to eliminate any
    > chance of accessing to-be-truncated buffers and make it easier to reason
    > about correctness of the implementation this way. I don't see any other
    > way how to prevent backends from accessing buffers that may disappear
    > without adding overhead on the read path, but if you folks have some
    > ideas -- please share!
    > 
    >> v5-0001-Process-config-reload-in-AIO-workers.patch
    >>
    >> 1) Hmmm, so which other workers may need such explicit handling? Do all
    >>    other processes participate in procsignal stuff, or does anything
    >>    need an explicit handling?
    > 
    > So far I've noticed the issue only with io_workers and the checkpointer.
    > 
    >> v5-0003-Introduce-pss_barrierReceivedGeneration.patch
    >>
    >> 1) Do we actually need this? Isn't it enough to just have two barriers?
    >>    Or a barrier + condition variable, or something like that.
    > 
    > The issue with two barriers is that they do not prevent disjoint groups,
    > i.e. one backend joins the barrier, finishes the work and detaches from
    > the barrier, then another backends joins. I'm not familiar with how this
    > was solved for online checkums patch though, will take a look. Having a
    > barrier and a condition variable would be possible, but it's hard to
    > figure out for how many backends to wait. All in all, a small extention
    > to the ProcSignalBarrier feels to me much more elegant.
    > 
    >> 2) The comment talks about "coordinated way" when processing messages,
    >>    but it's not very clear to me. It should explain what is needed and
    >>    not possible with the current barrier code.
    > 
    > Yeah, I need to work on the commentaries across the patch. Here in
    > particular it means any coordinated way, whatever that could be. I can
    > add an example to clarify that part.
    > 
    >> v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch
    > 
    > Most of the commentaries here and in the following patches are obviously
    > reasonable and I'll incorporate them into the next version.
    > 
    >> 5) I'm a bit confused about the segment/mapping difference. The patch
    >>    seems to randomly mix those, or maybe I'm just confused. I mean,
    >>    we are creating just shmem segment, and the pieces are mappings,
    >>    right? So why do we index them by "shmem_segment"?
    > 
    > Indeed, the patch uses "segment" and "mapping" interchangeably, I need
    > to tighten it up. The relation is still one to one, thus are multiple
    > segments as well as mappings.
    > 
    >> 7) We should remember which segments got to use huge pages and which
    >>    did not. And we should make it optional for each segment. Although,
    >>    maybe I'm just confused about the "segment" definition - if we only
    >>    have one, that's where huge pages are applied.
    >>
    >>    If we could have multiple segments for different segments (whatever
    >>    that means), not sure what we'll report for cases when some segments
    >>    get to use huge pages and others don't.
    > 
    > Exactly to avoid solving this, I've consciously decided to postpone
    > implementing possibility to mix huge and regular pages so far. Any
    > opinions, should a single reported value be removed and this information
    > is instead represented as part of an informational view about shared
    > memory (the one you were suggesting in this thread)?
    > 
    >> 11) Actually, what's the difference between the contents of Mappings
    >>     and Segments? Isn't that the same thing, indexed in the same way?
    >>     Or could it be unified? Or are they conceptually different thing?
    > 
    > Unless I'm mixing something badly, the content is the same. The relation
    > is a segment as a structure "contains" a mapping.
    > 
    
    Then, why do we need to track it in two places? Doesn't it just increase
    the likelihood that someone misses updating one of them?
    
    >> v5-0005-Address-space-reservation-for-shared-memory.patch
    >>
    >> 1) Shouldn't reserved_offset and huge_pages_on really be in the segment
    >>    info? Or maybe even in mapping info? (again, maybe I'm confused
    >>    about what these structs store)
    > 
    > I don't think there is reserved_offset variable in the latest version
    > anymore, can you please confirm you use it instead of ther one I've
    > posted in April?
    > 
    >> 3) So ReserveAnonymousMemory is what makes decisions about huge pages,
    >>    for the whole reserved space / all segments in it. That's a bit
    >>    unfortunate with respect to the desirability of some segments
    >>    benefiting from huge pages and others not. Maybe we should have two
    >>    "reserved" areas, one with huge pages, one without?
    > 
    > Again, there is no ReserveAnonymousMemory anymore, the new approach is
    > to reserve the memory via separate mappings.
    > 
    
    Will check. These may indeed be stale comments, from looking at the
    earlier version of the patch (the last one from Ashutosh).
    
    >>    I guess we don't want too many segments, because that might make
    >>    fork() more expensive, etc. Just guessing, though. Also, how would
    >>    this work with threading?
    > 
    > I assume multithreading will render it unnecessary to use shared memory
    > favoring some other types of memory usage, but the mechanism around it
    > could still be the same.
    > 
    >> 5) The general approach seems sound to me, but I'm not expert on this.
    >>    I wonder how portable this behavior is. I mean, will it work on other
    >>    Unix systems / Windows? Is it POSIX or Linux extension?
    > 
    > Don't know yet, it's a topic for investigation.
    > 
    >> v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch
    >>
    >> 2) In fact, what happens if the user tries to resize to a value that is
    >>    too large for one of the segments? How would the system know before
    >>    starting the resize (and failing)?
    > 
    > This type of situation is handled (doing hard stop) in the latest
    > version, because all the necessary information is present in the mapping
    > structure.
    > 
    
    I don't know, but crashing the instance (I assume that's what you mean
    by hard stop) does not seem like something we want to do. AFAIK the GUC
    hook should be able to determine if the value is too large, and reject
    it at that point. Not proceed and crash everything.
    
    >> v5-0007-Allow-to-resize-shared-memory-without-restart.patch
    >>
    >> 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
    >>    somewhere in the resizing?
    > 
    > When coordination with barriers kicks in, there is a cut off line after
    > which any newly spawned backend will not be able to take part in it
    > (e.g. it was too slow to init ProcSignal infrastructure).
    > AdjustShmemSize is used to handle this cases.
    > 
    >> 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
    >>    Why is it needed, exactly? If it's to flush stuff for processes
    >>    consuming EmitProcSignalBarrier, it's that too late? What if a
    >>    process consumes the barrier between the emit and memory barrier?
    > 
    > I think it's not needed, a leftover after code modifications.
    > 
    >> v5-0008-Support-shrinking-shared-buffers.patch
    >> v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
    >> v5-0010-Additional-validation-for-buffer-in-the-ring.patch
    > 
    > This reminds me I still need to review those, so Ashutosh probably can
    > answer those questions better than I.
    
    
    
    -- 
    Tomas Vondra
    
    
    
    
    
  81. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-05T10:35:44Z

    > On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote:
    > >> 2) pending GUC changes
    > >>
    > >> Perhaps this should be a separate utility command, or maybe even just
    > >> a new ALTER SYSTEM variant? Or even just a function, similar to what
    > >> the "online checksums" patch did, possibly combined with a bgworder
    > >> (but probably not needed, there are no db-specific tasks to do).
    > >
    > > This is one topic we still actively discuss, but haven't had much
    > > feedback otherwise. The pros and cons seem to be clear:
    > >
    > > * Utilizing the existing GUC mechanism would allow treating
    > >   shared_buffers as any other configuration, meaning that potential
    > >   users of this feature don't have to do anything new to use it -- they
    > >   still can use whatever method they prefer to apply new configuration
    > >   (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).
    > >
    > >   I'm also wondering if it's only shared_buffers, or some other options
    > >   could use similar approach.
    > >
    >
    > I don't know. What are the "potential users" of this feature? I don't
    > recall any, but there may be some. How do we know the new pending flag
    > will work for them too?
    
    It could be potentialy useful for any GUC that controls a resource
    shared between backend, and requires restart. To make this GUC
    changeable online, every backend has to perform some action, and they
    have to coordinate to make sure things are consistent -- exactly the use
    case we're trying to address, shared_buffers is just happened to be one
    of such resources. While I agree that the currently implemented
    interface is wrong (e.g. it doesn't prevent pending GUCs from being
    stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
    value is actually applied), it still makes sense to me to allow more
    flexible lifecycle for certain GUC.
    
    An example I could think of is shared_preload_libraries. If we ever want
    to do a hot reload of libraries, this will follow the procedure above:
    every backend has to do something like dlclose / dlopen and make sure
    that other backends have the same version of the library. Another maybe
    less far fetched example is max_worker_processes, which AFAICT is mostly
    used to control number of slots in shared memory (altough it's also
    stored in the control file, which makes things more complicated).
    
    > > * Having a separate utility command is a mighty simplification, which
    > >   helps avoiding problems you've described above.
    > >
    > > So far we've got two against one in favour of simple utility command, so
    > > we can as well go with that.
    > >
    >
    > Not sure voting is a good way to make design decisions ...
    
    I'm somewhat torn between those two options myself. The more I think
    about this topic, the more I convinced that pending GUC makes sense, but
    the more work I see needed to implement that. Maybe a good middle ground
    is to go with a simple utility command, as Ashutosh was suggesting, and
    keep pending GUC infrastructure on top of that as an optional patch.
    
    > >> 3) max_available_memory
    > >>
    > >> I think the GUC should specify the maximum shared_buffers we want to
    > >> allow, and then we'd work out the total to pre-allocate? Considering
    > >> we're only allowing to resize shared_buffers, that should be pretty
    > >> trivial. Yes, it might happen that the "total limit" happens to exceed
    > >> the available memory or something, but we already have the problem
    > >> with shared_buffers. Seems fine if we explain this in the docs, and
    > >> perhaps print the calculated memory limit on start.
    > >
    > > Somehow I'm not following what you suggest here. You mean having the
    > > maximum shared_buffers specified, but not as a separate GUC?
    >
    > My suggestion was to have a guc max_shared_buffers. Based on that you
    > can easily calculate the size of all other segments dependent on
    > NBuffers, and reserve memory for that.
    
    Got it, ok.
    
    > >> Moreover, all of the above is for mappings sized based on NBuffers. But
    > >> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
    > >> moment someone increases of max_connection, max_locks_per_transaction
    > >> and possibly some other stuff?
    > >
    > > Can you elaborate, what do you mean by that? Increasing max_connection,
    > > etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
    > > but the ratio is for memory reservation only.
    > >
    >
    > Stuff like PGPROC, fast-path locks etc. are allocated as part of
    > MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
    > space for that. If I significantly increase GUCs like max_connections or
    > max_locks_per_transaction, how do you know it didn't exceed the 10%?
    
    Still don't see the problem. The 10% we're talking about is the reserved
    space, thus it affects only shared memory resizing operation and nothing
    else. The real memory allocated is less than or equal to the reserved
    size, but is allocated and managed completely in the same way as without
    the patch, including size calculations. If some GUCs are increased and
    drive real memory usage high, it will be handled as before. Are we on
    the same page about this?
    
    > >> 11) Actually, what's the difference between the contents of Mappings
    > >>     and Segments? Isn't that the same thing, indexed in the same way?
    > >>     Or could it be unified? Or are they conceptually different thing?
    > >
    > > Unless I'm mixing something badly, the content is the same. The relation
    > > is a segment as a structure "contains" a mapping.
    > >
    > Then, why do we need to track it in two places? Doesn't it just increase
    > the likelihood that someone misses updating one of them?
    
    To clarify, under "contents" I mean the shared memory content (the
    actual data) behind both "segment" and the "mapping", maybe you had
    something else in mind.
    
    On the surface of it those are two different data structures that have
    mostly different, but related, fields: a shared memory segment contains
    stuff needed for working with memory (header, base, end, lock), mapping
    has more lower level details, e.g.  reserved space, fd, IPC key. The
    only common fields are size and address, maybe I can factor them out to
    not repeat.
    
    > >> 2) In fact, what happens if the user tries to resize to a value that is
    > >>    too large for one of the segments? How would the system know before
    > >>    starting the resize (and failing)?
    > >
    > > This type of situation is handled (doing hard stop) in the latest
    > > version, because all the necessary information is present in the mapping
    > > structure.
    > >
    > I don't know, but crashing the instance (I assume that's what you mean
    > by hard stop) does not seem like something we want to do. AFAIK the GUC
    > hook should be able to determine if the value is too large, and reject
    > it at that point. Not proceed and crash everything.
    
    I see, you're pointing out that it would be good to have more validation
    at the GUC level, right?
    
    
    
    
  82. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-06T13:01:34Z

    > On Fri, Jul 04, 2025 at 04:41:51PM +0200, Dmitry Dolgov wrote:
    > > v5-0003-Introduce-pss_barrierReceivedGeneration.patch
    > >
    > > 1) Do we actually need this? Isn't it enough to just have two barriers?
    > >    Or a barrier + condition variable, or something like that.
    >
    > The issue with two barriers is that they do not prevent disjoint groups,
    > i.e. one backend joins the barrier, finishes the work and detaches from
    > the barrier, then another backends joins. I'm not familiar with how this
    > was solved for online checkums patch though, will take a look. Having a
    > barrier and a condition variable would be possible, but it's hard to
    > figure out for how many backends to wait. All in all, a small extention
    > to the ProcSignalBarrier feels to me much more elegant.
    
    After quickly checking how online checksums patch is dealing with the
    coordination, I've realized my answer here about the disjoint groups is
    not quite correct. You were asking about ProcSignalBarrier, I was
    answering about the barrier within the resizing logic. Here is how it
    looks like to me:
    
    * We could follow the same way as the online checksums, launch a
      coordinator worker (Ashutosh was suggesting that, but no
      implementation has materialized yet) and fire two ProcSignalBarriers,
      one to kick off resizing and another one to finish it. Maybe it could
      even be three phases, one extra to tell backends to not pull in new
      buffers into the pool to help buffer eviction process.
    
    * This way any backend between the ProcSignalBarriers will be able
      proceed with whatever it's doing, and there is need to make sure it
      will not access buffers that will soon disappear. A suggestion so far
      was to get all backends agree to not allocate any new buffers in the
      to-be-truncated range, but accessing already existing buffers that
      will soon go away is a problem as well. As far as I can tell there is
      no rock solid method to make sure a backend doesn't have a reference
      to such a buffer somewhere (this was discussed earlier in thre
      thread), meaning that either a backend has to wait or buffers have to
      be checked every time on access.
    
    * Since the latter adds a performance overhead, we went with the former
      (making backends wait). And here is where all the complexity comes
      from, because waiting backends cannot reply on a ProcSignalBarrier and
      thus require some other approach. If I've overlooked any other
      alternative to backends waiting, let me know.
    
    > It also seems a bit strange that the "switch" gets to be be driven by
    > a randomly selected backend (unless I'm misunderstanding this bit). It
    > seems to be true for the buffer eviction during shrinking, at least.
    
    But looks like the eviction could be indeed improved via a new
    coordinator worker. Before resizing shared memory such a worker will
    first tell all the backends to not allocate new buffers via
    ProcSignalBarrier, then will do buffer eviction. Since backends don't
    need to be waiting after this type of ProcSignalBarrier, it should work
    and establish only one worker to do the eviction. But the second
    ProcSignalBarrier for resizing would still follow the current procedure
    with everybody waiting.
    
    Does it make sense to you folks?
    
    
    
    
  83. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-06T13:21:08Z

    > On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote:
    > * This way any backend between the ProcSignalBarriers will be able
    >   proceed with whatever it's doing, and there is need to make sure it
    >   will not access buffers that will soon disappear. A suggestion so far
    >   was to get all backends agree to not allocate any new buffers in the
    >   to-be-truncated range, but accessing already existing buffers that
    >   will soon go away is a problem as well. As far as I can tell there is
    >   no rock solid method to make sure a backend doesn't have a reference
    >   to such a buffer somewhere (this was discussed earlier in thre
    >   thread), meaning that either a backend has to wait or buffers have to
    >   be checked every time on access.
    
    And sure enough, after I wrote this I've realized there should be no
    such references after the buffer eviction and prohibiting new buffer
    allocation. I still need to check it though, because not only buffers,
    but other shared memory structures (which number depends on NBuffers)
    will be truncated. But if they will also be handled by the eviction,
    then maybe everything is just fine.
    
    
    
    
  84. Re: Changing shared_buffers without restart

    Tomas Vondra <tomas@vondra.me> — 2025-07-07T11:57:42Z

    
    On 7/5/25 12:35, Dmitry Dolgov wrote:
    >> On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote:
    >>>> 2) pending GUC changes
    >>>>
    >>>> Perhaps this should be a separate utility command, or maybe even just
    >>>> a new ALTER SYSTEM variant? Or even just a function, similar to what
    >>>> the "online checksums" patch did, possibly combined with a bgworder
    >>>> (but probably not needed, there are no db-specific tasks to do).
    >>>
    >>> This is one topic we still actively discuss, but haven't had much
    >>> feedback otherwise. The pros and cons seem to be clear:
    >>>
    >>> * Utilizing the existing GUC mechanism would allow treating
    >>>   shared_buffers as any other configuration, meaning that potential
    >>>   users of this feature don't have to do anything new to use it -- they
    >>>   still can use whatever method they prefer to apply new configuration
    >>>   (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).
    >>>
    >>>   I'm also wondering if it's only shared_buffers, or some other options
    >>>   could use similar approach.
    >>>
    >>
    >> I don't know. What are the "potential users" of this feature? I don't
    >> recall any, but there may be some. How do we know the new pending flag
    >> will work for them too?
    > 
    > It could be potentialy useful for any GUC that controls a resource
    > shared between backend, and requires restart. To make this GUC
    > changeable online, every backend has to perform some action, and they
    > have to coordinate to make sure things are consistent -- exactly the use
    > case we're trying to address, shared_buffers is just happened to be one
    > of such resources. While I agree that the currently implemented
    > interface is wrong (e.g. it doesn't prevent pending GUCs from being
    > stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
    > value is actually applied), it still makes sense to me to allow more
    > flexible lifecycle for certain GUC.
    > 
    > An example I could think of is shared_preload_libraries. If we ever want
    > to do a hot reload of libraries, this will follow the procedure above:
    > every backend has to do something like dlclose / dlopen and make sure
    > that other backends have the same version of the library. Another maybe
    > less far fetched example is max_worker_processes, which AFAICT is mostly
    > used to control number of slots in shared memory (altough it's also
    > stored in the control file, which makes things more complicated).
    > 
    
    Not sure. My concern is the config reload / GUC assign hook was not
    designed with this use case in mind, and we'll run into issues. I also
    dislike the "async" nature of this, which makes it harder to e.g. abort
    the change, etc.
    
    >>> * Having a separate utility command is a mighty simplification, which
    >>>   helps avoiding problems you've described above.
    >>>
    >>> So far we've got two against one in favour of simple utility command, so
    >>> we can as well go with that.
    >>>
    >>
    >> Not sure voting is a good way to make design decisions ...
    > 
    > I'm somewhat torn between those two options myself. The more I think
    > about this topic, the more I convinced that pending GUC makes sense, but
    > the more work I see needed to implement that. Maybe a good middle ground
    > is to go with a simple utility command, as Ashutosh was suggesting, and
    > keep pending GUC infrastructure on top of that as an optional patch.
    > 
    
    What about a simple function? Probably not as clean as a proper utility
    command, and it implies a transaction - not sure if that could be a
    problem for some part of this.
    
    >>>> 3) max_available_memory
    >>>>
    >>>> I think the GUC should specify the maximum shared_buffers we want to
    >>>> allow, and then we'd work out the total to pre-allocate? Considering
    >>>> we're only allowing to resize shared_buffers, that should be pretty
    >>>> trivial. Yes, it might happen that the "total limit" happens to exceed
    >>>> the available memory or something, but we already have the problem
    >>>> with shared_buffers. Seems fine if we explain this in the docs, and
    >>>> perhaps print the calculated memory limit on start.
    >>>
    >>> Somehow I'm not following what you suggest here. You mean having the
    >>> maximum shared_buffers specified, but not as a separate GUC?
    >>
    >> My suggestion was to have a guc max_shared_buffers. Based on that you
    >> can easily calculate the size of all other segments dependent on
    >> NBuffers, and reserve memory for that.
    > 
    > Got it, ok.
    > 
    >>>> Moreover, all of the above is for mappings sized based on NBuffers. But
    >>>> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
    >>>> moment someone increases of max_connection, max_locks_per_transaction
    >>>> and possibly some other stuff?
    >>>
    >>> Can you elaborate, what do you mean by that? Increasing max_connection,
    >>> etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
    >>> but the ratio is for memory reservation only.
    >>>
    >>
    >> Stuff like PGPROC, fast-path locks etc. are allocated as part of
    >> MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
    >> space for that. If I significantly increase GUCs like max_connections or
    >> max_locks_per_transaction, how do you know it didn't exceed the 10%?
    > 
    > Still don't see the problem. The 10% we're talking about is the reserved
    > space, thus it affects only shared memory resizing operation and nothing
    > else. The real memory allocated is less than or equal to the reserved
    > size, but is allocated and managed completely in the same way as without
    > the patch, including size calculations. If some GUCs are increased and
    > drive real memory usage high, it will be handled as before. Are we on
    > the same page about this?
    > 
    
    How do you know reserving 10% is sufficient? Imagine I set
    
      max_available_memory = '256MB'
      max_connections = 1000000
      max_locks_per_transaction = 10000
    
    How do you know it's not more than 10% of the available memory?
    
    FWIW if I add a simple assert to CreateAnonymousSegment
    
      Assert(mapping->shmem_reserved >= allocsize);
    
    it crashes even with just the max_available_memory=256MB
    
      #4  0x0000000000b74fbd in ExceptionalCondition (conditionName=0xe25920
    "mapping->shmem_reserved >= allocsize", fileName=0xe251e7 "pg_shmem.c",
    lineNumber=878) at assert.c:66
    
    because we happen to execute it with this:
    
      mapping->shmem_reserved 26845184  allocsize 125042688
    
    I think I mentioned a similar crash earlier, not sure if that's the same
    issue or a different one.
    
    >>>> 11) Actually, what's the difference between the contents of Mappings
    >>>>     and Segments? Isn't that the same thing, indexed in the same way?
    >>>>     Or could it be unified? Or are they conceptually different thing?
    >>>
    >>> Unless I'm mixing something badly, the content is the same. The relation
    >>> is a segment as a structure "contains" a mapping.
    >>>
    >> Then, why do we need to track it in two places? Doesn't it just increase
    >> the likelihood that someone misses updating one of them?
    > 
    > To clarify, under "contents" I mean the shared memory content (the
    > actual data) behind both "segment" and the "mapping", maybe you had
    > something else in mind.
    > 
    > On the surface of it those are two different data structures that have
    > mostly different, but related, fields: a shared memory segment contains
    > stuff needed for working with memory (header, base, end, lock), mapping
    > has more lower level details, e.g.  reserved space, fd, IPC key. The
    > only common fields are size and address, maybe I can factor them out to
    > not repeat.
    > 
    
    OK, I think I'm just confused by the ambiguous definitions of
    segment/mapping. It'd be good to document/explain this in a comment
    somewhere.
    
    >>>> 2) In fact, what happens if the user tries to resize to a value that is
    >>>>    too large for one of the segments? How would the system know before
    >>>>    starting the resize (and failing)?
    >>>
    >>> This type of situation is handled (doing hard stop) in the latest
    >>> version, because all the necessary information is present in the mapping
    >>> structure.
    >>>
    >> I don't know, but crashing the instance (I assume that's what you mean
    >> by hard stop) does not seem like something we want to do. AFAIK the GUC
    >> hook should be able to determine if the value is too large, and reject
    >> it at that point. Not proceed and crash everything.
    > 
    > I see, you're pointing out that it would be good to have more validation
    > at the GUC level, right?
    
    Well, that'd be a starting point. We definitely should not allow setting
    a value that end up crashing an instance (it does not matter if it's
    because of FATAL or hitting a segfault/sigbut somewhere).
    
    cheers
    
    -- 
    Tomas Vondra
    
    
    
    
    
  85. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-07T13:06:41Z

    > On Mon, Jul 07, 2025 at 01:57:42PM +0200, Tomas Vondra wrote:
    > > It could be potentialy useful for any GUC that controls a resource
    > > shared between backend, and requires restart. To make this GUC
    > > changeable online, every backend has to perform some action, and they
    > > have to coordinate to make sure things are consistent -- exactly the use
    > > case we're trying to address, shared_buffers is just happened to be one
    > > of such resources. While I agree that the currently implemented
    > > interface is wrong (e.g. it doesn't prevent pending GUCs from being
    > > stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
    > > value is actually applied), it still makes sense to me to allow more
    > > flexible lifecycle for certain GUC.
    > >
    > > An example I could think of is shared_preload_libraries. If we ever want
    > > to do a hot reload of libraries, this will follow the procedure above:
    > > every backend has to do something like dlclose / dlopen and make sure
    > > that other backends have the same version of the library. Another maybe
    > > less far fetched example is max_worker_processes, which AFAICT is mostly
    > > used to control number of slots in shared memory (altough it's also
    > > stored in the control file, which makes things more complicated).
    > >
    >
    > Not sure. My concern is the config reload / GUC assign hook was not
    > designed with this use case in mind, and we'll run into issues. I also
    > dislike the "async" nature of this, which makes it harder to e.g. abort
    > the change, etc.
    
    Yes, GUC assing hook was not designed for that. That's why the idea is
    to extend the design and see if it will be good enough.
    
    > > I'm somewhat torn between those two options myself. The more I think
    > > about this topic, the more I convinced that pending GUC makes sense, but
    > > the more work I see needed to implement that. Maybe a good middle ground
    > > is to go with a simple utility command, as Ashutosh was suggesting, and
    > > keep pending GUC infrastructure on top of that as an optional patch.
    > >
    >
    > What about a simple function? Probably not as clean as a proper utility
    > command, and it implies a transaction - not sure if that could be a
    > problem for some part of this.
    
    I'm currently inclined towards this and a new one worker to coordinate
    the process, with everything else provided as an optional follow-up
    step. Will try this out unless there are any objections.
    
    > >> Stuff like PGPROC, fast-path locks etc. are allocated as part of
    > >> MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
    > >> space for that. If I significantly increase GUCs like max_connections or
    > >> max_locks_per_transaction, how do you know it didn't exceed the 10%?
    > >
    > > Still don't see the problem. The 10% we're talking about is the reserved
    > > space, thus it affects only shared memory resizing operation and nothing
    > > else. The real memory allocated is less than or equal to the reserved
    > > size, but is allocated and managed completely in the same way as without
    > > the patch, including size calculations. If some GUCs are increased and
    > > drive real memory usage high, it will be handled as before. Are we on
    > > the same page about this?
    > >
    >
    > How do you know reserving 10% is sufficient? Imagine I set
    
    I see, I was convinced you're talking about changing something at
    runtime, which will hit the reservation boundary. But you mean all of
    that at simply the start, and yes, of course it will fail -- see the
    point about SHMEM_RATIO being just a temporary hack.
    
    
    
    
  86. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-07-07T13:42:50Z

    On Mon, Jul 7, 2025 at 6:36 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Mon, Jul 07, 2025 at 01:57:42PM +0200, Tomas Vondra wrote:
    > > > It could be potentialy useful for any GUC that controls a resource
    > > > shared between backend, and requires restart. To make this GUC
    > > > changeable online, every backend has to perform some action, and they
    > > > have to coordinate to make sure things are consistent -- exactly the use
    > > > case we're trying to address, shared_buffers is just happened to be one
    > > > of such resources. While I agree that the currently implemented
    > > > interface is wrong (e.g. it doesn't prevent pending GUCs from being
    > > > stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
    > > > value is actually applied), it still makes sense to me to allow more
    > > > flexible lifecycle for certain GUC.
    > > >
    > > > An example I could think of is shared_preload_libraries. If we ever want
    > > > to do a hot reload of libraries, this will follow the procedure above:
    > > > every backend has to do something like dlclose / dlopen and make sure
    > > > that other backends have the same version of the library. Another maybe
    > > > less far fetched example is max_worker_processes, which AFAICT is mostly
    > > > used to control number of slots in shared memory (altough it's also
    > > > stored in the control file, which makes things more complicated).
    > > >
    > >
    > > Not sure. My concern is the config reload / GUC assign hook was not
    > > designed with this use case in mind, and we'll run into issues. I also
    > > dislike the "async" nature of this, which makes it harder to e.g. abort
    > > the change, etc.
    >
    > Yes, GUC assing hook was not designed for that. That's why the idea is
    > to extend the design and see if it will be good enough.
    >
    > > > I'm somewhat torn between those two options myself. The more I think
    > > > about this topic, the more I convinced that pending GUC makes sense, but
    > > > the more work I see needed to implement that. Maybe a good middle ground
    > > > is to go with a simple utility command, as Ashutosh was suggesting, and
    > > > keep pending GUC infrastructure on top of that as an optional patch.
    > > >
    > >
    > > What about a simple function? Probably not as clean as a proper utility
    > > command, and it implies a transaction - not sure if that could be a
    > > problem for some part of this.
    >
    > I'm currently inclined towards this and a new one worker to coordinate
    > the process, with everything else provided as an optional follow-up
    > step. Will try this out unless there are any objections.
    
    I will reply to the questions but let me summarise my  offlist
    discussion with Andres.
    
    I had proposed ALTER SYSTEM ... UPDATE ... approach in pgconf.dev for
    any system wide GUC change such as this. However, Andres pointed out
    that any UI proposal has to honour the current ability to edit
    postgresql.conf and trigger the change in a running server. ALTER
    SYSTEM ... UDPATE ... does not allow that. So, I think we have to
    build something similar or on top of the current ALTER SYSTEM ... SET
    + pg_reload_conf().
    
    My current proposal is ALTER SYSTEM ... SET + pg_reload_conf() with
    pending mark + pg_apply_pending_conf(<name of GUC>, <more
    parameters>). The third function would take a GUC name as parameter
    and complete the pending application change. If the proposed change is
    not valid, it will throw an error. If there are problems completing
    the change it will throw an error and keep the pending mark intact.
    Further the function can take GUC specific parameters which control
    the application process. E.g. for example it could tell whether to
    wait for a backend to unpin a buffer or cancel that query or kill the
    backend or abort the application itself. If the operation takes too
    long, a user may want to cancel the function execution just like
    cancelling a query. Running two concurrent instances of the function,
    both applying the same GUC won't be allowed.
    
    Does that look good?
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  87. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-07T13:58:19Z

    > On Mon, Jul 07, 2025 at 07:12:50PM +0530, Ashutosh Bapat wrote:
    >
    > My current proposal is ALTER SYSTEM ... SET + pg_reload_conf() with
    > pending mark + pg_apply_pending_conf(<name of GUC>, <more
    > parameters>). The third function would take a GUC name as parameter
    > and complete the pending application change. If the proposed change is
    > not valid, it will throw an error. If there are problems completing
    > the change it will throw an error and keep the pending mark intact.
    > Further the function can take GUC specific parameters which control
    > the application process. E.g. for example it could tell whether to
    > wait for a backend to unpin a buffer or cancel that query or kill the
    > backend or abort the application itself. If the operation takes too
    > long, a user may want to cancel the function execution just like
    > cancelling a query. Running two concurrent instances of the function,
    > both applying the same GUC won't be allowed.
    
    Yeah, it can look like this, but it's a large chunk of work as well as
    improving the current implementation. I'm still convinced that using GUC
    mechanism one or another way is the right choice here, but maybe better
    as a follow-up step I was mentioning above -- simply to limit the scope
    and move step by step. How does it sound?
    
    Regarding the proposal, I'm somehow uncomfortable with the fact that
    between those two function call the system will be in an awkward state
    for some time, and how long would it take will not be controlled by
    the resizing logic anymore. But otherwise it seems to be equivalent of
    what we want to achieve in many other apspects.
    
    
    
    
  88. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-13T18:37:26Z

    > On Sun, Jul 06, 2025 at 03:21:08PM +0200, Dmitry Dolgov wrote:
    > > On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote:
    > > * This way any backend between the ProcSignalBarriers will be able
    > >   proceed with whatever it's doing, and there is need to make sure it
    > >   will not access buffers that will soon disappear. A suggestion so far
    > >   was to get all backends agree to not allocate any new buffers in the
    > >   to-be-truncated range, but accessing already existing buffers that
    > >   will soon go away is a problem as well. As far as I can tell there is
    > >   no rock solid method to make sure a backend doesn't have a reference
    > >   to such a buffer somewhere (this was discussed earlier in thre
    > >   thread), meaning that either a backend has to wait or buffers have to
    > >   be checked every time on access.
    >
    > And sure enough, after I wrote this I've realized there should be no
    > such references after the buffer eviction and prohibiting new buffer
    > allocation. I still need to check it though, because not only buffers,
    > but other shared memory structures (which number depends on NBuffers)
    > will be truncated. But if they will also be handled by the eviction,
    > then maybe everything is just fine.
    
    Pondering more about this topic, I've realized there was one more
    problematic case mentioned by Robert early in the thread, which is
    relatively easy to construct:
    
    * When increasing shared buffers from NBuffers_small to NBuffers_large
      it's possible that one backend already has applied NBuffers_large,
      then allocated a buffer B from (NBuffer_small, NBuffers_large] and put
      it into the buffer lookup table.
    
    * In the meantime another backend still has NBuffers_small, but got
      buffer B from the lookup table.
    
    Currently it's being addressed via every backend waiting for each other,
    but I guess it could be as well managed via handling the freelist, so
    that only "available" buffers will be inserted into the lookup table.
    
    It's probably the only such case, but I can't tell that for sure (hard
    to say, maybe there are more tricky cases with the latest async io). If
    you folks have some other examples that may break, let me know. The
    idea behind making everyone wait was to be rock solid that no similar
    but unknown scenarios could damage the resize procedure.
    
    As for other structures, BufferBlocks, BufferDescriptors and
    BufferIOCVArray are all buffer indexed, so making sure shared memory
    resizing works for buffers should automatically mean the same for the
    rest. But CkptBufferIds is a different case, as it collects buffers to
    sync and process them at later point in time -- it has to be explicitely
    handled when shrinking shared memory I guess.
    
    Long story short, in the next version of the patch I'll try to
    experiment with a simplified design: a simple function to trigger
    resizing, launching a coordinator worker, with backends not waiting for
    each other and buffers first allocated and then marked as "available to
    use".
    
    
    
    
  89. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-07-14T04:55:51Z

    On Mon, Jul 14, 2025 at 12:07 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Sun, Jul 06, 2025 at 03:21:08PM +0200, Dmitry Dolgov wrote:
    > > > On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote:
    > > > * This way any backend between the ProcSignalBarriers will be able
    > > >   proceed with whatever it's doing, and there is need to make sure it
    > > >   will not access buffers that will soon disappear. A suggestion so far
    > > >   was to get all backends agree to not allocate any new buffers in the
    > > >   to-be-truncated range, but accessing already existing buffers that
    > > >   will soon go away is a problem as well. As far as I can tell there is
    > > >   no rock solid method to make sure a backend doesn't have a reference
    > > >   to such a buffer somewhere (this was discussed earlier in thre
    > > >   thread), meaning that either a backend has to wait or buffers have to
    > > >   be checked every time on access.
    > >
    > > And sure enough, after I wrote this I've realized there should be no
    > > such references after the buffer eviction and prohibiting new buffer
    > > allocation. I still need to check it though, because not only buffers,
    > > but other shared memory structures (which number depends on NBuffers)
    > > will be truncated. But if they will also be handled by the eviction,
    > > then maybe everything is just fine.
    >
    > Pondering more about this topic, I've realized there was one more
    > problematic case mentioned by Robert early in the thread, which is
    > relatively easy to construct:
    >
    > * When increasing shared buffers from NBuffers_small to NBuffers_large
    >   it's possible that one backend already has applied NBuffers_large,
    >   then allocated a buffer B from (NBuffer_small, NBuffers_large] and put
    >   it into the buffer lookup table.
    >
    > * In the meantime another backend still has NBuffers_small, but got
    >   buffer B from the lookup table.
    >
    > Currently it's being addressed via every backend waiting for each other,
    > but I guess it could be as well managed via handling the freelist, so
    > that only "available" buffers will be inserted into the lookup table.
    
    I didn't get how can that be managed by freelist? Buffers are also
    allocated through clocksweep, which needs to be managed as well.
    
    > Long story short, in the next version of the patch I'll try to
    > experiment with a simplified design: a simple function to trigger
    > resizing, launching a coordinator worker, with backends not waiting for
    > each other and buffers first allocated and then marked as "available to
    > use".
    
    Should all the backends wait between buffer allocation and them being
    marked as "available"? I assume that marking them as available means
    "declaring the new NBuffers". What about when shrinking the buffers?
    Do you plan to make all the backends wait while the coordinator is
    evicting buffers?
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  90. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T08:10:26Z

    > On Mon, Jul 14, 2025 at 10:25:51AM +0530, Ashutosh Bapat wrote:
    > > Currently it's being addressed via every backend waiting for each other,
    > > but I guess it could be as well managed via handling the freelist, so
    > > that only "available" buffers will be inserted into the lookup table.
    >
    > I didn't get how can that be managed by freelist? Buffers are also
    > allocated through clocksweep, which needs to be managed as well.
    
    The way it is implemented in the patch right now is new buffers are
    added into the freelist right away, when they're initialized by the
    virtue of nextFree. What I have in mind is to do this as the last step,
    when all backends have confirmed shared memory signal was absorbed. This
    would mean that StrategyControll will not return a buffer id from the
    freshly allocated range until everything is done, and no such buffer
    will be inserted into the buffer lookup table.
    
    You're right of course, a buffer id could be returned from the
    ClockSweep and from the custom strategy buffer ring. Buf from what I see
    those are picking a buffer from the set of already utilized buffers,
    meaning that for a buffer to land there it first has to go through
    StrategyControl->firstFreeBuffer, and hence the idea above will be a
    requirement for those as well.
    
    > > Long story short, in the next version of the patch I'll try to
    > > experiment with a simplified design: a simple function to trigger
    > > resizing, launching a coordinator worker, with backends not waiting for
    > > each other and buffers first allocated and then marked as "available to
    > > use".
    >
    > Should all the backends wait between buffer allocation and them being
    > marked as "available"? I assume that marking them as available means
    > "declaring the new NBuffers".
    
    Yep, making buffers available would be equivalent to declaring the new
    NBuffers. What I think is important here is to note, that we use two
    mechanisms for coordination: the shared structure ShmemControl that
    shares the state of operation, and ProcSignal that tells backends to do
    something (change the memory mapping). Declaring the new NBuffers could
    be done via ShmemControl, atomically applying the new value, instead of
    sending a ProcSignal -- this way there is no need for backends to wait,
    but StrategyControl would need to use the ShmemControl instead of local
    copy of NBuffers. Does it make sense to you?
    
    > What about when shrinking the buffers? Do you plan to make all the
    > backends wait while the coordinator is evicting buffers?
    
    No, it was never planned like that, since it could easily end up with
    coordinator waiting for the backend to unpin a buffer, and the backend
    to wait for a signal from the coordinator.
    
    
    
    
  91. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-07-14T08:25:39Z

    On Mon, Jul 14, 2025 at 1:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Mon, Jul 14, 2025 at 10:25:51AM +0530, Ashutosh Bapat wrote:
    > > > Currently it's being addressed via every backend waiting for each other,
    > > > but I guess it could be as well managed via handling the freelist, so
    > > > that only "available" buffers will be inserted into the lookup table.
    > >
    > > I didn't get how can that be managed by freelist? Buffers are also
    > > allocated through clocksweep, which needs to be managed as well.
    >
    > The way it is implemented in the patch right now is new buffers are
    > added into the freelist right away, when they're initialized by the
    > virtue of nextFree. What I have in mind is to do this as the last step,
    > when all backends have confirmed shared memory signal was absorbed. This
    > would mean that StrategyControll will not return a buffer id from the
    > freshly allocated range until everything is done, and no such buffer
    > will be inserted into the buffer lookup table.
    >
    > You're right of course, a buffer id could be returned from the
    > ClockSweep and from the custom strategy buffer ring. Buf from what I see
    > those are picking a buffer from the set of already utilized buffers,
    > meaning that for a buffer to land there it first has to go through
    > StrategyControl->firstFreeBuffer, and hence the idea above will be a
    > requirement for those as well.
    
    That isn't true. A buffer which was never in the free list can still
    be picked up by clock sweep. But you are raising a relevant point
    about StrategyControl below
    
    >
    > > > Long story short, in the next version of the patch I'll try to
    > > > experiment with a simplified design: a simple function to trigger
    > > > resizing, launching a coordinator worker, with backends not waiting for
    > > > each other and buffers first allocated and then marked as "available to
    > > > use".
    > >
    > > Should all the backends wait between buffer allocation and them being
    > > marked as "available"? I assume that marking them as available means
    > > "declaring the new NBuffers".
    >
    > Yep, making buffers available would be equivalent to declaring the new
    > NBuffers. What I think is important here is to note, that we use two
    > mechanisms for coordination: the shared structure ShmemControl that
    > shares the state of operation, and ProcSignal that tells backends to do
    > something (change the memory mapping). Declaring the new NBuffers could
    > be done via ShmemControl, atomically applying the new value, instead of
    > sending a ProcSignal -- this way there is no need for backends to wait,
    > but StrategyControl would need to use the ShmemControl instead of local
    > copy of NBuffers. Does it make sense to you?
    
    When expanding buffers, letting StrategyControl continue with the old
    NBuffers may work. When propagating the new buffer value we have to
    reinitialize StrategyControl to use new NBuffers. But when shrinking,
    the StrategyControl needs to be initialized with the new NBuffers,
    lest it picks a victim from buffers being shrunk. And then if the
    operation fails, we have to reinitialize the StrategyControl again
    with the old NBuffers.
    
    >
    > > What about when shrinking the buffers? Do you plan to make all the
    > > backends wait while the coordinator is evicting buffers?
    >
    > No, it was never planned like that, since it could easily end up with
    > coordinator waiting for the backend to unpin a buffer, and the backend
    > to wait for a signal from the coordinator.
    
    I agree with the deadlock situation. How do we prevent the backends
    from picking or continuing to work with a buffer from buffers being
    shrunk then? Each backend then has to do something about their
    respective pinned buffers.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  92. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T08:54:38Z

    > On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:
    > > You're right of course, a buffer id could be returned from the
    > > ClockSweep and from the custom strategy buffer ring. Buf from what I see
    > > those are picking a buffer from the set of already utilized buffers,
    > > meaning that for a buffer to land there it first has to go through
    > > StrategyControl->firstFreeBuffer, and hence the idea above will be a
    > > requirement for those as well.
    >
    > That isn't true. A buffer which was never in the free list can still
    > be picked up by clock sweep.
    
    How's that?
    
    > > Yep, making buffers available would be equivalent to declaring the new
    > > NBuffers. What I think is important here is to note, that we use two
    > > mechanisms for coordination: the shared structure ShmemControl that
    > > shares the state of operation, and ProcSignal that tells backends to do
    > > something (change the memory mapping). Declaring the new NBuffers could
    > > be done via ShmemControl, atomically applying the new value, instead of
    > > sending a ProcSignal -- this way there is no need for backends to wait,
    > > but StrategyControl would need to use the ShmemControl instead of local
    > > copy of NBuffers. Does it make sense to you?
    >
    > When expanding buffers, letting StrategyControl continue with the old
    > NBuffers may work. When propagating the new buffer value we have to
    > reinitialize StrategyControl to use new NBuffers. But when shrinking,
    > the StrategyControl needs to be initialized with the new NBuffers,
    > lest it picks a victim from buffers being shrunk. And then if the
    > operation fails, we have to reinitialize the StrategyControl again
    > with the old NBuffers.
    
    Right, those two cases will become more asymmetrical: for expanding
    number of available buffers would have to be propagated to the backends
    at the end, when they're ready; for shrinking number of available
    buffers would have to be propagated at the start, so that backends will
    stop allocating unavailable buffers.
    
    > > > What about when shrinking the buffers? Do you plan to make all the
    > > > backends wait while the coordinator is evicting buffers?
    > >
    > > No, it was never planned like that, since it could easily end up with
    > > coordinator waiting for the backend to unpin a buffer, and the backend
    > > to wait for a signal from the coordinator.
    >
    > I agree with the deadlock situation. How do we prevent the backends
    > from picking or continuing to work with a buffer from buffers being
    > shrunk then? Each backend then has to do something about their
    > respective pinned buffers.
    
    The idea I've got so far is stop allocating buffers from the unavailable
    range and wait until backends will unpin all unavailable buffers. We
    either wait unconditionally until it happens, or bail out after certain
    timeout.
    
    It's probably possible to force backends to unpin buffers they work
    with, but it sounds much more problematic to me. What do you think?
    
    
    
    
  93. Re: Changing shared_buffers without restart

    Thom Brown <thom@linux.com> — 2025-07-14T09:24:50Z

    On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote:
    
    > > On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:
    > > > You're right of course, a buffer id could be returned from the
    > > > ClockSweep and from the custom strategy buffer ring. Buf from what I
    > see
    > > > those are picking a buffer from the set of already utilized buffers,
    > > > meaning that for a buffer to land there it first has to go through
    > > > StrategyControl->firstFreeBuffer, and hence the idea above will be a
    > > > requirement for those as well.
    > >
    > > That isn't true. A buffer which was never in the free list can still
    > > be picked up by clock sweep.
    >
    > How's that?
    >
    
    Isn't it its job to find usable buffers from the used buffer list when no
    free ones are available? The next victim buffer can be selected (and
    cleaned if dirty) and then immediately used without touching the free list.
    
    Thom
    
  94. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T09:32:25Z

    > On Mon, Jul 14, 2025 at 10:24:50AM +0100, Thom Brown wrote:
    > On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote:
    >
    > > > On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:
    > > > > You're right of course, a buffer id could be returned from the
    > > > > ClockSweep and from the custom strategy buffer ring. Buf from what I
    > > see
    > > > > those are picking a buffer from the set of already utilized buffers,
    > > > > meaning that for a buffer to land there it first has to go through
    > > > > StrategyControl->firstFreeBuffer, and hence the idea above will be a
    > > > > requirement for those as well.
    > > >
    > > > That isn't true. A buffer which was never in the free list can still
    > > > be picked up by clock sweep.
    > >
    > > How's that?
    > >
    >
    > Isn't it its job to find usable buffers from the used buffer list when no
    > free ones are available? The next victim buffer can be selected (and
    > cleaned if dirty) and then immediately used without touching the free list.
    
    Ah, I see what you mean folks. But I'm talking here only about buffers
    which will be allocated after extending shared memory -- they  must go
    through the freelist first (I don't see why not, any other options?),
    and clock sweep will have a chance to pick them up only afterwards. That
    makes the freelist sort of an entry point for those buffers.
    
    
    
    
  95. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-07-14T12:56:56Z

    Hi,
    
    On 2025-07-14 11:32:25 +0200, Dmitry Dolgov wrote:
    > > On Mon, Jul 14, 2025 at 10:24:50AM +0100, Thom Brown wrote:
    > > On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote:
    > >
    > > > > On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:
    > > > > > You're right of course, a buffer id could be returned from the
    > > > > > ClockSweep and from the custom strategy buffer ring. Buf from what I
    > > > see
    > > > > > those are picking a buffer from the set of already utilized buffers,
    > > > > > meaning that for a buffer to land there it first has to go through
    > > > > > StrategyControl->firstFreeBuffer, and hence the idea above will be a
    > > > > > requirement for those as well.
    > > > >
    > > > > That isn't true. A buffer which was never in the free list can still
    > > > > be picked up by clock sweep.
    > > >
    > > > How's that?
    > > >
    > >
    > > Isn't it its job to find usable buffers from the used buffer list when no
    > > free ones are available? The next victim buffer can be selected (and
    > > cleaned if dirty) and then immediately used without touching the free list.
    > 
    > Ah, I see what you mean folks. But I'm talking here only about buffers
    > which will be allocated after extending shared memory -- they  must go
    > through the freelist first (I don't see why not, any other options?),
    > and clock sweep will have a chance to pick them up only afterwards. That
    > makes the freelist sort of an entry point for those buffers.
    
    Clock sweep can find any buffer, independent of whether it's on the freelist.
    
    
    
    
  96. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T13:08:28Z

    > On Mon, Jul 14, 2025 at 08:56:56AM -0400, Andres Freund wrote:
    > > Ah, I see what you mean folks. But I'm talking here only about buffers
    > > which will be allocated after extending shared memory -- they  must go
    > > through the freelist first (I don't see why not, any other options?),
    > > and clock sweep will have a chance to pick them up only afterwards. That
    > > makes the freelist sort of an entry point for those buffers.
    >
    > Clock sweep can find any buffer, independent of whether it's on the freelist.
    
    It does the search based on nextVictimBuffer, where the actual buffer
    will be a modulo of NBuffers, right? If that's correct and I get
    everything else right, that would mean as long as NBuffers stays the
    same (which is the case for the purposes of the current discussion) new
    buffers, allocated on top of NBuffers after shared memory increase, will
    not be picked by the clock sweep.
    
    
    
    
  97. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-07-14T13:14:26Z

    Hi,
    
    On 2025-07-14 15:08:28 +0200, Dmitry Dolgov wrote:
    > > On Mon, Jul 14, 2025 at 08:56:56AM -0400, Andres Freund wrote:
    > > > Ah, I see what you mean folks. But I'm talking here only about buffers
    > > > which will be allocated after extending shared memory -- they  must go
    > > > through the freelist first (I don't see why not, any other options?),
    > > > and clock sweep will have a chance to pick them up only afterwards. That
    > > > makes the freelist sort of an entry point for those buffers.
    > >
    > > Clock sweep can find any buffer, independent of whether it's on the freelist.
    > 
    > It does the search based on nextVictimBuffer, where the actual buffer
    > will be a modulo of NBuffers, right? If that's correct and I get
    > everything else right, that would mean as long as NBuffers stays the
    > same (which is the case for the purposes of the current discussion) new
    > buffers, allocated on top of NBuffers after shared memory increase, will
    > not be picked by the clock sweep.
    
    Are you tell me that you'd put "new" buffers onto the freelist, before you
    increase NBuffers? That doesn't make sense.
    
    Orthogonaly - there's discussion about simply removing the freelist.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  98. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T13:20:03Z

    > On Mon, Jul 14, 2025 at 09:14:26AM -0400, Andres Freund wrote:
    > > > Clock sweep can find any buffer, independent of whether it's on the freelist.
    > >
    > > It does the search based on nextVictimBuffer, where the actual buffer
    > > will be a modulo of NBuffers, right? If that's correct and I get
    > > everything else right, that would mean as long as NBuffers stays the
    > > same (which is the case for the purposes of the current discussion) new
    > > buffers, allocated on top of NBuffers after shared memory increase, will
    > > not be picked by the clock sweep.
    >
    > Are you tell me that you'd put "new" buffers onto the freelist, before you
    > increase NBuffers? That doesn't make sense.
    
    Why?
    
    > Orthogonaly - there's discussion about simply removing the freelist.
    
    Good to know, will take a look at that thread, thanks.
    
    
    
    
  99. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-07-14T13:42:46Z

    Hi,
    
    On 2025-07-14 15:20:03 +0200, Dmitry Dolgov wrote:
    > > On Mon, Jul 14, 2025 at 09:14:26AM -0400, Andres Freund wrote:
    > > > > Clock sweep can find any buffer, independent of whether it's on the freelist.
    > > >
    > > > It does the search based on nextVictimBuffer, where the actual buffer
    > > > will be a modulo of NBuffers, right? If that's correct and I get
    > > > everything else right, that would mean as long as NBuffers stays the
    > > > same (which is the case for the purposes of the current discussion) new
    > > > buffers, allocated on top of NBuffers after shared memory increase, will
    > > > not be picked by the clock sweep.
    > >
    > > Are you tell me that you'd put "new" buffers onto the freelist, before you
    > > increase NBuffers? That doesn't make sense.
    > 
    > Why?
    
    I think it basically boils down to "That's not how it supposed to work".
    
    If you have buffers that are not in the clock sweep they'll get unfairly high
    usage counts, as their usecount won't be decremented by the clock
    sweep. Resulting in those buffers potentially being overly sticky after the
    s_b resize completed.
    
    It breaks the entirely reasonable check to verify that a buffer returned by
    StrategyGetBuffer() is within the buffer pool.
    
    Obviously, if we remove the freelist, not having the clock sweep find the
    buffer would mean it's unreachable.
    
    What on earth would be the point of putting a buffer on the freelist but not
    make it reachable by the clock sweep? To me that's just nonsensical.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  100. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T14:01:50Z

    > On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
    > What on earth would be the point of putting a buffer on the freelist but not
    > make it reachable by the clock sweep? To me that's just nonsensical.
    
    To clarify, we're not talking about this scenario as "that's how it
    would work after the resize". The point is that to expand shared buffers
    they need to be initialized, included into the whole buffer machinery
    (freelist, clock sweep, etc.) and NBuffers has to be updated. Those
    steps are separated in time, and I'm currently trying to understand what
    are the consequences of performing them in different order and whether
    there are possible concurrency issues under various scenarios. Does this
    make more sense, or still not?
    
    
    
    
  101. Re: Changing shared_buffers without restart

    Greg Burd <greg@burd.me> — 2025-07-14T14:22:17Z

    > On Jul 14, 2025, at 10:01 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > 
    >> On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
    >> What on earth would be the point of putting a buffer on the freelist but not
    >> make it reachable by the clock sweep? To me that's just nonsensical.
    > 
    > To clarify, we're not talking about this scenario as "that's how it
    > would work after the resize". The point is that to expand shared buffers
    > they need to be initialized, included into the whole buffer machinery
    > (freelist, clock sweep, etc.) and NBuffers has to be updated. Those
    > steps are separated in time, and I'm currently trying to understand what
    > are the consequences of performing them in different order and whether
    > there are possible concurrency issues under various scenarios. Does this
    > make more sense, or still not?
    
    Hello, first off thanks for working on the intricate issues related to resizing
    shared_buffers.
    
    Second, I'm new in this code so take that in account but I'm the person trying
    to remove the freelist entirely [1] so I have reviewed this code recently.
    
    I'd initialize them, expand BufferDescriptors, and adjust NBuffers.  The
    clock-sweep algorithm will eventually find them and make use of them.  The
    buf->freeNext should be FREENEXT_NOT_IN_LIST so that StrategyFreeBuffer() will
    do the work required to append it the freelist after use.  AFAICT there is no
    need to add to the freelist up front.
    
    
    best.
    
    -greg
    
    [1] https://postgr.es/m/flat/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10%40burd.me
    
    
    
    
    
  102. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-07-14T14:23:23Z

    Hi,
    
    On 2025-07-14 16:01:50 +0200, Dmitry Dolgov wrote:
    > > On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
    > > What on earth would be the point of putting a buffer on the freelist but not
    > > make it reachable by the clock sweep? To me that's just nonsensical.
    > 
    > To clarify, we're not talking about this scenario as "that's how it
    > would work after the resize". The point is that to expand shared buffers
    > they need to be initialized, included into the whole buffer machinery
    > (freelist, clock sweep, etc.) and NBuffers has to be updated.
    
    It seems pretty obvious to that the order has to be
    
    1) initialize buffer headers
    2) update NBuffers
    3) put them onto the freelist
    
    (with 3) hopefully becoming obsolete)
    
    
    > Those steps are separated in time, and I'm currently trying to understand
    > what are the consequences of performing them in different order and whether
    > there are possible concurrency issues under various scenarios. Does this
    > make more sense, or still not?
    
    I still don't understand why it'd ever make sense to put a buffer onto the
    freelist before updating NBuffers first.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  103. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T14:39:33Z

    > On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:
    > > Those steps are separated in time, and I'm currently trying to understand
    > > what are the consequences of performing them in different order and whether
    > > there are possible concurrency issues under various scenarios. Does this
    > > make more sense, or still not?
    >
    > I still don't understand why it'd ever make sense to put a buffer onto the
    > freelist before updating NBuffers first.
    
    Depending on how NBuffers is updated, different backends may have
    different value of NBuffers for a short time frame. In that case a
    scenario I'm trying to address is when one backend with the new NBuffers
    value allocates a new buffer and puts it into the buffer lookup table,
    where it could become reachable by another backend, which still has the
    old NBuffer value. Correct me if I'm wrong, but initializing buffer
    headers + updating NBuffers means clock sweep can now return one of
    those new buffers, opening the scenario above, right?
    
    
    
    
  104. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T14:43:22Z

    > On Mon, Jul 14, 2025 at 10:22:17AM -0400, Burd, Greg wrote:
    > I'd initialize them, expand BufferDescriptors, and adjust NBuffers.  The
    > clock-sweep algorithm will eventually find them and make use of them.  The
    > buf->freeNext should be FREENEXT_NOT_IN_LIST so that StrategyFreeBuffer() will
    > do the work required to append it the freelist after use.  AFAICT there is no
    > need to add to the freelist up front.
    
    Yep, thanks. I think this approach may lead to a problem I'm trying to
    address with the buffer lookup table (just have described it in the
    message above). But if I'm wrong, that of course would be the way to go.
    
    
    
    
  105. RE: Changing shared_buffers without restart

    Jack Ng <jack.ng@huawei.com> — 2025-07-14T15:10:50Z

    If I understanding correctly, putting a new buffer in the freelist before updating NBuffers could break existing logic that calls BufferIsValid(bufnum) and asserts bufnum <= NBuffers? (since a backend can grab the new buffer and checks its validity before the coordinator can add it to the freelist.)
    
    But it seems updating NBuffers before adding new elements to the freelist could be problematic too?  Like if a new buffer is already chosen as a victim and then the coordinator adds it to the freelist, would that lead to "double-use"? (seems possible at least with current logic and serialization in StrategyGetBuffer). If that's a valid concern, would something like this work?
    
    1) initialize buffer headers, with a new state/flag to indicate "add-pending"
    2) update NBuffers
      -- add a check in clock-sweep logic for "add-pending" and skip them
    3) put them onto the freelist
    4) when a new element is grabbed from freelist, check for and reset add-pending flag. 
    
    This ensure the new element is always obtained from the freelist first I think.
    
    Jack
    
    >-----Original Message-----
    >From: Andres Freund <andres@anarazel.de>
    >Sent: Monday, July 14, 2025 10:23 AM
    >To: Dmitry Dolgov <9erthalion6@gmail.com>
    >Cc: Thom Brown <thom@linux.com>; Ashutosh Bapat
    ><ashutosh.bapat.oss@gmail.com>; Tomas Vondra <tomas@vondra.me>;
    >Thomas Munro <thomas.munro@gmail.com>; PostgreSQL-development <pgsql-
    >hackers@postgresql.org>; Jack Ng <Jack.Ng@huawei.com>; Ni Ku
    ><jakkuniku@gmail.com>
    >Subject: Re: Changing shared_buffers without restart
    >
    >Hi,
    >
    >On 2025-07-14 16:01:50 +0200, Dmitry Dolgov wrote:
    >> > On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote:
    >> > What on earth would be the point of putting a buffer on the freelist
    >> > but not make it reachable by the clock sweep? To me that's just nonsensical.
    >>
    >> To clarify, we're not talking about this scenario as "that's how it
    >> would work after the resize". The point is that to expand shared
    >> buffers they need to be initialized, included into the whole buffer
    >> machinery (freelist, clock sweep, etc.) and NBuffers has to be updated.
    >
    >It seems pretty obvious to that the order has to be
    >
    >1) initialize buffer headers
    >2) update NBuffers
    >3) put them onto the freelist
    >
    >(with 3) hopefully becoming obsolete)
    >
    >
    >> Those steps are separated in time, and I'm currently trying to
    >> understand what are the consequences of performing them in different
    >> order and whether there are possible concurrency issues under various
    >> scenarios. Does this make more sense, or still not?
    >
    >I still don't understand why it'd ever make sense to put a buffer onto the freelist
    >before updating NBuffers first.
    >
    >Greetings,
    >
    >Andres Freund
    
    
    
    
  106. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-07-14T15:11:36Z

    Hi, 
    
    On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:
    >> > Those steps are separated in time, and I'm currently trying to understand
    >> > what are the consequences of performing them in different order and whether
    >> > there are possible concurrency issues under various scenarios. Does this
    >> > make more sense, or still not?
    >>
    >> I still don't understand why it'd ever make sense to put a buffer onto the
    >> freelist before updating NBuffers first.
    >
    >Depending on how NBuffers is updated, different backends may have
    >different value of NBuffers for a short time frame. In that case a
    >scenario I'm trying to address is when one backend with the new NBuffers
    >value allocates a new buffer and puts it into the buffer lookup table,
    >where it could become reachable by another backend, which still has the
    >old NBuffer value. Correct me if I'm wrong, but initializing buffer
    >headers + updating NBuffers means clock sweep can now return one of
    >those new buffers, opening the scenario above, right?
    
    The same is true if you put buffers into the freelist. 
    
    Andres
    -- 
    Sent from my Android device with K-9 Mail. Please excuse my brevity.
    
    
    
    
  107. RE: Changing shared_buffers without restart

    Jack Ng <jack.ng@huawei.com> — 2025-07-14T15:18:10Z

    Just brain-storming here... would moving NBuffers to shared memory solve this specific issue? Though I'm pretty sure that would open up a new set of synchronization issues elsewhere, so I'm not sure if there's a net gain.
    
    Jack
    
    >-----Original Message-----
    >From: Andres Freund <andres@anarazel.de>
    >Sent: Monday, July 14, 2025 11:12 AM
    >To: Dmitry Dolgov <9erthalion6@gmail.com>
    >Cc: Thom Brown <thom@linux.com>; Ashutosh Bapat
    ><ashutosh.bapat.oss@gmail.com>; Tomas Vondra <tomas@vondra.me>;
    >Thomas Munro <thomas.munro@gmail.com>; PostgreSQL-development <pgsql-
    >hackers@postgresql.org>; Jack Ng <Jack.Ng@huawei.com>; Ni Ku
    ><jakkuniku@gmail.com>
    >Subject: Re: Changing shared_buffers without restart
    >
    >Hi,
    >
    >On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com>
    >wrote:
    >>> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:
    >>> > Those steps are separated in time, and I'm currently trying to
    >>> > understand what are the consequences of performing them in
    >>> > different order and whether there are possible concurrency issues
    >>> > under various scenarios. Does this make more sense, or still not?
    >>>
    >>> I still don't understand why it'd ever make sense to put a buffer
    >>> onto the freelist before updating NBuffers first.
    >>
    >>Depending on how NBuffers is updated, different backends may have
    >>different value of NBuffers for a short time frame. In that case a
    >>scenario I'm trying to address is when one backend with the new
    >>NBuffers value allocates a new buffer and puts it into the buffer
    >>lookup table, where it could become reachable by another backend, which
    >>still has the old NBuffer value. Correct me if I'm wrong, but
    >>initializing buffer headers + updating NBuffers means clock sweep can
    >>now return one of those new buffers, opening the scenario above, right?
    >
    >The same is true if you put buffers into the freelist.
    >
    >Andres
    >--
    >Sent from my Android device with K-9 Mail. Please excuse my brevity.
    
  108. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T15:35:23Z

    > On Mon, Jul 14, 2025 at 11:11:36AM -0400, Andres Freund wrote:
    > Hi,
    >
    > On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > >> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote:
    > >> > Those steps are separated in time, and I'm currently trying to understand
    > >> > what are the consequences of performing them in different order and whether
    > >> > there are possible concurrency issues under various scenarios. Does this
    > >> > make more sense, or still not?
    > >>
    > >> I still don't understand why it'd ever make sense to put a buffer onto the
    > >> freelist before updating NBuffers first.
    > >
    > >Depending on how NBuffers is updated, different backends may have
    > >different value of NBuffers for a short time frame. In that case a
    > >scenario I'm trying to address is when one backend with the new NBuffers
    > >value allocates a new buffer and puts it into the buffer lookup table,
    > >where it could become reachable by another backend, which still has the
    > >old NBuffer value. Correct me if I'm wrong, but initializing buffer
    > >headers + updating NBuffers means clock sweep can now return one of
    > >those new buffers, opening the scenario above, right?
    >
    > The same is true if you put buffers into the freelist.
    
    Yep, but the question about clock sweep still stays. Anyway, thanks for
    the input, let me digest it and come up with more questions & patch
    series.
    
    
    
    
  109. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-14T16:32:35Z

    > On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote:
    > Just brain-storming here... would moving NBuffers to shared memory solve this specific issue? Though I'm pretty sure that would open up a new set of synchronization issues elsewhere, so I'm not sure if there's a net gain.
    
    It's in fact already happening, there is a shared structure that
    described the resize status. But if I get everything right, it doesn't
    solve all the problems.
    
    
    
    
  110. Re: Changing shared_buffers without restart

    Jim Nasby <jnasby@upgrade.com> — 2025-07-14T22:55:13Z

    On Fri, Jul 4, 2025 at 9:42 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    
    > > On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
    >
    
    ...
    
    
    > > 10) what to do about stuck resize?
    > >
    > > AFAICS the resize can get stuck for various reasons, e.g. because it
    > > can't evict pinned buffers, possibly indefinitely. Not great, it's not
    > > clear to me if there's a way out (canceling the resize) after a timeout,
    > > or something like that? Not great to start an "online resize" only to
    > > get stuck with all activity blocked for indefinite amount of time, and
    > > get to restart anyway.
    > >
    > > Seems related to Thomas' message [2], but AFAICS the patch does not do
    > > anything about this yet, right? What's the plan here?
    >
    > It's another open discussion right now, with an idea to eventually allow
    > canceling after a timeout. I think canceling when stuck on buffer
    > eviction should be pretty straightforward (the evition must take place
    > before actual shared memory resize, so we know nothing has changed yet),
    > but in some other failure scenarios it would be harder (e.g. if one
    > backend is stuck resizing, while other have succeeded -- this would
    > require another round of synchronization and some way to figure out what
    > is the current status).
    
    
    From a user standpoint, I would expect any kind of resize like this to be
    an online operation that happens in the background. If this is driven by a
    GUC I don't see how it could be anything else, but if something else is
    decided on I think it'd just be pain to require a session to stay connected
    until a resize was complete. (Of course we'd need to provide some means of
    monitoring a resize that was in-process, perhaps via a pg_stat_progress
    view or a system function.)
    
    Also, while I haven't fully followed discussion about how to synchronize
    backends, I will say that I don't think it's at all unreasonable if a
    resize doesn't take full effect until every backend has at minimum ended
    any running transaction, or potentially even returned back to the
    equivalent of `PostgresMain()` for that type of backend. Obviously it'd be
    nicer to be more responsive than that, but I don't think the first version
    of the feature has to accomplish that.
    
    For that matter, I also feel it'd be fine if the first version didn't even
    support shrinking shared buffers.
    
    Finally, while shared buffers is the most visible target here, there are
    other shared memory settings that have a *much* smaller surface area, and
    in my experience are going to be much more valuable from a tuning
    perspective; notably wal_buffers and the MXID SLRUs (and possibly CLOG and
    subtrans). I say that because unless you're running a workload that
    entirely fits in shared buffers, or a *really* small shared buffers
    compared to system memory, increasing shared buffers quickly gets into
    diminishing returns. But since the default size for the other fixed sized
    areas is so much smaller than normal values for shared_buffers, increasing
    those areas can have a much, much larger impact on performance. (Especially
    for something like the MXID SLRUs.) I would certainly consider focusing on
    one of those areas before trying to tackle shared buffers.
    
  111. RE: Changing shared_buffers without restart

    Jack Ng <jack.ng@huawei.com> — 2025-07-15T22:52:01Z

    >> On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote:
    >> Just brain-storming here... would moving NBuffers to shared memory solve
    >this specific issue? Though I'm pretty sure that would open up a new set of
    >synchronization issues elsewhere, so I'm not sure if there's a net gain.
    >
    >It's in fact already happening, there is a shared structure that described the
    >resize status. But if I get everything right, it doesn't solve all the problems.
    
    Hi Dmitry, 
    
    Just to clarify, you're not only referring to the ShmemControl::NSharedBuffers
    and related logic in the current patches, but actually getting rid of per-process
    NBuffers completely and use ShmemControl::NSharedBuffers everywhere instead (or
    something along those lines)? So that when the coordinator updates
    ShmemControl::NSharedBuffers, everyone sees the new value right away.
    I guess this is part of the "simplified design" you mentioned several posts earlier?
    
    I also thought about that approach more, and there seems to be new synchronization
    issues we would need to deal with, like:
    
    1. Mid-execution change of NBuffers in functions like BufferSync and BgBufferSync,
    which could cause correctness and performance issues. I suppose most of them
    are solvable with atomics and shared r/w locks etc, but at the cost of higher
    performance overheads.
    
    2. NBuffers becomes inconsistent with the underlying shared memory mappings for a
    period of time for each process. Currently both are updated in AnonymousShmemResize
    and AdjustShmemSize "atomically" for a process, so I wonder if letting them get
    out-of-sync (even for a brief period) could be problematic.
    
    I agree it doesn't seem to solve all the problems. It can simplify certain aspects
    of the design, but may also introduce new issues. Overall not a "silver bullet" :)
    
    Jack
    
    
    
    
  112. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-16T14:48:06Z

    > On Tue, Jul 15, 2025 at 10:52:01PM +0000, Jack Ng wrote:
    > >> On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote:
    > >> Just brain-storming here... would moving NBuffers to shared memory solve
    > >this specific issue? Though I'm pretty sure that would open up a new set of
    > >synchronization issues elsewhere, so I'm not sure if there's a net gain.
    > >
    > >It's in fact already happening, there is a shared structure that described the
    > >resize status. But if I get everything right, it doesn't solve all the problems.
    >
    > Just to clarify, you're not only referring to the ShmemControl::NSharedBuffers
    > and related logic in the current patches, but actually getting rid of per-process
    > NBuffers completely and use ShmemControl::NSharedBuffers everywhere instead (or
    > something along those lines)? So that when the coordinator updates
    > ShmemControl::NSharedBuffers, everyone sees the new value right away.
    > I guess this is part of the "simplified design" you mentioned several posts earlier?
    
    I was thinking more about something like NBuffersAvailable, which would
    control how victim buffers are getting picked, but there is a spectrum
    of different options to experiment with.
    
    > I also thought about that approach more, and there seems to be new synchronization
    > issues we would need to deal with, like:
    
    Potentially tricky change of NBuffers already happens in the current
    patch set, e.g. NBuffers is getting updated in ProcessProcSignalBarrier,
    which is called at the end of BufferSync loop iteration. By itself I
    don't see any obvious problems here except remembering buffer id in
    CkptBufferIds (I've mentioned this few messages above).
    
    
    
    
  113. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-07-16T14:52:46Z

    > On Mon, Jul 14, 2025 at 05:55:13PM -0500, Jim Nasby wrote:
    >
    > Finally, while shared buffers is the most visible target here, there are
    > other shared memory settings that have a *much* smaller surface area, and
    > in my experience are going to be much more valuable from a tuning
    > perspective; notably wal_buffers and the MXID SLRUs (and possibly CLOG and
    > subtrans). I say that because unless you're running a workload that
    > entirely fits in shared buffers, or a *really* small shared buffers
    > compared to system memory, increasing shared buffers quickly gets into
    > diminishing returns. But since the default size for the other fixed sized
    > areas is so much smaller than normal values for shared_buffers, increasing
    > those areas can have a much, much larger impact on performance. (Especially
    > for something like the MXID SLRUs.) I would certainly consider focusing on
    > one of those areas before trying to tackle shared buffers.
    
    That's an interesting idea, thanks for sharing. The reason I'm
    concentrating on shared buffers is that it was frequently called out as
    a problem when trying to tune PostgreSQL automatically. In this context
    shared buffers is usually one of the most impactful knobs, yet one of
    the most painful to manage as well. But if the amount of complexity
    around resizable shared buffers will be proved unsurmountable, yeah, it
    would make sense to consider simpler targets using the same mechanism.
    
    
    
    
  114. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-07-16T15:44:16Z

    Hi,
    
    On 2025-07-14 17:55:13 -0500, Jim Nasby wrote:
    > I say that because unless you're running a workload that entirely fits in
    > shared buffers, or a *really* small shared buffers compared to system
    > memory, increasing shared buffers quickly gets into diminishing returns.
    
    I don't think that's true, at all, today. And it certainly won't be true in a
    world where we will be able to use direct_io for real workloads.
    
    Particularly for write heavy workloads, the difference between a small buffer
    pool and a large one can be *dramatic*, because the large buffer pool allows
    most writes to be done by checkpointer (and thus largely sequentially) or by
    backends and bgwriter (and thus largely randomly). Doing more writes
    sequentially helps with short-term performance, but *particularly* helps with
    sustained performance on SSDs. A larger buffer pool also reduces the *total*
    number of writes dramatically, because the same buffer will often be dirtied
    repeatedly within one checkpoint window.
    
    
    r/w/ pgbench is a workload that *undersells* the benefit of a larger
    shared_buffers, as each transaction is uncommonly small, making WAL flushes
    much more of a bottleneck (the access pattern is too uniform, too). But even
    for that the difference can be massive:
    
    
    A scale 500 pgbench with 48 clients:
    s_b= 512MB:
         averages 390MB/s of writes in steady state
         average TPS: 25072
    s_b=8192MB:
         averages  48MB/s of writes in steady state
         average TPS: 47901
    Nearly an order of magnitude difference in writes and nearly a 2x difference
    in TPS.
    
    
    25%, the advice we give for shared_buffers, is literally close to the worst
    possible value. The only thing it maximizes is double buffering. While
    removing information useful about what to cache for how long from both
    postgres and the OS, leading to reduced cache hit rates.
    
    
    > But since the default size for the other fixed sized areas is so much
    > smaller than normal values for shared_buffers, increasing those areas can
    > have a much, much larger impact on performance. (Especially for something
    > like the MXID SLRUs.) I would certainly consider focusing on one of those
    > areas before trying to tackle shared buffers.
    
    I think that'd be a bad idea. There's simply no point in having the complexity
    in place to allow for dynamically resizing a few megabytes of buffers. You
    just configure them large enough (including probalby increasing some of the
    defaults one of these years). Whereas you can't just do that for
    shared_buffers, as we're talking really memory. Ahead of time you do not know
    how much memory backends themselves need and the amount of memory in the
    system may change.
    
    Resizing shared_buffers is particularly important because it's becoming more
    important to be able to dynamically increase/decrease the resources of a
    running postgres instance to adjust for system load. Memory and CPUs can be
    hot added/removed from VMs, but we need to utilize them...
    
    Greetings,
    
    Andres Freund
    
    
    
    
  115. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-18T04:47:15Z

    Hi Tomas,
    Thanks for your detailed feedback. Sorry for replying late.
    
    On Fri, Jul 4, 2025 at 5:36 AM Tomas Vondra <tomas@vondra.me> wrote:
    >
    > v5-0008-Support-shrinking-shared-buffers.patch
    >
    > 1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
    >    there a place starting it and waiting for it to complete? Why
    >    shouldn't it do EvictExtraBuffers itself?
    >
    > 3) Seems a bit strange to do it from a random backend. Shouldn't it
    >    be the responsibility of a process like checkpointer/bgwriter, or
    >    maybe a dedicated dynamic bgworker? Can we even rely on a backend
    >    to be available?
    
    I will answer these two together. I don't think we should rely on a
    random backend. But that's what the rest of the patches did and
    patches to support shrinking followed them. But AFAIK, Dmitry is
    working on a set of changes which will make a non-postmaster backend
    to be a coordinator for buffer pool resizing process. When that
    happens the same backend which initializes the expanded memory when
    expanding the buffer pool should also be responsible for evicting the
    buffers when shrinking the buffer pool. Will wait for Dmitry's next
    set of patches before making this change.
    
    >
    > 4) Unsolved issues with buffers pinned for a long time. Could be an
    >    issue if the buffer is pinned indefinitely (e.g. cursor in idle
    >    connection), and the resizing blocks some activity (new connections
    >    or stuff like that).
    
    In such cases we should cancel the operation or kill that backend (per
    user preference) after a timeout with (user specified) timeout >= 0.
    We haven't yet figured out the details. I think the first version of
    the feature would just cancel the operation, if it encounters a pinned
    buffer.
    
    > 2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
    >    last buffer is still at the end of the freelist? Seems unlikely.
    > 6) It's not clear to me in what situations this triggers (in the call
    >    to BufferManagerShmemInit)
    >
    >    if (FirstBufferToInit < NBuffers) ...
    >
    
    Will answer these two together. As the comment says FirstBufferToInit
    < NBuffers indicates two situations: When FirstBufferToInit = 0, it's
    the first time the buffer pool is being initialized. Otherwise it
    indicates expanding the buffer pool, in which case the last buffer
    will be a newly initialized buffer. All newly initialized buffers are
    linked into the freelist one after the other in the increasing order
    of their buffer ids by code a few lines above. Now that the free
    buffer list has been removed, we don't need to worry about it. In the
    next set of patches, I have removed this code.
    
    >
    > v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
    >
    > 1) IMHO this should be included in the earlier resize/shrink patches,
    >    I don't see a reason to keep it separate (assuming this is the
    >    correct way, and the "init" is not).
    
    These patches are separate just because me and Dmitry developed them
    respectively. Once they are reviewed by Dmitry, we will squash them
    into a single patch. I am expecting that Dmitry's next patchset which
    will do significant changes to the synchronization will have a single
    patch for all code related to and consequential to resizing.
    
    >
    > 5) Funny that "AI suggests" something, but doesn't the block fail to
    >    reset nextVictimBuffer of the clocksweep? It may point to a buffer
    >    we're removing, and it'll be invalid, no?
    >
    
    The TODO no more applies. There's code to reset the clocksweep in a
    separate patch. Sorry for not removing it earlier. It will be removed
    in the next set of patches.
    
    >
    > 2) Doesn't StrategyPurgeFreeList already do some of this for the case
    >    of shrinking memory?
    >
    > 3) Not great adding a bunch of static variables to bufmgr.c. Why do we
    >    need to make "everything" static global? Isn't it enough to make
    >    only the "valid" flag global? The rest can stay local, no?
    >
    >    If everything needs to be global for some reason, could we at least
    >    make it a struct, to group the fields, not just separate random
    >    variables? And maybe at the top, not half-way throught the file?
    >
    > 4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
    >    anything, it's just invalidating the info about past runs.
    
    I think there's a bit of refactoring possible here. Setting up the
    BgBufferSync state, resetting it when bgwriter_lru_maxpages <= 0 and
    then re initializing it when bgwriter_lru_maxpages > 0, and actually
    performing the buffer sync is all packed into the same function
    BgBufferSync() right now. It makes this function harder to read. I
    think these functionalities should be separated into their own
    functions and use the appropriate one instead of BgBufferSyncAdjust(),
    whose name is misleading. The static global variables should all be
    packed into a structure which is passed as an argument to these
    functions. I need more time to study the code and refactor it that
    way. For now I have added a note to the commit message of this patch
    so that I will revisit it. I have renamed BgBufferSyncAdjust() to
    BgBufferSyncReset().
    
    >
    > 5) I don't quite understand why BufferSync needs to do the dance with
    >    delay_shmem_resize.  I mean, we certainly should not run BufferSync
    >    from the code that resizes buffers, right? Certainly not after the
    >    eviction, from the part that actually rebuilds shmem structs etc.
    
    Right. But let me answer all three questions together.
    
    >    So perhaps something could trigger resize while we're running the
    >    BufferSync()? Isn't that a bit strange? If this flag is needed, it
    >    seems more like a band-aid for some issue in the architecture.
    >
    > 6) Also, why should it be fine to get into situation that some of the
    >    buffers might not be valid, during shrinking? I mean, why should
    >    this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers).
    >    It seems better to ensure we never get into "sync" in a way that
    >    might lead some of the buffers invalid. Seems way too lowlevel to
    >    care about whether resize is happening.
    >
    > 7) I don't understand the new condition for "Execute the LRU scan".
    >    Won't this stop LRU scan even in cases when we want it to happen?
    >    Don't we want to scan the buffers in the remaining part (after
    >    shrinking), for example? Also, we already checked this shmem flag at
    >    the beginning of the function - sure, it could change (if some other
    >    process modifies it), but does that make sense? Wouldn't it cause
    >    problems if it can change at an arbitrary point while running the
    >    BufferSync? IMHO just another sign it may not make sense to allow
    >    this, i.e. buffer sync should not run during the "actual" resize.
    >
    
    ProcessBarrierShmemResize() which does the resizing is part of
    ProcessProcSignalBarrier() which in turn gets called from
    CHECK_FOR_INTERRUPTS(), which is called from multiple places, even
    from elog(). I am not able to find a call stack linking BgBufferSync()
    and ProcessProcSignalBarrier(). But I couldn't convince myself that it
    is true and will remain true in the future. I mean, the function loops
    through a large number of buffers and performs IO, both avenues to
    call CHECK_FOR_INTERRUPTS(). Hence that flag. Do you know what (part
    of code) guarantees that ProcessProcSignalBarrier() will never be
    called from BgBufferSync()?
    
    Note, resizing can not begin till delay_shmem_resize is cleared, so
    while BgBufferSync is executing, no buffer can be invalidated or no
    new buffers could be added. But at the cost of all other backends to
    wait till BgBufferSync finishes. We want to avoid that.  The idea here
    is to make BgBufferSync stop as soon as it realises that the buffer
    resizing is "about to begin". But I think the condition looks wrong. I
    think the right condition would be NBufferPending != NBuffers or
    NBuffersOld. AFAIK, Dmitry is working on consolidating NBuffers*
    variables as you have requested elsewhere. Better even if we could
    somehow set a flag in shared memory indicating that the buffer
    resizing is "about to begin" and BgBufferSync() checks that flag. So I
    will wait for him to make that change and then change this condition.
    
    >
    > v5-0010-Additional-validation-for-buffer-in-the-ring.patch
    >
    > 1) So the problem is we might create a ring before shrinking shared
    >    buffers, and then GetBufferFromRing will see bogus buffers? OK, but
    >    we should be more careful with these checks, otherwise we'll miss
    >    real issues when we incorrectly get an invalid buffer. Can't the
    >    backends do this only when they for sure know we did shrink the
    >    shared buffers? Or maybe even handle that during the barrier?
    >
    > 2) IMHO a sign there's the "transitions" between different NBuffers
    >    values may not be clear enough, and we're allowing stuff to happen
    >    in the "blurry" area. I think that's likely to cause bugs (it did
    >    cause issues for the online checksums patch, I think).
    >
    
    I think you are right, that this might hide some bugs. Just like we
    remove buffers to be shrunk from freelist only once, I wanted each
    backend to remove them buffer rings only once. But I couldn't find a
    way to make all the buffer rings for a given backend available to the
    barrier handling code. The rings are stored in Scan objects, which
    seem local to the executor nodes. Is there a way to make them
    available to barrier handling code (even if it has to walk an
    execution tree, let's say)?
    
    If there would have been only one scan, we could have set a flag after
    shrinking, let GetBufferFromRing() purge all invalid buffers once when
    flag is true and reset the flag. But there can be more than one scan
    happening and we don't know how many there are and when all of them
    have finished calling GetBufferFromRing() after shrinking. Suggestions
    to do this only once are welcome.
    
    I will send the next set of patches with my next email.
    
    
    --
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  116. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-18T04:55:29Z

    On Mon, Jun 16, 2025 at 6:09 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > >
    > > Buffer lookup table resizing
    > > ------------------------------------
    I looked at the interaction of shared buffer lookup table with buffer
    resizing as per the patches in [0]. Here's list of my findings, issues
    and fixes.
    
    1. The basic structure of buffer lookup table (directory and control
    area etc.) is allocated in a shared memory segment dedicated to the
    buffer lookup table. However, the entries are allocated in the shared
    memory using ShmemAllocNoError() which allocates the entries in the
    main memory segment. In order for ShmemAllocNoError() to allocate
    entries in the dedicated shared memory segment, it should know the
    shared memory segment. We could do that by setting the segment number
    in element_alloc() before calling hashp->alloc(). This is similar to
    how ShmemAllocNoError() knows the memory context in which to allocate
    the entries on heap. But read on ...
    
    2. When the buffer pool is expanded, an "out of shared memory" error
    is thrown when more entries are added to the buffer look up table. We
    could temporarily adjust that flag and allocate more entries. But the
    directory also needs to be expanded proportionately otherwise it may
    lead to more contention. Expanding directory is non-trivial since it's
    a contiguous chunk of memory, followed by other data structures.
    Further, expanding directory would require rehashing all the existing
    entries, which may impact the time taken by the resizing operation and
    how long other backends remain blocked.
    
    3. When the buffer pool is shrunk, there is no way to free the extra
    entries in such a way that a contiguous chunk of shared memory can be
    given back to the OS. In case we implement it, we will need some way
    to compact the shrunk entries in contiguous chunk of memory and unmap
    remaining chunk. That's some significant code.
    
    Given these things, I think we should set up the buffer lookup table
    to hold maximum entries required to expand the buffer pool to its
    maximum, right at the beginning. The maximum size to which buffer pool
    can grow is given by GUC max_available_memory (which is a misnomer and
    should be renamed to max_shared_buffers or something), introduced by
    previous set of patches [0]. We don't shrink or expand the buffer
    lookup table as we shrink and expand the buffer pool. With that the
    buffer lookup table can be located in the main memory segment itself
    and we don't have to fix ShmemAllocNoError().
    
    This has two side effects:
    1. larger hash table makes hash table operations slower [2]. Its
    impact on actual queries needs to be studied.
    2. There's increase in the total shared memory allocated upfront.
    Currently we allocate 150MB memory with all default GUC values. With
    this change we will allocate 250MB memory since max_available_memory
    (or rather max_shared_buffers) defaults to allow 524288 shared
    buffers. If we make max_shared_buffers to default to shared_buffers,
    it won't be a problem. However, when a user sets max_shared_buffers
    themselves, they have to be conscious of the fact that it will
    allocate more memory than necessary with given shared_buffers value.
    
    This fix is part of patch 0015.
    
    The patchset contains more fixes and improvements as described below.
    
    Per TODO in the prologue of CalculateShmemSize(), more than necessary
    shared memory was mapped and allocated in the buffer manager related
    memory segments because of an error in that function; the amount of
    memory to be allocated in the main shared memory segment was added to
    every other shared memory segment. Thus shrinking those memory
    segments didn't actually affect the objects allocated in those.
    Because of that, we were not seeing SIGBUS even when the objects
    supposedly shrunk were accessed, masking bugs in the patches. In this
    patchset I have a working fix for CalculateShmemSize(). With that fix
    in place we see server crashing with SIGBUS in some resizing
    operations. Those cases need to be investigated. The fix changes its
    minions to a. return size of shared memory objects to be allocated in
    the main memory segment and b. add sizes of the shared memory objects
    to be allocated in other memory segments in the respective
    AnonymousMapping structures. This assymetry between main segment and
    other segment exists so as not to change a lot the minions of
    CalculateShmemSize(). But I think we should eliminate the assymetry
    and change every minion to add sizes in the respective segment's
    AnonymousMapping structure. The patch proposed at [3] would simplify
    CalculateShmemSize() which should help eliminating the assymetry.
    Along with refactoring CalculateShmemSize() I have added small fixes
    to update the total size and end address of shared memory mapping
    after resizing them and also to update the new allocated_sizes of
    resized structures in ShmemIndex entry. Patch 0009 includes these
    changes.
    
    I found that the shared memory resizing synchronization is triggered
    even before setting up the shared buffers the first time after
    starting the server. That's not required and also can lead to issues
    because of trying to resize shared buffers which do not exist. A WIP
    fix is included as patch 0012. A TODO in the patch needs to be
    addressed. It should be squashed into an earlier patch 0011 when
    appropriate.
    
    While debugging the above mentioned issues, I found it useful to have
    an insight into the contents of buffer lookup table. Hence I added a
    system view exposing the contents of the buffer lookup table. This is
    added as patch 0001 in the attached patchset. I think it's useful to
    have this independent of this patchset to investigate inconsistencies
    between the contents of shared buffer pool and buffer lookup table.
    
    Again for debugging purposes, I have added a new column "segment" in
    pg_shmem_allocations reporting the shared memory segment in which the
    given allocation has happened. I have also added another view
    pg_shmem_segments to provide information about the shared memory
    segments. This view definition will change as we design shared memory
    mappings and shared memory segments better. So it's WIP and needs doc
    changes as well. I have included it in the patchset as patch 0011
    since it will be helpful to debug issues found in the patch when
    testing. The patch should be merged into patch 0007.
    
    Last but not the least, patch 0016 contains two tests a. stress test
    to run buffer resizing while pgbench is running, b. a SQL test to test
    the sizes of segments and shared memory allocations after resizing.
    The stress test polls "show shared_buffers" output to know when the
    resizing is finished. I think we need a better interface to know when
    resizing has finished. Thanks a lot my colleague Palak Chaturvedi for
    providing initial draft of the test case.
    
    The patches are rebased on top of the latest master, which includes
    changes to remove free buffer list. That led to removing all the code
    in these patches dealing with free buffer list.
    
    I am intentionally keeping my changes (patches 0001, 0008 to 0012,
    0012 to 0016) separate from Dmitry's changes so that Dmitry can review
    them easily. The patches are arranged so that my patches are nearer to
    Dmitry's patches, into which, they should be squashed.
    
    Dmitry,
    I found that max_available_memory is PGC_SIGHUP. Is that intentional?
    I thought it's PGC_POSTMASTER since we can not reserve more address
    space without restarting postmaster. Left a TODO for this. I think we
    also need to change the name and description to better reflect its
    actual functionality.
    
    [0] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
    [1] https://www.postgresql.org/message-id/CAExHW5v0jh3F_wj86yC%3DqBfWk0uiT94qy%3DZ41uzAHLHh0SerRA%40mail.gmail.com
    [2] https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html
    [3] https://commitfest.postgresql.org/patch/5997/
    
    
    --
    Best Wishes,
    Ashutosh Bapat
    
  117. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-09-18T13:52:03Z

    Hi,
    
    On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote:
    > From d1ed934ccd02fca2c831e582b07a169e17d19f59 Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Tue, 17 Jun 2025 15:14:33 +0200
    > Subject: [PATCH 02/16] Process config reload in AIO workers
    
    I think this is superfluous due to b8e1f2d96bb9
    
    > Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
    > which does not include ConfigReloadPending. Thus we need to check for it
    > explicitly.
    
    
    > +/*
    > + * Process any new interrupts.
    > + */
    > +static void
    > +pgaio_worker_process_interrupts(void)
    > +{
    > +	/*
    > +	 * Reloading config can trigger further signals, complicating interrupts
    > +	 * processing -- so let it run first.
    > +	 *
    > +	 * XXX: Is there any need in memory barrier after ProcessConfigFile?
    > +	 */
    > +	if (ConfigReloadPending)
    > +	{
    > +		ConfigReloadPending = false;
    > +		ProcessConfigFile(PGC_SIGHUP);
    > +	}
    > +
    > +	if (ProcSignalBarrierPending)
    > +		ProcessProcSignalBarrier();
    > +}
    
    Given that even before b8e1f2d96bb9 method_worker.c used
    CHECK_FOR_INTERRUPTS(), which contains a ProcessProcSignalBarrier(), I don't
    know why that second check was added here?
    
    
    
    > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Sun, 6 Apr 2025 16:40:32 +0200
    > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
    > 
    > Currently an assing hook can perform some preprocessing of a new value,
    > but it cannot change the behavior, which dictates that the new value
    > will be applied immediately after the hook. Certain GUC options (like
    > shared_buffers, coming in subsequent patches) may need coordinating work
    > between backends to change, meaning we cannot apply it right away.
    > 
    > Add a new flag "pending" for an assign hook to allow the hook indicate
    > exactly that. If the pending flag is set after the hook, the new value
    > will not be applied and it's handling becomes the hook's implementation
    > responsibility.
    
    I doubt it makes sense to add this to the GUC system. I think it'd be better
    to just use the GUC value as the desired "target" configuration and have a
    function or a show-only GUC for reporting the current size.
    
    I don't think you can't just block application of the GUC until the resize is
    complete. E.g. what if the value was too big and the new configuration needs
    to fixed to be lower?
    
    
    > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Fri, 4 Apr 2025 21:46:14 +0200
    > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
    > 
    > Currently WaitForProcSignalBarrier allows to make sure the message sent
    > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
    > participants.
    > 
    > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
    > which will be updated when a process has received the message, but not
    > processed it yet. This makes it possible to support a new mode of
    > waiting, when ProcSignal participants want to synchronize message
    > processing. To do that, a participant can wait via
    > WaitForProcSignalBarrierReceived when processing a message, effectively
    > making sure that all processes are going to start processing
    > ProcSignalBarrier simultaneously.
    
    I doubt "online resizing" that requires synchronously processing the same
    event, can really be called "online". There can be significant delays in
    processing a barrier, stalling the entire server until that is reached seems
    like a complete no-go for production systems?
    
    
    > From 63fe27340656c52b13f4eecebd9e73d24efe5e33 Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Fri, 28 Feb 2025 19:54:47 +0100
    > Subject: [PATCH 05/16] Allow to use multiple shared memory mappings
    > 
    > Currently all the work with shared memory is done via a single anonymous
    > memory mapping, which limits ways how the shared memory could be organized.
    > 
    > Introduce possibility to allocate multiple shared memory mappings, where
    > a single mapping is associated with a specified shared memory segment.
    > There is only fixed amount of available segments, currently only one
    > main shared memory segment is allocated. A new shared memory API is
    > introduces, extended with a segment as a new parameter. As a path of
    > least resistance, the original API is kept in place, utilizing the main
    > shared memory segment.
    
    
    > -#define MAX_ON_EXITS 20
    > +#define MAX_ON_EXITS 40
    
    Why does a patch like this contain changes like this mixed in with the rest?
    That's clearly not directly related to $subject.
    
    
    >  /* shared memory global variables */
    >  
    > -static PGShmemHeader *ShmemSegHdr;	/* shared mem segment header */
    > +ShmemSegment Segments[ANON_MAPPINGS];
    >  
    > -static void *ShmemBase;			/* start address of shared memory */
    > -
    > -static void *ShmemEnd;			/* end+1 address of shared memory */
    > -
    > -slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
    > -								 * allocation */
    > -
    > -static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
    
    Why do we need a separate ShmemLock for each segment? Besides being
    unnecessary, it seems like that prevents locking in a way that provides
    consistency across all segments.
    
    
    
    > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Tue, 17 Jun 2025 11:47:04 +0200
    > Subject: [PATCH 06/16] Address space reservation for shared memory
    > 
    > Currently the shared memory layout is designed to pack everything tight
    > together, leaving no space between mappings for resizing. Here is how it
    > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
    > anonymous shared memory we talk about:
    >
    >     00400000-00490000         /path/bin/postgres
    >     ...
    >     012d9000-0133e000         [heap]
    >     7f443a800000-7f470a800000 /dev/zero (deleted)
    >     7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    >     7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    >     ...
    > 
    > Make the layout more dynamic via splitting every shared memory segment
    > into two parts:
    > 
    > * An anonymous file, which actually contains shared memory content. Such
    >   an anonymous file is created via memfd_create, it lives in memory,
    >   behaves like a regular file and semantically equivalent to an
    >   anonymous memory allocated via mmap with MAP_ANONYMOUS.
    > 
    > * A reservation mapping, which size is much larger than required shared
    >   segment size. This mapping is created with flags PROT_NONE (which
    >   makes sure the reserved space is not used), and MAP_NORESERVE (to not
    >   count the reserved space against memory limits). The anonymous file is
    >   mapped into this reservation mapping.
    
    The commit message fails to explain why, if we're already relying on
    MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
    sized allocation that's marked MAP_NORESERVE for all the parts that we don't
    yet need?
    
    
    
    > There are also few unrelated advantages of using anon files:
    > 
    > * We've got a file descriptor, which could be used for regular file
    >   operations (modification, truncation, you name it).
    
    What is this an advantage for?
    
    
    > * The file could be given a name, which improves readability when it
    >   comes to process maps.
    
    > * By default, Linux will not add file-backed shared mappings into a core dump,
    >   making it more convenient to work with them in PostgreSQL: no more huge dumps
    >   to process.
    
    That's just as well a downside, because you now can't investigate some
    issues. This was already configurable via coredump_filter.
    
    
    
    > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Tue, 17 Jun 2025 11:22:02 +0200
    > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
    > 
    > Add more shmem segments to split shared buffers into following chunks:
    > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
    > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
    > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
    > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
    > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
    
    Why do all these need to be separate segments? Afaict we'll have to maximally
    size everything other than BUFFERS_SHMEM_SEGMENT at start?
    
    
    
    > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
    > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > Date: Tue, 17 Jun 2025 14:16:55 +0200
    > Subject: [PATCH 11/16] Allow to resize shared memory without restart
    > 
    > Add assing hook for shared_buffers to resize shared memory using space,
    > introduced in the previous commits without requiring PostgreSQL restart.
    > Essentially the implementation is based on two mechanisms: a
    > ProcSignalBarrier is used to make sure all processes are starting the
    > resize procedure simultaneously, and a global Barrier is used to
    > coordinate after that and make sure all finished processes are waiting
    > for others that are in progress.
    > 
    > The resize process looks like this:
    > 
    > * The GUC assign hook sets a flag to let the Postmaster know that resize
    >   was requested.
    > 
    > * Postmaster verifies the flag in the event loop, and starts the resize
    >   by emitting a ProcSignal barrier.
    > 
    > * All processes, that participate in ProcSignal mechanism, begin to
    >   process ProcSignal barrier. First a process waits until all processes
    >   have confirmed they received the message and can start simultaneously.
    
    As mentioned above, this basically makes the entire feature not really
    online. Besides the latency of some processes not getting to the barrier
    immediately, there's also the issue that actually reserving large amounts of
    memory can take a long time - during which all processes would be unavailable.
    
    I really don't see that being viable. It'd be one thing if that were a
    "temporary" restriction, but the whole design seems to be fairly centered
    around that.
    
    
    > * Every process recalculates shared memory size based on the new
    >   NBuffers, adjusts its size using ftruncate and adjust reservation
    >   permissions with mprotect. One elected process signals the postmaster
    >   to do the same.
    
    If we just used a single memory mapping with all unused parts marked
    MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other
    work in this patchset)..
    
    
    > From experiment it turns out that shared mappings have to be extended
    > separately for each process that uses them. Another rough edge is that a
    > backend blocked on ReadCommand will not apply shared_buffers change
    > until it receives something.
    
    That's not a rough edge, that basically makes the feature unusable, no?
    
    
    > +-- Test 2: Set to 64MB  
    > +ALTER SYSTEM SET shared_buffers = '64MB';
    > +SELECT pg_reload_conf();
    > +SELECT pg_sleep(1);
    > +SHOW shared_buffers;
    
    Tests containing sleeps are a significant warning flag imo.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  118. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-09-18T14:05:01Z

    Hi,
    
    On 2025-09-18 09:52:03 -0400, Andres Freund wrote:
    > On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote:
    > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Fri, 4 Apr 2025 21:46:14 +0200
    > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
    > > 
    > > Currently WaitForProcSignalBarrier allows to make sure the message sent
    > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
    > > participants.
    > > 
    > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
    > > which will be updated when a process has received the message, but not
    > > processed it yet. This makes it possible to support a new mode of
    > > waiting, when ProcSignal participants want to synchronize message
    > > processing. To do that, a participant can wait via
    > > WaitForProcSignalBarrierReceived when processing a message, effectively
    > > making sure that all processes are going to start processing
    > > ProcSignalBarrier simultaneously.
    > 
    > I doubt "online resizing" that requires synchronously processing the same
    > event, can really be called "online". There can be significant delays in
    > processing a barrier, stalling the entire server until that is reached seems
    > like a complete no-go for production systems?
    
    > [...]
    
    > > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Tue, 17 Jun 2025 14:16:55 +0200
    > > Subject: [PATCH 11/16] Allow to resize shared memory without restart
    > > 
    > > Add assing hook for shared_buffers to resize shared memory using space,
    > > introduced in the previous commits without requiring PostgreSQL restart.
    > > Essentially the implementation is based on two mechanisms: a
    > > ProcSignalBarrier is used to make sure all processes are starting the
    > > resize procedure simultaneously, and a global Barrier is used to
    > > coordinate after that and make sure all finished processes are waiting
    > > for others that are in progress.
    > > 
    > > The resize process looks like this:
    > > 
    > > * The GUC assign hook sets a flag to let the Postmaster know that resize
    > >   was requested.
    > > 
    > > * Postmaster verifies the flag in the event loop, and starts the resize
    > >   by emitting a ProcSignal barrier.
    > > 
    > > * All processes, that participate in ProcSignal mechanism, begin to
    > >   process ProcSignal barrier. First a process waits until all processes
    > >   have confirmed they received the message and can start simultaneously.
    > 
    > As mentioned above, this basically makes the entire feature not really
    > online. Besides the latency of some processes not getting to the barrier
    > immediately, there's also the issue that actually reserving large amounts of
    > memory can take a long time - during which all processes would be unavailable.
    > 
    > I really don't see that being viable. It'd be one thing if that were a
    > "temporary" restriction, but the whole design seems to be fairly centered
    > around that.
    
    Besides not really being online, isn't this a recipe for endless undetected
    deadlocks? What if process A waits for a lock held by process B and process B
    arrives at the barrier? Process A won't ever get there, because process B
    can't make progress, because A is not making progress.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  119. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-09-26T18:04:21Z

    Sorry for late reply folks.
    
    > On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:
    > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Sun, 6 Apr 2025 16:40:32 +0200
    > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
    > > 
    > > Currently an assing hook can perform some preprocessing of a new value,
    > > but it cannot change the behavior, which dictates that the new value
    > > will be applied immediately after the hook. Certain GUC options (like
    > > shared_buffers, coming in subsequent patches) may need coordinating work
    > > between backends to change, meaning we cannot apply it right away.
    > > 
    > > Add a new flag "pending" for an assign hook to allow the hook indicate
    > > exactly that. If the pending flag is set after the hook, the new value
    > > will not be applied and it's handling becomes the hook's implementation
    > > responsibility.
    > 
    > I doubt it makes sense to add this to the GUC system. I think it'd be better
    > to just use the GUC value as the desired "target" configuration and have a
    > function or a show-only GUC for reporting the current size.
    > 
    > I don't think you can't just block application of the GUC until the resize is
    > complete. E.g. what if the value was too big and the new configuration needs
    > to fixed to be lower?
    
    I think it was a bit hasty to post another version of the patch without
    the design changes we've agreed upon last time. I'm still working on
    that (sorry, it takes time, I haven't wrote so much Perl for testing
    since forever), the current implementation doesn't include anything with
    GUC to simplify the discussion. I'm still convinced that multi-step GUC
    changing makes sense, but it has proven to be more complicated than I
    anticipated, so I'll spin up another thread to discuss when I come to
    it.
    
    > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Fri, 4 Apr 2025 21:46:14 +0200
    > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
    > > 
    > > Currently WaitForProcSignalBarrier allows to make sure the message sent
    > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
    > > participants.
    > > 
    > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
    > > which will be updated when a process has received the message, but not
    > > processed it yet. This makes it possible to support a new mode of
    > > waiting, when ProcSignal participants want to synchronize message
    > > processing. To do that, a participant can wait via
    > > WaitForProcSignalBarrierReceived when processing a message, effectively
    > > making sure that all processes are going to start processing
    > > ProcSignalBarrier simultaneously.
    > 
    > I doubt "online resizing" that requires synchronously processing the same
    > event, can really be called "online". There can be significant delays in
    > processing a barrier, stalling the entire server until that is reached seems
    > like a complete no-go for production systems?
    > 
    > [...]
    
    > As mentioned above, this basically makes the entire feature not really
    > online. Besides the latency of some processes not getting to the barrier
    > immediately, there's also the issue that actually reserving large amounts of
    > memory can take a long time - during which all processes would be unavailable.
    > 
    > I really don't see that being viable. It'd be one thing if that were a
    > "temporary" restriction, but the whole design seems to be fairly centered
    > around that.
    >
    > [...]
    > 
    > Besides not really being online, isn't this a recipe for endless undetected
    > deadlocks? What if process A waits for a lock held by process B and process B
    > arrives at the barrier? Process A won't ever get there, because process B
    > can't make progress, because A is not making progress.
    
    Same as above, in the version I'm working right now it's changed in
    favor of an approach that looks more like the one from "online checksum
    change" patch. I've even stumbled upon a cases when a process was just
    killed and never arrive at the barrier, so that was it. The new approach
    makes certain parts simpler, but requires managing backends with
    different understanding of how large shared memory segments are for some
    time interval. Introducing a new parameter "number of available buffers"
    seems to be helpful to address all cases I've found so far.
    
    Btw, under "online" resizing I mostly understood "without restart", the
    goal was not to make it really "online".
    
    > > -#define MAX_ON_EXITS 20
    > > +#define MAX_ON_EXITS 40
    > 
    > Why does a patch like this contain changes like this mixed in with the rest?
    > That's clearly not directly related to $subject.
    
    An artifact of rebasing, it belonged to 0007.
    
    > > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Tue, 17 Jun 2025 11:47:04 +0200
    > > Subject: [PATCH 06/16] Address space reservation for shared memory
    > > 
    > > Currently the shared memory layout is designed to pack everything tight
    > > together, leaving no space between mappings for resizing. Here is how it
    > > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
    > > anonymous shared memory we talk about:
    > >
    > >     00400000-00490000         /path/bin/postgres
    > >     ...
    > >     012d9000-0133e000         [heap]
    > >     7f443a800000-7f470a800000 /dev/zero (deleted)
    > >     7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    > >     7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    > >     ...
    > > 
    > > Make the layout more dynamic via splitting every shared memory segment
    > > into two parts:
    > > 
    > > * An anonymous file, which actually contains shared memory content. Such
    > >   an anonymous file is created via memfd_create, it lives in memory,
    > >   behaves like a regular file and semantically equivalent to an
    > >   anonymous memory allocated via mmap with MAP_ANONYMOUS.
    > > 
    > > * A reservation mapping, which size is much larger than required shared
    > >   segment size. This mapping is created with flags PROT_NONE (which
    > >   makes sure the reserved space is not used), and MAP_NORESERVE (to not
    > >   count the reserved space against memory limits). The anonymous file is
    > >   mapped into this reservation mapping.
    > 
    > The commit message fails to explain why, if we're already relying on
    > MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
    > sized allocation that's marked MAP_NORESERVE for all the parts that we don't
    > yet need?
    
    How do we return memory to the OS in that case? Currently it's done
    explicitly via truncating the anonymous file.
    
    > > * The file could be given a name, which improves readability when it
    > >   comes to process maps.
    > 
    > > * By default, Linux will not add file-backed shared mappings into a core dump,
    > >   making it more convenient to work with them in PostgreSQL: no more huge dumps
    > >   to process.
    > 
    > That's just as well a downside, because you now can't investigate some
    > issues. This was already configurable via coredump_filter.
    
    This behaviour is configured via coredump_filter as well, so just the
    default value has been changed.
     
    > > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Tue, 17 Jun 2025 11:22:02 +0200
    > > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
    > > 
    > > Add more shmem segments to split shared buffers into following chunks:
    > > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
    > > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
    > > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
    > > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
    > > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
    > 
    > Why do all these need to be separate segments? Afaict we'll have to maximally
    > size everything other than BUFFERS_SHMEM_SEGMENT at start?
    
    Why would they need to me maxed out at the start? So far my rule of
    thumb was one segment for one structure which size depends on NBuffers,
    so that when changing NBuffers each segment could be adjusted
    independently.
    
    > > +-- Test 2: Set to 64MB  
    > > +ALTER SYSTEM SET shared_buffers = '64MB';
    > > +SELECT pg_reload_conf();
    > > +SELECT pg_sleep(1);
    > > +SHOW shared_buffers;
    > 
    > Tests containing sleeps are a significant warning flag imo.
    
    Tests I'm preparing so far avoiding this by waiting in injection points.
    I haven't found anything similar in existing tests, but I assume such
    approach is fine.
    
    
    
    
  120. Re: Changing shared_buffers without restart

    Andres Freund <andres@anarazel.de> — 2025-09-26T18:36:43Z

    Hi,
    
    On 2025-09-26 20:04:21 +0200, Dmitry Dolgov wrote:
    > > On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:
    > > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
    > > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > > Date: Sun, 6 Apr 2025 16:40:32 +0200
    > > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
    > > > 
    > > > Currently an assing hook can perform some preprocessing of a new value,
    > > > but it cannot change the behavior, which dictates that the new value
    > > > will be applied immediately after the hook. Certain GUC options (like
    > > > shared_buffers, coming in subsequent patches) may need coordinating work
    > > > between backends to change, meaning we cannot apply it right away.
    > > > 
    > > > Add a new flag "pending" for an assign hook to allow the hook indicate
    > > > exactly that. If the pending flag is set after the hook, the new value
    > > > will not be applied and it's handling becomes the hook's implementation
    > > > responsibility.
    > > 
    > > I doubt it makes sense to add this to the GUC system. I think it'd be better
    > > to just use the GUC value as the desired "target" configuration and have a
    > > function or a show-only GUC for reporting the current size.
    > > 
    > > I don't think you can't just block application of the GUC until the resize is
    > > complete. E.g. what if the value was too big and the new configuration needs
    > > to fixed to be lower?
    > 
    > I think it was a bit hasty to post another version of the patch without
    > the design changes we've agreed upon last time. I'm still working on
    > that (sorry, it takes time, I haven't wrote so much Perl for testing
    > since forever), the current implementation doesn't include anything with
    > GUC to simplify the discussion. I'm still convinced that multi-step GUC
    > changing makes sense, but it has proven to be more complicated than I
    > anticipated, so I'll spin up another thread to discuss when I come to
    > it.
    
    FWIW, I'm fairly convinced it's a completely dead end.
    
    
    
    
    > > > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
    > > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > > Date: Tue, 17 Jun 2025 11:47:04 +0200
    > > > Subject: [PATCH 06/16] Address space reservation for shared memory
    > > > 
    > > > Currently the shared memory layout is designed to pack everything tight
    > > > together, leaving no space between mappings for resizing. Here is how it
    > > > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
    > > > anonymous shared memory we talk about:
    > > >
    > > >     00400000-00490000         /path/bin/postgres
    > > >     ...
    > > >     012d9000-0133e000         [heap]
    > > >     7f443a800000-7f470a800000 /dev/zero (deleted)
    > > >     7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    > > >     7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    > > >     ...
    > > > 
    > > > Make the layout more dynamic via splitting every shared memory segment
    > > > into two parts:
    > > > 
    > > > * An anonymous file, which actually contains shared memory content. Such
    > > >   an anonymous file is created via memfd_create, it lives in memory,
    > > >   behaves like a regular file and semantically equivalent to an
    > > >   anonymous memory allocated via mmap with MAP_ANONYMOUS.
    > > > 
    > > > * A reservation mapping, which size is much larger than required shared
    > > >   segment size. This mapping is created with flags PROT_NONE (which
    > > >   makes sure the reserved space is not used), and MAP_NORESERVE (to not
    > > >   count the reserved space against memory limits). The anonymous file is
    > > >   mapped into this reservation mapping.
    > > 
    > > The commit message fails to explain why, if we're already relying on
    > > MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
    > > sized allocation that's marked MAP_NORESERVE for all the parts that we don't
    > > yet need?
    > 
    > How do we return memory to the OS in that case? Currently it's done
    > explicitly via truncating the anonymous file.
    
    madvise with MADV_DONTNEED or MADV_REMOVE.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  121. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-09-28T09:24:26Z

    > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
    > Given these things, I think we should set up the buffer lookup table
    > to hold maximum entries required to expand the buffer pool to its
    > maximum, right at the beginning.
    
    Thanks for investigating. I think another option would be to rebuild the
    buffer lookup table (create a new table based on the new size and copy
    the data over from the original one) as part of the resize procedure,
    alongsize with buffers eviction and initialization. From what I recall
    the size of buffer lookup table is about two orders of magnitude lower
    than shared buffers, so the overhead should not be that large even for
    significant amount of buffers.
    
    
    
    
  122. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-29T06:51:08Z

    On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
    > > Given these things, I think we should set up the buffer lookup table
    > > to hold maximum entries required to expand the buffer pool to its
    > > maximum, right at the beginning.
    >
    > Thanks for investigating. I think another option would be to rebuild the
    > buffer lookup table (create a new table based on the new size and copy
    > the data over from the original one) as part of the resize procedure,
    > alongsize with buffers eviction and initialization. From what I recall
    > the size of buffer lookup table is about two orders of magnitude lower
    > than shared buffers, so the overhead should not be that large even for
    > significant amount of buffers.
    
    The proposal will work but will require significant work:
    
    1. The pointer to the shared buffer lookup table will change. The
    change needs to be absorbed by all the processes at the same time; we
    can not have few processes accessing old lookup table and few
    processes new one. That has potential to make many processes wait for
    a very long time. That can be fixed by accessing a new pointer when
    the next buffer lookup access happens by modifying BufTable*
    functions. But that means an extra condition checks and some extra
    code in those hot paths. Not sure whether that's acceptable.
    2. The memory consumed by the old buffer lookup table will need to be
    "freed" to the OS. The only way to do so is by having a new memory
    segment (which can be unmapped) or unmapping portions of segment
    dedicated to the buffer lookup table. That's some more synchronization
    and additional wait times for backends.
    3. When the new shared buffer lookup table will be built, processes
    may be able to access it in shared mode but they may not be able to
    make changes to it (or else we need to make corresponding changes to
    new table as well). That means more restrictions on the running
    backends.
    
    I am not saying that we can not implement your idea, but maybe we
    could do that incrementally after basic resizing is in place.
    
    --
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  123. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-29T06:57:18Z

    On Fri, Sep 26, 2025 at 11:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > Sorry for late reply folks.
    >
    > > On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:
    > > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
    > > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > > Date: Sun, 6 Apr 2025 16:40:32 +0200
    > > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
    > > >
    > > > Currently an assing hook can perform some preprocessing of a new value,
    > > > but it cannot change the behavior, which dictates that the new value
    > > > will be applied immediately after the hook. Certain GUC options (like
    > > > shared_buffers, coming in subsequent patches) may need coordinating work
    > > > between backends to change, meaning we cannot apply it right away.
    > > >
    > > > Add a new flag "pending" for an assign hook to allow the hook indicate
    > > > exactly that. If the pending flag is set after the hook, the new value
    > > > will not be applied and it's handling becomes the hook's implementation
    > > > responsibility.
    > >
    > > I doubt it makes sense to add this to the GUC system. I think it'd be better
    > > to just use the GUC value as the desired "target" configuration and have a
    > > function or a show-only GUC for reporting the current size.
    > >
    > > I don't think you can't just block application of the GUC until the resize is
    > > complete. E.g. what if the value was too big and the new configuration needs
    > > to fixed to be lower?
    >
    > I think it was a bit hasty to post another version of the patch without
    > the design changes we've agreed upon last time. I'm still working on
    > that (sorry, it takes time, I haven't wrote so much Perl for testing
    > since forever), the current implementation doesn't include anything with
    > GUC to simplify the discussion. I'm still convinced that multi-step GUC
    > changing makes sense, but it has proven to be more complicated than I
    > anticipated, so I'll spin up another thread to discuss when I come to
    > it.
    >
    > > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
    > > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > > Date: Fri, 4 Apr 2025 21:46:14 +0200
    > > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
    > > >
    > > > Currently WaitForProcSignalBarrier allows to make sure the message sent
    > > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
    > > > participants.
    > > >
    > > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
    > > > which will be updated when a process has received the message, but not
    > > > processed it yet. This makes it possible to support a new mode of
    > > > waiting, when ProcSignal participants want to synchronize message
    > > > processing. To do that, a participant can wait via
    > > > WaitForProcSignalBarrierReceived when processing a message, effectively
    > > > making sure that all processes are going to start processing
    > > > ProcSignalBarrier simultaneously.
    > >
    > > I doubt "online resizing" that requires synchronously processing the same
    > > event, can really be called "online". There can be significant delays in
    > > processing a barrier, stalling the entire server until that is reached seems
    > > like a complete no-go for production systems?
    > >
    > > [...]
    >
    > > As mentioned above, this basically makes the entire feature not really
    > > online. Besides the latency of some processes not getting to the barrier
    > > immediately, there's also the issue that actually reserving large amounts of
    > > memory can take a long time - during which all processes would be unavailable.
    > >
    > > I really don't see that being viable. It'd be one thing if that were a
    > > "temporary" restriction, but the whole design seems to be fairly centered
    > > around that.
    > >
    > > [...]
    > >
    > > Besides not really being online, isn't this a recipe for endless undetected
    > > deadlocks? What if process A waits for a lock held by process B and process B
    > > arrives at the barrier? Process A won't ever get there, because process B
    > > can't make progress, because A is not making progress.
    >
    > Same as above, in the version I'm working right now it's changed in
    > favor of an approach that looks more like the one from "online checksum
    > change" patch. I've even stumbled upon a cases when a process was just
    > killed and never arrive at the barrier, so that was it. The new approach
    > makes certain parts simpler, but requires managing backends with
    > different understanding of how large shared memory segments are for some
    > time interval. Introducing a new parameter "number of available buffers"
    > seems to be helpful to address all cases I've found so far.
    >
    > Btw, under "online" resizing I mostly understood "without restart", the
    > goal was not to make it really "online".
    >
    > > > -#define MAX_ON_EXITS 20
    > > > +#define MAX_ON_EXITS 40
    > >
    > > Why does a patch like this contain changes like this mixed in with the rest?
    > > That's clearly not directly related to $subject.
    >
    > An artifact of rebasing, it belonged to 0007.
    >
    > > > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
    > > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > > Date: Tue, 17 Jun 2025 11:47:04 +0200
    > > > Subject: [PATCH 06/16] Address space reservation for shared memory
    > > >
    > > > Currently the shared memory layout is designed to pack everything tight
    > > > together, leaving no space between mappings for resizing. Here is how it
    > > > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
    > > > anonymous shared memory we talk about:
    > > >
    > > >     00400000-00490000         /path/bin/postgres
    > > >     ...
    > > >     012d9000-0133e000         [heap]
    > > >     7f443a800000-7f470a800000 /dev/zero (deleted)
    > > >     7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    > > >     7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    > > >     ...
    > > >
    > > > Make the layout more dynamic via splitting every shared memory segment
    > > > into two parts:
    > > >
    > > > * An anonymous file, which actually contains shared memory content. Such
    > > >   an anonymous file is created via memfd_create, it lives in memory,
    > > >   behaves like a regular file and semantically equivalent to an
    > > >   anonymous memory allocated via mmap with MAP_ANONYMOUS.
    > > >
    > > > * A reservation mapping, which size is much larger than required shared
    > > >   segment size. This mapping is created with flags PROT_NONE (which
    > > >   makes sure the reserved space is not used), and MAP_NORESERVE (to not
    > > >   count the reserved space against memory limits). The anonymous file is
    > > >   mapped into this reservation mapping.
    > >
    > > The commit message fails to explain why, if we're already relying on
    > > MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
    > > sized allocation that's marked MAP_NORESERVE for all the parts that we don't
    > > yet need?
    >
    > How do we return memory to the OS in that case? Currently it's done
    > explicitly via truncating the anonymous file.
    >
    > > > * The file could be given a name, which improves readability when it
    > > >   comes to process maps.
    > >
    > > > * By default, Linux will not add file-backed shared mappings into a core dump,
    > > >   making it more convenient to work with them in PostgreSQL: no more huge dumps
    > > >   to process.
    > >
    > > That's just as well a downside, because you now can't investigate some
    > > issues. This was already configurable via coredump_filter.
    >
    > This behaviour is configured via coredump_filter as well, so just the
    > default value has been changed.
    >
    > > > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
    > > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > > Date: Tue, 17 Jun 2025 11:22:02 +0200
    > > > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
    > > >
    > > > Add more shmem segments to split shared buffers into following chunks:
    > > > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
    > > > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
    > > > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
    > > > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
    > > > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
    > >
    > > Why do all these need to be separate segments? Afaict we'll have to maximally
    > > size everything other than BUFFERS_SHMEM_SEGMENT at start?
    >
    > Why would they need to me maxed out at the start? So far my rule of
    > thumb was one segment for one structure which size depends on NBuffers,
    > so that when changing NBuffers each segment could be adjusted
    > independently.
    >
    
    Offlist Andres expressed that having multiple shared memory segments
    may impact the time it takes to disconnect a backend. If the
    application is using all the configured number of backends, a slow
    disconnection will lead to a slow connection. If we want to go the
    route of multple segments (as many as 5) it would make sense to
    measure that impact first.
    
    Maxing out at start avoids using multiple segments. Those segments
    have much much lower memory compared to the buffer blocks even when
    maxed out with a reasonable max_shared_buffers setting. We avoid
    complicating code for a small increase in shared memory.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  124. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-10-01T09:10:28Z

    > On Mon, Sep 29, 2025 at 12:21:08PM +0530, Ashutosh Bapat wrote:
    > On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > >
    > > > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
    > > > Given these things, I think we should set up the buffer lookup table
    > > > to hold maximum entries required to expand the buffer pool to its
    > > > maximum, right at the beginning.
    > >
    > > Thanks for investigating. I think another option would be to rebuild the
    > > buffer lookup table (create a new table based on the new size and copy
    > > the data over from the original one) as part of the resize procedure,
    > > alongsize with buffers eviction and initialization. From what I recall
    > > the size of buffer lookup table is about two orders of magnitude lower
    > > than shared buffers, so the overhead should not be that large even for
    > > significant amount of buffers.
    > 
    > The proposal will work but will require significant work:
    > 
    > 1. The pointer to the shared buffer lookup table will change.
    
    Which pointers you mean? AFAICT no operation on the buffer lookup table
    returns a pointer (they work with buffer id or a hash) and keys are
    compared by value as well.
    
    > we can not have few processes accessing old lookup table and few
    > processes new one. That has potential to make many processes wait for
    > a very long time.
    
    As I've mentioned above, size of the buffer lookup table is few
    magnitudes lower than shared buffers, so I doubt about "a very long
    time". But it can be measured.
    
    > 2. The memory consumed by the old buffer lookup table will need to be
    > "freed" to the OS. The only way to do so is by having a new memory
    > segment 
    
    Shared buffer lookup table already lives in it's own segment as
    implemented in the current patch, so I don't see any problem here.
    
    I see you folks are inclined to keep some small segments static and
    allocate maximum allowed memory for it. It's an option, at the end of
    the day we need to experiment and measure both approaches.
    
    
    
    
  125. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-10-01T10:20:17Z

    On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > > On Mon, Sep 29, 2025 at 12:21:08PM +0530, Ashutosh Bapat wrote:
    > > On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > >
    > > > > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote:
    > > > > Given these things, I think we should set up the buffer lookup table
    > > > > to hold maximum entries required to expand the buffer pool to its
    > > > > maximum, right at the beginning.
    > > >
    > > > Thanks for investigating. I think another option would be to rebuild the
    > > > buffer lookup table (create a new table based on the new size and copy
    > > > the data over from the original one) as part of the resize procedure,
    > > > alongsize with buffers eviction and initialization. From what I recall
    > > > the size of buffer lookup table is about two orders of magnitude lower
    > > > than shared buffers, so the overhead should not be that large even for
    > > > significant amount of buffers.
    > >
    > > The proposal will work but will require significant work:
    > >
    > > 1. The pointer to the shared buffer lookup table will change.
    >
    > Which pointers you mean? AFAICT no operation on the buffer lookup table
    > returns a pointer (they work with buffer id or a hash) and keys are
    > compared by value as well.
    
    The buffer lookup table itself.
    /* Pass location of hashtable header to hash_create */
    infoP->hctl = (HASHHDR *) location;
    
    >
    > > we can not have few processes accessing old lookup table and few
    > > processes new one. That has potential to make many processes wait for
    > > a very long time.
    >
    > As I've mentioned above, size of the buffer lookup table is few
    > magnitudes lower than shared buffers, so I doubt about "a very long
    > time". But it can be measured.
    >
    > > 2. The memory consumed by the old buffer lookup table will need to be
    > > "freed" to the OS. The only way to do so is by having a new memory
    > > segment
    >
    > Shared buffer lookup table already lives in it's own segment as
    > implemented in the current patch, so I don't see any problem here.
    
    The table is not a single chunk of memory. It's a few chunks spread
    across the shared memory segment. Freeing a lookup table is like
    freeing those chunks. We have ways to free tail parts of shared memory
    segments, but not chunks in-between.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  126. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-10-01T10:42:38Z

    > On Wed, Oct 01, 2025 at 03:50:17PM +0530, Ashutosh Bapat wrote:
    > The buffer lookup table itself.
    > /* Pass location of hashtable header to hash_create */
    > infoP->hctl = (HASHHDR *) location;
    
    How does this affect any users of the lookup table, if they do not even
    get to see those?
    
    > > Shared buffer lookup table already lives in it's own segment as
    > > implemented in the current patch, so I don't see any problem here.
    > 
    > The table is not a single chunk of memory. It's a few chunks spread
    > across the shared memory segment. Freeing a lookup table is like
    > freeing those chunks. We have ways to free tail parts of shared memory
    > segments, but not chunks in-between.
    
    Right, and the idea was to rebuild it completely to fit into the new
    size, not just chunk-by-chunk.
    
    
    
    
  127. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-10-13T15:58:09Z

    Hi,
    
    I started studying the interaction of the checkpointer process with
    buffer pool resizing. Soon I noticed that the checkpointer didn't load
    the config as frequently as other backends. When it is executing a
    checkpoint, it does not reload the config for the entire duration of
    the checkpoint for example. As the synchronization is implemented, in
    the set of patches so far, the checkpointer will not see the new value
    of shared_buffers and will not acknowledge the proc signal barrier and
    thus not enter the synchronized buffer resizing. However, other
    backends will notice that the checkpointer has received the proc
    signal barrier and will enter the synchronization process. Once the
    proc signal barrier is received by all the backends, the backends
    which have entered the synchronization process will move forward with
    resizing buffer pool leaving behind those who have received but not
    acknowledged the proc signal barrier. At the end there will be two
    sets of backends, one which have entered synchronization and see the
    buffer pool with new size and the other which haven't entered
    synchronization and do not see the buffer pool with new size. This
    leads to SIGBUS, SIG 11 in the other set of backends. I saw this
    mostly with the checkpointer process but we also saw it with other
    types of backends.
    
    Every aspect of buffer resizing that I started looking at was blocked
    by this behaviour. Since there were already other suggestions and
    comments about the current UI as well as synchronization mechanism, I
    started implementing a different UI and synchronization as described
    below. The WIP implementation is available in the attached set of
    patches.
    
    Patches 0001 to 0016 are the same as the previous patchset. I haven't
    touched them in case someone would like to see an incremental change.
    However, it's getting unwieldy at this point, so I will squash
    relevant patches together and provide a patchset with fewer patches
    next.
    0017 reverts to 0003 and gets rid of the "pending" GUC flag which is
    not required by the new UI. They will vanish from the next patchset.
    0018 implements the new UI described below.
    
    New UI and synchronization
    ======================
    
    0018 changes the way "shared_buffers" is handled.
    a. A new global variable NBuffersPending is used to hold the value of
    this GUC. When the server starts, shared memory required by the buffer
    manager is calculated using NBuffersPending instead of NBuffers. Once
    the shared memory is allocated, NBuffers is set to NBuffersPending.
    NBuffers, thus shows the number of buffers in the buffer pool instead
    of the value of the GUC.
    b. "shared_buffers" is PGC_SIGHUP now so it can be changed using ALTER
    SYSTEM ... SET shared_buffers = ...; followed by SELECT
    pg_reload_config(). But this does not resize the buffer pool. It
    merely sets NBuffersPending to the new value. A new function
    pg_resize_buffer_pool() (described later) can be used to resize the
    buffer pool to the pending value.
    c. show "shared_buffers" shows the value of NBuffers, and
    NBuffersPending if it differs from NBuffers. I think we need some
    adjustment here when the resizing is in progress since the value of
    NBuffers would be changed to the size of the active buffer pool
    (explained later in the email), but I haven't worked out those details
    yet.
    
    A new GUC max_shared_buffers sets the upper limit on "shared_buffers".
    It is PGC_POSTMASTER; requires a restart to change the value. This GUC
    is used a. to reserve the address space for future expansion of the
    buffer pool and b. allocate memory for a maximally sized buffer lookup
    table at the server start. We may decide to use the GUC to maximally
    allocate data structures other than buffer blocks as suggested by
    Andres. But these patches don't do that. The default for this GUC is
    0, which means it will be the same as shared_buffers. This maintains
    backward compatibility and also allows systems, which do not want to
    resize shared buffer pool, to allocate minimum memory. When it is set
    to a value other than 0, it should be set to a value higher than the
    shared_buffers at the start.
    
    We need to support the ALTER SYSTEM ... SET shared_buffers = "" for
    backward compatibility. The users will still be able to perform ALTER
    SYSTEM and restart the server with a newer size of buffer pool. Also
    this allows the new buffer pool size to be written to
    postgresql.auto.conf and persist it. With this we can simply use
    pg_reload_conf() to load the new value along with other GUC changes.
    pg_resize_buffer_pool() merely picks the new value from the backend
    where it is executed and resizes the buffer pool. It does not need the
    new value to be loaded in all the backends.
    
    We may want to use a new PGC_ for this GUC but PGC_SIGHUP suffices for
    the time being and it might be acceptable with clear documentation.
    
    pg_resize_buffer_pool() implements phase wise buffer pool resizing
    operation, but it does not block all the backends till the buffer pool
    resizing is finished. It works as follows: Pasting from the prologue
    in patch 0018.
    
    When resizing the buffer pool is divided into two portions
    
    - active buffer pool, which is the part of the buffer pool which
    remains active even during resizing. Its size is given by
    activeNBuffers. Newly allocated buffers will have their buffer ids
    less than activeNBuffers.
    
    - in-transit buffer pool, which is the part of the buffer pool which
    may be accessible to some backends but not others depending upon the
    time when a given backend processes a shrink/expand barrier. When
    shrinking the buffer pool this is the part of the buffer pool which
    will be evicted. When expanding the buffer pool this is the expanded
    portion. Its size is given by transitNBuffers. The backends may see
    buffer ids upto transitNBuffers till the resizing finishes.
    
    Before starting resizing, activeNBuffers = transitNBuffers = NBuffers
    where NBuffers is the size of buffer pool before resizing. NewNBuffers
    is the new size of the shared buffer pool. After resizing finishes
    activeNBuffers = transitNBuffers = NBuffers = newNBuffers.
    
    In order to synchronize with other running backends, the coordinator
    sends following ProcSignalBarriers in the order given below:
    
    1. When shrinking the shared buffer pool the coordinator sends
    SHBUF_SHRINK ProcSignalBarrier. Every backend sets activeNBuffers =
    NewNBuffers to restrict its buffer pool allocations to the new size of
    the buffer pool and acknowledges the ProcSignalBarrrier. Once every
    backend has acknowledged, the coordinator evicts the buffers in the
    area being shrunk. Note that tansitNBuffers is still NBuffers, so the
    backends may see buffer ids upto NBuffers from earlier allocations
    till eviction completes.
    
    2. In both cases, when expanding the buffer pool or shrinking the
    buffer pool, the coordinator sends SHBUF_RESIZE_MAP_AND_MEM
    ProcSignalBarrier after resizing the shared memory segments and
    initializing the required data structures if any. Every backend is
    expected to adjust their shared memory segment address maps (by
    calling AnonymousShmemResize()) and validate that their pointers to
    the shared buffers structure are valid and have the right size. When
    shrinking shared buffer pool transitNBuffers is set to NewNBuffers and
    the backends should no longer see buffer ids beyond NewNBuffers; the
    buffer resizing operation is finished at this stage. When expanding
    they should set transitNBuffers to NewNBuffers to accommodate for the
    backends which may accept the next barrier earlier than the others.
    Once every backend acknowledges this barrier, the coordinator sends
    the next barrier when expanding the buffer pool.
    
    3. When expanding the buffer pool, the coordinator sends SHBUF_EXPAND
    ProcSignalBarrier. The backends are expected to set activeNBuffers =
    NewNBuffers and start allocating buffers from the expanded range. The
    coordinator uses this barrier to know when all the backends have
    settled using the new size of the buffer pool.
    
    For either operation, at most two barriers are sent.
    
    All this together in action looks like (See tests in the patch for
    more examples)
    SHOW shared_buffers; -- default
     shared_buffers
    ----------------
     128MB
    (1 row)
    
    ALTER SYSTEM SET shared_buffers = '64MB';
    SELECT pg_reload_conf();
     pg_reload_conf
    ----------------
     t
    (1 row)
    
    SHOW shared_buffers;
        shared_buffers
    -----------------------
     128MB (pending: 64MB)
    (1 row)
    
    SELECT pg_resize_shared_buffers();
     pg_resize_shared_buffers
    --------------------------
     t
    (1 row)
    
    SHOW shared_buffers;
     shared_buffers
    ----------------
     64MB
    (1 row)
    
    ALTER SYSTEM SET shared_buffers = '256MB';
    SELECT pg_reload_conf();
     pg_reload_conf
    ----------------
     t
    (1 row)
    
    SHOW shared_buffers;
        shared_buffers
    -----------------------
     64MB (pending: 256MB)
    (1 row)
    
    SELECT pg_resize_shared_buffers();
     pg_resize_shared_buffers
    --------------------------
     t
    (1 row)
    
    SHOW shared_buffers;
     shared_buffers
    ----------------
     256MB
    (1 row)
    
    On Thu, Sep 18, 2025 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:
    >
    > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Sun, 6 Apr 2025 16:40:32 +0200
    > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
    > >
    > > Currently an assing hook can perform some preprocessing of a new value,
    > > but it cannot change the behavior, which dictates that the new value
    > > will be applied immediately after the hook. Certain GUC options (like
    > > shared_buffers, coming in subsequent patches) may need coordinating work
    > > between backends to change, meaning we cannot apply it right away.
    > >
    > > Add a new flag "pending" for an assign hook to allow the hook indicate
    > > exactly that. If the pending flag is set after the hook, the new value
    > > will not be applied and it's handling becomes the hook's implementation
    > > responsibility.
    >
    > I doubt it makes sense to add this to the GUC system. I think it'd be better
    > to just use the GUC value as the desired "target" configuration and have a
    > function or a show-only GUC for reporting the current size.
    
    This has been taken care of in the new implementation with slightly
    different approach to show command as described above.
    
    >
    > I don't think you can't just block application of the GUC until the resize is
    > complete. E.g. what if the value was too big and the new configuration needs
    > to fixed to be lower?
    >
    
    With the above approach, the application of the GUC won't be blocked
    but if the size being applied is taking too long, the operation will
    be required to be cancelled before the new resize can happen. That's a
    part that needs some work. Chasing a moving target requires a very
    complex implementation, which would be good to avoid in the first
    version at least. However, we should leave room for that future
    enhancement. The current implementation gives that flexibility, I
    think.
    
    >
    > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Fri, 4 Apr 2025 21:46:14 +0200
    > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
    > >
    > > Currently WaitForProcSignalBarrier allows to make sure the message sent
    > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
    > > participants.
    > >
    > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
    > > which will be updated when a process has received the message, but not
    > > processed it yet. This makes it possible to support a new mode of
    > > waiting, when ProcSignal participants want to synchronize message
    > > processing. To do that, a participant can wait via
    > > WaitForProcSignalBarrierReceived when processing a message, effectively
    > > making sure that all processes are going to start processing
    > > ProcSignalBarrier simultaneously.
    >
    > I doubt "online resizing" that requires synchronously processing the same
    > event, can really be called "online". There can be significant delays in
    > processing a barrier, stalling the entire server until that is reached seems
    > like a complete no-go for production systems?
    >
    > > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Tue, 17 Jun 2025 14:16:55 +0200
    > > Subject: [PATCH 11/16] Allow to resize shared memory without restart
    > >
    > > Add assing hook for shared_buffers to resize shared memory using space,
    > > introduced in the previous commits without requiring PostgreSQL restart.
    > > Essentially the implementation is based on two mechanisms: a
    > > ProcSignalBarrier is used to make sure all processes are starting the
    > > resize procedure simultaneously, and a global Barrier is used to
    > > coordinate after that and make sure all finished processes are waiting
    > > for others that are in progress.
    > >
    > > The resize process looks like this:
    > >
    > > * The GUC assign hook sets a flag to let the Postmaster know that resize
    > >   was requested.
    > >
    > > * Postmaster verifies the flag in the event loop, and starts the resize
    > >   by emitting a ProcSignal barrier.
    > >
    > > * All processes, that participate in ProcSignal mechanism, begin to
    > >   process ProcSignal barrier. First a process waits until all processes
    > >   have confirmed they received the message and can start simultaneously.
    >
    > As mentioned above, this basically makes the entire feature not really
    > online. Besides the latency of some processes not getting to the barrier
    > immediately, there's also the issue that actually reserving large amounts of
    > memory can take a long time - during which all processes would be unavailable.
    >
    > I really don't see that being viable. It'd be one thing if that were a
    > "temporary" restriction, but the whole design seems to be fairly centered
    > around that.
    
    In the new implementation regular backends are not stalled when the
    resizing is going on. They continue their work with possible temporary
    performance degradation (this needs to be measured).
    
    >
    > > From experiment it turns out that shared mappings have to be extended
    > > separately for each process that uses them. Another rough edge is that a
    > > backend blocked on ReadCommand will not apply shared_buffers change
    > > until it receives something.
    >
    > That's not a rough edge, that basically makes the feature unusable, no?
    
    New synchronization doesn't have this problem since it doesn't require
    every backend to load the new value. The value being loaded only in
    the backend where pg_resize_buffer_pool() is being run is enough.
    
    >
    > > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
    > > From: Dmitrii Dolgov <9erthalion6@gmail.com>
    > > Date: Tue, 17 Jun 2025 11:22:02 +0200
    > > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
    > >
    > > Add more shmem segments to split shared buffers into following chunks:
    > > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
    > > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
    > > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
    > > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
    > > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
    >
    > Why do all these need to be separate segments? Afaict we'll have to maximally
    > size everything other than BUFFERS_SHMEM_SEGMENT at start?
    >
    
    I am leaning towards that. I will implement that soon.
    
    On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    >
    > I see you folks are inclined to keep some small segments static and
    > allocate maximum allowed memory for it. It's an option, at the end of
    > the day we need to experiment and measure both approaches.
    
    I did measure performance with a maximally sized buffer lookup table
    (shared_buffers = 128MB, max_shared_buffers = 10GB) on my laptop.
    There was no noticeable difference in the performance. I will post
    formal numbers with the next patchset.
    
    >
    >
    > > * Every process recalculates shared memory size based on the new
    > >   NBuffers, adjusts its size using ftruncate and adjust reservation
    > >   permissions with mprotect. One elected process signals the postmaster
    > >   to do the same.
    >
    > If we just used a single memory mapping with all unused parts marked
    > MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other
    > work in this patchset)..
    >
    On Sat, Sep 27, 2025 at 12:06 AM Andres Freund <andres@anarazel.de> wrote:
    >
    > > How do we return memory to the OS in that case? Currently it's done
    > > explicitly via truncating the anonymous file.
    >
    > madvise with MADV_DONTNEED or MADV_REMOVE.
    
    The patchset still uses the ftruncate + mprotect. I have questions
    apart from portability concerns about your proposal. MADV_DONTNEED
    documentation says
                  After a successful MADV_DONTNEED operation, the
    semantics of memory access in the specified region are changed:
    subsequent  accesses
                  of  pages  in  the range will succeed, but will result
    in either repopulating the memory contents from the up-to-date
    contents of the
                  underlying mapped file (for shared file mappings, shared
    anonymous mappings, and shmem-based techniques such as System V shared
     mem‐
                  ory segments) or zero-fill-on-demand pages for anonymous
    private mappings.
    
                  Note  that, when applied to shared mappings,
    MADV_DONTNEED might not lead to immediate freeing of the pages in the
    range.  The kernel
                  is free to delay freeing the pages until an appropriate
    moment.  The resident set size (RSS) of the calling process will  be
    immedi‐
                  ately reduced however.
    
                  MADV_DONTNEED  cannot be applied to locked pages, Huge
    TLB pages, or VM_PFNMAP pages.  (Pages marked with the kernel-internal
    VM_PFN‐
                  MAP flag are special memory areas that are not managed
    by the virtual memory subsystem.  Such pages are typically created  by
     device
                  drivers that map the pages into user space.)
    
    and MADV_REMOVE (since Linux 2.6.16)
                  Free  up  a  given  range of pages and its associated
    backing store.  This is equivalent to punching a hole in the
    corresponding byte
                  range of the backing store (see fallocate(2)).
    Subsequent accesses in the specified address range will see bytes
    containing zero.
    
                  The specified address range must be mapped shared and
    writable.  This flag cannot be applied to locked  pages,  Huge  TLB
    pages,  or
                  VM_PFNMAP pages.
    
    Combining these two,
    1. The access to the freed memory doesn't give any error but returns
    0. Won't that lead to silent corruption?
    2. Those are not supported with huge tlb pages. So can not be used
    when huge pages = on?
    
    With the current approach, we get SIGBUS and SIG 11 when the process
    tries to access the freed memory. That protection won't be there with
    madvise().
    
    The synchronization mechanism in this patch is inspired from Thomas's
    implementation posted in [1].
    
    I still need to go through Tomas's detailed comments and address those
    which still apply. And the patches are still WIP, with many TODOs. But
    I wanted to get some feedback on the proposed UI and synchronization
    as described above.
    
    I will be looking into the cases below one by one
    1. New backends join while the synchronization is going on. An
    existing backend exiting.
    2. Failure or crash in the backend which is executing pg_resize_buffer_pool()
    3. Fix crashes in the tests.
    
    [1] postgr.es/m/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com
    
    
    
    
    --
    Best Wishes,
    Ashutosh Bapat
    
  128. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-10-14T08:35:55Z

    As I've mentioned in our off list communication, I'm working on the new
    design and was planning to post some intermediate results in a couple of
    weeks. Thus I'm surprised that instead of aligning on plans you've
    decided to post you own version earlier. It most certainly doesn't make
    things easier for me, so what's your plan anyway? Are you trying to
    hijack the thread with your own patches? It doesn't strike me as
    particularly constructive thing to do.
    
    
    
    
  129. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-10-16T16:25:05Z

    Hi Dmitry,
    
    On Tue, Oct 14, 2025 at 2:05 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    >
    > As I've mentioned in our off list communication, I'm working on the new
    > design and was planning to post some intermediate results in a couple of
    > weeks. Thus I'm surprised that instead of aligning on plans you've
    > decided to post you own version earlier. It most certainly doesn't make
    > things easier for me, so what's your plan anyway? Are you trying to
    > hijack the thread with your own patches? It doesn't strike me as
    > particularly constructive thing to do.
    
    I am sorry if you have felt that way. That wasn't the intention.
    Please allow me to explain ...
    
    Tomas and Andres have pointed out some serious faults in the patchset
    posted at [2], including problems in my patches. There are many of
    those. If we work in parallel we can make good progress. So I
    continued working based on your last patchset [2], which was posted
    more than 3 months ago. Knowing that you are working on a design, I
    tried not to touch the synchronization and UI part and yet find
    solutions to some of the open problems (my patchset in [3] is a recent
    example). As I mentioned in my email at [1], every open question I
    tried to solve next was blocked because of a single problem, which I
    have described in my previous email - A problem in synchronization in
    the patchset at [2].
    
    Instead of just doing nothing, I thought I would try to implement the
    UI and synchronization that I had in mind. Once I implemented it and
    saw that it could address a few serious concerns raised by Andres, I
    thought I would share it with hackers to get some early feedback.
    Early feedback from people like Andres and Tomas is important to avoid
    going down the wrong path (and wasting time). Is there something wrong
    with that? BTW, this idea isn't new and it's certainly not only mine.
    It's a combination of an implementation shared by Thomas Munro [4] and
    an implementation I had shared with you offlist on 30th January 2025.
    I never saw any comments from you on the specific changes in those
    implementations and neither anything from those patchsets was absorbed
    in your patchsets.
    
    If I would have posted my alternate solution in January itself, that
    might have been considered hijacking (that's a serious accusation,
    btw). But instead I worked with your patches, improving them as long
    as I could. Even the patchset I shared is still
    on top of your patchset in [2].
    
    I don't know your solution. But if it's similar to my proposal, we are
    in agreement and can work further in parallel on subproblems. If it's
    different, let's discuss pros and cons of both - maybe there is some
    value in letting those evolve parallely and let the community choose
    the best, or choose best of both solutions giving rise to a new
    solution. My patchset might give you solutions/code for the problems
    you are trying to solve. It has tests which you can adapt to your
    solution. Many exciting possibilities lie ahead with multiple working
    solutions. Knowing nothing about the solution you are attempting, it's
    hard to know which of these apply and help you.
    
    [1] https://www.postgresql.org/message-id/CAExHW5sOu8%2B9h6t7jsA5jVcQ--N-LCtjkPnCw%2BrpoN0ovT6PHg%40mail.gmail.com
    [2] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj%40hmuxsf2ngov2
    [3] https://www.postgresql.org/message-id/CAExHW5vB8sAmDtkEN5dcYYeBok3D8eAzMFCOH1k%2Bkrxht1yFjA%40mail.gmail.com
    [4] https://www.postgresql.org/message-id/CA%2BhUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g%40mail.gmail.com
    
    --
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  130. Re: Changing shared_buffers without restart

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-10-17T10:33:31Z

    > On Thu, Oct 16, 2025 at 09:55:05PM +0530, Ashutosh Bapat wrote:
    >
    > BTW, this idea isn't new and it's certainly not only mine.
    > It's a combination of an implementation shared by Thomas Munro [4] and
    > an implementation I had shared with you offlist on 30th January 2025.
    > I never saw any comments from you on the specific changes in those
    > implementations and neither anything from those patchsets was absorbed
    > in your patchsets.
    
    Well, this is imply not true. We had an extensive discussion for long
    time off-list and even a few video calls to talk through various design
    options and agree about next steps.
    
    > I don't know your solution. But if it's similar to my proposal, we are
    > in agreement and can work further in parallel on subproblems. If it's
    > different, let's discuss pros and cons of both - maybe there is some
    > value in letting those evolve parallely and let the community choose
    > the best, or choose best of both solutions giving rise to a new
    > solution. My patchset might give you solutions/code for the problems
    > you are trying to solve. It has tests which you can adapt to your
    > solution. Many exciting possibilities lie ahead with multiple working
    > solutions. Knowing nothing about the solution you are attempting, it's
    > hard to know which of these apply and help you.
    
    I've shared many times on- and off-list the general directions I'm
    working in and even the expected timeline, so it's strange to state you
    don't know it.
    
    In the end you're free to do whatever you want, fortunately it's open
    source. But posting an alternative patch series and "let the community
    choose" does sound like hijacking to me, and a direct way to split and
    reduce already scarse review attention.
    
    
    
    
  131. Re: Changing shared_buffers without restart

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-11-14T11:53:21Z

    Hi,
    PFA new patchset with some TODOs from previous email addressed:
    
    On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    > 1. New backends join while the synchronization is going on.
    
    Done. Explained the solution below in detail.
    
    > An existing backend exiting.
    
    Not tested specifically, but should work.
    
    > 2. Failure or crash in the backend which is executing pg_resize_buffer_pool()
    
    still a TODO
    
    > 3. Fix crashes in the tests.
    
    core regression passes, pg_buffercache regression tests pass and the
    tests for buffer resizing pass most of the time. So far I have seen
    two issues
    1. An assertion from AIO worker - which happened only once and I
    couldn't reproduce again. Need to study interaction of AIO worker with
    buffer resizing.
    2. checkpointer crashes - which is one of the TODOs listed below.
    3. Also there's an shared memory id related failure, which I don't
    understand but happen more frequently than the first one. Need to look
    into that.
    
    > go through Tomas's detailed comments and address those
    > which still apply.
    
    Still a TODO. But since many of those patches are revised heavily, I
    think many of the comments may have been addressed, some may not apply
    anymore.
    
    > And the patches are still WIP, with many TODOs. But I wanted to get some feedback on the proposed UI and synchronization
    
    This is still a request.
    
    > Patches 0001 to 0016 are the same as the previous patchset. I haven't
    > touched them in case someone would like to see an incremental change.
    > However, it's getting unwieldy at this point, so I will squash
    > relevant patches together and provide a patchset with fewer patches
    > next.
    
    I have squashed the patches into 3 so that it's easy to review, read
    and work with those patches. The work is still WIP and there are many
    TODOs in the patches.
    
    Patch 0001: SQL interface to read contents of buffer lookup table. It
    was there in the previous patchset as 0001 but in this patchset I have
    moved the SQL function to the pg_buffercache module and renamed it
    accordingly. I added this change because I found it useful to debug
    issues I found while testing buffer resizing patches. The issues were
    related to page->buffer mappings which existed in the buffer look up
    table but were not present in the buffer descriptor array or buffer
    blocks. pg_buffercache, which traverses just the buffer descriptor
    array, isn't enough. Even without the resizing functionality this will
    help us catch situations where buffer descriptor array and buffer
    lookup table goes out of sync. I plan to keep it in this patchset as a
    debugging tool. If other developers feel that it could be useful, I
    will propose it in a separate thread.
    
    Patch 0002: This is a single patch squashing all patches (0005, 0006,
    0007, 0008, 0009 and 0010) related to shared memory management and
    address space reservation together. This patch allows the creation of
    multiple shared memory segments and also lays them out so as to make
    those resizable. The actual code to resize the segments is in the next
    patch. The APIs used for memory management and address space
    reservation are described later. Prominent changes from the previous
    patches are:
    1. modifies CalculateShmemSize() so that it can work with multiple
    shared memory segments.
    2. It also combines AnonymousMapping and ShmemSegment structures
    together as suggested by Tomas upthread. The merger is still going on.
    There are some old comments or variable names referring to memory
    mapping when they should be mentioning shared memory segments. I will
    work on that when I start polishing this patch.
    4. GUC to specify the maximum size of buffer pool has been renamed and
    moved to the next patch which deals with actual resizing.
    5. Changes to process config reload in AIO workers are removed. Those
    are not needed after 55b454d0e14084c841a034073abbf1a0ea937a45.
    
    Patch 0003: Implements the UI and synchronization described in the
    previous email [1] with additional improvements to support a new
    backend joining while resizing is in progress. This patch squashes
    other patches 0002 - 0004 and 0011 onward patches from the previous
    patchset, but it also gets rid of a lot of code related to the old
    synchronization method and the old UI. The code related to resizing
    including implementation of pg_resize_shared_buffers() is moved to
    storage/buffer/buf_resize.c, a new file. There is no change to the UI.
    The buffer resizing still looks like as described in the previous
    email.
    
    > SHOW shared_buffers; -- default
    >  shared_buffers
    > ----------------
    >  128MB
    > (1 row)
    >
    > ALTER SYSTEM SET shared_buffers = '64MB';
    > SELECT pg_reload_conf();
    >  pg_reload_conf
    > ----------------
    >  t
    > (1 row)
    >
    > SHOW shared_buffers;
    >     shared_buffers
    > -----------------------
    >  128MB (pending: 64MB)
    > (1 row)
    >
    > SELECT pg_resize_shared_buffers();
    >  pg_resize_shared_buffers
    > --------------------------
    >  t
    > (1 row)
    >
    > SHOW shared_buffers;
    >  shared_buffers
    > ----------------
    >  64MB
    > (1 row)
    >
    > ALTER SYSTEM SET shared_buffers = '256MB';
    > SELECT pg_reload_conf();
    >  pg_reload_conf
    > ----------------
    >  t
    > (1 row)
    >
    > SHOW shared_buffers;
    >     shared_buffers
    > -----------------------
    >  64MB (pending: 256MB)
    > (1 row)
    >
    > SELECT pg_resize_shared_buffers();
    >  pg_resize_shared_buffers
    > --------------------------
    >  t
    > (1 row)
    >
    > SHOW shared_buffers;
    >  shared_buffers
    > ----------------
    >  256MB
    > (1 row)
    >
    
    The implementation uses a similar strategy as described in the
    previous email with changes described below.
    
    A new backend inherits the address space of shared memory segments and
    the local variable NBuffers through Postmaster. These are changed when
    resizing the buffer pool. And the same changes need to be applied to
    the Postmaster so that a new backend inherits them. Since Postmaster
    is not part of the ProcSignalBarrier mechanism, the coordinator has to
    send signals to the Postmaster separately. This has the following
    drawbacks
    1. Additional code to signal Postmaster
    2. coordinator has to wait for Postmaster to apply the changes
    separately, thus adding extra delays
    3. platforms which use fork() + exec(), will add more complexity to
    transfer the state to new child
    4. If the postmaster is signaled after sending a barrier to other
    backends, the newly joined backend will miss the state update as well
    as the barrier. If the postmaster is signaled before sending a barrier
    to other backends, a newly joining backend will receive the barrier as
    well as state update from Postmaster. This means the barrier handling
    code is required to be idempotent. This will make the barrier handling
    code more complex and also constrained.
    
    Instead the approach taken by Thomas Munro in [2] does not require
    updating the address space. It uses shared memory variables instead of
    process local memory variables to save the state of the shared buffer
    pool. This patchset uses a similar approach and
    1. avoids involving Postmaster in the resizing process
    2. additionally making barrier handling code super thin.
    
    Shared Memory and address space management
    ========================================
    An fd is created using memfd_create to manage the size of the shared
    memory segment using ftruncate and fallocate(). That fd is passed to
    mmap() which reserves the maximum required address space and maps the
    anonymous file (and the backing memory) in that address space. mmap
    uses MAP_NORESERVE so as not to allocate memory against mapping. The
    size of the anonymous file controls the amount of memory allocated.
    For the main shared memory segment, the size of the reserved space is
    the same as the amount of memory required. But for shared buffer pool
    related segments the size of the reserved space is decided by GUC
    max_shared_buffers (mentioned in the previous email and quoted below).
    When resizing shared buffers only the anonymous file is resized and
    not the address space. I tested this protocol with an attached small
    program (mfdtruncate.c). Sharing it in case somebody finds it useful.
    
    Saving shared buffer pool sizes in the shared memory
    =========================================
    When resizing, we need to track two ranges of buffers 1. active
    buffers, which is the range of buffers from which the new allocations
    happen at a given time and 2. valid buffers which is the range of
    buffers which are valid at a given time. When shrinking, the active
    buffers is set to the new size while the valid buffers remains same as
    the old size till all the buffers outside the new size are evicted.
    When expanding, valid buffers and active buffers are both changed to
    new size after memory is resized and expanded data structures are
    initialized. Current global variable NBuffers is insufficient to track
    these two numbers.
    
    Instead we have a new member StrategyControl::activeNBuffers which
    tracks the active buffer range. The shared memory structure
    controlling the resizing operation (ShmemCtrl) has a member
    currentNBuffers which gives the range of valid number of shared
    buffers at a given point in time. (I am planning to merge ShmemCtrl
    and StrategyControl, so that we have all the metadata about shared
    buffers in one place in the shared memory). These two numbers are
    saved in the shared memory for the reasons explained below and replace
    current NBuffers. They are modified by the coordinator as the resizing
    progresses. Some usages of NBuffers are replaced by one of the two
    variables as appropriate but more work is required.
    
    Next I will be working on
    1. Background writer synchronization
    2. Checkpoint synchronization
    3. Make all the shared buffer pool structures, except buffer blocks,
    static and maximally allocated as suggested by Andres earlier. [3]
    4. Replace NBuffers usages as explained above
    3. merge ShmemCtrl and StrategyControl as explained above
    4. Handle failures in resizing
    5. There have been concerns raised earlier that anonymous file backed
    memory is not dumped with core. I am thinking of not using an
    anonymous file for the main memory segment so that it gets dumped with
    core. But shared buffers still will be dumped. However, I am skeptical
    as to whether we need GBs (say) of shared buffers being dumped along
    with core or should we leave that choice to users.
    
    [1] https://www.postgresql.org/message-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com
    [2] https://www.postgresql.org/message-id/CA%2BhUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g%40mail.gmail.com
    [3] https://www.postgresql.org/message-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u%40yaogvphfwc4h
    
    
    
    
    --
    Best Wishes,
    Ashutosh Bapat