Re: Changing shared_buffers without restart

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>
To: Dmitry Dolgov <9erthalion6@gmail.com>
Cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>, Thomas Munro <thomas.munro@gmail.com>, pgsql-hackers@postgresql.org, Jack Ng <Jack.Ng@huawei.com>, Ni Ku <jakkuniku@gmail.com>
Date: 2025-07-07T11:57:42Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Remove PG_MMAP_FLAGS from mem.h

  2. Improve runtime and output of tests for replication slots checkpointing.

  3. Revert support for improved tracking of nested queries

  4. Use exported symbols list on macOS for loadable modules as well

  5. Add support for basic NUMA awareness

  6. Avoid unnecessary copying of a string in pg_restore.c

  7. aio: Infrastructure for io_method=worker

  8. Improve InitShmemAccess() prototype


On 7/5/25 12:35, Dmitry Dolgov wrote:
>> On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote:
>>>> 2) pending GUC changes
>>>>
>>>> Perhaps this should be a separate utility command, or maybe even just
>>>> a new ALTER SYSTEM variant? Or even just a function, similar to what
>>>> the "online checksums" patch did, possibly combined with a bgworder
>>>> (but probably not needed, there are no db-specific tasks to do).
>>>
>>> This is one topic we still actively discuss, but haven't had much
>>> feedback otherwise. The pros and cons seem to be clear:
>>>
>>> * Utilizing the existing GUC mechanism would allow treating
>>>   shared_buffers as any other configuration, meaning that potential
>>>   users of this feature don't have to do anything new to use it -- they
>>>   still can use whatever method they prefer to apply new configuration
>>>   (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly).
>>>
>>>   I'm also wondering if it's only shared_buffers, or some other options
>>>   could use similar approach.
>>>
>>
>> I don't know. What are the "potential users" of this feature? I don't
>> recall any, but there may be some. How do we know the new pending flag
>> will work for them too?
> 
> It could be potentialy useful for any GUC that controls a resource
> shared between backend, and requires restart. To make this GUC
> changeable online, every backend has to perform some action, and they
> have to coordinate to make sure things are consistent -- exactly the use
> case we're trying to address, shared_buffers is just happened to be one
> of such resources. While I agree that the currently implemented
> interface is wrong (e.g. it doesn't prevent pending GUCs from being
> stored in PG_AUTOCONF_FILENAME, this has to happen only when the new
> value is actually applied), it still makes sense to me to allow more
> flexible lifecycle for certain GUC.
> 
> An example I could think of is shared_preload_libraries. If we ever want
> to do a hot reload of libraries, this will follow the procedure above:
> every backend has to do something like dlclose / dlopen and make sure
> that other backends have the same version of the library. Another maybe
> less far fetched example is max_worker_processes, which AFAICT is mostly
> used to control number of slots in shared memory (altough it's also
> stored in the control file, which makes things more complicated).
> 

Not sure. My concern is the config reload / GUC assign hook was not
designed with this use case in mind, and we'll run into issues. I also
dislike the "async" nature of this, which makes it harder to e.g. abort
the change, etc.

>>> * Having a separate utility command is a mighty simplification, which
>>>   helps avoiding problems you've described above.
>>>
>>> So far we've got two against one in favour of simple utility command, so
>>> we can as well go with that.
>>>
>>
>> Not sure voting is a good way to make design decisions ...
> 
> I'm somewhat torn between those two options myself. The more I think
> about this topic, the more I convinced that pending GUC makes sense, but
> the more work I see needed to implement that. Maybe a good middle ground
> is to go with a simple utility command, as Ashutosh was suggesting, and
> keep pending GUC infrastructure on top of that as an optional patch.
> 

What about a simple function? Probably not as clean as a proper utility
command, and it implies a transaction - not sure if that could be a
problem for some part of this.

>>>> 3) max_available_memory
>>>>
>>>> I think the GUC should specify the maximum shared_buffers we want to
>>>> allow, and then we'd work out the total to pre-allocate? Considering
>>>> we're only allowing to resize shared_buffers, that should be pretty
>>>> trivial. Yes, it might happen that the "total limit" happens to exceed
>>>> the available memory or something, but we already have the problem
>>>> with shared_buffers. Seems fine if we explain this in the docs, and
>>>> perhaps print the calculated memory limit on start.
>>>
>>> Somehow I'm not following what you suggest here. You mean having the
>>> maximum shared_buffers specified, but not as a separate GUC?
>>
>> My suggestion was to have a guc max_shared_buffers. Based on that you
>> can easily calculate the size of all other segments dependent on
>> NBuffers, and reserve memory for that.
> 
> Got it, ok.
> 
>>>> Moreover, all of the above is for mappings sized based on NBuffers. But
>>>> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
>>>> moment someone increases of max_connection, max_locks_per_transaction
>>>> and possibly some other stuff?
>>>
>>> Can you elaborate, what do you mean by that? Increasing max_connection,
>>> etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT,
>>> but the ratio is for memory reservation only.
>>>
>>
>> Stuff like PGPROC, fast-path locks etc. are allocated as part of
>> MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum
>> space for that. If I significantly increase GUCs like max_connections or
>> max_locks_per_transaction, how do you know it didn't exceed the 10%?
> 
> Still don't see the problem. The 10% we're talking about is the reserved
> space, thus it affects only shared memory resizing operation and nothing
> else. The real memory allocated is less than or equal to the reserved
> size, but is allocated and managed completely in the same way as without
> the patch, including size calculations. If some GUCs are increased and
> drive real memory usage high, it will be handled as before. Are we on
> the same page about this?
> 

How do you know reserving 10% is sufficient? Imagine I set

  max_available_memory = '256MB'
  max_connections = 1000000
  max_locks_per_transaction = 10000

How do you know it's not more than 10% of the available memory?

FWIW if I add a simple assert to CreateAnonymousSegment

  Assert(mapping->shmem_reserved >= allocsize);

it crashes even with just the max_available_memory=256MB

  #4  0x0000000000b74fbd in ExceptionalCondition (conditionName=0xe25920
"mapping->shmem_reserved >= allocsize", fileName=0xe251e7 "pg_shmem.c",
lineNumber=878) at assert.c:66

because we happen to execute it with this:

  mapping->shmem_reserved 26845184  allocsize 125042688

I think I mentioned a similar crash earlier, not sure if that's the same
issue or a different one.

>>>> 11) Actually, what's the difference between the contents of Mappings
>>>>     and Segments? Isn't that the same thing, indexed in the same way?
>>>>     Or could it be unified? Or are they conceptually different thing?
>>>
>>> Unless I'm mixing something badly, the content is the same. The relation
>>> is a segment as a structure "contains" a mapping.
>>>
>> Then, why do we need to track it in two places? Doesn't it just increase
>> the likelihood that someone misses updating one of them?
> 
> To clarify, under "contents" I mean the shared memory content (the
> actual data) behind both "segment" and the "mapping", maybe you had
> something else in mind.
> 
> On the surface of it those are two different data structures that have
> mostly different, but related, fields: a shared memory segment contains
> stuff needed for working with memory (header, base, end, lock), mapping
> has more lower level details, e.g.  reserved space, fd, IPC key. The
> only common fields are size and address, maybe I can factor them out to
> not repeat.
> 

OK, I think I'm just confused by the ambiguous definitions of
segment/mapping. It'd be good to document/explain this in a comment
somewhere.

>>>> 2) In fact, what happens if the user tries to resize to a value that is
>>>>    too large for one of the segments? How would the system know before
>>>>    starting the resize (and failing)?
>>>
>>> This type of situation is handled (doing hard stop) in the latest
>>> version, because all the necessary information is present in the mapping
>>> structure.
>>>
>> I don't know, but crashing the instance (I assume that's what you mean
>> by hard stop) does not seem like something we want to do. AFAIK the GUC
>> hook should be able to determine if the value is too large, and reject
>> it at that point. Not proceed and crash everything.
> 
> I see, you're pointing out that it would be good to have more validation
> at the GUC level, right?

Well, that'd be a starting point. We definitely should not allow setting
a value that end up crashing an instance (it does not matter if it's
because of FATAL or hitting a segfault/sigbut somewhere).

cheers

-- 
Tomas Vondra