Re: Changing shared_buffers without restart

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>

From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>

To: Dmitry Dolgov <9erthalion6@gmail.com>

Cc: pgsql-hackers@postgresql.org, Robert Haas <robertmhaas@gmail.com>

Date: 2025-04-11T14:34:39Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Remove PG_MMAP_FLAGS from mem.h
- c100340729b6 19 (unreleased) landed
Improve runtime and output of tests for replication slots checkpointing.
- 4464fddf7b50 18.0 cited
Revert support for improved tracking of nested queries
- f85f6ab051b7 18.0 cited
Use exported symbols list on macOS for loadable modules as well
- 3feff3916ee1 18.0 cited
Add support for basic NUMA awareness
- 65c298f61fc7 18.0 cited
Avoid unnecessary copying of a string in pg_restore.c
- 5e1915439085 18.0 cited
aio: Infrastructure for io_method=worker
- 55b454d0e140 18.0 cited
Improve InitShmemAccess() prototype
- 2a7b2d97171d 18.0 landed

On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> Yes, you're right, plain dynamic Barrier does not ensure all available
> processes will be synchronized. I was aware about the scenario you
> describe, it's mentioned in commentaries for the resize function. I was
> under the impression this should be enough, but after some more thinking
> I'm not so sure anymore. Let me try to structure it as a list of
> possible corner cases that we need to worry about:
>
> * New backend spawned while we're busy resizing shared memory. Those
>   should wait until the resizing is complete and get the new size as well.
>
> * Old backend receives a resize message, but exits before attempting to
>   resize. Those should be excluded from coordination.

Should we detach barrier in on_exit()?

>
> * A backend is blocked and not responding before or after the
>   ProcSignalBarrier message was sent. I'm thinking about a failure
>   situation, when one rogue backend is doing something without checking
>   for interrupts. We need to wait for those to become responsive, and
>   potentially abort shared memory resize after some timeout.

Right.

>
> I think a relatively elegant solution is to extend ProcSignalBarrier
> mechanism to track not only pss_barrierGeneration, as a sign that
> everything was processed, but also something like
> pss_barrierReceivedGeneration, indicating that the message was received
> everywhere but not processed yet. That would be enough to allow
> processes to wait until the resize message was received everywhere, then
> use a global Barrier to wait until all processes are finished.  It's
> somehow similar to your proposal to use two signals, but has less
> implementation overhead.

The way it's implemented in v4 still has the disjoint group problem.
Assume backends p1, p2, p3. All three of them are executing
ProcessProcSignalBarrier(). All three of them updated
pss_barrierReceivedGeneration

/* The message is observed, record that */
pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration,
shared_gen);

p1, p2 moved faster and reached following code from ProcessBarrierShmemResize()
if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED)
  WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation));

Since all the processes have received the barrier message, p1, p2 move
ahead and go through all the next phases and finish resizing even
before p3 gets a chance to call ProcessBarrierShmemResize() and attach
itself to Barrier. This could happen because it processed some other
ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has
not attached itself to the barrier. Once p1, p2 finish, p3 will attach
itself to the barrier and resize buffers again - reinitializing the
shared memory, which might has been already modified by p1 or p2. Boom
- there's memory corruption.

Either every process has to make sure that all the other extant
backends have attached themselves to the barrier OR somebody has to
ensure that and signal all the backends to proceed. The implementation
doesn't do either.

>
> * Shared memory address space is now reserved for future usage, making
>   shared memory segments clash (e.g. due to memory allocation)
>   impossible.  There is a new GUC to control how much space to reserve,
>   which is called max_available_memory -- on the assumption that most of
>   the time it would make sense to set its value to the total amount of
>   memory on the machine. I'm open for suggestions regarding the name.

With 0006 applied
+ /* Clean up some reserved space to resize into */
+ if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1)
ze, m->shmem)));
... snip ...
+ ptr = mremap(m->shmem, m->shmem_size, new_size, 0);

We unmap the portion of reserved address space where the existing
segment would expand into. As long as we are just expanding this will
work. I am wondering how would this work for shrinking buffers? What
scheme do you have in mind?

>
> * There is one more patch to address hugepages remap. As mentioned in
>   this thread above, Linux kernel has certain limitations when it comes
>   to mremap for segments allocated with huge pages. To work around it's
>   possible to replace mremap with a sequence of unmap and map again,
>   relying on the anon file behind the segment to keep the memory
>   content. I haven't found any downsides of this approach so far, but it
>   makes the anonymous file patch 0007 mandatory.

In 0008
if (munmap(m->shmem, m->shmem_size) < 0)
... snip ...
/* Resize the backing anon file. */
if(ftruncate(m->segment_fd, new_size) == -1)
...
/* Reclaim the space */
ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE,
mmap_flags | MAP_FIXED, m->segment_fd, 0);

How are we preventing something get mapped into the space after
m->shmem + newsize? We will need to add an unallocated but reserved
addressed space map after m->shmem+newsize right?

--
Best Wishes,
Ashutosh Bapat