Re: Changing shared_buffers without restart

Konstantin Knizhnik <knizhnik@garret.ru>

From: Konstantin Knizhnik <knizhnik@garret.ru>

To: Dmitry Dolgov <9erthalion6@gmail.com>, pgsql-hackers@postgresql.org

Cc: Robert Haas <robertmhaas@gmail.com>, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>

Date: 2025-04-17T11:21:07Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Remove PG_MMAP_FLAGS from mem.h
- c100340729b6 19 (unreleased) landed
Improve runtime and output of tests for replication slots checkpointing.
- 4464fddf7b50 18.0 cited
Revert support for improved tracking of nested queries
- f85f6ab051b7 18.0 cited
Use exported symbols list on macOS for loadable modules as well
- 3feff3916ee1 18.0 cited
Add support for basic NUMA awareness
- 65c298f61fc7 18.0 cited
Avoid unnecessary copying of a string in pg_restore.c
- 5e1915439085 18.0 cited
aio: Infrastructure for io_method=worker
- 55b454d0e140 18.0 cited
Improve InitShmemAccess() prototype
- 2a7b2d97171d 18.0 landed

On 25/02/2025 11:52 am, Dmitry Dolgov wrote:
>> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
>> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
>> changing shared memory mapping layout. Any feedback is appreciated.

Hi Dmitry,

I am sorry that I have not participated in the discussion in this thread 
from the very beginning, although I am also very interested in dynamic 
shared buffer resizing and evn proposed my own implementation of it: 
https://github.com/knizhnik/postgres/pull/2 based on memory ballooning 
and using `madvise`. And it really works (returns unused memory to the 
system).
This PoC allows me to understand the main drawbacks of this approach:

1. Performance of Postgres CLOCK page eviction algorithm depends on 
number of shared buffers. My first native attempt just to mark unused 
buffers as invalid cause significant degrade of performance

pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S

(here shared_buffers - is maximal shared buffers size and 
`available_buffers` - is used part:

| shared_buffers | available_buffers | TPS | | ------------------| 
---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 | 
324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 
2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 
2Gb | 348k |

My first thought is to replace clock with LRU based in double-linked 
list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard 
solution is partitioning: use N  LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. 
Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided 
not to perform such revolution in bufmgr, but optimize clock to more 
efficiently split reserved buffers.
Just add|skip_count|field to buffer descriptor. And it helps! Now the 
worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as  shared_buffers=128Mb without ballooning.

2. There are several data structures i Postgres which size depends on 
number of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this 
structure has to be allocated in shared memory then still maximal size 
has to be used. We have the buffers themselves (8 kB per buffer), then 
the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), 
checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache 
(24B+8B/entry).
128 bytes per 8kb bytes seems to  large overhead (~1%) but but it may be 
quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer 
we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers.

3. `madvise` is not portable.

Certainly you have moved much further in your proposal comparing with my 
PoC (including huge pages support).
But it is still not quite clear to me how you are going to solve the 
problems with large memory overhead in case of ~100x times variation of 
shared buffers size.

I