Thread

  1. Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-09-02T21:10:43Z

    Hi,
    
    The attached patchset implements checkpointer write combining -- which
    makes immediate checkpoints at least 20% faster in my tests.
    Checkpointer achieves higher write throughput and higher write IOPs
    with the patch.
    
    Besides the immediate performance gain with the patchset, we will
    eventually need all writers to do write combining if we want to use
    direct IO. Additionally, I think the general shape I refactored
    BufferSync() into will be useful for AIO-ifying checkpointer.
    
    The patch set has preliminary patches (0001-0004) that implement eager
    flushing and write combining for bulkwrites (like COPY FROM). The
    functions used to flush a batch of writes for bulkwrites (see 0004)
    are reused for the checkpointer. The eager flushing component of this
    patch set has been discussed elsewhere [1].
    
    0005 implements a fix for XLogNeedsFlush() when called by checkpointer
    during an end-of-crash-recovery checkpoint. I've already started
    another thread about this [2], but the patch is required for the patch
    set to pass tests.
    
    One outstanding action item is to test to see if there are any
    benefits to spread checkpoints.
    
    More on how I measured the performance benefit to immediate checkpoints:
    
    I tuned checkpoint_completion_target, checkpoint_timeout, and min and
    max_wal_size to ensure no other checkpoints were initiated.
    
    With 16 GB shared buffers and io_combine_limit 128, I created a 15 GB
    table. To get consistent results, I used pg_prewarm to read the table
    into shared buffers, issued a checkpoint, then used Bilal's patch [3]
    to mark all the buffers as dirty again and issue another checkpoint.
    On a fast local SSD, this proved to be a consistent 20%+ speed up
    (~6.5 seconds to ~5 seconds).
    
    - Melanie
    
    [1] https://www.postgresql.org/message-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com
    [2] https://www.postgresql.org/message-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
    [3] https://www.postgresql.org/message-id/flat/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
    
  2. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-09-08T23:44:02Z

    On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > The attached patchset implements checkpointer write combining -- which
    > makes immediate checkpoints at least 20% faster in my tests.
    > Checkpointer achieves higher write throughput and higher write IOPs
    > with the patch.
    
    These needed a rebase. Attached v2.
    
    - Melanie
    
  3. Re: Checkpointer write combining

    Nazir Bilal Yavuz <byavuz81@gmail.com> — 2025-09-09T13:27:47Z

    Hi,
    
    Thank you for working on this!
    
    On Tue, 9 Sept 2025 at 02:44, Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
    > <melanieplageman@gmail.com> wrote:
    > >
    > > The attached patchset implements checkpointer write combining -- which
    > > makes immediate checkpoints at least 20% faster in my tests.
    > > Checkpointer achieves higher write throughput and higher write IOPs
    > > with the patch.
    
    I did the same benchmark you did and I found it is %50 faster (16
    seconds to 8 seconds).
    
    
  4. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-09-09T13:39:19Z

    On Tue, Sep 9, 2025 at 9:27 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
    >
    
    Thanks for the review!
    
    > From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
    > From: Melanie Plageman <melanieplageman@gmail.com>
    > Date: Tue, 2 Sep 2025 11:00:44 -0400
    > Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()
    >
    > @@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy
    > strategy, BufferDesc *buf, bool from_r
    >          strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
    >          return false;
    >
    > +    buf_state = LockBufHdr(buf);
    > +    lsn = BufferGetLSN(buf);
    > +    UnlockBufHdr(buf, buf_state);
    > +
    > +    if (!XLogNeedsFlush(lsn))
    > +        return true;
    >
    > I think this should return false.
    
    Oops, you're right. v3 attached with that mistake fixed.
    
    - Melanie
    
  5. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-09-09T15:16:59Z

    On Tue, Sep 9, 2025 at 9:39 AM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > Oops, you're right. v3 attached with that mistake fixed.
    
    One more fix and a bit more cleanup in attached v4.
    
    - Melanie
    
  6. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-09-09T17:55:08Z

    On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > One more fix and a bit more cleanup in attached v4.
    
    Okay one more version: I updated the thread on eager flushing the
    bulkwrite ring [1], and some updates were needed here.
    
    - Melanie
    
    [1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
    
  7. Re: Checkpointer write combining

    Jeff Davis <pgsql@j-davis.com> — 2025-09-10T01:28:55Z

    On Tue, 2025-09-09 at 13:55 -0400, Melanie Plageman wrote:
    > On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
    > <melanieplageman@gmail.com> wrote:
    > > 
    > > One more fix and a bit more cleanup in attached v4.
    > 
    > Okay one more version: I updated the thread on eager flushing the
    > bulkwrite ring [1], and some updates were needed here.
    
    v5-0005 comments:
    
      * Please update the comment above the code change.
      * The last paragraph in the commit message has a typo: "potentially
    update the local copy of min recovery point, when xlog inserts are
    *not* allowed", right?
      * Shouldn't the code be consistent between XLogNeedsFlush() and
    XLogFlush()? The latter only checks for !XLogInsertAllowed(), whereas
    the former also checks for RecoveryInProgress().
    
    I'm still not sure I understand the problem situation this is fixing,
    but that's being discussed in another thread.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  8. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-09-10T08:24:13Z

    
    > On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
    > 
    > 
    > [1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
    > <v5-0004-Write-combining-for-BAS_BULKWRITE.patch><v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch><v5-0002-Split-FlushBuffer-into-two-parts.patch><v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v5-0006-Add-database-Oid-to-CkptSortItem.patch><v5-0007-Implement-checkpointer-data-write-combining.patch>
    
    
    1 - 0001
    ```
    --- a/src/backend/storage/buffer/freelist.c
    +++ b/src/backend/storage/buffer/freelist.c
    
    + * The buffer must pinned and content locked and the buffer header spinlock
    ```
    
    “Must pinned” -> “must be pinned"
    
    2 - 0001
    ```
    --- a/src/backend/storage/buffer/freelist.c
    +++ b/src/backend/storage/buffer/freelist.c
    
    +	if (XLogNeedsFlush(lsn))
    +	{
    +		/*
    +		 * Remove the dirty buffer from the ring; necessary to prevent an
    +		 * infinite loop if all ring members are dirty.
    +		 */
    +		strategy->buffers[strategy->current] = InvalidBuffer;
    +		return true;
    +	}
     
    -	return true;
    +	return false;
     }
    ```
    
    We can do:
    ```
             If (!XLogNeedsFlush(lan))
                Return false
    
            /* Remove the dirty buffer ….
             */
           Return true;
    }
    ```
    
    This way makes less diff.
    
    3 - 0002
    ```
    + * Prepare to write and write a dirty victim buffer.
    ```
    
    Prepare to write a dirty victim buffer.
    
    4 - 0002
    ```
    -			/* OK, do the I/O */
    -			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
    -			LWLockRelease(content_lock);
    -
    -			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
    -										  &buf_hdr->tag);
    +			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
    ```
    I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().
    
    I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.
    
    5 - 0002
    ```
     	 * disastrous system-wide consequences.  To make sure that can't happen,
     	 * skip the flush if the buffer isn't permanent.
     	 */
    -	if (buf_state & BM_PERMANENT)
    -		XLogFlush(recptr);
    +	if (!XLogRecPtrIsInvalid(buffer_lsn))
    +		XLogFlush(buffer_lsn);
    ```
    
    Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.
    
    6 - 0003
    ```
    --- a/src/backend/storage/buffer/freelist.c
    +++ b/src/backend/storage/buffer/freelist.c
    
    + * patterns than lazily flushing buffers directly before reusing them.
    + */
    ```
    
    Here “directly” is kind ambiguous. It could mean “immediately before” or “without going through something else”. My understanding is “immediately”, If that is true, please change “directly” to “immediately” or just remove it.
    
    7 - 0003
    ```
    +void
    +StartStrategySweep(BufferAccessStrategy strategy)
    +{
    +	if (!strategy)
    +		return;
    ```
    
    I doubt if this “strategy” null check is needed. Because it is only called when strategy_supports_eager_flush() is true, and strategy_supports_eager_flush() has asserted “strategy”.
    
    And as a pair function, StrategySweepNextBuffer() doesn’t do null check nor assert strategy.
    
    8 - 0003
    ```
    bool
    +strategy_supports_eager_flush(BufferAccessStrategy strategy)
    ```
    
    This function is only used in bufmgr.c, can we move it there and make it static?
    
    9 - 0004
    ```
    --- a/src/backend/storage/buffer/bufmgr.c
    +++ b/src/backend/storage/buffer/bufmgr.c
    
    +	batch->start = batch->bufdescs[0]->tag.blockNum;
    +	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
    +	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
    +	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
    +
    +	Assert(BlockNumberIsValid(batch->start));
    ```
    
    Why don’t assert immediately after batch->start is assigned? So upon error, smgropen() will not be called.
    
    10 - 0004
    ```
    --- a/src/backend/storage/buffer/bufmgr.c
    +++ b/src/backend/storage/buffer/bufmgr.c
    
    +	limit = Min(max_batch_size, limit);
    ```
    
    Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.
    
    11 - 0004
    ```
    --- a/src/backend/storage/buffer/bufmgr.c
    +++ b/src/backend/storage/buffer/bufmgr.c
    
    +	for (batch->n = 1; batch->n < limit; batch->n++)
    +	{
    +		Buffer		bufnum;
    +
    +		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
    +			break;
    ```
    
    Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.
    
    12 - 0004
    ```
    @@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
     
     	if (from_ring && strategy_supports_eager_flush(strategy))
     	{
    +		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
    ```
    
    I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.
    
    13 - 0004
    ```
    +void
    +CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
    +					 WritebackContext *wb_context)
    +{
    +	ErrorContextCallback errcallback =
    +	{
    +		.callback = shared_buffer_write_error_callback,
    +		.previous = error_context_stack,
    +	};
    +
    +	error_context_stack = &errcallback;
    +	pgBufferUsage.shared_blks_written += batch->n;
    ```
    
    Should we only increase shared_blks_written only after the loop of write-back is done?
    
    14 - 0004
    ```
    --- a/src/backend/storage/buffer/freelist.c
    +++ b/src/backend/storage/buffer/freelist.c
    
    +uint32
    +max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
    ```
    
    I think this function can be moved to bufmgr.c and make it static.
    
    15 - 0004
    ```
    --- a/src/backend/storage/page/bufpage.c
    +++ b/src/backend/storage/page/bufpage.c
    @@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
     
     	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
     }
    +
    +/*
    + * A helper to set multiple block's checksums.
    + */
    +void
    +PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
    ```
    
    We should mark blknos as const to indicate it is readonly: const BlockNumber *blknos, which will also prevent from incidentally change on blknos in within the function. 
    
    16 - 0005
    ```
     	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
     	 * would need to be updated.
     	 */
    -	if (RecoveryInProgress())
    +	if (RecoveryInProgress() && !XLogInsertAllowed())
    ```
    
    As a new check is added, the comment should be updated accordingly.
    
    17 - 0006
    ```
    --- a/src/include/storage/buf_internals.h
    +++ b/src/include/storage/buf_internals.h
    @@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
     typedef struct CkptSortItem
     {
     	Oid			tsId;
    +	Oid			db_id;
    ```
    
    I think “db_id” should be named “dbId” or “dbOid”. Let’s keep the name conversation consistent.
    
    18 - 0007
    ```
    --- a/src/backend/storage/buffer/bufmgr.c
    +++ b/src/backend/storage/buffer/bufmgr.c
    
    +	max_batch_size = checkpointer_max_batch_size();
    ```
    
    Look like we don’t need to calculate max_batch_size in the for loop.
    
    19 - 0007
    ```
    +		 * Each batch will have exactly one start and one max lsn and one
    +		 * length.
     		 */
    ```
    
    I don’t get what you want to explain with this comment. It sounds quite unnecessary.
    
    Best regards,
    —
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  9. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-09-11T23:11:12Z

    On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    
    Thanks for the review!
    
    For any of your feedback that I simply implemented, I omitted an
    inline comment about it. Those changes are included in the attached
    v6. My inline replies below are only for feedback requiring more
    discussion.
    
    > On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
    >
    > 2 - 0001
    > ```
    > --- a/src/backend/storage/buffer/freelist.c
    > +++ b/src/backend/storage/buffer/freelist.c
    >
    > + if (XLogNeedsFlush(lsn))
    > + {
    > + /*
    > + * Remove the dirty buffer from the ring; necessary to prevent an
    > + * infinite loop if all ring members are dirty.
    > + */
    > + strategy->buffers[strategy->current] = InvalidBuffer;
    > + return true;
    > + }
    >
    > - return true;
    > + return false;
    >  }
    > ```
    >
    > We can do:
    > ```
    >          If (!XLogNeedsFlush(lan))
    >             Return false
    >
    >         /* Remove the dirty buffer ….
    >          */
    >        Return true;
    > }
    > ```
    
    This would make the order of evaluation the same as master but I
    actually prefer it this way because then we only take the buffer
    header spinlock if there is a chance we will reject the buffer (e.g.
    we don't need to examine it for strategies except BAS_BULKREAD)
    
    > 4 - 0002
    > ```
    > - /* OK, do the I/O */
    > - FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
    > - LWLockRelease(content_lock);
    > -
    > - ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
    > -  &buf_hdr->tag);
    > + CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
    > ```
    > I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().
    >
    > I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.
    
    I think for this specific patch in the set your idea makes sense.
    However, in the later patch to do write combining, I release the
    content locks for the batch in CompleteWriteBatchIO() and having the
    start buffer's lock separate as a parameter would force me to have a
    special case handling this.
    
    I've added a comment to both CleanVictimBuffer() and its caller
    specifying that the lock must be held and that it will be released
    inside CleanVictimBuffer.
    
    > 5 - 0002
    > ```
    >   * disastrous system-wide consequences.  To make sure that can't happen,
    >   * skip the flush if the buffer isn't permanent.
    >   */
    > - if (buf_state & BM_PERMANENT)
    > - XLogFlush(recptr);
    > + if (!XLogRecPtrIsInvalid(buffer_lsn))
    > + XLogFlush(buffer_lsn);
    > ```
    >
    > Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.
    
    It's changed because I split the logic for flushing to that LSN and
    determining the LSN across the Prepare and Do functions. This is
    needed because when we do batches, we want to flush to the max LSN
    across all buffers in the batch.
    
    I check if the buffer is BM_PERMANENT in PrepareFlushBuffer(). You
    make a good point about my comment, though. I've moved it to
    PrepareFlushBuffer() and updated it.
    
    > 8 - 0003
    > ```
    > bool
    > +strategy_supports_eager_flush(BufferAccessStrategy strategy)
    > ```
    >
    > This function is only used in bufmgr.c, can we move it there and make it static?
    
    BufferAccessStrategyData is opaque to bufmgr.c. Only freelist.c can
    access it. I agree it is gross that I have these helpers and functions
    that would otherwise be static, though.
    
    > 10 - 0004
    > ```
    > --- a/src/backend/storage/buffer/bufmgr.c
    > +++ b/src/backend/storage/buffer/bufmgr.c
    >
    > + limit = Min(max_batch_size, limit);
    > ```
    >
    > Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.
    
    I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
    allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
    since we are capping ourselves at io_combine_limit. Or is that your
    point?
    
    > 11 - 0004
    > ```
    > --- a/src/backend/storage/buffer/bufmgr.c
    > +++ b/src/backend/storage/buffer/bufmgr.c
    >
    > + for (batch->n = 1; batch->n < limit; batch->n++)
    > + {
    > + Buffer bufnum;
    > +
    > + if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
    > + break;
    > ```
    >
    > Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.
    
    Ah, great point. I didn't think about this. Our sweep will always
    start right after the start buffer, but then if it goes all the way
    around, it will "lap" the start buffer. Because of this and because I
    think it is weird to have the sweep variables in the
    BufferAccessStrategy object, I've changed my approach in attached v6.
    I set sweep_end to be the start block in the batch and then pass
    around a sweep cursor variable. Hitting sweep_end is the termination
    condition.
    
    > 12 - 0004
    > ```
    > @@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
    >
    >   if (from_ring && strategy_supports_eager_flush(strategy))
    >   {
    > + uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
    > ```
    >
    > I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.
    
    Actually, the max pin limit can change quite frequently. See
    GetAdditionalPinLimit()'s usage in read stream code. If the query is
    pinning other buffers in another part of the query, it can change our
    limit.
    
    I'm not sure if I should call GetAdditionalPinLImit() for each batch
    or for each run of batches (like in StrategyMaxWriteBatchSize()).
    Currently, I call it for each batch (in FindFlushAdjacents()). The
    read stream calls it pretty frequently (each
    read_stream_start_pending_read()). But, in the batch flush case,
    nothing could change between batches in a run of batches. So maybe I
    should move it up and out and make it per run of batches...
    
    > 13 - 0004
    > ```
    > +void
    > +CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
    > + WritebackContext *wb_context)
    > +{
    > + ErrorContextCallback errcallback =
    > + {
    > + .callback = shared_buffer_write_error_callback,
    > + .previous = error_context_stack,
    > + };
    > +
    > + error_context_stack = &errcallback;
    > + pgBufferUsage.shared_blks_written += batch->n;
    > ```
    >
    > Should we only increase shared_blks_written only after the loop of write-back is done?
    
    On master, FlushBuffer() does it after smgrwrite() (before writeback).
    I think pgBufferUsage is mainly used in EXPLAIN (also
    pg_stat_statements) which won't be used until the end of the query and
    won't be displayed if we error out.
    
    > 14 - 0004
    > ```
    > --- a/src/backend/storage/buffer/freelist.c
    > +++ b/src/backend/storage/buffer/freelist.c
    >
    > +uint32
    > +max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
    > ```
    >
    > I think this function can be moved to bufmgr.c and make it static.
    
    This technically could be moved, but it is a function giving you
    information about a strategy which seemed to fit better in freelist.c.
    
    > 18 - 0007
    > ```
    > --- a/src/backend/storage/buffer/bufmgr.c
    > +++ b/src/backend/storage/buffer/bufmgr.c
    >
    > + max_batch_size = checkpointer_max_batch_size();
    > ```
    >
    > Look like we don’t need to calculate max_batch_size in the for loop.
    
    I don't think it's in the for loop.
    
    - Melanie
    
  10. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-09-12T03:33:17Z

    
    > On Sep 12, 2025, at 07:11, Melanie Plageman <melanieplageman@gmail.com> wrote:
    > 
    > On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >> 
    > 
    > Thanks for the review!
    > 
    > For any of your feedback that I simply implemented, I omitted an
    > inline comment about it. Those changes are included in the attached
    > v6. My inline replies below are only for feedback requiring more
    > discussion.
    > 
    >> On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
    >> 
    >> 2 - 0001
    >> ```
    >> --- a/src/backend/storage/buffer/freelist.c
    >> +++ b/src/backend/storage/buffer/freelist.c
    >> 
    >> + if (XLogNeedsFlush(lsn))
    >> + {
    >> + /*
    >> + * Remove the dirty buffer from the ring; necessary to prevent an
    >> + * infinite loop if all ring members are dirty.
    >> + */
    >> + strategy->buffers[strategy->current] = InvalidBuffer;
    >> + return true;
    >> + }
    >> 
    >> - return true;
    >> + return false;
    >> }
    >> ```
    >> 
    >> We can do:
    >> ```
    >>         If (!XLogNeedsFlush(lan))
    >>            Return false
    >> 
    >>        /* Remove the dirty buffer ….
    >>         */
    >>       Return true;
    >> }
    >> ```
    > 
    > This would make the order of evaluation the same as master but I
    > actually prefer it this way because then we only take the buffer
    > header spinlock if there is a chance we will reject the buffer (e.g.
    > we don't need to examine it for strategies except BAS_BULKREAD)
    > 
    > 
    
    I don’t understand why the two versions are different:
    
        if (XLogNeedsFlush(lsn))
        {
            /*
             * Remove the dirty buffer from the ring; necessary to prevent an
             * infinite loop if all ring members are dirty.
             */
            strategy->buffers[strategy->current] = InvalidBuffer;
            return true;
        }
    
        return false;
    
    VS
    
        if (XLogNeedsFlush(lsn))
            return false;
    
        /*
         * Remove the dirty buffer from the ring; necessary to prevent an
         * infinite loop if all ring members are dirty.
        */
        strategy->buffers[strategy->current] = InvalidBuffer;
        return true;
    
    > 
    >> 10 - 0004
    >> ```
    >> --- a/src/backend/storage/buffer/bufmgr.c
    >> +++ b/src/backend/storage/buffer/bufmgr.c
    >> 
    >> + limit = Min(max_batch_size, limit);
    >> ```
    >> 
    >> Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.
    > 
    > I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
    > allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
    > since we are capping ourselves at io_combine_limit. Or is that your
    > point?
    > 
    
    Please ignore comment 10. I think I cross-line it in my original email. I added the comment, then lately I found you have checked MAX_IO_COMBINE_LIMIT in the other function, so tried to delete it by cross-lining the comment.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  11. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-10-15T20:56:55Z

    On Thu, Sep 11, 2025 at 11:33 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > I don’t understand why the two versions are different:
    >
    > if (XLogNeedsFlush(lsn))
    > {
    > /*
    > * Remove the dirty buffer from the ring; necessary to prevent an
    > * infinite loop if all ring members are dirty.
    > */
    > strategy->buffers[strategy->current] = InvalidBuffer;
    > return true;
    > }
    >
    > return false;
    >
    > VS
    >
    > if (XLogNeedsFlush(lsn))
    > return false;
    
    I think you mean
    if (!XLogNeedsFlush(lsn))
    {
       return false;
    }
    // remove buffer
    return true
    
    is the same as
    
    if (XLogNeedsFlush(lsn))
    {
    //remove dirty buffer
    return true
    }
    return false;
    
    Which is true. I've changed it to be like that.
    
    Attached version 7 is rebased and has some bug fixes.
    
    I also added a bonus batch on the end (0007) that refactors
    SyncOneBuffer() to use the CAS loop pattern for pinning the buffer
    that Andres introduced in 5e89985928795f243. bgwriter is now the only
    user of SyncOneBuffer() and it rejects writing out buffers that are
    used, so it seemed like a decent use case for this.
    
    - Melanie
    
  12. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-10-16T04:24:51Z

    Hi Milanie,
    
    Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked as WIP.
    
    > 
    > - Melanie
    > <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
    
    1 - 0001
    ```
    --- a/src/include/storage/buf_internals.h
    +++ b/src/include/storage/buf_internals.h
    @@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
     /*
      * Internal buffer management routines
      */
    +
    +
    +/* Note: these two macros only work on shared buffers, not local ones! */
    ```
    
    Nit: here you added two empty lines, I think we need only 1.
    
    2 - 0002
    ```
    +static void
    +CleanVictimBuffer(BufferDesc *bufdesc,
    +				  bool from_ring, IOContext io_context)
    +{
     
    -	/* Find smgr relation for buffer */
    -	if (reln == NULL)
    -		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
    +	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
    +	LWLock	   *content_lock;
    ``` 
    
    Nit: the empty line after “{“ should be removed.
    
    3 - 0003
    ```
    +/*
    + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
    + * over.
    + */
    +Buffer
    +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
    +{
    +	if (++(*sweep_cursor) >= strategy->nbuffers)
    +		*sweep_cursor = 0;
    +
    +	return strategy->buffers[*sweep_cursor];
    +}
    ```
    
    Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.
    
    Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."
    
    4 - 0003
    ```
    static BufferDesc *
    NextStratBufToFlush(BufferAccessStrategy strategy,
    Buffer sweep_end,
    XLogRecPtr *lsn, int *sweep_cursor)
    ``
    
    “Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function names all have the whole word of “strategy”, why don’t also use the whole word in this function name as well?
    
    5 - 0004
    ```
    +uint32
    +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    +{
    +	uint32		max_possible_buffer_limit;
    +	uint32		max_write_batch_size;
    +	int			strategy_pin_limit;
    +
    +	max_write_batch_size = io_combine_limit;
    +
    +	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    +	max_possible_buffer_limit = GetPinLimit();
    +
    +	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    +	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    +	max_write_batch_size = Max(1, max_write_batch_size);
    +	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
    +	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    +	return max_write_batch_size;
    +}
    ```
    
    This implementation is hard to understand. I tried to simplify it:
    ```
    uint32
    StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    {
    	int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    	uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
    
    	/* Clamp to io_combine_limit and enforce minimum of 1 */
    	if (max_write_batch_size > io_combine_limit)
    		max_write_batch_size = io_combine_limit;
    	if (max_write_batch_size == 0)
    		max_write_batch_size = 1;
    
    	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    	return max_write_batch_size;
    }
    ```
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  13. Re: Checkpointer write combining

    BharatDB <bharatdbpg@gmail.com> — 2025-10-28T12:51:41Z

    On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:
    
    > Hi Milanie,
    >
    > Thanks for updating the patch set. I review 1-6 and got a few more small
    > comments. I didn’t review 0007 as it’s marked as WIP.
    >
    > >
    > > - Melanie
    > >
    > <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
    >
    > 1 - 0001
    > ```
    > --- a/src/include/storage/buf_internals.h
    > +++ b/src/include/storage/buf_internals.h
    > @@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner,
    > Buffer buffer)
    >  /*
    >   * Internal buffer management routines
    >   */
    > +
    > +
    > +/* Note: these two macros only work on shared buffers, not local ones! */
    > ```
    >
    > Nit: here you added two empty lines, I think we need only 1.
    >
    > 2 - 0002
    > ```
    > +static void
    > +CleanVictimBuffer(BufferDesc *bufdesc,
    > +                                 bool from_ring, IOContext io_context)
    > +{
    >
    > -       /* Find smgr relation for buffer */
    > -       if (reln == NULL)
    > -               reln = smgropen(BufTagGetRelFileLocator(&buf->tag),
    > INVALID_PROC_NUMBER);
    > +       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
    > +       LWLock     *content_lock;
    > ```
    >
    > Nit: the empty line after “{“ should be removed.
    >
    > 3 - 0003
    > ```
    > +/*
    > + * Return the next buffer in the ring or InvalidBuffer if the current
    > sweep is
    > + * over.
    > + */
    > +Buffer
    > +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
    > +{
    > +       if (++(*sweep_cursor) >= strategy->nbuffers)
    > +               *sweep_cursor = 0;
    > +
    > +       return strategy->buffers[*sweep_cursor];
    > +}
    > ```
    >
    > Feels the function comment is a bit confusing, because the function code
    > doesn’t really perform sweep, the function is just a getter. InvalidBuffer
    > just implies the current sweep is over.
    >
    > Maybe rephrase to something like: “Return the next buffer in the range. If
    > InvalidBuffer is returned, that implies the current sweep is done."
    >
    > 4 - 0003
    > ```
    > static BufferDesc *
    > NextStratBufToFlush(BufferAccessStrategy strategy,
    > Buffer sweep_end,
    > XLogRecPtr *lsn, int *sweep_cursor)
    > ``
    >
    > “Strat” is confusing. I think it’s the short version of “Strategy”. As
    > this is a static function, and other function names all have the whole word
    > of “strategy”, why don’t also use the whole word in this function name as
    > well?
    >
    > 5 - 0004
    > ```
    > +uint32
    > +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > +{
    > +       uint32          max_possible_buffer_limit;
    > +       uint32          max_write_batch_size;
    > +       int                     strategy_pin_limit;
    > +
    > +       max_write_batch_size = io_combine_limit;
    > +
    > +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    > +       max_possible_buffer_limit = GetPinLimit();
    > +
    > +       max_write_batch_size = Min(strategy_pin_limit,
    > max_write_batch_size);
    > +       max_write_batch_size = Min(max_possible_buffer_limit,
    > max_write_batch_size);
    > +       max_write_batch_size = Max(1, max_write_batch_size);
    > +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
    > +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    > +       return max_write_batch_size;
    > +}
    > ```
    >
    > This implementation is hard to understand. I tried to simplify it:
    > ```
    > uint32
    > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > {
    >         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    >         uint32 max_write_batch_size = Min(GetPinLimit(),
    > (uint32)strategy_pin_limit);
    >
    >         /* Clamp to io_combine_limit and enforce minimum of 1 */
    >         if (max_write_batch_size > io_combine_limit)
    >                 max_write_batch_size = io_combine_limit;
    >         if (max_write_batch_size == 0)
    >                 max_write_batch_size = 1;
    >
    >         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    >         return max_write_batch_size;
    > }
    > ```
    >
    > Best regards,
    > --
    > Chao Li (Evan)
    > HighGo Software Co., Ltd.
    > https://www.highgo.com/
    >
    >
    >
    > Hello All,
    
    
    
    > As per reference to the previous mails, I understood the changes made and
    > had tried to replicate the patches into the source code for the bug fix but
    > it didn't show any significant bug. Also I ran some verification tests for
    > the recent changes related to batched write statistics during checkpoints.
    > Below are my observations and results:
    >
    > Test Setup
    >
    >    -
    >
    >    PostgreSQL version: 19devel (custom build)
    >    -
    >
    >    OS: Ubuntu Linux
    >    -
    >
    >    Port: 55432
    >    -
    >
    >    Database: postgres
    >    -
    >
    >    Test tool: pgbench
    >    -
    >
    >    Duration: 120 seconds
    >    -
    >
    >    Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres
    >
    > Log Output
    
    After running the workload, I triggered a manual checkpoint and checked the
    latest log entry:
    2025-10-28 16:53:05.696 IST [11422] LOG:  checkpoint complete:
    wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
    write=0.023 s, sync=0.017 s, total=0.071 s;
    sync files=8, longest=0.004 s, average=0.003 s;
    distance=33437 kB, estimate=308790 kB;
    
    Observations:
    
    Metric
    
    Value
    
    Source
    
    Interpretation
    
    Buffers written
    
    1383
    
    >From log
    
    Consistent with moderate workload
    
    Checkpoint write time
    
    0.023 s
    
    >From log
    
    Realistic for ~11 MB write
    
    Checkpoint sync time
    
    0.017 s
    
    >From log
    
    Reasonable
    
    Total checkpoint time
    
    0.071 s
    
    >From log
    
    ≈ write + sync + small overhead
    
    CHECKPOINT runtime (psql)
    
    -
    
    -
    
    Fast, confirms idle background activity
    
    
    The total time closely matches the sum of write and sync times, with only a
    small overhead (expected for control file updates).
    Checkpoint stats in pg_stat_checkpointer also updated correctly, with no
    missing or duplicated values.
    Expected behavior observed:
    
       -
    
       write + sync ≈ total (no zero, truncation, or aggregation bug)
       -
    
       Buffer counts and timing scale realistically with workload
       -
    
       No evidence of under- or over-counted times
       -
    
       Checkpoint stats properly recorded in log and pg_stat_checkpointer
    
    Math check:
    0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s
    overhead => normal for control file + metadata writes.
    Comparison to Prior Reports:
    
    Test
    
    Pre-Patch
    
    Post-Patch
    
    Difference
    
    Checkpoint duration
    
    6.5 s
    
    5.0 s
    
    −23
    
    Heavy workload test
    
    16 s
    
    8 s
    
    −50
    
    Result: It shows consistent and stable timing even under moderate pgbench
    load — confirming the patch is integrated and functioning correctly.
    
    Final Status:
    
    Category
    
    Result
    
    Batched Write Stats Accuracy
    
    Verified OK
    
    Timing Aggregation Correctness
    
    Verified OK
    
    pg_stat_checkpointer Consistency
    
    Verified OK
    
    Log Formatting
    
    Normal
    
    Performance Regression
    
    None detected
    *Inference:*
    
    
       -
    
       The reported write, sync, and total timings are consistent and realistic.
       -
    
       No anomalies or timing mismatches were seen.
       -
    
       Checkpoint performance seems stable and accurate under moderate load.
       -
    
       No regression or counter accumulation issues detected.
    
    
    Also, I have been verifying output using manual queries recording the
    observations and find no significant delays. Observations are recorded
    below and the screenshots are attached herewith.
    
    Observations:
    
    Metric
    
    Before Activity
    
    After Activity
    
    Notes
    
    Buffers written
    
    27949
    
    27972
    
    Matches wrote 23 buffers from log
    
    Write time
    
    206882
    
    206887
    
    Matches write=0.005 s from log
    
    Sync time
    
    840
    
    877
    
    Matches sync=0.037 s from log
    
    num_done
    
    6
    
    7
    
    Completed successfully
    
    slru_written
    
    8
    
    9
    
    Matches wrote 1 SLRU buffer from log
    
    Stats reset
    
    yes
    
    yes
    
    successful
    
    num_requested
    
    7
    
    8
    
    manual checkpoint recorded
    
    For further review, I have attached the screenshots of my terminal
    herewith.
    Kindly review the observations and please let me know if any additional
    details need to be focused on.
    
    Thanking you.
    
    Regards,
    Soumya
    
  14. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-03T23:34:26Z

    Thanks for continuing to review! I've revised the patches to
    incorporate all of your feedback except for where I mention below.
    
    There were failures in CI due to issues with max batch size, so
    attached v8 also seeks to fix those.
    
    - Melanie
    
    On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > 3 - 0003
    > ```
    > +/*
    > + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
    > + * over.
    > + */
    > +Buffer
    > +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
    > +{
    > +       if (++(*sweep_cursor) >= strategy->nbuffers)
    > +               *sweep_cursor = 0;
    > +
    > +       return strategy->buffers[*sweep_cursor];
    > +}
    > ```
    >
    > Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.
    >
    > Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."
    
    Yes, actually I think having these helpers mention the sweep is more
    confusing than anything else. I've revised them to be named more
    generically and updated the comments accordingly.
    
    > 5 - 0004
    > ```
    > +uint32
    > +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > +{
    > +       uint32          max_possible_buffer_limit;
    > +       uint32          max_write_batch_size;
    > +       int                     strategy_pin_limit;
    > +
    > +       max_write_batch_size = io_combine_limit;
    > +
    > +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    > +       max_possible_buffer_limit = GetPinLimit();
    > +
    > +       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    > +       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    > +       max_write_batch_size = Max(1, max_write_batch_size);
    > +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
    > +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    > +       return max_write_batch_size;
    > +}
    > ```
    >
    > This implementation is hard to understand. I tried to simplify it:
    > ```
    > uint32
    > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > {
    >         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    >         uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
    >
    >         /* Clamp to io_combine_limit and enforce minimum of 1 */
    >         if (max_write_batch_size > io_combine_limit)
    >                 max_write_batch_size = io_combine_limit;
    >         if (max_write_batch_size == 0)
    >                 max_write_batch_size = 1;
    >
    >         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    >         return max_write_batch_size;
    > }
    > ```
    
    I agree that the implementation was hard to understand. I've not quite
    gone with your version but I have rewritten it like this:
    
    uint32
    StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    {
        uint32        max_write_batch_size = Min(io_combine_limit,
    MAX_IO_COMBINE_LIMIT);
        int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
        uint32        max_possible_buffer_limit = GetPinLimit();
    
        /* Identify the minimum of the above */
        max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
        max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    
        /* Must allow at least 1 IO for forward progress */
        max_write_batch_size = Max(1, max_write_batch_size);
    
        return max_write_batch_size;
    }
    
    Is this better?
    
    - Melanie
    
  15. Re: Checkpointer write combining

    BharatDB <bharatdbpg@gmail.com> — 2025-11-12T11:40:28Z

    Hi all,
    
    Considering the CI failures in earlier patch versions around “max batch
    size”, upon my observation I found the failures arise either from boundary
    conditions when io_combine_limit (GUC) is set larger than the compile-time
    MAX_IO_COMBINE_LIMIT or when pin limits return small/zero values due to
    which it produce out-of-range batch sizes or assertion failures in CI.
    
    Comparing the approaches suggested in the thread, I think implementing (GUC
    + compile-time cap first, and then pin limits) could be more effective in
    avoiding CI failures and also we should consider the following logic
    conditions:
    
       1.
    
       Set io_combine_limit == 0 explicitly (fallback to 1 for forward
       progress).
       2.
    
       Cap early to a conservative compile_cap (MAX_IO_COMBINE_LIMIT - 1) to
       avoid array overflow. Otherwise if we confirm all slots are usable,
       change to MAX_IO_COMBINE_LIMIT.
       3.
    
       Apply per-strategy pin and global pin limits only if they are positive.
       4.
    
       Use explicit typed comparisons to avoid signed/unsigned pitfalls and add
       a final Assert() to capture assumptions in CI.
    
    *Implementation logic:*
    
    uint32
    StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    {
        uint32        max_write_batch_size;
        uint32        compile_cap = MAX_IO_COMBINE_LIMIT - 1;    /*
    conservative cap */
        int           strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
        uint32        max_possible_buffer_limit = GetPinLimit();
        max_write_batch_size = (io_combine_limit == 0) ? 1 : io_combine_limit;
        if (max_write_batch_size > compile_cap)
            max_write_batch_size = compile_cap;
        if (strategy_pin_limit > 0 &&
            (uint32) strategy_pin_limit < max_write_batch_size)
            max_write_batch_size = (uint32) strategy_pin_limit;
        if (max_possible_buffer_limit > 0 &&
            max_possible_buffer_limit < max_write_batch_size)
            max_write_batch_size = max_possible_buffer_limit;
        if (max_write_batch_size == 0)
            max_write_batch_size = 1;
        Assert(max_write_batch_size <= compile_cap);
        return max_write_batch_size;
    }
    
    I hope this will be helpful for proceeding further. Looking forward to
    more feedback.
    
    Thanking you.
    
    Regards,
    
    Soumya
    
    
    On Tue, Nov 4, 2025 at 5:04 AM Melanie Plageman <melanieplageman@gmail.com>
    wrote:
    
    > Thanks for continuing to review! I've revised the patches to
    > incorporate all of your feedback except for where I mention below.
    >
    > There were failures in CI due to issues with max batch size, so
    > attached v8 also seeks to fix those.
    >
    > - Melanie
    >
    > On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
    > >
    > > 3 - 0003
    > > ```
    > > +/*
    > > + * Return the next buffer in the ring or InvalidBuffer if the current
    > sweep is
    > > + * over.
    > > + */
    > > +Buffer
    > > +StrategySweepNextBuffer(BufferAccessStrategy strategy, int
    > *sweep_cursor)
    > > +{
    > > +       if (++(*sweep_cursor) >= strategy->nbuffers)
    > > +               *sweep_cursor = 0;
    > > +
    > > +       return strategy->buffers[*sweep_cursor];
    > > +}
    > > ```
    > >
    > > Feels the function comment is a bit confusing, because the function code
    > doesn’t really perform sweep, the function is just a getter. InvalidBuffer
    > just implies the current sweep is over.
    > >
    > > Maybe rephrase to something like: “Return the next buffer in the range.
    > If InvalidBuffer is returned, that implies the current sweep is done."
    >
    > Yes, actually I think having these helpers mention the sweep is more
    > confusing than anything else. I've revised them to be named more
    > generically and updated the comments accordingly.
    >
    > > 5 - 0004
    > > ```
    > > +uint32
    > > +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > > +{
    > > +       uint32          max_possible_buffer_limit;
    > > +       uint32          max_write_batch_size;
    > > +       int                     strategy_pin_limit;
    > > +
    > > +       max_write_batch_size = io_combine_limit;
    > > +
    > > +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    > > +       max_possible_buffer_limit = GetPinLimit();
    > > +
    > > +       max_write_batch_size = Min(strategy_pin_limit,
    > max_write_batch_size);
    > > +       max_write_batch_size = Min(max_possible_buffer_limit,
    > max_write_batch_size);
    > > +       max_write_batch_size = Max(1, max_write_batch_size);
    > > +       max_write_batch_size = Min(max_write_batch_size,
    > io_combine_limit);
    > > +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    > > +       return max_write_batch_size;
    > > +}
    > > ```
    > >
    > > This implementation is hard to understand. I tried to simplify it:
    > > ```
    > > uint32
    > > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > > {
    > >         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    > >         uint32 max_write_batch_size = Min(GetPinLimit(),
    > (uint32)strategy_pin_limit);
    > >
    > >         /* Clamp to io_combine_limit and enforce minimum of 1 */
    > >         if (max_write_batch_size > io_combine_limit)
    > >                 max_write_batch_size = io_combine_limit;
    > >         if (max_write_batch_size == 0)
    > >                 max_write_batch_size = 1;
    > >
    > >         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    > >         return max_write_batch_size;
    > > }
    > > ```
    >
    > I agree that the implementation was hard to understand. I've not quite
    > gone with your version but I have rewritten it like this:
    >
    > uint32
    > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > {
    >     uint32        max_write_batch_size = Min(io_combine_limit,
    > MAX_IO_COMBINE_LIMIT);
    >     int            strategy_pin_limit =
    > GetAccessStrategyPinLimit(strategy);
    >     uint32        max_possible_buffer_limit = GetPinLimit();
    >
    >     /* Identify the minimum of the above */
    >     max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    >     max_write_batch_size = Min(max_possible_buffer_limit,
    > max_write_batch_size);
    >
    >     /* Must allow at least 1 IO for forward progress */
    >     max_write_batch_size = Max(1, max_write_batch_size);
    >
    >     return max_write_batch_size;
    > }
    >
    > Is this better?
    >
    > - Melanie
    >
    
  16. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-11-13T08:29:50Z

    
    > On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:
    > 
    > Thanks for continuing to review! I've revised the patches to
    > incorporate all of your feedback except for where I mention below.
    > 
    > There were failures in CI due to issues with max batch size, so
    > attached v8 also seeks to fix those.
    > 
    > - Melanie
    > 
    > On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >> 
    >> 3 - 0003
    >> ```
    >> +/*
    >> + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
    >> + * over.
    >> + */
    >> +Buffer
    >> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
    >> +{
    >> +       if (++(*sweep_cursor) >= strategy->nbuffers)
    >> +               *sweep_cursor = 0;
    >> +
    >> +       return strategy->buffers[*sweep_cursor];
    >> +}
    >> ```
    >> 
    >> Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.
    >> 
    >> Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."
    > 
    > Yes, actually I think having these helpers mention the sweep is more
    > confusing than anything else. I've revised them to be named more
    > generically and updated the comments accordingly.
    > 
    >> 5 - 0004
    >> ```
    >> +uint32
    >> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    >> +{
    >> +       uint32          max_possible_buffer_limit;
    >> +       uint32          max_write_batch_size;
    >> +       int                     strategy_pin_limit;
    >> +
    >> +       max_write_batch_size = io_combine_limit;
    >> +
    >> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    >> +       max_possible_buffer_limit = GetPinLimit();
    >> +
    >> +       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    >> +       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    >> +       max_write_batch_size = Max(1, max_write_batch_size);
    >> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
    >> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    >> +       return max_write_batch_size;
    >> +}
    >> ```
    >> 
    >> This implementation is hard to understand. I tried to simplify it:
    >> ```
    >> uint32
    >> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    >> {
    >>        int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    >>        uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
    >> 
    >>        /* Clamp to io_combine_limit and enforce minimum of 1 */
    >>        if (max_write_batch_size > io_combine_limit)
    >>                max_write_batch_size = io_combine_limit;
    >>        if (max_write_batch_size == 0)
    >>                max_write_batch_size = 1;
    >> 
    >>        Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
    >>        return max_write_batch_size;
    >> }
    >> ```
    > 
    > I agree that the implementation was hard to understand. I've not quite
    > gone with your version but I have rewritten it like this:
    > 
    > uint32
    > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > {
    >    uint32        max_write_batch_size = Min(io_combine_limit,
    > MAX_IO_COMBINE_LIMIT);
    >    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    >    uint32        max_possible_buffer_limit = GetPinLimit();
    > 
    >    /* Identify the minimum of the above */
    >    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    >    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    > 
    >    /* Must allow at least 1 IO for forward progress */
    >    max_write_batch_size = Max(1, max_write_batch_size);
    > 
    >    return max_write_batch_size;
    > }
    > 
    > Is this better?
    
    Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug builds.
    
    > 
    > - Melanie
    > <v8-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v8-0002-Split-FlushBuffer-into-two-parts.patch><v8-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v8-0004-Write-combining-for-BAS_BULKWRITE.patch><v8-0005-Add-database-Oid-to-CkptSortItem.patch><v8-0006-Implement-checkpointer-data-write-combining.patch><v8-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
    
    I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  17. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-18T18:29:56Z

    On Wed, Nov 12, 2025 at 6:40 AM BharatDB <bharatdbpg@gmail.com> wrote:
    >
    > Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT
    
    How could io_combine_limit be higher than MAX_IO_COMBINE_LIMIT? The
    GUC is capped to MAX_IO_COMBINE_LIMIT -- see guc_parameters.dat.
    
    > or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.
    
    This is true. But I think it can be solved with a single comparison to
    1 as in the recent version.
    
    > Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:
    >
    > Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).
    
    The io_combine_limit GUC has a minimum value of 1 (in guc_parameters.dat)
    
    > Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.
    
    I think my new version works.
    
        uint32        max_write_batch_size = Min(io_combine_limit,
    MAX_IO_COMBINE_LIMIT);
        int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
        uint32        max_possible_buffer_limit = GetPinLimit();
    
        /* Identify the minimum of the above */
        max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
        max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    
        /* Must allow at least 1 IO for forward progress */
        max_write_batch_size = Max(1, max_write_batch_size);
    
        return max_write_batch_size;
    
    GetAccessStrategyPinLimit() is the only function returning a signed
    value here -- and it should always return a positive value (while I
    wish we would use unsigned integers when a value will never be
    negative, the strategy->nbuffers struct member was added a long time
    ago). Then the Min() macro should work correctly and produce a value
    that fits in a uint32 because of integer promotion rules.
    
    - Melanie
    
    
    
    
  18. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-18T18:49:28Z

    On Thu, Nov 13, 2025 at 3:30 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > > On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:
    > >
    > >
    > > uint32
    > > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
    > > {
    > >    uint32        max_write_batch_size = Min(io_combine_limit,
    > > MAX_IO_COMBINE_LIMIT);
    > >    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    > >    uint32        max_possible_buffer_limit = GetPinLimit();
    > >
    > >    /* Identify the minimum of the above */
    > >    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    > >    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    > >
    > >    /* Must allow at least 1 IO for forward progress */
    > >    max_write_batch_size = Max(1, max_write_batch_size);
    > >
    > >    return max_write_batch_size;
    > > }
    > >
    > > Is this better?
    >
    > Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug builds.
    
    Cool. I've attached a v9 which is rebased over recent bufmgr.c
    changes. In the process, I found a bit of cleanup to do.
    
    > I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.
    
    I no longer remember why I made that patch WIP, so I've removed that
    designation.
    
    - Melanie
    
  19. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-11-19T02:00:40Z

    
    > On Nov 19, 2025, at 02:49, Melanie Plageman <melanieplageman@gmail.com> wrote:
    > 
    > I no longer remember why I made that patch WIP, so I've removed that
    > designation.
    
    I just reviewed 0007. It removes the second parameter "bool skip_recently_used”  from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.
    
    I only got a tiny comment:
    ```
    +		 * We can make these check without taking the buffer content lock so
    ``` 
    
    As you changed “this” to “these”, “check” should be changed to “checks” accordingly.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  20. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-11-19T02:09:42Z

    
    > On Nov 19, 2025, at 10:00, Chao Li <li.evan.chao@gmail.com> wrote:
    > 
    > 
    > 
    >> On Nov 19, 2025, at 02:49, Melanie Plageman <melanieplageman@gmail.com> wrote:
    >> 
    >> I no longer remember why I made that patch WIP, so I've removed that
    >> designation.
    > 
    > I just reviewed 0007. It removes the second parameter "bool skip_recently_used”  from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.
    > 
    > I only got a tiny comment:
    > ```
    > + * We can make these check without taking the buffer content lock so
    > ``` 
    > 
    > As you changed “this” to “these”, “check” should be changed to “checks” accordingly.
    > 
    
    I just got an compile error:
    ```
    bufmgr.c:3580:33: error: no member named 'dbId' in 'struct CkptSortItem'
     3580 |                                 batch.rlocator.dbOid = item.dbId;
          |                                                        ~~~~ ^
    bufmgr.c:3598:13: error: no member named 'dbId' in 'struct CkptSortItem'
     3598 |                         if (item.dbId != batch.rlocator.dbOid)
          |                             ~~~~ ^
    2 errors generated.
    make[4]: *** [bufmgr.o] Error 1
    ```
    
    I tried “make clean” and “make” again, which didn’t work. I think the error is introduced by 0006.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  21. Re: Checkpointer write combining

    BharatDB <bharatdbpg@gmail.com> — 2025-11-19T11:21:06Z

    Hi Melanie,
    
    Thank you for the detailed clarifications which helped me understand
    the constraints much more clearly. You are absolutely right regarding
    the key points you specified. My initial concern came from trying to
    avoid cases where strategy pin limits were unexpectedly small (0 or
    negative) or global pin limits temporarily shrink due to memory issue
    / fast cycling of pins.
    
    On Wed, Nov 19, 2025 at 12:00 AM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > On Wed, Nov 12, 2025 at 6:40 AM BharatDB <bharatdbpg@gmail.com> wrote:
    > >
    > > Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT
    >
    > How could io_combine_limit be higher than MAX_IO_COMBINE_LIMIT? The
    > GUC is capped to MAX_IO_COMBINE_LIMIT -- see guc_parameters.dat.
    
    After revisiting guc_parameters.dat, Now I see that the GUC is
    strictly capped at MAX_IO_COMBINE_LIMIT, so comparisons against larger
    values are unnecessary. Thus my earlier concern came from assuming
    some unbounded user-provided values are not applicable here.
    
    > > or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.
    >
    > This is true. But I think it can be solved with a single comparison to
    > 1 as in the recent version.
     ok
    
    > > Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:
    > >
    > > Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).
    >
    > The io_combine_limit GUC has a minimum value of 1 (in guc_parameters.dat)
    Noted.
    
    > > Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.
    >
    > I think my new version works.
    >
    >     uint32        max_write_batch_size = Min(io_combine_limit,
    > MAX_IO_COMBINE_LIMIT);
    >     int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    >     uint32        max_possible_buffer_limit = GetPinLimit();
    >
    >     /* Identify the minimum of the above */
    >     max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    >     max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
    >
    >     /* Must allow at least 1 IO for forward progress */
    >     max_write_batch_size = Max(1, max_write_batch_size);
    >
    >     return max_write_batch_size;
    >
    > GetAccessStrategyPinLimit() is the only function returning a signed
    > value here -- and it should always return a positive value (while I
    > wish we would use unsigned integers when a value will never be
    > negative, the strategy->nbuffers struct member was added a long time
    > ago). Then the Min() macro should work correctly and produce a value
    > that fits in a uint32 because of integer promotion rules.
    
    Th explanation about GetAccessStrategyPinLimit() (despite being int)
    makes sense and I agree that with the Min() macro and integer
    promotion rules, the outcome is always safe. Therefore, explicit typed
    comparisons are indeed redundant. However, after reviewing the
    existing code paths and your updated version,
    max_write_batch_size = Max(1, max_write_batch_size);
    => It is clear that both GetAccessStrategyPinLimit() and GetPinLimit()
    should always return sensible positive values and the fallback fully
    covers the forward-progress requirement. And I agree that it is both
    correct and sufficient for the CI failures we were seeing earlier.
    
    Thank you for helping me understand the reasoning behind this design.
    And this will be kept in mind for further work on implementing  write
    combining.
    I appreciate your patience and the opportunity to refine my
    assumptions. Looking forward to more suggestions and feedbacks.
    
    Thanking you.
    
    Regards,
    Soumya
    
    
    
    
  22. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-19T22:12:49Z

    Thanks for the review!
    
    On Tue, Nov 18, 2025 at 9:10 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > > On Nov 19, 2025, at 10:00, Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > > I just reviewed 0007. It removes the second parameter "bool skip_recently_used”  from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.
    > >
    > > I only got a tiny comment:
    > > ```
    > > + * We can make these check without taking the buffer content lock so
    > > ```
    > >
    > > As you changed “this” to “these”, “check” should be changed to “checks” accordingly.
    
    I've made this change. Attached v10 has this.
    
    > I just got an compile error:
    > ```
    > bufmgr.c:3580:33: error: no member named 'dbId' in 'struct CkptSortItem'
    >  3580 |                                 batch.rlocator.dbOid = item.dbId;
    >       |                                                        ~~~~ ^
    > bufmgr.c:3598:13: error: no member named 'dbId' in 'struct CkptSortItem'
    >  3598 |                         if (item.dbId != batch.rlocator.dbOid)
    >       |                             ~~~~ ^
    > 2 errors generated.
    > make[4]: *** [bufmgr.o] Error 1
    > ```
    >
    > I tried “make clean” and “make” again, which didn’t work. I think the error is introduced by 0006.
    
    Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.
    
    - Melanie
    
  23. Re: Checkpointer write combining

    Chao Li <li.evan.chao@gmail.com> — 2025-11-20T02:31:43Z

    
    > On Nov 20, 2025, at 06:12, Melanie Plageman <melanieplageman@gmail.com> wrote:
    > 
    > Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.
    > 
    
    My bad. Yes, I missed to download and apply 0005.
    
    > - Melanie
    > <v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v10-0002-Split-FlushBuffer-into-two-parts.patch><v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v10-0004-Write-combining-for-BAS_BULKWRITE.patch><v10-0005-Add-database-Oid-to-CkptSortItem.patch><v10-0006-Implement-checkpointer-data-write-combining.patch><v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch>
    
    I went through the whole patch set again, and got a few more comments:
    
    1 - 0002
    ```
    +/*
    + * Prepare and write out a dirty victim buffer.
    + *
    + * Buffer must be pinned, the content lock must be held exclusively, and the
    + * buffer header spinlock must not be held. The exclusive lock is released and
    + * the buffer is returned pinned but not locked.
    + *
    + * bufdesc may be modified.
    + */
    +static void
    +CleanVictimBuffer(BufferDesc *bufdesc,
    +				  bool from_ring, IOContext io_context)
    +{
    +	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
    +	LWLock	   *content_lock;
     
    -	/* Find smgr relation for buffer */
    -	if (reln == NULL)
    -		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
    +	/* Set up this victim buffer to be flushed */
    +	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
    +		return;
    ```
    
    I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.
    
    Because the function comment clearly says “the content lock must be held exclusively”. Also, looking at the code where CleanVictimBuffer() is called:
    ```
    			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
    			{
    				LWLockRelease(content_lock);
    				UnpinBuffer(buf_hdr);
    				continue;
    			}
    
    			/* Content lock is released inside CleanVictimBuffer */
    			CleanVictimBuffer(buf_hdr, from_ring, io_context);
    ```
    
    In the previous “if” clause, it releases content_lock.
    
    2 - 0002
    ```
    + * Buffer must be pinned, the content lock must be held exclusively, and the
    + * buffer header spinlock must not be held. The exclusive lock is released and
    + * the buffer is returned pinned but not locked.
    + *
    + * bufdesc may be modified.
    + */
    +static void
    +CleanVictimBuffer(BufferDesc *bufdesc,
    +				  bool from_ring, IOContext io_context)
    ```
    
    The function comment says "the content lock must be held exclusively”, but from the code:
    ```
    			content_lock = BufferDescriptorGetContentLock(buf_hdr);
    			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
    			{
    				/*
    				 * Someone else has locked the buffer, so give it up and loop
    				 * back to get another one.
    				 */
    				UnpinBuffer(buf_hdr);
    				continue;
    			}
    
    			/*
    			 * If using a nondefault strategy, and writing the buffer would
    			 * require a WAL flush, let the strategy decide whether to go
    			 * ahead and write/reuse the buffer or to choose another victim.
    			 * We need the content lock to inspect the page LSN, so this can't
    			 * be done inside StrategyGetBuffer.
    			 */
    			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
    			{
    				LWLockRelease(content_lock);
    				UnpinBuffer(buf_hdr);
    				continue;
    			}
    
    			/* Content lock is released inside CleanVictimBuffer */
    			CleanVictimBuffer(buf_hdr, from_ring, io_context);
    ```
    Content_lock is acquired with LW_SHARED.
    
    3 - 0003
    
    In CleanVictimBuffer(), more logic are added, but the content_lock leak problem is still there.
    
    4 - 0003
    ```
    +/*
    + * Returns true if the buffer needs WAL flushed before it can be written out.
    + * Caller must not already hold the buffer header spinlock.
    + */
    +static bool
    +BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
    +{
    +	uint32		buf_state = LockBufHdr(bufdesc);
    +
    +	*lsn = BufferGetLSN(bufdesc);
    +
    +	UnlockBufHdr(bufdesc);
    +
    +	/*
    +	 * See buffer flushing code for more details on why we condition this on
    +	 * the relation being logged.
    +	 */
    +	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
    +}
    ```
    
    This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT? Because from the existing code:
    ```
    	if (buf_state & BM_PERMANENT)
    		XLogFlush(recptr);
    ```
    
    XLogFlush should only happen when BM_PERMANENT.
    
    5 - 0004 - I am thinking if that could be a race condition?
    
    PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the page contents while the batch is being assembled, because batching only holds content_lock per page temporarily, not across the entire run.
    So that:
        • Page A pinned + content lock acquired + LSN read → content lock released
        • Another backend modifies Page A and sets new LSN, dirties buffer
        • Page A is written by this batch using an outdated checksum / outdated page contents
    
    6 - 0006 - Ah, 0006 seems to have solved comment 3 about BM_PERMANENT.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  24. Re: Checkpointer write combining

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-20T22:11:34Z

    Thanks for continuing to review this!
    
    On Wed, Nov 19, 2025 at 9:32 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > +static void
    > +CleanVictimBuffer(BufferDesc *bufdesc,
    > +                                 bool from_ring, IOContext io_context)
    > +{
    > +       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
    > +       LWLock     *content_lock;
    >
    > -       /* Find smgr relation for buffer */
    > -       if (reln == NULL)
    > -               reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
    > +       /* Set up this victim buffer to be flushed */
    > +       if (!PrepareFlushBuffer(bufdesc, &max_lsn))
    > +               return;
    > ```
    >
    > I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.
    
    Oh wow, you are so right. That's a big mistake! I would have thought a
    test would fail, but it seems we don't have coverage of this. I've
    fixed the code in the attached v11. I'll see how difficult it is to
    write a test that covers the case where we try to do IO but someone
    else has already done it.
    
    > + * Buffer must be pinned, the content lock must be held exclusively, and the
    > + * buffer header spinlock must not be held. The exclusive lock is released and
    > + * the buffer is returned pinned but not locked.
    >
    > The function comment says "the content lock must be held exclusively”, but from the code:
    > Content_lock is acquired with LW_SHARED.
    
    Thanks, I've updated the comment.
    
    > +static bool
    > +BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
    > +{
    > +       uint32          buf_state = LockBufHdr(bufdesc);
    > +
    > +       *lsn = BufferGetLSN(bufdesc);
    > +
    > +       UnlockBufHdr(bufdesc);
    > +
    > +       /*
    > +        * See buffer flushing code for more details on why we condition this on
    > +        * the relation being logged.
    > +        */
    > +       return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
    > +}
    >
    > This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT?
    
    That makes sense. I don't use the lsn unless the buffer is logged
    (BM_PERMANENT), but I think it is weird for the function to set the
    LSN if it is returning false. I've modified the function to set it to
    InvalidXLogRecPtr when it would return false. I've also added an
    assert before XLogFlush() to make sure the buffer is logged before
    flushing the WAL.
    
    > 5 - 0004 - I am thinking if that could be a race condition?
    >
    > PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the page contents while the batch is being assembled, because batching only holds content_lock per page temporarily, not across the entire run.
    > So that:
    >     • Page A pinned + content lock acquired + LSN read → content lock released
    >     • Another backend modifies Page A and sets new LSN, dirties buffer
    >     • Page A is written by this batch using an outdated checksum / outdated page contents
    
    So, there is a race condition but it is slightly different from the
    scenario you mention. We actually hold the content lock until after
    doing the write. That means someone else can't get an exclusive lock
    and modify the tuple data in the buffer. However, queries can set hint
    bits while only holding a share lock. That can update the LSN (if an
    FPI is required), which would cause bad things to happen when we write
    out a buffer with an outdated checksum. What we do in master is make a
    copy of the buffer, checksum it, and write out the copied buffer (see
    PageSetChecksumCopy() and its function comment).
    
    I have an XXX in PageSetBatchChecksumInplace() and in the commit
    message for this patch explaining that it isn't committable until some
    ongoing work Andres is doing adding a new lock type for setting hint
    bits is committed. So, this specific part of this patch is WIP.
    
    - Melanie