Thread

Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-09-02T21:10:43Z

Hi,

The attached patchset implements checkpointer write combining -- which
makes immediate checkpoints at least 20% faster in my tests.
Checkpointer achieves higher write throughput and higher write IOPs
with the patch.

Besides the immediate performance gain with the patchset, we will
eventually need all writers to do write combining if we want to use
direct IO. Additionally, I think the general shape I refactored
BufferSync() into will be useful for AIO-ifying checkpointer.

The patch set has preliminary patches (0001-0004) that implement eager
flushing and write combining for bulkwrites (like COPY FROM). The
functions used to flush a batch of writes for bulkwrites (see 0004)
are reused for the checkpointer. The eager flushing component of this
patch set has been discussed elsewhere [1].

0005 implements a fix for XLogNeedsFlush() when called by checkpointer
during an end-of-crash-recovery checkpoint. I've already started
another thread about this [2], but the patch is required for the patch
set to pass tests.

One outstanding action item is to test to see if there are any
benefits to spread checkpoints.

More on how I measured the performance benefit to immediate checkpoints:

I tuned checkpoint_completion_target, checkpoint_timeout, and min and
max_wal_size to ensure no other checkpoints were initiated.

With 16 GB shared buffers and io_combine_limit 128, I created a 15 GB
table. To get consistent results, I used pg_prewarm to read the table
into shared buffers, issued a checkpoint, then used Bilal's patch [3]
to mark all the buffers as dirty again and issue another checkpoint.
On a fast local SSD, this proved to be a consistent 20%+ speed up
(~6.5 seconds to ~5 seconds).

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-09-08T23:44:02Z

On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The attached patchset implements checkpointer write combining -- which
> makes immediate checkpoints at least 20% faster in my tests.
> Checkpointer achieves higher write throughput and higher write IOPs
> with the patch.

These needed a rebase. Attached v2.

- Melanie

Re: Checkpointer write combining

Nazir Bilal Yavuz <byavuz81@gmail.com> — 2025-09-09T13:27:47Z

Hi,

Thank you for working on this!

On Tue, 9 Sept 2025 at 02:44, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The attached patchset implements checkpointer write combining -- which
> > makes immediate checkpoints at least 20% faster in my tests.
> > Checkpointer achieves higher write throughput and higher write IOPs
> > with the patch.

I did the same benchmark you did and I found it is %50 faster (16
seconds to 8 seconds).

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-09-09T13:39:19Z

On Tue, Sep 9, 2025 at 9:27 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>

Thanks for the review!

> From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 2 Sep 2025 11:00:44 -0400
> Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()
>
> @@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy
> strategy, BufferDesc *buf, bool from_r
>          strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
>          return false;
>
> +    buf_state = LockBufHdr(buf);
> +    lsn = BufferGetLSN(buf);
> +    UnlockBufHdr(buf, buf_state);
> +
> +    if (!XLogNeedsFlush(lsn))
> +        return true;
>
> I think this should return false.

Oops, you're right. v3 attached with that mistake fixed.

- Melanie

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-09-09T15:16:59Z

On Tue, Sep 9, 2025 at 9:39 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Oops, you're right. v3 attached with that mistake fixed.

One more fix and a bit more cleanup in attached v4.

- Melanie

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-09-09T17:55:08Z

On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> One more fix and a bit more cleanup in attached v4.

Okay one more version: I updated the thread on eager flushing the
bulkwrite ring [1], and some updates were needed here.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com

Re: Checkpointer write combining

Jeff Davis <pgsql@j-davis.com> — 2025-09-10T01:28:55Z

On Tue, 2025-09-09 at 13:55 -0400, Melanie Plageman wrote:
> On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > 
> > One more fix and a bit more cleanup in attached v4.
> 
> Okay one more version: I updated the thread on eager flushing the
> bulkwrite ring [1], and some updates were needed here.

v5-0005 comments:

  * Please update the comment above the code change.
  * The last paragraph in the commit message has a typo: "potentially
update the local copy of min recovery point, when xlog inserts are
*not* allowed", right?
  * Shouldn't the code be consistent between XLogNeedsFlush() and
XLogFlush()? The latter only checks for !XLogInsertAllowed(), whereas
the former also checks for RecoveryInProgress().

I'm still not sure I understand the problem situation this is fixing,
but that's being discussed in another thread.

Regards,
	Jeff Davis

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-09-10T08:24:13Z

> On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
> 
> 
> [1] https://www.postgresql.org/message-id/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
> <v5-0004-Write-combining-for-BAS_BULKWRITE.patch><v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch><v5-0002-Split-FlushBuffer-into-two-parts.patch><v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v5-0006-Add-database-Oid-to-CkptSortItem.patch><v5-0007-Implement-checkpointer-data-write-combining.patch>

1 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+ * The buffer must pinned and content locked and the buffer header spinlock
```

“Must pinned” -> “must be pinned"

2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+	if (XLogNeedsFlush(lsn))
+	{
+		/*
+		 * Remove the dirty buffer from the ring; necessary to prevent an
+		 * infinite loop if all ring members are dirty.
+		 */
+		strategy->buffers[strategy->current] = InvalidBuffer;
+		return true;
+	}

-	return true;
+	return false;
 }
```

We can do:
```
         If (!XLogNeedsFlush(lan))
            Return false

        /* Remove the dirty buffer ….
         */
       Return true;
}
```

This way makes less diff.

3 - 0002
```
+ * Prepare to write and write a dirty victim buffer.
```

Prepare to write a dirty victim buffer.

4 - 0002
```
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
```
I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().

I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

5 - 0002
```
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
```

Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

6 - 0003
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+ * patterns than lazily flushing buffers directly before reusing them.
+ */
```

Here “directly” is kind ambiguous. It could mean “immediately before” or “without going through something else”. My understanding is “immediately”, If that is true, please change “directly” to “immediately” or just remove it.

7 - 0003
```
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
```

I doubt if this “strategy” null check is needed. Because it is only called when strategy_supports_eager_flush() is true, and strategy_supports_eager_flush() has asserted “strategy”.

And as a pair function, StrategySweepNextBuffer() doesn’t do null check nor assert strategy.

8 - 0003
```
bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
```

This function is only used in bufmgr.c, can we move it there and make it static?

9 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
```

Why don’t assert immediately after batch->start is assigned? So upon error, smgropen() will not be called.

10 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+	limit = Min(max_batch_size, limit);
```

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

11 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
```

Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

12 - 0004
```
@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,

 	if (from_ring && strategy_supports_eager_flush(strategy))
 	{
+		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
```

I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

13 - 0004
```
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
```

Should we only increase shared_blks_written only after the loop of write-back is done?

14 - 0004
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
```

I think this function can be moved to bufmgr.c and make it static.

15 - 0004
```
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)

 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
```

We should mark blknos as const to indicate it is readonly: const BlockNumber *blknos, which will also prevent from incidentally change on blknos in within the function. 

16 - 0005
```
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
```

As a new check is added, the comment should be updated accordingly.

17 - 0006
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
```

I think “db_id” should be named “dbId” or “dbOid”. Let’s keep the name conversation consistent.

18 - 0007
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+	max_batch_size = checkpointer_max_batch_size();
```

Look like we don’t need to calculate max_batch_size in the for loop.

19 - 0007
```
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
```

I don’t get what you want to explain with this comment. It sounds quite unnecessary.

Best regards,
—
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-09-11T23:11:12Z

On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:
>

Thanks for the review!

For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.

> On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> 2 - 0001
> ```
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
>
> + if (XLogNeedsFlush(lsn))
> + {
> + /*
> + * Remove the dirty buffer from the ring; necessary to prevent an
> + * infinite loop if all ring members are dirty.
> + */
> + strategy->buffers[strategy->current] = InvalidBuffer;
> + return true;
> + }
>
> - return true;
> + return false;
>  }
> ```
>
> We can do:
> ```
>          If (!XLogNeedsFlush(lan))
>             Return false
>
>         /* Remove the dirty buffer ….
>          */
>        Return true;
> }
> ```

This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)

> 4 - 0002
> ```
> - /* OK, do the I/O */
> - FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
> - LWLockRelease(content_lock);
> -
> - ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
> -  &buf_hdr->tag);
> + CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
> ```
> I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().
>
> I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

I think for this specific patch in the set your idea makes sense.
However, in the later patch to do write combining, I release the
content locks for the batch in CompleteWriteBatchIO() and having the
start buffer's lock separate as a parameter would force me to have a
special case handling this.

I've added a comment to both CleanVictimBuffer() and its caller
specifying that the lock must be held and that it will be released
inside CleanVictimBuffer.

> 5 - 0002
> ```
>   * disastrous system-wide consequences.  To make sure that can't happen,
>   * skip the flush if the buffer isn't permanent.
>   */
> - if (buf_state & BM_PERMANENT)
> - XLogFlush(recptr);
> + if (!XLogRecPtrIsInvalid(buffer_lsn))
> + XLogFlush(buffer_lsn);
> ```
>
> Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

It's changed because I split the logic for flushing to that LSN and
determining the LSN across the Prepare and Do functions. This is
needed because when we do batches, we want to flush to the max LSN
across all buffers in the batch.

I check if the buffer is BM_PERMANENT in PrepareFlushBuffer(). You
make a good point about my comment, though. I've moved it to
PrepareFlushBuffer() and updated it.

> 8 - 0003
> ```
> bool
> +strategy_supports_eager_flush(BufferAccessStrategy strategy)
> ```
>
> This function is only used in bufmgr.c, can we move it there and make it static?

BufferAccessStrategyData is opaque to bufmgr.c. Only freelist.c can
access it. I agree it is gross that I have these helpers and functions
that would otherwise be static, though.

> 10 - 0004
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + limit = Min(max_batch_size, limit);
> ```
>
> Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?

> 11 - 0004
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + for (batch->n = 1; batch->n < limit; batch->n++)
> + {
> + Buffer bufnum;
> +
> + if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
> + break;
> ```
>
> Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

Ah, great point. I didn't think about this. Our sweep will always
start right after the start buffer, but then if it goes all the way
around, it will "lap" the start buffer. Because of this and because I
think it is weird to have the sweep variables in the
BufferAccessStrategy object, I've changed my approach in attached v6.
I set sweep_end to be the start block in the batch and then pass
around a sweep cursor variable. Hitting sweep_end is the termination
condition.

> 12 - 0004
> ```
> @@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
>
>   if (from_ring && strategy_supports_eager_flush(strategy))
>   {
> + uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
> ```
>
> I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

Actually, the max pin limit can change quite frequently. See
GetAdditionalPinLimit()'s usage in read stream code. If the query is
pinning other buffers in another part of the query, it can change our
limit.

I'm not sure if I should call GetAdditionalPinLImit() for each batch
or for each run of batches (like in StrategyMaxWriteBatchSize()).
Currently, I call it for each batch (in FindFlushAdjacents()). The
read stream calls it pretty frequently (each
read_stream_start_pending_read()). But, in the batch flush case,
nothing could change between batches in a run of batches. So maybe I
should move it up and out and make it per run of batches...

> 13 - 0004
> ```
> +void
> +CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
> + WritebackContext *wb_context)
> +{
> + ErrorContextCallback errcallback =
> + {
> + .callback = shared_buffer_write_error_callback,
> + .previous = error_context_stack,
> + };
> +
> + error_context_stack = &errcallback;
> + pgBufferUsage.shared_blks_written += batch->n;
> ```
>
> Should we only increase shared_blks_written only after the loop of write-back is done?

On master, FlushBuffer() does it after smgrwrite() (before writeback).
I think pgBufferUsage is mainly used in EXPLAIN (also
pg_stat_statements) which won't be used until the end of the query and
won't be displayed if we error out.

> 14 - 0004
> ```
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
>
> +uint32
> +max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
> ```
>
> I think this function can be moved to bufmgr.c and make it static.

This technically could be moved, but it is a function giving you
information about a strategy which seemed to fit better in freelist.c.

> 18 - 0007
> ```
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> + max_batch_size = checkpointer_max_batch_size();
> ```
>
> Look like we don’t need to calculate max_batch_size in the for loop.

I don't think it's in the for loop.

- Melanie

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-09-12T03:33:17Z


> On Sep 12, 2025, at 07:11, Melanie Plageman <melanieplageman@gmail.com> wrote:
> 
> On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:
>> 
> 
> Thanks for the review!
> 
> For any of your feedback that I simply implemented, I omitted an
> inline comment about it. Those changes are included in the attached
> v6. My inline replies below are only for feedback requiring more
> discussion.
> 
>> On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
>> 
>> 2 - 0001
>> ```
>> --- a/src/backend/storage/buffer/freelist.c
>> +++ b/src/backend/storage/buffer/freelist.c
>> 
>> + if (XLogNeedsFlush(lsn))
>> + {
>> + /*
>> + * Remove the dirty buffer from the ring; necessary to prevent an
>> + * infinite loop if all ring members are dirty.
>> + */
>> + strategy->buffers[strategy->current] = InvalidBuffer;
>> + return true;
>> + }
>> 
>> - return true;
>> + return false;
>> }
>> ```
>> 
>> We can do:
>> ```
>>         If (!XLogNeedsFlush(lan))
>>            Return false
>> 
>>        /* Remove the dirty buffer ….
>>         */
>>       Return true;
>> }
>> ```
> 
> This would make the order of evaluation the same as master but I
> actually prefer it this way because then we only take the buffer
> header spinlock if there is a chance we will reject the buffer (e.g.
> we don't need to examine it for strategies except BAS_BULKREAD)
> 
> 

I don’t understand why the two versions are different:

    if (XLogNeedsFlush(lsn))
    {
        /*
         * Remove the dirty buffer from the ring; necessary to prevent an
         * infinite loop if all ring members are dirty.
         */
        strategy->buffers[strategy->current] = InvalidBuffer;
        return true;
    }

    return false;

VS

    if (XLogNeedsFlush(lsn))
        return false;

    /*
     * Remove the dirty buffer from the ring; necessary to prevent an
     * infinite loop if all ring members are dirty.
    */
    strategy->buffers[strategy->current] = InvalidBuffer;
    return true;

> 
>> 10 - 0004
>> ```
>> --- a/src/backend/storage/buffer/bufmgr.c
>> +++ b/src/backend/storage/buffer/bufmgr.c
>> 
>> + limit = Min(max_batch_size, limit);
>> ```
>> 
>> Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.
> 
> I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
> allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
> since we are capping ourselves at io_combine_limit. Or is that your
> point?
> 

Please ignore comment 10. I think I cross-line it in my original email. I added the comment, then lately I found you have checked MAX_IO_COMBINE_LIMIT in the other function, so tried to delete it by cross-lining the comment.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-10-15T20:56:55Z

On Thu, Sep 11, 2025 at 11:33 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> I don’t understand why the two versions are different:
>
> if (XLogNeedsFlush(lsn))
> {
> /*
> * Remove the dirty buffer from the ring; necessary to prevent an
> * infinite loop if all ring members are dirty.
> */
> strategy->buffers[strategy->current] = InvalidBuffer;
> return true;
> }
>
> return false;
>
> VS
>
> if (XLogNeedsFlush(lsn))
> return false;

I think you mean
if (!XLogNeedsFlush(lsn))
{
   return false;
}
// remove buffer
return true

is the same as

if (XLogNeedsFlush(lsn))
{
//remove dirty buffer
return true
}
return false;

Which is true. I've changed it to be like that.

Attached version 7 is rebased and has some bug fixes.

I also added a bonus batch on the end (0007) that refactors
SyncOneBuffer() to use the CAS loop pattern for pinning the buffer
that Andres introduced in 5e89985928795f243. bgwriter is now the only
user of SyncOneBuffer() and it rejects writing out buffers that are
used, so it seemed like a decent use case for this.

- Melanie

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-10-16T04:24:51Z

Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked as WIP.

> 
> - Melanie
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```

Nit: here you added two empty lines, I think we need only 1.

2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
``` 

Nit: the empty line after “{“ should be removed.

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+	if (++(*sweep_cursor) >= strategy->nbuffers)
+		*sweep_cursor = 0;
+
+	return strategy->buffers[*sweep_cursor];
+}
```

Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function names all have the whole word of “strategy”, why don’t also use the whole word in this function name as well?

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
```

This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
	int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
	uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

	/* Clamp to io_combine_limit and enforce minimum of 1 */
	if (max_write_batch_size > io_combine_limit)
		max_write_batch_size = io_combine_limit;
	if (max_write_batch_size == 0)
		max_write_batch_size = 1;

	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
	return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

BharatDB <bharatdbpg@gmail.com> — 2025-10-28T12:51:41Z

On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:

> Hi Milanie,
>
> Thanks for updating the patch set. I review 1-6 and got a few more small
> comments. I didn’t review 0007 as it’s marked as WIP.
>
> >
> > - Melanie
> >
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
>
> 1 - 0001
> ```
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner,
> Buffer buffer)
>  /*
>   * Internal buffer management routines
>   */
> +
> +
> +/* Note: these two macros only work on shared buffers, not local ones! */
> ```
>
> Nit: here you added two empty lines, I think we need only 1.
>
> 2 - 0002
> ```
> +static void
> +CleanVictimBuffer(BufferDesc *bufdesc,
> +                                 bool from_ring, IOContext io_context)
> +{
>
> -       /* Find smgr relation for buffer */
> -       if (reln == NULL)
> -               reln = smgropen(BufTagGetRelFileLocator(&buf->tag),
> INVALID_PROC_NUMBER);
> +       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
> +       LWLock     *content_lock;
> ```
>
> Nit: the empty line after “{“ should be removed.
>
> 3 - 0003
> ```
> +/*
> + * Return the next buffer in the ring or InvalidBuffer if the current
> sweep is
> + * over.
> + */
> +Buffer
> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
> +{
> +       if (++(*sweep_cursor) >= strategy->nbuffers)
> +               *sweep_cursor = 0;
> +
> +       return strategy->buffers[*sweep_cursor];
> +}
> ```
>
> Feels the function comment is a bit confusing, because the function code
> doesn’t really perform sweep, the function is just a getter. InvalidBuffer
> just implies the current sweep is over.
>
> Maybe rephrase to something like: “Return the next buffer in the range. If
> InvalidBuffer is returned, that implies the current sweep is done."
>
> 4 - 0003
> ```
> static BufferDesc *
> NextStratBufToFlush(BufferAccessStrategy strategy,
> Buffer sweep_end,
> XLogRecPtr *lsn, int *sweep_cursor)
> ``
>
> “Strat” is confusing. I think it’s the short version of “Strategy”. As
> this is a static function, and other function names all have the whole word
> of “strategy”, why don’t also use the whole word in this function name as
> well?
>
> 5 - 0004
> ```
> +uint32
> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> +{
> +       uint32          max_possible_buffer_limit;
> +       uint32          max_write_batch_size;
> +       int                     strategy_pin_limit;
> +
> +       max_write_batch_size = io_combine_limit;
> +
> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> +       max_possible_buffer_limit = GetPinLimit();
> +
> +       max_write_batch_size = Min(strategy_pin_limit,
> max_write_batch_size);
> +       max_write_batch_size = Min(max_possible_buffer_limit,
> max_write_batch_size);
> +       max_write_batch_size = Max(1, max_write_batch_size);
> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> +       return max_write_batch_size;
> +}
> ```
>
> This implementation is hard to understand. I tried to simplify it:
> ```
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>         uint32 max_write_batch_size = Min(GetPinLimit(),
> (uint32)strategy_pin_limit);
>
>         /* Clamp to io_combine_limit and enforce minimum of 1 */
>         if (max_write_batch_size > io_combine_limit)
>                 max_write_batch_size = io_combine_limit;
>         if (max_write_batch_size == 0)
>                 max_write_batch_size = 1;
>
>         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>         return max_write_batch_size;
> }
> ```
>
> Best regards,
> --
> Chao Li (Evan)
> HighGo Software Co., Ltd.
> https://www.highgo.com/
>
>
>
> Hello All,



> As per reference to the previous mails, I understood the changes made and
> had tried to replicate the patches into the source code for the bug fix but
> it didn't show any significant bug. Also I ran some verification tests for
> the recent changes related to batched write statistics during checkpoints.
> Below are my observations and results:
>
> Test Setup
>
>    -
>
>    PostgreSQL version: 19devel (custom build)
>    -
>
>    OS: Ubuntu Linux
>    -
>
>    Port: 55432
>    -
>
>    Database: postgres
>    -
>
>    Test tool: pgbench
>    -
>
>    Duration: 120 seconds
>    -
>
>    Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres
>
> Log Output

After running the workload, I triggered a manual checkpoint and checked the
latest log entry:
2025-10-28 16:53:05.696 IST [11422] LOG:  checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;
distance=33437 kB, estimate=308790 kB;

Observations:

Metric

Value

Source

Interpretation

Buffers written

1383

>From log

Consistent with moderate workload

Checkpoint write time

0.023 s

>From log

Realistic for ~11 MB write

Checkpoint sync time

0.017 s

>From log

Reasonable

Total checkpoint time

0.071 s

>From log

≈ write + sync + small overhead

CHECKPOINT runtime (psql)

-

-

Fast, confirms idle background activity


The total time closely matches the sum of write and sync times, with only a
small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no
missing or duplicated values.
Expected behavior observed:

   -

   write + sync ≈ total (no zero, truncation, or aggregation bug)
   -

   Buffer counts and timing scale realistically with workload
   -

   No evidence of under- or over-counted times
   -

   Checkpoint stats properly recorded in log and pg_stat_checkpointer

Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s
overhead => normal for control file + metadata writes.
Comparison to Prior Reports:

Test

Pre-Patch

Post-Patch

Difference

Checkpoint duration

6.5 s

5.0 s

−23

Heavy workload test

16 s

8 s

−50

Result: It shows consistent and stable timing even under moderate pgbench
load — confirming the patch is integrated and functioning correctly.

Final Status:

Category

Result

Batched Write Stats Accuracy

Verified OK

Timing Aggregation Correctness

Verified OK

pg_stat_checkpointer Consistency

Verified OK

Log Formatting

Normal

Performance Regression

None detected
*Inference:*


   -

   The reported write, sync, and total timings are consistent and realistic.
   -

   No anomalies or timing mismatches were seen.
   -

   Checkpoint performance seems stable and accurate under moderate load.
   -

   No regression or counter accumulation issues detected.


Also, I have been verifying output using manual queries recording the
observations and find no significant delays. Observations are recorded
below and the screenshots are attached herewith.

Observations:

Metric

Before Activity

After Activity

Notes

Buffers written

27949

27972

Matches wrote 23 buffers from log

Write time

206882

206887

Matches write=0.005 s from log

Sync time

840

877

Matches sync=0.037 s from log

num_done

6

7

Completed successfully

slru_written

8

9

Matches wrote 1 SLRU buffer from log

Stats reset

yes

yes

successful

num_requested

7

8

manual checkpoint recorded

For further review, I have attached the screenshots of my terminal
herewith.
Kindly review the observations and please let me know if any additional
details need to be focused on.

Thanking you.

Regards,
Soumya

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-11-03T23:34:26Z

Thanks for continuing to review! I've revised the patches to
incorporate all of your feedback except for where I mention below.

There were failures in CI due to issues with max batch size, so
attached v8 also seeks to fix those.

- Melanie

On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> 3 - 0003
> ```
> +/*
> + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
> + * over.
> + */
> +Buffer
> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
> +{
> +       if (++(*sweep_cursor) >= strategy->nbuffers)
> +               *sweep_cursor = 0;
> +
> +       return strategy->buffers[*sweep_cursor];
> +}
> ```
>
> Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.
>
> Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

Yes, actually I think having these helpers mention the sweep is more
confusing than anything else. I've revised them to be named more
generically and updated the comments accordingly.

> 5 - 0004
> ```
> +uint32
> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> +{
> +       uint32          max_possible_buffer_limit;
> +       uint32          max_write_batch_size;
> +       int                     strategy_pin_limit;
> +
> +       max_write_batch_size = io_combine_limit;
> +
> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> +       max_possible_buffer_limit = GetPinLimit();
> +
> +       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
> +       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
> +       max_write_batch_size = Max(1, max_write_batch_size);
> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> +       return max_write_batch_size;
> +}
> ```
>
> This implementation is hard to understand. I tried to simplify it:
> ```
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>         uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
>
>         /* Clamp to io_combine_limit and enforce minimum of 1 */
>         if (max_write_batch_size > io_combine_limit)
>                 max_write_batch_size = io_combine_limit;
>         if (max_write_batch_size == 0)
>                 max_write_batch_size = 1;
>
>         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>         return max_write_batch_size;
> }
> ```

I agree that the implementation was hard to understand. I've not quite
gone with your version but I have rewritten it like this:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
    uint32        max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    uint32        max_possible_buffer_limit = GetPinLimit();

    /* Identify the minimum of the above */
    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

    /* Must allow at least 1 IO for forward progress */
    max_write_batch_size = Max(1, max_write_batch_size);

    return max_write_batch_size;
}

Is this better?

- Melanie

Re: Checkpointer write combining

BharatDB <bharatdbpg@gmail.com> — 2025-11-12T11:40:28Z

Hi all,

Considering the CI failures in earlier patch versions around “max batch
size”, upon my observation I found the failures arise either from boundary
conditions when io_combine_limit (GUC) is set larger than the compile-time
MAX_IO_COMBINE_LIMIT or when pin limits return small/zero values due to
which it produce out-of-range batch sizes or assertion failures in CI.

Comparing the approaches suggested in the thread, I think implementing (GUC
+ compile-time cap first, and then pin limits) could be more effective in
avoiding CI failures and also we should consider the following logic
conditions:

   1.

   Set io_combine_limit == 0 explicitly (fallback to 1 for forward
   progress).
   2.

   Cap early to a conservative compile_cap (MAX_IO_COMBINE_LIMIT - 1) to
   avoid array overflow. Otherwise if we confirm all slots are usable,
   change to MAX_IO_COMBINE_LIMIT.
   3.

   Apply per-strategy pin and global pin limits only if they are positive.
   4.

   Use explicit typed comparisons to avoid signed/unsigned pitfalls and add
   a final Assert() to capture assumptions in CI.

*Implementation logic:*

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
    uint32        max_write_batch_size;
    uint32        compile_cap = MAX_IO_COMBINE_LIMIT - 1;    /*
conservative cap */
    int           strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    uint32        max_possible_buffer_limit = GetPinLimit();
    max_write_batch_size = (io_combine_limit == 0) ? 1 : io_combine_limit;
    if (max_write_batch_size > compile_cap)
        max_write_batch_size = compile_cap;
    if (strategy_pin_limit > 0 &&
        (uint32) strategy_pin_limit < max_write_batch_size)
        max_write_batch_size = (uint32) strategy_pin_limit;
    if (max_possible_buffer_limit > 0 &&
        max_possible_buffer_limit < max_write_batch_size)
        max_write_batch_size = max_possible_buffer_limit;
    if (max_write_batch_size == 0)
        max_write_batch_size = 1;
    Assert(max_write_batch_size <= compile_cap);
    return max_write_batch_size;
}

I hope this will be helpful for proceeding further. Looking forward to
more feedback.

Thanking you.

Regards,

Soumya


On Tue, Nov 4, 2025 at 5:04 AM Melanie Plageman <melanieplageman@gmail.com>
wrote:

> Thanks for continuing to review! I've revised the patches to
> incorporate all of your feedback except for where I mention below.
>
> There were failures in CI due to issues with max batch size, so
> attached v8 also seeks to fix those.
>
> - Melanie
>
> On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
> >
> > 3 - 0003
> > ```
> > +/*
> > + * Return the next buffer in the ring or InvalidBuffer if the current
> sweep is
> > + * over.
> > + */
> > +Buffer
> > +StrategySweepNextBuffer(BufferAccessStrategy strategy, int
> *sweep_cursor)
> > +{
> > +       if (++(*sweep_cursor) >= strategy->nbuffers)
> > +               *sweep_cursor = 0;
> > +
> > +       return strategy->buffers[*sweep_cursor];
> > +}
> > ```
> >
> > Feels the function comment is a bit confusing, because the function code
> doesn’t really perform sweep, the function is just a getter. InvalidBuffer
> just implies the current sweep is over.
> >
> > Maybe rephrase to something like: “Return the next buffer in the range.
> If InvalidBuffer is returned, that implies the current sweep is done."
>
> Yes, actually I think having these helpers mention the sweep is more
> confusing than anything else. I've revised them to be named more
> generically and updated the comments accordingly.
>
> > 5 - 0004
> > ```
> > +uint32
> > +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> > +{
> > +       uint32          max_possible_buffer_limit;
> > +       uint32          max_write_batch_size;
> > +       int                     strategy_pin_limit;
> > +
> > +       max_write_batch_size = io_combine_limit;
> > +
> > +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> > +       max_possible_buffer_limit = GetPinLimit();
> > +
> > +       max_write_batch_size = Min(strategy_pin_limit,
> max_write_batch_size);
> > +       max_write_batch_size = Min(max_possible_buffer_limit,
> max_write_batch_size);
> > +       max_write_batch_size = Max(1, max_write_batch_size);
> > +       max_write_batch_size = Min(max_write_batch_size,
> io_combine_limit);
> > +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> > +       return max_write_batch_size;
> > +}
> > ```
> >
> > This implementation is hard to understand. I tried to simplify it:
> > ```
> > uint32
> > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> > {
> >         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> >         uint32 max_write_batch_size = Min(GetPinLimit(),
> (uint32)strategy_pin_limit);
> >
> >         /* Clamp to io_combine_limit and enforce minimum of 1 */
> >         if (max_write_batch_size > io_combine_limit)
> >                 max_write_batch_size = io_combine_limit;
> >         if (max_write_batch_size == 0)
> >                 max_write_batch_size = 1;
> >
> >         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> >         return max_write_batch_size;
> > }
> > ```
>
> I agree that the implementation was hard to understand. I've not quite
> gone with your version but I have rewritten it like this:
>
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>     uint32        max_write_batch_size = Min(io_combine_limit,
> MAX_IO_COMBINE_LIMIT);
>     int            strategy_pin_limit =
> GetAccessStrategyPinLimit(strategy);
>     uint32        max_possible_buffer_limit = GetPinLimit();
>
>     /* Identify the minimum of the above */
>     max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
>     max_write_batch_size = Min(max_possible_buffer_limit,
> max_write_batch_size);
>
>     /* Must allow at least 1 IO for forward progress */
>     max_write_batch_size = Max(1, max_write_batch_size);
>
>     return max_write_batch_size;
> }
>
> Is this better?
>
> - Melanie
>

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-11-13T08:29:50Z


> On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:
> 
> Thanks for continuing to review! I've revised the patches to
> incorporate all of your feedback except for where I mention below.
> 
> There were failures in CI due to issues with max batch size, so
> attached v8 also seeks to fix those.
> 
> - Melanie
> 
> On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
>> 
>> 3 - 0003
>> ```
>> +/*
>> + * Return the next buffer in the ring or InvalidBuffer if the current sweep is
>> + * over.
>> + */
>> +Buffer
>> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
>> +{
>> +       if (++(*sweep_cursor) >= strategy->nbuffers)
>> +               *sweep_cursor = 0;
>> +
>> +       return strategy->buffers[*sweep_cursor];
>> +}
>> ```
>> 
>> Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.
>> 
>> Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."
> 
> Yes, actually I think having these helpers mention the sweep is more
> confusing than anything else. I've revised them to be named more
> generically and updated the comments accordingly.
> 
>> 5 - 0004
>> ```
>> +uint32
>> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
>> +{
>> +       uint32          max_possible_buffer_limit;
>> +       uint32          max_write_batch_size;
>> +       int                     strategy_pin_limit;
>> +
>> +       max_write_batch_size = io_combine_limit;
>> +
>> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>> +       max_possible_buffer_limit = GetPinLimit();
>> +
>> +       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
>> +       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
>> +       max_write_batch_size = Max(1, max_write_batch_size);
>> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
>> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>> +       return max_write_batch_size;
>> +}
>> ```
>> 
>> This implementation is hard to understand. I tried to simplify it:
>> ```
>> uint32
>> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
>> {
>>        int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>>        uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);
>> 
>>        /* Clamp to io_combine_limit and enforce minimum of 1 */
>>        if (max_write_batch_size > io_combine_limit)
>>                max_write_batch_size = io_combine_limit;
>>        if (max_write_batch_size == 0)
>>                max_write_batch_size = 1;
>> 
>>        Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>>        return max_write_batch_size;
>> }
>> ```
> 
> I agree that the implementation was hard to understand. I've not quite
> gone with your version but I have rewritten it like this:
> 
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>    uint32        max_write_batch_size = Min(io_combine_limit,
> MAX_IO_COMBINE_LIMIT);
>    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>    uint32        max_possible_buffer_limit = GetPinLimit();
> 
>    /* Identify the minimum of the above */
>    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
>    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
> 
>    /* Must allow at least 1 IO for forward progress */
>    max_write_batch_size = Max(1, max_write_batch_size);
> 
>    return max_write_batch_size;
> }
> 
> Is this better?

Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug builds.

> 
> - Melanie
> <v8-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v8-0002-Split-FlushBuffer-into-two-parts.patch><v8-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v8-0004-Write-combining-for-BAS_BULKWRITE.patch><v8-0005-Add-database-Oid-to-CkptSortItem.patch><v8-0006-Implement-checkpointer-data-write-combining.patch><v8-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-11-18T18:29:56Z

On Wed, Nov 12, 2025 at 6:40 AM BharatDB <bharatdbpg@gmail.com> wrote:
>
> Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT

How could io_combine_limit be higher than MAX_IO_COMBINE_LIMIT? The
GUC is capped to MAX_IO_COMBINE_LIMIT -- see guc_parameters.dat.

> or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.

This is true. But I think it can be solved with a single comparison to
1 as in the recent version.

> Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:
>
> Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).

The io_combine_limit GUC has a minimum value of 1 (in guc_parameters.dat)

> Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.

I think my new version works.

    uint32        max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
    uint32        max_possible_buffer_limit = GetPinLimit();

    /* Identify the minimum of the above */
    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

    /* Must allow at least 1 IO for forward progress */
    max_write_batch_size = Max(1, max_write_batch_size);

    return max_write_batch_size;

GetAccessStrategyPinLimit() is the only function returning a signed
value here -- and it should always return a positive value (while I
wish we would use unsigned integers when a value will never be
negative, the strategy->nbuffers struct member was added a long time
ago). Then the Min() macro should work correctly and produce a value
that fits in a uint32 because of integer promotion rules.

- Melanie

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-11-18T18:49:28Z

On Thu, Nov 13, 2025 at 3:30 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> > On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> >
> > uint32
> > StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> > {
> >    uint32        max_write_batch_size = Min(io_combine_limit,
> > MAX_IO_COMBINE_LIMIT);
> >    int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> >    uint32        max_possible_buffer_limit = GetPinLimit();
> >
> >    /* Identify the minimum of the above */
> >    max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
> >    max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
> >
> >    /* Must allow at least 1 IO for forward progress */
> >    max_write_batch_size = Max(1, max_write_batch_size);
> >
> >    return max_write_batch_size;
> > }
> >
> > Is this better?
>
> Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug builds.

Cool. I've attached a v9 which is rebased over recent bufmgr.c
changes. In the process, I found a bit of cleanup to do.

> I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.

I no longer remember why I made that patch WIP, so I've removed that
designation.

- Melanie

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-11-19T02:00:40Z

> On Nov 19, 2025, at 02:49, Melanie Plageman <melanieplageman@gmail.com> wrote:
> 
> I no longer remember why I made that patch WIP, so I've removed that
> designation.

I just reviewed 0007. It removes the second parameter "bool skip_recently_used”  from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.

I only got a tiny comment:
```
+		 * We can make these check without taking the buffer content lock so
``` 

As you changed “this” to “these”, “check” should be changed to “checks” accordingly.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-11-19T02:09:42Z


> On Nov 19, 2025, at 10:00, Chao Li <li.evan.chao@gmail.com> wrote:
> 
> 
> 
>> On Nov 19, 2025, at 02:49, Melanie Plageman <melanieplageman@gmail.com> wrote:
>> 
>> I no longer remember why I made that patch WIP, so I've removed that
>> designation.
> 
> I just reviewed 0007. It removes the second parameter "bool skip_recently_used”  from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.
> 
> I only got a tiny comment:
> ```
> + * We can make these check without taking the buffer content lock so
> ``` 
> 
> As you changed “this” to “these”, “check” should be changed to “checks” accordingly.
> 

I just got an compile error:
```
bufmgr.c:3580:33: error: no member named 'dbId' in 'struct CkptSortItem'
 3580 |                                 batch.rlocator.dbOid = item.dbId;
      |                                                        ~~~~ ^
bufmgr.c:3598:13: error: no member named 'dbId' in 'struct CkptSortItem'
 3598 |                         if (item.dbId != batch.rlocator.dbOid)
      |                             ~~~~ ^
2 errors generated.
make[4]: *** [bufmgr.o] Error 1
```

I tried “make clean” and “make” again, which didn’t work. I think the error is introduced by 0006.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

BharatDB <bharatdbpg@gmail.com> — 2025-11-19T11:21:06Z

Hi Melanie,

Thank you for the detailed clarifications which helped me understand
the constraints much more clearly. You are absolutely right regarding
the key points you specified. My initial concern came from trying to
avoid cases where strategy pin limits were unexpectedly small (0 or
negative) or global pin limits temporarily shrink due to memory issue
/ fast cycling of pins.

On Wed, Nov 19, 2025 at 12:00 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Nov 12, 2025 at 6:40 AM BharatDB <bharatdbpg@gmail.com> wrote:
> >
> > Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT
>
> How could io_combine_limit be higher than MAX_IO_COMBINE_LIMIT? The
> GUC is capped to MAX_IO_COMBINE_LIMIT -- see guc_parameters.dat.

After revisiting guc_parameters.dat, Now I see that the GUC is
strictly capped at MAX_IO_COMBINE_LIMIT, so comparisons against larger
values are unnecessary. Thus my earlier concern came from assuming
some unbounded user-provided values are not applicable here.

> > or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.
>
> This is true. But I think it can be solved with a single comparison to
> 1 as in the recent version.
 ok

> > Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:
> >
> > Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).
>
> The io_combine_limit GUC has a minimum value of 1 (in guc_parameters.dat)
Noted.

> > Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.
>
> I think my new version works.
>
>     uint32        max_write_batch_size = Min(io_combine_limit,
> MAX_IO_COMBINE_LIMIT);
>     int            strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>     uint32        max_possible_buffer_limit = GetPinLimit();
>
>     /* Identify the minimum of the above */
>     max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
>     max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
>
>     /* Must allow at least 1 IO for forward progress */
>     max_write_batch_size = Max(1, max_write_batch_size);
>
>     return max_write_batch_size;
>
> GetAccessStrategyPinLimit() is the only function returning a signed
> value here -- and it should always return a positive value (while I
> wish we would use unsigned integers when a value will never be
> negative, the strategy->nbuffers struct member was added a long time
> ago). Then the Min() macro should work correctly and produce a value
> that fits in a uint32 because of integer promotion rules.

Th explanation about GetAccessStrategyPinLimit() (despite being int)
makes sense and I agree that with the Min() macro and integer
promotion rules, the outcome is always safe. Therefore, explicit typed
comparisons are indeed redundant. However, after reviewing the
existing code paths and your updated version,
max_write_batch_size = Max(1, max_write_batch_size);
=> It is clear that both GetAccessStrategyPinLimit() and GetPinLimit()
should always return sensible positive values and the fallback fully
covers the forward-progress requirement. And I agree that it is both
correct and sufficient for the CI failures we were seeing earlier.

Thank you for helping me understand the reasoning behind this design.
And this will be kept in mind for further work on implementing  write
combining.
I appreciate your patience and the opportunity to refine my
assumptions. Looking forward to more suggestions and feedbacks.

Thanking you.

Regards,
Soumya

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-11-19T22:12:49Z

Thanks for the review!

On Tue, Nov 18, 2025 at 9:10 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> > On Nov 19, 2025, at 10:00, Chao Li <li.evan.chao@gmail.com> wrote:
>
> > I just reviewed 0007. It removes the second parameter "bool skip_recently_used”  from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.
> >
> > I only got a tiny comment:
> > ```
> > + * We can make these check without taking the buffer content lock so
> > ```
> >
> > As you changed “this” to “these”, “check” should be changed to “checks” accordingly.

I've made this change. Attached v10 has this.

> I just got an compile error:
> ```
> bufmgr.c:3580:33: error: no member named 'dbId' in 'struct CkptSortItem'
>  3580 |                                 batch.rlocator.dbOid = item.dbId;
>       |                                                        ~~~~ ^
> bufmgr.c:3598:13: error: no member named 'dbId' in 'struct CkptSortItem'
>  3598 |                         if (item.dbId != batch.rlocator.dbOid)
>       |                             ~~~~ ^
> 2 errors generated.
> make[4]: *** [bufmgr.o] Error 1
> ```
>
> I tried “make clean” and “make” again, which didn’t work. I think the error is introduced by 0006.

Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.

- Melanie

Re: Checkpointer write combining

Chao Li <li.evan.chao@gmail.com> — 2025-11-20T02:31:43Z


> On Nov 20, 2025, at 06:12, Melanie Plageman <melanieplageman@gmail.com> wrote:
> 
> Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.
> 

My bad. Yes, I missed to download and apply 0005.

> - Melanie
> <v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v10-0002-Split-FlushBuffer-into-two-parts.patch><v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v10-0004-Write-combining-for-BAS_BULKWRITE.patch><v10-0005-Add-database-Oid-to-CkptSortItem.patch><v10-0006-Implement-checkpointer-data-write-combining.patch><v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch>

I went through the whole patch set again, and got a few more comments:

1 - 0002
```
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+		return;
```

I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.

Because the function comment clearly says “the content lock must be held exclusively”. Also, looking at the code where CleanVictimBuffer() is called:
```
			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
			{
				LWLockRelease(content_lock);
				UnpinBuffer(buf_hdr);
				continue;
			}

			/* Content lock is released inside CleanVictimBuffer */
			CleanVictimBuffer(buf_hdr, from_ring, io_context);
```

In the previous “if” clause, it releases content_lock.

2 - 0002
```
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
```

The function comment says "the content lock must be held exclusively”, but from the code:
```
			content_lock = BufferDescriptorGetContentLock(buf_hdr);
			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
			{
				/*
				 * Someone else has locked the buffer, so give it up and loop
				 * back to get another one.
				 */
				UnpinBuffer(buf_hdr);
				continue;
			}

			/*
			 * If using a nondefault strategy, and writing the buffer would
			 * require a WAL flush, let the strategy decide whether to go
			 * ahead and write/reuse the buffer or to choose another victim.
			 * We need the content lock to inspect the page LSN, so this can't
			 * be done inside StrategyGetBuffer.
			 */
			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
			{
				LWLockRelease(content_lock);
				UnpinBuffer(buf_hdr);
				continue;
			}

			/* Content lock is released inside CleanVictimBuffer */
			CleanVictimBuffer(buf_hdr, from_ring, io_context);
```
Content_lock is acquired with LW_SHARED.

3 - 0003

In CleanVictimBuffer(), more logic are added, but the content_lock leak problem is still there.

4 - 0003
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state = LockBufHdr(bufdesc);
+
+	*lsn = BufferGetLSN(bufdesc);
+
+	UnlockBufHdr(bufdesc);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
```

This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT? Because from the existing code:
```
	if (buf_state & BM_PERMANENT)
		XLogFlush(recptr);
```

XLogFlush should only happen when BM_PERMANENT.

5 - 0004 - I am thinking if that could be a race condition?

PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the page contents while the batch is being assembled, because batching only holds content_lock per page temporarily, not across the entire run.
So that:
    • Page A pinned + content lock acquired + LSN read → content lock released
    • Another backend modifies Page A and sets new LSN, dirties buffer
    • Page A is written by this batch using an outdated checksum / outdated page contents

6 - 0006 - Ah, 0006 seems to have solved comment 3 about BM_PERMANENT.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com> — 2025-11-20T22:11:34Z

Thanks for continuing to review this!

On Wed, Nov 19, 2025 at 9:32 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> +static void
> +CleanVictimBuffer(BufferDesc *bufdesc,
> +                                 bool from_ring, IOContext io_context)
> +{
> +       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
> +       LWLock     *content_lock;
>
> -       /* Find smgr relation for buffer */
> -       if (reln == NULL)
> -               reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
> +       /* Set up this victim buffer to be flushed */
> +       if (!PrepareFlushBuffer(bufdesc, &max_lsn))
> +               return;
> ```
>
> I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.

Oh wow, you are so right. That's a big mistake! I would have thought a
test would fail, but it seems we don't have coverage of this. I've
fixed the code in the attached v11. I'll see how difficult it is to
write a test that covers the case where we try to do IO but someone
else has already done it.

> + * Buffer must be pinned, the content lock must be held exclusively, and the
> + * buffer header spinlock must not be held. The exclusive lock is released and
> + * the buffer is returned pinned but not locked.
>
> The function comment says "the content lock must be held exclusively”, but from the code:
> Content_lock is acquired with LW_SHARED.

Thanks, I've updated the comment.

> +static bool
> +BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
> +{
> +       uint32          buf_state = LockBufHdr(bufdesc);
> +
> +       *lsn = BufferGetLSN(bufdesc);
> +
> +       UnlockBufHdr(bufdesc);
> +
> +       /*
> +        * See buffer flushing code for more details on why we condition this on
> +        * the relation being logged.
> +        */
> +       return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
> +}
>
> This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT?

That makes sense. I don't use the lsn unless the buffer is logged
(BM_PERMANENT), but I think it is weird for the function to set the
LSN if it is returning false. I've modified the function to set it to
InvalidXLogRecPtr when it would return false. I've also added an
assert before XLogFlush() to make sure the buffer is logged before
flushing the WAL.

> 5 - 0004 - I am thinking if that could be a race condition?
>
> PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the page contents while the batch is being assembled, because batching only holds content_lock per page temporarily, not across the entire run.
> So that:
>     • Page A pinned + content lock acquired + LSN read → content lock released
>     • Another backend modifies Page A and sets new LSN, dirties buffer
>     • Page A is written by this batch using an outdated checksum / outdated page contents

So, there is a race condition but it is slightly different from the
scenario you mention. We actually hold the content lock until after
doing the write. That means someone else can't get an exclusive lock
and modify the tuple data in the buffer. However, queries can set hint
bits while only holding a share lock. That can update the LSN (if an
FPI is required), which would cause bad things to happen when we write
out a buffer with an outdated checksum. What we do in master is make a
copy of the buffer, checksum it, and write out the copied buffer (see
PageSetChecksumCopy() and its function comment).

I have an XXX in PageSetBatchChecksumInplace() and in the commit
message for this patch explaining that it isn't committable until some
ongoing work Andres is doing adding a new lock type for setting hint
bits is committed. So, this specific part of this patch is WIP.

- Melanie