Re: Checkpointer write combining

BharatDB <bharatdbpg@gmail.com>

From: BharatDB <bharatdbpg@gmail.com>
To: Chao Li <li.evan.chao@gmail.com>
Cc: Melanie Plageman <melanieplageman@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Andres Freund <andres@anarazel.de>
Date: 2025-10-28T12:51:41Z
Lists: pgsql-hackers

Attachments

On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:

> Hi Milanie,
>
> Thanks for updating the patch set. I review 1-6 and got a few more small
> comments. I didn’t review 0007 as it’s marked as WIP.
>
> >
> > - Melanie
> >
> <v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
>
> 1 - 0001
> ```
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner,
> Buffer buffer)
>  /*
>   * Internal buffer management routines
>   */
> +
> +
> +/* Note: these two macros only work on shared buffers, not local ones! */
> ```
>
> Nit: here you added two empty lines, I think we need only 1.
>
> 2 - 0002
> ```
> +static void
> +CleanVictimBuffer(BufferDesc *bufdesc,
> +                                 bool from_ring, IOContext io_context)
> +{
>
> -       /* Find smgr relation for buffer */
> -       if (reln == NULL)
> -               reln = smgropen(BufTagGetRelFileLocator(&buf->tag),
> INVALID_PROC_NUMBER);
> +       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
> +       LWLock     *content_lock;
> ```
>
> Nit: the empty line after “{“ should be removed.
>
> 3 - 0003
> ```
> +/*
> + * Return the next buffer in the ring or InvalidBuffer if the current
> sweep is
> + * over.
> + */
> +Buffer
> +StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
> +{
> +       if (++(*sweep_cursor) >= strategy->nbuffers)
> +               *sweep_cursor = 0;
> +
> +       return strategy->buffers[*sweep_cursor];
> +}
> ```
>
> Feels the function comment is a bit confusing, because the function code
> doesn’t really perform sweep, the function is just a getter. InvalidBuffer
> just implies the current sweep is over.
>
> Maybe rephrase to something like: “Return the next buffer in the range. If
> InvalidBuffer is returned, that implies the current sweep is done."
>
> 4 - 0003
> ```
> static BufferDesc *
> NextStratBufToFlush(BufferAccessStrategy strategy,
> Buffer sweep_end,
> XLogRecPtr *lsn, int *sweep_cursor)
> ``
>
> “Strat” is confusing. I think it’s the short version of “Strategy”. As
> this is a static function, and other function names all have the whole word
> of “strategy”, why don’t also use the whole word in this function name as
> well?
>
> 5 - 0004
> ```
> +uint32
> +StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> +{
> +       uint32          max_possible_buffer_limit;
> +       uint32          max_write_batch_size;
> +       int                     strategy_pin_limit;
> +
> +       max_write_batch_size = io_combine_limit;
> +
> +       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
> +       max_possible_buffer_limit = GetPinLimit();
> +
> +       max_write_batch_size = Min(strategy_pin_limit,
> max_write_batch_size);
> +       max_write_batch_size = Min(max_possible_buffer_limit,
> max_write_batch_size);
> +       max_write_batch_size = Max(1, max_write_batch_size);
> +       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
> +       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
> +       return max_write_batch_size;
> +}
> ```
>
> This implementation is hard to understand. I tried to simplify it:
> ```
> uint32
> StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
> {
>         int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
>         uint32 max_write_batch_size = Min(GetPinLimit(),
> (uint32)strategy_pin_limit);
>
>         /* Clamp to io_combine_limit and enforce minimum of 1 */
>         if (max_write_batch_size > io_combine_limit)
>                 max_write_batch_size = io_combine_limit;
>         if (max_write_batch_size == 0)
>                 max_write_batch_size = 1;
>
>         Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
>         return max_write_batch_size;
> }
> ```
>
> Best regards,
> --
> Chao Li (Evan)
> HighGo Software Co., Ltd.
> https://www.highgo.com/
>
>
>
> Hello All,



> As per reference to the previous mails, I understood the changes made and
> had tried to replicate the patches into the source code for the bug fix but
> it didn't show any significant bug. Also I ran some verification tests for
> the recent changes related to batched write statistics during checkpoints.
> Below are my observations and results:
>
> Test Setup
>
>    -
>
>    PostgreSQL version: 19devel (custom build)
>    -
>
>    OS: Ubuntu Linux
>    -
>
>    Port: 55432
>    -
>
>    Database: postgres
>    -
>
>    Test tool: pgbench
>    -
>
>    Duration: 120 seconds
>    -
>
>    Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres
>
> Log Output

After running the workload, I triggered a manual checkpoint and checked the
latest log entry:
2025-10-28 16:53:05.696 IST [11422] LOG:  checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;
distance=33437 kB, estimate=308790 kB;

Observations:

Metric

Value

Source

Interpretation

Buffers written

1383

>From log

Consistent with moderate workload

Checkpoint write time

0.023 s

>From log

Realistic for ~11 MB write

Checkpoint sync time

0.017 s

>From log

Reasonable

Total checkpoint time

0.071 s

>From log

≈ write + sync + small overhead

CHECKPOINT runtime (psql)

-

-

Fast, confirms idle background activity


The total time closely matches the sum of write and sync times, with only a
small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no
missing or duplicated values.
Expected behavior observed:

   -

   write + sync ≈ total (no zero, truncation, or aggregation bug)
   -

   Buffer counts and timing scale realistically with workload
   -

   No evidence of under- or over-counted times
   -

   Checkpoint stats properly recorded in log and pg_stat_checkpointer

Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s
overhead => normal for control file + metadata writes.
Comparison to Prior Reports:

Test

Pre-Patch

Post-Patch

Difference

Checkpoint duration

6.5 s

5.0 s

−23

Heavy workload test

16 s

8 s

−50

Result: It shows consistent and stable timing even under moderate pgbench
load — confirming the patch is integrated and functioning correctly.

Final Status:

Category

Result

Batched Write Stats Accuracy

Verified OK

Timing Aggregation Correctness

Verified OK

pg_stat_checkpointer Consistency

Verified OK

Log Formatting

Normal

Performance Regression

None detected
*Inference:*


   -

   The reported write, sync, and total timings are consistent and realistic.
   -

   No anomalies or timing mismatches were seen.
   -

   Checkpoint performance seems stable and accurate under moderate load.
   -

   No regression or counter accumulation issues detected.


Also, I have been verifying output using manual queries recording the
observations and find no significant delays. Observations are recorded
below and the screenshots are attached herewith.

Observations:

Metric

Before Activity

After Activity

Notes

Buffers written

27949

27972

Matches wrote 23 buffers from log

Write time

206882

206887

Matches write=0.005 s from log

Sync time

840

877

Matches sync=0.037 s from log

num_done

6

7

Completed successfully

slru_written

8

9

Matches wrote 1 SLRU buffer from log

Stats reset

yes

yes

successful

num_requested

7

8

manual checkpoint recorded

For further review, I have attached the screenshots of my terminal
herewith.
Kindly review the observations and please let me know if any additional
details need to be focused on.

Thanking you.

Regards,
Soumya