Checkpointer write combining

Melanie Plageman <melanieplageman@gmail.com>

View thread

From: Melanie Plageman <melanieplageman@gmail.com>

To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>

Cc: Andres Freund <andres@anarazel.de>

Date: 2025-09-02T21:10:43Z

Lists: pgsql-hackers

Attachments

v1-0005-Fix-XLogNeedsFlush-for-checkpointer.patch (text/x-patch) patch v1-0005
v1-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch (text/x-patch) patch v1-0001
v1-0004-Write-combining-for-BAS_BULKWRITE.patch (text/x-patch) patch v1-0004
v1-0003-Eagerly-flush-bulkwrite-strategy-ring.patch (text/x-patch) patch v1-0003
v1-0002-Split-FlushBuffer-into-two-parts.patch (text/x-patch) patch v1-0002
v1-0006-Add-database-Oid-to-CkptSortItem.patch (text/x-patch) patch v1-0006
v1-0007-Implement-checkpointer-data-write-combining.patch (text/x-patch) patch v1-0007

Hi,

The attached patchset implements checkpointer write combining -- which
makes immediate checkpoints at least 20% faster in my tests.
Checkpointer achieves higher write throughput and higher write IOPs
with the patch.

Besides the immediate performance gain with the patchset, we will
eventually need all writers to do write combining if we want to use
direct IO. Additionally, I think the general shape I refactored
BufferSync() into will be useful for AIO-ifying checkpointer.

The patch set has preliminary patches (0001-0004) that implement eager
flushing and write combining for bulkwrites (like COPY FROM). The
functions used to flush a batch of writes for bulkwrites (see 0004)
are reused for the checkpointer. The eager flushing component of this
patch set has been discussed elsewhere [1].

0005 implements a fix for XLogNeedsFlush() when called by checkpointer
during an end-of-crash-recovery checkpoint. I've already started
another thread about this [2], but the patch is required for the patch
set to pass tests.

One outstanding action item is to test to see if there are any
benefits to spread checkpoints.

More on how I measured the performance benefit to immediate checkpoints:

I tuned checkpoint_completion_target, checkpoint_timeout, and min and
max_wal_size to ensure no other checkpoints were initiated.

With 16 GB shared buffers and io_combine_limit 128, I created a 15 GB
table. To get consistent results, I used pg_prewarm to read the table
into shared buffers, issued a checkpoint, then used Bilal's patch [3]
to mark all the buffers as dirty again and issue another checkpoint.
On a fast local SSD, this proved to be a consistent 20%+ speed up
(~6.5 seconds to ~5 seconds).

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com