Thread
-
Sending unflushed WAL in physical replication
Rahila Syed <rahilasyed90@gmail.com> — 2025-09-25T19:02:28Z
Hi, Please find attached a POC patch that introduces changes to the WAL sender and receiver, allowing WAL records to be sent to standbys before they are flushed to disk on the primary during physical replication. This is intended to improve replication latency by reducing the amount of WAL read from disk. For large transactions, this approach ensures that the bulk of the transaction’s WAL records are already sent to the standby before the flush occurs on the primary. As a result, the flush on the primary and standby happen closer together, reducing replication lag. Observations from the benchmark: 1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs, I see that the TPS increase is anywhere between 5% to 13% . 2. WAL sender reads significantly less WAL from disk, indicating more efficient use of WAL buffers and reduced disk I/O Following are some of the details of the implementation: 1. Primary does not wait for flush before starting to send data, so it is likely to send smaller chunks of data. To prevent network overload, changes are made to avoid sending excessively small packets. 2. The sender includes the current flush pointer in the replication protocol messages, so the standby knows up to which point WAL has been safely flushed on the primary. 3. The logic ensures that standbys do not apply transactions that have not been flushed on the primary, by updating the flushedUpto position on the standby only up to the flushPtr received from the primary. 4. WAL records received from the primary are written and can be flushed to disk on the standby, but are only marked as flushed up to the flushPtr reported by the primary. Benchmark details are as follows: Synchronous replication with remote write enabled. Two Azure VMs: Central India (primary), Central US (standby). OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM). With patch TPS : 115 WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io) WAL generated during the test: 772705760 bytes. Without the patch TPS: 102 WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io) WAL generated during the test : 760060792 bytes Commit hash: b1187266e0 pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql wal_test.sql (each transaction generates ~36KB of WAL): \set delta random(1, 500) BEGIN; INSERT INTO wal_bloat_:delta (data) SELECT repeat('x', 8000) FROM generate_series(1, 80); TODO: 1. Ensure there is a robust mechanism on the receiver to prevent WAL records that are not flushed on primary from being applied on standby, under any circumstances. 2. When smaller chunks of WAL are received on the standby, it can lead to more frequent disk write operations. To mitigate this issue, employing WAL buffers on the standby could be a more effective approach. Evaluate the performance impact of using WAL buffers on the standby. Similar idea was proposed here: Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas <https://www.postgresql.org/message-id/flat/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com> This idea is also discussed here recently : https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com Kindly let me know your thoughts. Thank you, Rahila Syed