Thread

Re: Sending unflushed WAL in physical replication

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2025-09-27T08:23:47Z
Hi Rahila,

On Thu, Sep 25, 2025 at 12:02 PM Rahila Syed <rahilasyed90@gmail.com> wrote:

> Hi,
>
> Please find attached a POC patch that introduces changes to the WAL sender
> and
> receiver, allowing WAL records to be sent to standbys before they are
> flushed
> to disk on the primary during physical replication. This is intended to
> improve
> replication latency by reducing the amount of WAL read from disk.
> For large transactions, this approach ensures that the bulk of the
> transaction’s
> WAL records are already sent to the standby before the flush occurs on the
> primary.
> As a result, the flush on the primary and standby happen closer together,
> reducing replication lag.
>

At the high level idea LGTM.


>
> Observations from the benchmark:
> 1. The patch improves TPS by ~13% in the sync replication setup. In
> repeated runs,
> I see that the TPS increase is anywhere between 5% to 13% .
> 2. WAL sender reads significantly less WAL from disk, indicating more
> efficient use
> of WAL buffers and reduced disk I/O
>

Can you please measure the transaction commit latency improvement as well.
Commit latency = Primary_Disk_Flush_time +  Standby_disk_fluish_time +
network_roundtrip_time


>
> Following are some of the details of the implementation:
>
> 1. Primary does not wait for flush before starting to send data, so it is
> likely to
> send smaller chunks of data. To prevent network overload, changes are made
> to
> avoid sending excessively small packets.
> 2. The sender includes the current flush pointer in the replication
> protocol
> messages, so the standby knows up to which point WAL has been safely
> flushed
> on the primary.
> 3. The logic ensures that standbys do not apply transactions that have not
> been flushed on the primary, by updating the flushedUpto position on the
> standby
> only up to the flushPtr received from the primary.
> 4. WAL records received from the primary are written and can be flushed to
> disk on the
> standby, but are only marked as flushed up to the flushPtr reported by the
> primary.
>

What happens in crash recovery scenarios? For example, when a standby crash
restart,
it replays until the end of WAL. In this case, it may end up replaying WAL
that was
never flushed on the primary (if primary does a crash recovery).
Shouldn't archive on standby not upload WAL before WAL gets flushed on the
primary?
Same applicable for pg_receivewal.

Thanks,
Satya

>