Thread

  1. Sending unflushed WAL in physical replication

    Rahila Syed <rahilasyed90@gmail.com> — 2025-09-25T19:02:28Z

    Hi,
    
    Please find attached a POC patch that introduces changes to the WAL sender
    and
    receiver, allowing WAL records to be sent to standbys before they are
    flushed
    to disk on the primary during physical replication. This is intended to
    improve
    replication latency by reducing the amount of WAL read from disk.
    For large transactions, this approach ensures that the bulk of the
    transaction’s
    WAL records are already sent to the standby before the flush occurs on the
    primary.
    As a result, the flush on the primary and standby happen closer together,
    reducing replication lag.
    
    Observations from the benchmark:
    1. The patch improves TPS by ~13% in the sync replication setup. In
    repeated runs,
    I see that the TPS increase is anywhere between 5% to 13% .
    2. WAL sender reads significantly less WAL from disk, indicating more
    efficient use
    of WAL buffers and reduced disk I/O
    
    Following are some of the details of the implementation:
    
    1. Primary does not wait for flush before starting to send data, so it is
    likely to
    send smaller chunks of data. To prevent network overload, changes are made
    to
    avoid sending excessively small packets.
    2. The sender includes the current flush pointer in the replication
    protocol
    messages, so the standby knows up to which point WAL has been safely
    flushed
    on the primary.
    3. The logic ensures that standbys do not apply transactions that have not
    been flushed on the primary, by updating the flushedUpto position on the
    standby
    only up to the flushPtr received from the primary.
    4. WAL records received from the primary are written and can be flushed to
    disk on the
    standby, but are only marked as flushed up to the flushPtr reported by the
    primary.
    
    Benchmark details are as follows:
    Synchronous replication with remote write enabled.
    Two Azure VMs: Central India (primary), Central US (standby).
    OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM).
    
    With patch
    TPS : 115
    WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io)
    WAL generated during the test: 772705760 bytes.
    
    Without the patch
    TPS: 102
    WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io)
    WAL generated during the test : 760060792 bytes
    
    Commit hash: b1187266e0
    
    pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql
    
    wal_test.sql (each transaction generates ~36KB of WAL):
    \set delta random(1, 500)
    BEGIN;
    INSERT INTO wal_bloat_:delta (data)
    SELECT repeat('x', 8000)
    FROM generate_series(1, 80);
    
    TODO:
    1. Ensure there is a robust mechanism on the receiver to prevent WAL
    records
    that are not flushed on primary from being applied on standby, under any
    circumstances.
    2. When smaller chunks of WAL are received on the standby, it can lead to
    more
    frequent disk write operations. To mitigate this issue, employing WAL
    buffers
    on the standby could be a more effective approach. Evaluate the performance
    impact of using WAL buffers on the standby.
    
    Similar idea was proposed here:
    Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
    <https://www.postgresql.org/message-id/flat/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com>
    This idea is also discussed here recently :
    https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com
    
    Kindly let me know your thoughts.
    
    Thank you,
    Rahila Syed