Thread

  1. Re: Sending unflushed WAL in physical replication

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2025-09-27T08:23:47Z

    Hi Rahila,
    
    On Thu, Sep 25, 2025 at 12:02 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
    
    > Hi,
    >
    > Please find attached a POC patch that introduces changes to the WAL sender
    > and
    > receiver, allowing WAL records to be sent to standbys before they are
    > flushed
    > to disk on the primary during physical replication. This is intended to
    > improve
    > replication latency by reducing the amount of WAL read from disk.
    > For large transactions, this approach ensures that the bulk of the
    > transaction’s
    > WAL records are already sent to the standby before the flush occurs on the
    > primary.
    > As a result, the flush on the primary and standby happen closer together,
    > reducing replication lag.
    >
    
    At the high level idea LGTM.
    
    
    >
    > Observations from the benchmark:
    > 1. The patch improves TPS by ~13% in the sync replication setup. In
    > repeated runs,
    > I see that the TPS increase is anywhere between 5% to 13% .
    > 2. WAL sender reads significantly less WAL from disk, indicating more
    > efficient use
    > of WAL buffers and reduced disk I/O
    >
    
    Can you please measure the transaction commit latency improvement as well.
    Commit latency = Primary_Disk_Flush_time +  Standby_disk_fluish_time +
    network_roundtrip_time
    
    
    >
    > Following are some of the details of the implementation:
    >
    > 1. Primary does not wait for flush before starting to send data, so it is
    > likely to
    > send smaller chunks of data. To prevent network overload, changes are made
    > to
    > avoid sending excessively small packets.
    > 2. The sender includes the current flush pointer in the replication
    > protocol
    > messages, so the standby knows up to which point WAL has been safely
    > flushed
    > on the primary.
    > 3. The logic ensures that standbys do not apply transactions that have not
    > been flushed on the primary, by updating the flushedUpto position on the
    > standby
    > only up to the flushPtr received from the primary.
    > 4. WAL records received from the primary are written and can be flushed to
    > disk on the
    > standby, but are only marked as flushed up to the flushPtr reported by the
    > primary.
    >
    
    What happens in crash recovery scenarios? For example, when a standby crash
    restart,
    it replays until the end of WAL. In this case, it may end up replaying WAL
    that was
    never flushed on the primary (if primary does a crash recovery).
    Shouldn't archive on standby not upload WAL before WAL gets flushed on the
    primary?
    Same applicable for pg_receivewal.
    
    Thanks,
    Satya
    
    >