Thread

  1. [PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion

    Andrey Borodin <x4mmm@yandex-team.ru> — 2025-10-23T16:25:04Z

    Hi hackers,
    
    I'd like to propose a new archive_mode setting to address a gap in WAL 
    archiving for high availability streaming replication configurations.
    
    ## Problem
    
    In HA setups using streaming replication, standbys can be 
    promoted when primary has failed. Some WAL segments might be not yet 
    archived. This creates gaps in the WAL archive, breaking point-in-time 
    recovery:
    
    1. Primary generates WAL, streams to standby
    2. Standby receives WAL, marks segments as .done immediately
    3. Standby deletes WAL during checkpoints
    4. Primary hasn't archived yet (archiver lag, network issues, etc.)
    5. Primary vanishes
    6. Standby gets promoted
    7. WAL history lost from archive
    
    This is particularly problematic in synchronous replication where 
    promotion might happen while the primary is still catching up on archival.
    
    Promoted standby might have some WALs from walreceiver, some from archive. In 
    this case we need to archive only those WALs which were received, but not
    confirmed to be archived by primary.
    
    ## Proposed Solution
    
    Add archive_mode=follow_primary, where standbys defer WAL deletion until 
    the primary confirms archival:
    
    - During recovery: standby creates .ready files for received segments
    - Periodically: standby queries primary for archive status via replication protocol
    - Primary responds: which segments are archived (no .ready file exists)
    - Standby marks those as .done and can safely delete them
    - On promotion: standby automatically archives remaining .ready segments
    
    ## Implementation
    
    The patch adds two replication protocol messages:
    - 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, segno) pairs
    - 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with archived pairs
    
    Key changes:
    - walreceiver: XLogWalRcvSendArchiveQuery() scans archive_status, sends 
    queries. I particularily dislike necessity to read whole arcive_status directory, 
    but found no better way.
    - walsender: ProcessStandbyArchiveQueryMessage() checks .ready files, responds.
    Fortunately, no potentially FS-heavy operations on Primary.
    - archiver: skips archiving during recovery if archive_mode=follow_primary.
    I considered creating new kind of status file, but rejected the idea.
    - XLogWalRcvClose(): creates .ready files instead of .done in follow_primary mode
    
    Status requests happen at wal_receiver_status_interval (similar to hot_standby_feedback).
    Works with cascading replication - each standby queries its immediate upstream.
    Primary can be configured with archive_mode=follow_primary too.
    
    ## Testing
    
    Included TAP tests cover:
    - Basic archive status synchronization
    - Standby promotion triggering archival
    - Cascading standby configurations
    - Multiple standbys from same primary
    
    ## Performance Impact
    
    The overhead is minimal:
    - Standby: One archive_status directory scan per wal_receiver_status_interval
    - Primary: O(n) stat() calls where n = number of .ready files on standby
    - Network: Small message (~1KB for 64 segments)
    - Some space occupied by unarchived WALs on all standbys
    
    ## Open Questions
    
    1. **Naming**: Is "follow_primary" the best name? Alternatives considered:
       - standby
       - synchronized/sync  
       - coordinated
       - primary_sync
    
    2. **Query frequency**: Currently tied to wal_receiver_status_interval. 
       Should this be a separate GUC?
    
    3. **Message protocol**: Should we batch more segments per message? 
       Current limit is 64 per query. Maybe sort rqeuests by LSN to pick 64 oldest segments?
    
    4. **Backwards compatibility**: Primary must understand the protocol. 
       Should we version-check or gracefully degrade? I don't think additional check is necessary, but I'm not sure.
       Currently, if a walreceiver with follow_primary connects to an old primary that 
       doesn't understand the 'a' message, the primary will log a protocol error 
       but replication will continue (the standby just won't get responses).
    
    ## Future work
    
    I'd like to extend archiver design to distribute archival work between cluster nodes. But
    it would be too big project to do at once, so I decided to address PITR continuity issue first.
    
    ## Patch
    
    Patch attached implements the feature with documentation and tests, but main purpose is, of course, a discussion. Does this approach seem right direction of development?
    Looking forward to feedback on the approach and any concerns.
    
    
    Best regards, Andrey Borodin.