Thread

[PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion

Andrey Borodin <x4mmm@yandex-team.ru> — 2025-10-23T16:25:04Z

Hi hackers,

I'd like to propose a new archive_mode setting to address a gap in WAL
archiving for high availability streaming replication configurations.

## Problem

In HA setups using streaming replication, standbys can be
promoted when primary has failed. Some WAL segments might be not yet
archived. This creates gaps in the WAL archive, breaking point-in-time
recovery:

1. Primary generates WAL, streams to standby
2. Standby receives WAL, marks segments as .done immediately
3. Standby deletes WAL during checkpoints
4. Primary hasn't archived yet (archiver lag, network issues, etc.)
5. Primary vanishes
6. Standby gets promoted
7. WAL history lost from archive

This is particularly problematic in synchronous replication where
promotion might happen while the primary is still catching up on archival.

Promoted standby might have some WALs from walreceiver, some from archive. In
this case we need to archive only those WALs which were received, but not
confirmed to be archived by primary.

## Proposed Solution

Add archive_mode=follow_primary, where standbys defer WAL deletion until
the primary confirms archival:

- During recovery: standby creates .ready files for received segments
- Periodically: standby queries primary for archive status via replication protocol
- Primary responds: which segments are archived (no .ready file exists)
- Standby marks those as .done and can safely delete them
- On promotion: standby automatically archives remaining .ready segments

## Implementation

The patch adds two replication protocol messages:
- 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, segno) pairs
- 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with archived pairs

Key changes:
- walreceiver: XLogWalRcvSendArchiveQuery() scans archive_status, sends
queries. I particularily dislike necessity to read whole arcive_status directory,
but found no better way.
- walsender: ProcessStandbyArchiveQueryMessage() checks .ready files, responds.
Fortunately, no potentially FS-heavy operations on Primary.
- archiver: skips archiving during recovery if archive_mode=follow_primary.
I considered creating new kind of status file, but rejected the idea.
- XLogWalRcvClose(): creates .ready files instead of .done in follow_primary mode

Status requests happen at wal_receiver_status_interval (similar to hot_standby_feedback).
Works with cascading replication - each standby queries its immediate upstream.
Primary can be configured with archive_mode=follow_primary too.

## Testing

Included TAP tests cover:
- Basic archive status synchronization
- Standby promotion triggering archival
- Cascading standby configurations
- Multiple standbys from same primary

## Performance Impact

The overhead is minimal:
- Standby: One archive_status directory scan per wal_receiver_status_interval
- Primary: O(n) stat() calls where n = number of .ready files on standby
- Network: Small message (~1KB for 64 segments)
- Some space occupied by unarchived WALs on all standbys

## Open Questions

1. **Naming**: Is "follow_primary" the best name? Alternatives considered:
- standby
- synchronized/sync
- coordinated
- primary_sync

2. **Query frequency**: Currently tied to wal_receiver_status_interval.
Should this be a separate GUC?

3. **Message protocol**: Should we batch more segments per message?
Current limit is 64 per query. Maybe sort rqeuests by LSN to pick 64 oldest segments?

4. **Backwards compatibility**: Primary must understand the protocol.
Should we version-check or gracefully degrade? I don't think additional check is necessary, but I'm not sure.
Currently, if a walreceiver with follow_primary connects to an old primary that
doesn't understand the 'a' message, the primary will log a protocol error
but replication will continue (the standby just won't get responses).

## Future work

I'd like to extend archiver design to distribute archival work between cluster nodes. But
it would be too big project to do at once, so I decided to address PITR continuity issue first.

## Patch

Patch attached implements the feature with documentation and tests, but main purpose is, of course, a discussion. Does this approach seem right direction of development?
Looking forward to feedback on the approach and any concerns.

Best regards, Andrey Borodin.