v7_PG18-0001-Fix-the-race-condition-between-slot-wal-rese.patch

application/octet-stream
Filename: v7_PG18-0001-Fix-the-race-condition-between-slot-wal-rese.patch
Type: application/octet-stream
Part: 1
Message: RE: Newly created replication slot may be invalidated by checkpoint
From 02c50e07e2ce2e5fd293a553bae90ff2489f4b0d Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Thu, 4 Dec 2025 14:22:53 +0800
Subject: [PATCH v7_PG18] Fix the race condition between slot wal reservation
 and checkpoint

Previously, a race condition existed where a newly created slot selected a WAL
position for reservation but had not yet updated the restart_lsn. Meanwhile, a
checkpoint could occur, choosing a later WAL position as the redo pointer. Due
to the lack of interlock between WAL reservation and checkpoints, this situation
could lead to the premature removal of WALs reserved by the new slot during a
checkpoint, causing the newly created slot to become invalidated.

This commit fixes it by taking exclusive lock on ReplicationSlotAllocationLock
when reserving WAL to serialize the minimum restart_lsn computation in
checkpoint process and WAL reservation, ensuring that:

* If the WAL reservation occurs first, the checkpoint must wait for the
restart_lsn to be updated before proceeding with WAL removal. This guarantees
that the most recent restart_lsn position is detected.

* If the checkpoint calls CheckPointReplicationSlots() first, then any
subsequent WAL reservation must take a position later than the redo pointer.
---
 src/backend/replication/slot.c | 97 ++++++++++++++++++----------------
 1 file changed, 51 insertions(+), 46 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90f9e8068a6..4467496b795 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1530,63 +1530,62 @@ void
 ReplicationSlotReserveWal(void)
 {
 	ReplicationSlot *slot = MyReplicationSlot;
+	XLogSegNo	segno;
+	XLogRecPtr	restart_lsn;
 
 	Assert(slot != NULL);
 	Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
 	Assert(slot->last_saved_restart_lsn == InvalidXLogRecPtr);
 
 	/*
-	 * The replication slot mechanism is used to prevent removal of required
-	 * WAL. As there is no interlock between this routine and checkpoints, WAL
-	 * segments could concurrently be removed when a now stale return value of
-	 * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
-	 * this happens we'll just retry.
+	 * Acquire an exclusive lock to prevent the checkpoint process from
+	 * concurrently calculating the minimum slot LSN (see
+	 * CheckPointReplicationSlots), ensuring that the reserved WAL cannot be
+	 * removed during a checkpoint.
+	 *
+	 * The mechanism is reliable because if WAL reservation occurs first, the
+	 * checkpoint must wait for the restart_lsn update before determining the
+	 * minimum non-removable LSN. On the other hand, if the checkpoint occurs
+	 * first, subsequent WAL reservations must choose positions beyond or
+	 * equal to the redo pointer of checkpoint.
 	 */
-	while (true)
-	{
-		XLogSegNo	segno;
-		XLogRecPtr	restart_lsn;
+	LWLockAcquire(ReplicationSlotAllocationLock, LW_EXCLUSIVE);
 
-		/*
-		 * For logical slots log a standby snapshot and start logical decoding
-		 * at exactly that position. That allows the slot to start up more
-		 * quickly. But on a standby we cannot do WAL writes, so just use the
-		 * replay pointer; effectively, an attempt to create a logical slot on
-		 * standby will cause it to wait for an xl_running_xact record to be
-		 * logged independently on the primary, so that a snapshot can be
-		 * built using the record.
-		 *
-		 * None of this is needed (or indeed helpful) for physical slots as
-		 * they'll start replay at the last logged checkpoint anyway. Instead
-		 * return the location of the last redo LSN. While that slightly
-		 * increases the chance that we have to retry, it's where a base
-		 * backup has to start replay at.
-		 */
-		if (SlotIsPhysical(slot))
-			restart_lsn = GetRedoRecPtr();
-		else if (RecoveryInProgress())
-			restart_lsn = GetXLogReplayRecPtr(NULL);
-		else
-			restart_lsn = GetXLogInsertRecPtr();
+	/*
+	 * For logical slots log a standby snapshot and start logical decoding at
+	 * exactly that position. That allows the slot to start up more quickly.
+	 * But on a standby we cannot do WAL writes, so just use the replay
+	 * pointer; effectively, an attempt to create a logical slot on standby
+	 * will cause it to wait for an xl_running_xact record to be logged
+	 * independently on the primary, so that a snapshot can be built using the
+	 * record.
+	 *
+	 * None of this is needed (or indeed helpful) for physical slots as
+	 * they'll start replay at the last logged checkpoint anyway. Instead
+	 * return the location of the last redo LSN. While that slightly increases
+	 * the chance that we have to retry, it's where a base backup has to start
+	 * replay at.
+	 */
+	if (SlotIsPhysical(slot))
+		restart_lsn = GetRedoRecPtr();
+	else if (RecoveryInProgress())
+		restart_lsn = GetXLogReplayRecPtr(NULL);
+	else
+		restart_lsn = GetXLogInsertRecPtr();
 
-		SpinLockAcquire(&slot->mutex);
-		slot->data.restart_lsn = restart_lsn;
-		SpinLockRelease(&slot->mutex);
+	SpinLockAcquire(&slot->mutex);
+	slot->data.restart_lsn = restart_lsn;
+	SpinLockRelease(&slot->mutex);
 
-		/* prevent WAL removal as fast as possible */
-		ReplicationSlotsComputeRequiredLSN();
+	/* prevent WAL removal as fast as possible */
+	ReplicationSlotsComputeRequiredLSN();
 
-		/*
-		 * If all required WAL is still there, great, otherwise retry. The
-		 * slot should prevent further removal of WAL, unless there's a
-		 * concurrent ReplicationSlotsComputeRequiredLSN() after we've written
-		 * the new restart_lsn above, so normally we should never need to loop
-		 * more than twice.
-		 */
-		XLByteToSeg(slot->data.restart_lsn, segno, wal_segment_size);
-		if (XLogGetLastRemovedSegno() < segno)
-			break;
-	}
+	XLByteToSeg(slot->data.restart_lsn, segno, wal_segment_size);
+	if (XLogGetLastRemovedSegno() >= segno)
+		elog(ERROR, "WAL required by replication slot %s has been removed concurrently",
+			 NameStr(slot->data.name));
+
+	LWLockRelease(ReplicationSlotAllocationLock);
 
 	if (!RecoveryInProgress() && SlotIsLogical(slot))
 	{
@@ -2094,6 +2093,12 @@ CheckPointReplicationSlots(bool is_shutdown)
 	 * acquiring a slot we cannot take the control lock - but that's OK,
 	 * because holding ReplicationSlotAllocationLock is strictly stronger, and
 	 * enough to guarantee that nobody can change the in_use bits on us.
+	 *
+	 * Additionally, acquiring the Allocation lock is necessary to serialize
+	 * the slot flush process with concurrent slot WAL reservation. This
+	 * ensures that the WAL position being reserved is either flushed to disk
+	 * or beyond or equal to the redo pointer (See ReplicationSlotReserveWal
+	 * for details).
 	 */
 	LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
 
-- 
2.51.1.windows.1