Thread
-
RE: Newly created replication slot may be invalidated by checkpoint
Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-12-02T04:19:09Z
On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > OK, I think it makes sense to start separate threads. > > > > I have split the patches based on the different bugs they > > address and am sharing them here for reference. > > > > I'm reviewing the 0001 patch and the problem that can be addressed by > that patch. While the proposed patch addresses the race condition > between a checkpointing and newly created slot, could the same issue > happen between the checkpointing and copying a slot? I'm trying to > understand when we have to acquire ReplicationSlotAllocationLock in an > exclusive mode in the new lock scheme. Thanks for reviewing ! I think the situation is somewhat different in the copy_replication_slot(). As noted in the comments[1], it's considered acceptable for WALs preceding the initial restart_lsn to be removed since the latest restart_lsn will be copied again in the second phase, so latest WAL being reserved is safe. Aside from this specific case, I think it's necessary to acquire the ReplicationSlotAllocationLock when reserving WALs for newly created slots. [1] /* * We need to prevent the source slot's reserved WAL from being removed, * but we don't want to lock that slot for very long, and it can advance * in the meantime. So obtain the source slot's data, and create a new * slot using its restart_lsn. Afterwards we lock the source slot again * and verify that the data we copied (name, type) has not changed * incompatibly. No inconvenient WAL removal can occur once the new slot * is created -- but since WAL removal could have occurred before we * managed to create the new slot, we advance the new slot's restart_lsn * to the source slot's updated restart_lsn the second time we lock it. */ Best Regards, Hou zj