Thread

  1. RE: Assertion failure in SnapBuildInitialSnapshot()

    Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-11-25T12:02:31Z

    On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > 
    > On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada
    > <sawada.mshk@gmail.com> wrote:
    > >
    > > On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit.kapila16@gmail.com>
    > wrote:
    > > >
    > > > On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
    > > > <houzj.fnst@fujitsu.com> wrote:
    > > > >
    > > > > On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu)
    > <houzj.fnst@fujitsu.com> wrote:
    > > > > >
    > > > >
    > > > > I have been thinking if there a way to avoid holding
    > > > > ReplicationSlotControlLock exclusively in
    > > > > ReplicationSlotsComputeRequiredXmin() because that could cause lock
    > contention when many slots exist and advancements occur frequently.
    > > > >
    > > > > Given that the bug arises from a race condition between slot
    > > > > creation and concurrent slot xmin computation, I think another way
    > > > > is that, we acquire the ReplicationSlotControlLock exclusively
    > > > > only during slot creation to do the initial update of the slot
    > > > > xmin. In ReplicationSlotsComputeRequiredXmin(), we still hold the
    > > > > ReplicationSlotControlLock in shared mode until the global slot
    > > > > xmin is updated in ProcArraySetReplicationSlotXmin(). This
    > > > > approach prevents concurrent computations and updates of new xmin
    > > > > horizons by other backends during the initial slot xmin update process,
    > while it still permits concurrent calls to
    > ReplicationSlotsComputeRequiredXmin().
    > > > >
    > > >
    > > > Yeah, this seems to work.
    > >
    > > +1
    > 
    > Given that the computation of xmin and catalog_xmin among all slots could
    > be executed concurrently, could the following scenario happen where
    > procArray->replication_slot_xmin and
    > procArray->replication_slot_catalog_xmin are retreat to a non-invalid
    > XID?
    > 
    > 1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50.
    > 2. Process-A updates its owned slot's catalog_xmin to 100, and computes the
    > new catalog_xmin as 100 while holding ReplicationSlotControlLock in a shared
    > mode in ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
    > procArray's catalog_xmin value yet.
    > 3. Process-B updates its owned slot's catalog_xmin to 150, and computes the
    > new catalog_xmin as 150.
    > 4. Process-B updates the procArray->replication_slot_catalog_xmin to 150.
    > 5. Process-A updates the procArray->repilcation_slot_catalog_xmin to 100,
    > which was 150.
    
    After further investigation, I think that steps 3 and 4 cannot occur because
    Process-B must have already encountered the catalog_xmin maintained by
    Process-A, either 50 or 100. Consequently, Process-B will refrain from updating
    the catalog_xmin to a more recent value, such as 150.
    
    > 
    > It might be worth adding an assertion to ProcArraySetReplicationSlotXmin(),
    > checking if the new xmin and catalog_xmin values are either >= the current
    > values or an InvalidTransactionId.
    
    I considered this scenario and identified a potential exception in the
    copy_replication_slot(). This function uses a two-phase copy process, the
    original restart_lsn is directly copied to the new slot during the first phase.
    However, the original slot.restart_lsn might advance between phases.
    Consequently, the newly created slot initially uses the outdated restart_lsn,
    which could cause the procArray->replication_slot_catalog_xmin to retreat. I
    think this behavior isn't harmful, as explained in the comments, because the new
    restart_lsn will be updated in the created slot during the second phase.
    
    Best Regards,
    Hou zj