Thread

  1. RE: Assertion failure in SnapBuildInitialSnapshot()

    Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-12-10T03:32:39Z

    On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > 
    > On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
    > <houzj.fnst@fujitsu.com> wrote:
    > >
    > > On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
    > <sawada.mshk@gmail.com> wrote:
    > > >
    > > > On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu)
    > <houzj.fnst@fujitsu.com>
    > > > wrote:
    > > > >
    > > > > On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada
    > > > <sawada.mshk@gmail.com> wrote:
    > > > > >
    > > > > >
    > > > > > Given that the computation of xmin and catalog_xmin among all slots
    > > > > > could be executed concurrently, could the following scenario happen
    > > > > > where
    > > > > > procArray->replication_slot_xmin and replication_slot_catalog_xmin
    > > > > > procArray->are retreat to a non-invalid
    > > > > > XID?
    > > > > >
    > > > > > 1. Suppose the initial value procArray->replication_slot_catalog_xmin
    > is
    > > > 50.
    > > > > > 2. Process-A updates its owned slot's catalog_xmin to 100, and
    > > > > > computes the new catalog_xmin as 100 while holding
    > > > > > ReplicationSlotControlLock in a shared mode in
    > > > > > ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
    > > > procArray's catalog_xmin value yet.
    > > > > > 3. Process-B updates its owned slot's catalog_xmin to 150, and
    > > > > > computes the new catalog_xmin as 150.
    > > > > > 4. Process-B updates the procArray->replication_slot_catalog_xmin to
    > > > 150.
    > > > > > 5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
    > > > > > 100, which was 150.
    > > > >
    > > > > After further investigation, I think that steps 3 and 4 cannot occur
    > > > > because Process-B must have already encountered the catalog_xmin
    > > > > maintained by Process-A, either 50 or 100. Consequently, Process-B
    > > > > will refrain from updating the catalog_xmin to a more recent value, such
    > as
    > > > 150.
    > > >
    > > > Right. But the following scenario seems to happen:
    > > >
    > > > 1. Both processes have a slot with effective_catalog_xmin = 100.
    > > > 2. Process-A updates effective_catalog_xmin to 150, and computes the
    > new
    > > > catalog_xmin as 100 because process-B slot still has
    > effective_catalog_xmin =
    > > > 100.
    > > > 3. Process-B updates effective_catalog_xmin to 150, and computes the
    > new
    > > > catalog_xmin as 150.
    > > > 4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
    > > > 5. Process-A updates procArray->replication_slot_catalog_xmin to 100.
    > >
    > > I think this scenario can occur, but is not harmful. Because the catalog rows
    > > removed prior to xid:150 would no longer be used, as both slots have
    > advanced
    > > their catalog_xmin and flushed the value to disk. Therefore, even if
    > > replication_slot_catalog_xmin regresses, it should be OK.
    > >
    > > Considering all above, I think allowing concurrent xmin computation, as the
    > > patch does, is acceptable. What do you think ?
    > 
    > I agree with your analysis. Another thing I'd like to confirm is that
    > in an extreme case, if the server crashes suddenly after removing
    > catalog tuples older than XID 100 and logical decoding restarts, it
    > ends up missing necessary catalog tuples? I think it's not a problem
    > as long as the subscriber knows the next commit LSN they want but
    > could it be problematic if the user switches to use the logical
    > decoding SQL API? I might be worrying too much, though.
    
    I think this case is not a problem because:
    
    In LogicalConfirmReceivedLocation, the updated restart_lsn and catalog_xmin are
    flushed to disk before the effective_catalog_xmin is updated. Thus, once
    replication_slot_catalog_xmin advances to a certain value, even in the event of
    a crash, users won't encounter any removed tuples when consuming from slots
    after a restart. This is because all slots have their updated restart_lsn
    flushed to disk, ensuring that upon restarting, changes are decoded from the
    updated position where older catalog tuples are no longer needed.
    
    BTW, I assume you meant catalog tuples older than XID 150 are removed, since in
    the previous example, tuples older than XID 100 are already not useful.
    
    Best Regards,
    Hou zj