Thread

  1. RE: Assertion failure in SnapBuildInitialSnapshot()

    Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-12-19T03:19:15Z

    On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > 
    > On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
    > wrote:
    > >
    > > On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada
    > <sawada.mshk@gmail.com> wrote:
    > > >
    > > > On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
    > > > <houzj.fnst@fujitsu.com> wrote:
    > > > >
    > > > > On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
    > > > <sawada.mshk@gmail.com> wrote:
    > > > > > Right. But the following scenario seems to happen:
    > > > > >
    > > > > > 1. Both processes have a slot with effective_catalog_xmin = 100.
    > > > > > 2. Process-A updates effective_catalog_xmin to 150, and computes
    > > > > > the
    > > > new
    > > > > > catalog_xmin as 100 because process-B slot still has
    > > > effective_catalog_xmin =
    > > > > > 100.
    > > > > > 3. Process-B updates effective_catalog_xmin to 150, and computes
    > > > > > the
    > > > new
    > > > > > catalog_xmin as 150.
    > > > > > 4. Process-B updates procArray->replication_slot_catalog_xmin to
    > 150.
    > > > > > 5. Process-A updates procArray->replication_slot_catalog_xmin to
    > 100.
    > > > >
    > > > > I think this scenario can occur, but is not harmful. Because the
    > > > > catalog rows removed prior to xid:150 would no longer be used, as
    > > > > both slots have
    > > > advanced
    > > > > their catalog_xmin and flushed the value to disk. Therefore, even
    > > > > if replication_slot_catalog_xmin regresses, it should be OK.
    > > > >
    > > > > Considering all above, I think allowing concurrent xmin
    > > > > computation, as the patch does, is acceptable. What do you think ?
    > > >
    > > > I agree with your analysis. Another thing I'd like to confirm is
    > > > that in an extreme case, if the server crashes suddenly after
    > > > removing catalog tuples older than XID 100 and logical decoding
    > > > restarts, it ends up missing necessary catalog tuples? I think it's
    > > > not a problem as long as the subscriber knows the next commit LSN
    > > > they want but could it be problematic if the user switches to use
    > > > the logical decoding SQL API? I might be worrying too much, though.
    > >
    > > I think this case is not a problem because:
    > >
    > > In LogicalConfirmReceivedLocation, the updated restart_lsn and
    > > catalog_xmin are flushed to disk before the effective_catalog_xmin is
    > > updated. Thus, once replication_slot_catalog_xmin advances to a
    > > certain value, even in the event of a crash, users won't encounter any
    > > removed tuples when consuming from slots after a restart. This is
    > > because all slots have their updated restart_lsn flushed to disk,
    > > ensuring that upon restarting, changes are decoded from the updated
    > position where older catalog tuples are no longer needed.
    > 
    > Agreed.
    > 
    > >
    > > BTW, I assume you meant catalog tuples older than XID 150 are removed,
    > > since in the previous example, tuples older than XID 100 are already not
    > useful.
    > 
    > Right. Thank you for pointing this out.
    > 
    > I think we can proceed with the idea proposed by Hou-san. Regarding the
    > patch, while it mostly looks good, it might be worth considering adding more
    > comments:
    > 
    > - If the caller passes already_locked=true to
    > ReplicationSlotsComputeRequiredXmin(), the lock order should also be
    > considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
    > - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
    > process, resulting in temporarily moving
    > procArray->replication_slot_catalog_xmin backward, but it's harmless
    > because a smaller catalog_xmin is conservative: it merely prevents VACUUM
    > from removing catalog tuples that could otherwise be pruned. It does not lead
    > to premature deletion of required data.
    
    Thanks for the comments. I added some more comments as suggested.
    
    Here is the updated patch.
    
    Best Regards,
    Hou zj