Thread

  1. Re: Assertion failure in SnapBuildInitialSnapshot()

    Masahiko Sawada <sawada.mshk@gmail.com> — 2025-12-30T18:52:43Z

    On Mon, Dec 29, 2025 at 7:55 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    >
    >
    > > On Dec 30, 2025, at 06:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > >
    > > On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu)
    > > <houzj.fnst@fujitsu.com> wrote:
    > >>
    > >> On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > >>>
    > >>> On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
    > >>> wrote:
    > >>>>
    > >>>> On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada
    > >>> <sawada.mshk@gmail.com> wrote:
    > >>>>>
    > >>>>> On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
    > >>>>> <houzj.fnst@fujitsu.com> wrote:
    > >>>>>>
    > >>>>>> On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
    > >>>>> <sawada.mshk@gmail.com> wrote:
    > >>>>>>> Right. But the following scenario seems to happen:
    > >>>>>>>
    > >>>>>>> 1. Both processes have a slot with effective_catalog_xmin = 100.
    > >>>>>>> 2. Process-A updates effective_catalog_xmin to 150, and computes
    > >>>>>>> the
    > >>>>> new
    > >>>>>>> catalog_xmin as 100 because process-B slot still has
    > >>>>> effective_catalog_xmin =
    > >>>>>>> 100.
    > >>>>>>> 3. Process-B updates effective_catalog_xmin to 150, and computes
    > >>>>>>> the
    > >>>>> new
    > >>>>>>> catalog_xmin as 150.
    > >>>>>>> 4. Process-B updates procArray->replication_slot_catalog_xmin to
    > >>> 150.
    > >>>>>>> 5. Process-A updates procArray->replication_slot_catalog_xmin to
    > >>> 100.
    > >>>>>>
    > >>>>>> I think this scenario can occur, but is not harmful. Because the
    > >>>>>> catalog rows removed prior to xid:150 would no longer be used, as
    > >>>>>> both slots have
    > >>>>> advanced
    > >>>>>> their catalog_xmin and flushed the value to disk. Therefore, even
    > >>>>>> if replication_slot_catalog_xmin regresses, it should be OK.
    > >>>>>>
    > >>>>>> Considering all above, I think allowing concurrent xmin
    > >>>>>> computation, as the patch does, is acceptable. What do you think ?
    > >>>>>
    > >>>>> I agree with your analysis. Another thing I'd like to confirm is
    > >>>>> that in an extreme case, if the server crashes suddenly after
    > >>>>> removing catalog tuples older than XID 100 and logical decoding
    > >>>>> restarts, it ends up missing necessary catalog tuples? I think it's
    > >>>>> not a problem as long as the subscriber knows the next commit LSN
    > >>>>> they want but could it be problematic if the user switches to use
    > >>>>> the logical decoding SQL API? I might be worrying too much, though.
    > >>>>
    > >>>> I think this case is not a problem because:
    > >>>>
    > >>>> In LogicalConfirmReceivedLocation, the updated restart_lsn and
    > >>>> catalog_xmin are flushed to disk before the effective_catalog_xmin is
    > >>>> updated. Thus, once replication_slot_catalog_xmin advances to a
    > >>>> certain value, even in the event of a crash, users won't encounter any
    > >>>> removed tuples when consuming from slots after a restart. This is
    > >>>> because all slots have their updated restart_lsn flushed to disk,
    > >>>> ensuring that upon restarting, changes are decoded from the updated
    > >>> position where older catalog tuples are no longer needed.
    > >>>
    > >>> Agreed.
    > >>>
    > >>>>
    > >>>> BTW, I assume you meant catalog tuples older than XID 150 are removed,
    > >>>> since in the previous example, tuples older than XID 100 are already not
    > >>> useful.
    > >>>
    > >>> Right. Thank you for pointing this out.
    > >>>
    > >>> I think we can proceed with the idea proposed by Hou-san. Regarding the
    > >>> patch, while it mostly looks good, it might be worth considering adding more
    > >>> comments:
    > >>>
    > >>> - If the caller passes already_locked=true to
    > >>> ReplicationSlotsComputeRequiredXmin(), the lock order should also be
    > >>> considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
    > >>> - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
    > >>> process, resulting in temporarily moving
    > >>> procArray->replication_slot_catalog_xmin backward, but it's harmless
    > >>> because a smaller catalog_xmin is conservative: it merely prevents VACUUM
    > >>> from removing catalog tuples that could otherwise be pruned. It does not lead
    > >>> to premature deletion of required data.
    > >>
    > >> Thanks for the comments. I added some more comments as suggested.
    > >>
    > >> Here is the updated patch.
    > >
    > > Thank you for updating the patch! The patch looks good to me.
    > >
    > > I've made minor changes to the comment and commit message and created
    > > patches for backbranches. I'm going to push them, barring any
    > > objections.
    > >
    > > Regards,
    > >
    > > --
    > > Masahiko Sawada
    > > Amazon Web Services: https://aws.amazon.com
    > > <master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch>
    >
    >
    > I’ve just looked through the patch for master. The fix itself looks solid to me. I only noticed a few minor comment nits:
    >
    > 1
    > ```
    > + * ProcArrayLock, to prevent any undetectable deadlocks since this function
    > + * acquire them in that order.
    > ```
    >
    > acquire -> acquires
    >
    > 2
    > ```
    > +        * values, so no backend update the initial xmin for newly created slot
    > ```
    >
    > Update -> updates
    >
    > 3
    > ```
    > +        * slot machinery about the new limit. Once that's done the both locks
    > ```
    >
    > “The both locks”, feels like “the” is not needed.
    >
    
    Thank you for reviewing the patches! I'll incorporate these comments
    before push.
    
    I totally overlooked the fact that we no longer support PG13, so I'm
    going to push them down to 14.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com