Thread

  1. RE: Newly created replication slot may be invalidated by checkpoint

    Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-12-09T02:06:42Z

    On Tuesday, December 9, 2025 12:40 AM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
    > 
    > On Monday, December 08, 2025 13:24 MSK, "Zhijie Hou (Fujitsu)"
    > <houzj.fnst@fujitsu.com> wrote:
    > 
    > > On Monday, December 8, 2025 5:47 PM Amit Kapila
    > <amit.kapila16@gmail.com> wrote:
    > > > > > Sawada-san/Vitaly, do you have any opinion on patch or the
    > > > > > direction to fix? The idea is to get this fixed for HEAD and 18,
    > > > > > then continue discussion for other bank-branches and the remaining
    > patches.
    > 
    > Hi Amit, Zhijie Hou
    > 
    > Thank you for preparing and comiting 0001 patch. I'm ok with it. I did some
    > auto testing of the patch and haven't found any problems. As I realized,
    > another two patches (0002, 0003) are still in review.
    
    Thanks for testing!
    
    > 
    > In my previous email I wrote about copy_replication_slot, where restart_lsn is
    > assigned without any locks, but I'm not sure that email was successfully
    > delivered. Masahiko Sagawa mentioned about it in one of the latest emails as
    > well. I also read the answer but not completely understood it at the moment,
    > sorry (need some more time to investigate). Anyway, I would prefer to use
    > locks in create_physical_replication_slot rather than rely on signals handling
    > which may be changed in the future.
    
    If we want to improve that, taking lock when updating restart_lsn does not work,
    because the initial restart_lsn is an old position copied from another slot,
    where the WALs could have already been removed, so unlike the mechanism
    mentioned in 006dd4b2, the lock cannot ensure the same. I think we might need
    to some other solutions for it which can be discussed separately.
    
    > 
    > One more thing, when we copy a logical replication slot,
    > DecodingContextFindStartpoint reads the WAL from the specified restart_lsn
    > which may be removed by a concurrent checkpoint. It can produce an error
    > and stop slot copying, I guess. This behaviour may be not desirable.
    
    Given that we don't search for a starting point for copied slots (we
    pass find_startpoint=false when copying), I don't think this issue exists.
    
    Best Regards,
    Hou zj