Thread

  1. Re: How can end users know the cause of LR slot sync delays?

    Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-11-12T10:55:32Z

    On Wed, 5 Nov 2025 at 11:49, shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Nov 4, 2025 at 3:17 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Mon, Nov 3, 2025 at 3:14 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > >
    > > > > I have attached the latest patch.
    > > > >
    > > >
    > > > Thanks. I have started going through it.
    > > >
    > > > I’m not sure if this has already been discussed; I couldn’t find any
    > > > mention of it in the thread. Why don’t we persist
    > > > 'slot_sync_skip_reason' (it is outside of
    > > > ReplicationSlotPersistentData)? If a slot wasn’t synced during the
    > > > last cycle and the server restarts, it would be helpful to know the
    > > > reason it wasn’t synced prior to the node restart.
    > > >
    Actually I did not think in this direction. I think it will be useful
    to persist 'slot_sync_skip_reason'. I have made the change for the
    same in the latest patch.
    
    > >
    > > Please find a few more comments:
    > >
    > > 1)
    > > last_slot_sync_skip
    > >
    > > Will 'last_slotsync_skip_at' be a better name?
    > >
    > > Since we refer worker as slotsync in docs, I feel slotsync seems a
    > > more natural choice than slot_sync. Also '_at' gives clarity that it
    > > is about time rather than a boolean (which currently it seems like).
    > >
    > > Same goes for slot_sync_skip_count and slot_sync_skip_reason. Shall
    > > these be slotsync_skip_count and slotsync_skip_reason.
    > >
    Fixed it.
    
    > > 2)
    > > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason
    > > skip_reason,
    > > + bool acquire_slot)
    > >
    > > It looks like there is only one caller that passes acquire_slot as
    > > true, while all others pass false. Instead of keeping the acquire_slot
    > > parameter, will it be better if we remove it and add an
    > > Assert(MyReplication) to ensure the slot is already acquired? We can
    > > add a comment stating that this function expects the slot to be
    > > acquired by the caller. The one caller that currently passes
    > > acquire_slot = true can acquire the slot explicitly before invoking
    > > this function. Thoughts?
    This idea looks good to me. I have updated the patch accordingly.
    
    > >
    > > 3)
    > > In update_and_persist_local_synced_slot(), we get both
    > > 'found_consistent_snapshot' and 'remote_slot_precedes' from
    > > update_local_synced_slot(). But skipsync-reason for
    > > 'remote_slot_precedes' is updated inside update_local_synced_slot()
    > > while skipsync-reason for '!found_consistent_snapshot' is updated in
    > > caller update_and_persist_local_synced_slot.  Is there a reason for
    > > that?
    > >
    update_and_persist_local_synced_slot  is called when the synced slot
    is in temporary state and we are calling function
    'update_local_synced_slot'  directly for permanent slots.Slot sync
    skip when "remote_slot_precedes" is true can happen for both permanent
    and temporary slot. So I think we need to update the stats in
    "update_local_synced_slot"
    Whereas we are skipping slot sync only for temporary slots when a
    consistent snapshot is not found. So I added this in the function
    "update_and_persist_local_synced_slot".
    
    > > 4)
    > > What about the case where the slot is invalidated and sync is skipped?
    > > I do not see any stats for that. See 'Skip the sync of an invalidated
    > > slot' in synchronize_one_slot(). If it is already discussed and
    > > concluded, please add a comment.
    > >
    It was not discussed earlier.
    The pg_replication_slots already have a column name
    'invalidation_reason'. And when the remote slot is invalidated the
    local slot's is also invalidated. So, should we be required to
    maintain this in 'slotsync_skip_reason' as well? I think it would be
    kind of redundant? Thoughts?
    
    >
    > Few more on 001:
    >
    > 5)
    > The name SS_SKIP_MISSING_WAL_RECORD doesn’t seem appropriate. It
    > sounds more like some WAL issue, rather than indicating that the WAL
    > hasn’t been flushed. A better name could be SS_SKIP_WAL_NOT_FLUSHED.
    >
    I think that the suggested name is better. I have updated the patch accordingly.
    
    > 6)
    > Instead of calling 'update_slot_sync_skip_stats' at multiple places,
    > how about we just update the skip_reason everywhere and make a call to
    > 'update_slot_sync_skip_stats' only in synchronize_one_slot(). IMO,
    > that will look cleaner. Thoughts?
    >
    I tried this approach in [1] (see v2_approach2). This approach would
    require passing extra parameters to the functions.
    Here Amit suggested that we should try an approach where this can be
    avoided. So, I came up with the current approach. See [2].
    
    I have addressed the comments and attached the updated v7 patch.
    
    [1]: https://www.postgresql.org/message-id/CANhcyEXHcdoRRo0N0uib-t7mfkbotv%3DaYjAWAekDAbHCRe%2BBng%40mail.gmail.com
    [2]: https://www.postgresql.org/message-id/CAA4eK1KZLPxv7VBZf%3DBp9%3D-pzKNfvNmFDqFYYzwkowE4FpRs1A%40mail.gmail.com
    
    Thanks,
    Shlok Kyal