Thread

Re: How can end users know the cause of LR slot sync delays?

Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-11-12T10:55:32Z
On Wed, 5 Nov 2025 at 11:49, shveta malik <shveta.malik@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 3:17 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Mon, Nov 3, 2025 at 3:14 PM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > >
> > > > I have attached the latest patch.
> > > >
> > >
> > > Thanks. I have started going through it.
> > >
> > > I’m not sure if this has already been discussed; I couldn’t find any
> > > mention of it in the thread. Why don’t we persist
> > > 'slot_sync_skip_reason' (it is outside of
> > > ReplicationSlotPersistentData)? If a slot wasn’t synced during the
> > > last cycle and the server restarts, it would be helpful to know the
> > > reason it wasn’t synced prior to the node restart.
> > >
Actually I did not think in this direction. I think it will be useful
to persist 'slot_sync_skip_reason'. I have made the change for the
same in the latest patch.

> >
> > Please find a few more comments:
> >
> > 1)
> > last_slot_sync_skip
> >
> > Will 'last_slotsync_skip_at' be a better name?
> >
> > Since we refer worker as slotsync in docs, I feel slotsync seems a
> > more natural choice than slot_sync. Also '_at' gives clarity that it
> > is about time rather than a boolean (which currently it seems like).
> >
> > Same goes for slot_sync_skip_count and slot_sync_skip_reason. Shall
> > these be slotsync_skip_count and slotsync_skip_reason.
> >
Fixed it.

> > 2)
> > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason
> > skip_reason,
> > + bool acquire_slot)
> >
> > It looks like there is only one caller that passes acquire_slot as
> > true, while all others pass false. Instead of keeping the acquire_slot
> > parameter, will it be better if we remove it and add an
> > Assert(MyReplication) to ensure the slot is already acquired? We can
> > add a comment stating that this function expects the slot to be
> > acquired by the caller. The one caller that currently passes
> > acquire_slot = true can acquire the slot explicitly before invoking
> > this function. Thoughts?
This idea looks good to me. I have updated the patch accordingly.

> >
> > 3)
> > In update_and_persist_local_synced_slot(), we get both
> > 'found_consistent_snapshot' and 'remote_slot_precedes' from
> > update_local_synced_slot(). But skipsync-reason for
> > 'remote_slot_precedes' is updated inside update_local_synced_slot()
> > while skipsync-reason for '!found_consistent_snapshot' is updated in
> > caller update_and_persist_local_synced_slot.  Is there a reason for
> > that?
> >
update_and_persist_local_synced_slot  is called when the synced slot
is in temporary state and we are calling function
'update_local_synced_slot'  directly for permanent slots.Slot sync
skip when "remote_slot_precedes" is true can happen for both permanent
and temporary slot. So I think we need to update the stats in
"update_local_synced_slot"
Whereas we are skipping slot sync only for temporary slots when a
consistent snapshot is not found. So I added this in the function
"update_and_persist_local_synced_slot".

> > 4)
> > What about the case where the slot is invalidated and sync is skipped?
> > I do not see any stats for that. See 'Skip the sync of an invalidated
> > slot' in synchronize_one_slot(). If it is already discussed and
> > concluded, please add a comment.
> >
It was not discussed earlier.
The pg_replication_slots already have a column name
'invalidation_reason'. And when the remote slot is invalidated the
local slot's is also invalidated. So, should we be required to
maintain this in 'slotsync_skip_reason' as well? I think it would be
kind of redundant? Thoughts?

>
> Few more on 001:
>
> 5)
> The name SS_SKIP_MISSING_WAL_RECORD doesn’t seem appropriate. It
> sounds more like some WAL issue, rather than indicating that the WAL
> hasn’t been flushed. A better name could be SS_SKIP_WAL_NOT_FLUSHED.
>
I think that the suggested name is better. I have updated the patch accordingly.

> 6)
> Instead of calling 'update_slot_sync_skip_stats' at multiple places,
> how about we just update the skip_reason everywhere and make a call to
> 'update_slot_sync_skip_stats' only in synchronize_one_slot(). IMO,
> that will look cleaner. Thoughts?
>
I tried this approach in [1] (see v2_approach2). This approach would
require passing extra parameters to the functions.
Here Amit suggested that we should try an approach where this can be
avoided. So, I came up with the current approach. See [2].

I have addressed the comments and attached the updated v7 patch.

[1]: https://www.postgresql.org/message-id/CANhcyEXHcdoRRo0N0uib-t7mfkbotv%3DaYjAWAekDAbHCRe%2BBng%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/CAA4eK1KZLPxv7VBZf%3DBp9%3D-pzKNfvNmFDqFYYzwkowE4FpRs1A%40mail.gmail.com

Thanks,
Shlok Kyal