Thread

  1. Re: How can end users know the cause of LR slot sync delays?

    Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-11-18T10:36:53Z

    On Fri, 14 Nov 2025 at 14:13, Hayato Kuroda (Fujitsu)
    <kuroda.hayato@fujitsu.com> wrote:
    >
    > Dear Shlok,
    >
    > Thanks for updating the patch. Few more comments.
    >
    > > > > > I’m not sure if this has already been discussed; I couldn’t find any
    > > > > > mention of it in the thread. Why don’t we persist
    > > > > > 'slot_sync_skip_reason' (it is outside of
    > > > > > ReplicationSlotPersistentData)? If a slot wasn’t synced during the
    > > > > > last cycle and the server restarts, it would be helpful to know the
    > > > > > reason it wasn’t synced prior to the node restart.
    > > > > >
    > > Actually I did not think in this direction. I think it will be useful
    > > to persist 'slot_sync_skip_reason'. I have made the change for the
    > > same in the latest patch.
    >
    > Hmm, I'm wondering it should be written on the disk. Other attributes on the disk
    > are essential to decode or replicate changes correctly, but sync status is not
    > used for the purpose. Personally considered, slot sync would re-start soon after
    > the reboot so that it is OK to start with empty. How about others?
    >
    > If we want to serialize the info, we should do further tasks:
    >  - update SLOT_VERSION
    >  - make the slot dirty then SaveSlotToPath() when the status is updated.
    >
    I agree with your point. Slot synchronization will restart shortly
    after a reboot, so it seems reasonable to begin with an empty state
    rather than persisting slot_sync_skip_reason.
    For now, I’ve updated the patch so that slot_sync_skip_reason is no
    longer persisted; its initialization is kept outside of
    ReplicationSlotPersistentData. I’d also like to hear what others
    think.
    
    > ```
    > +static void
    > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason skip_reason)
    > +{
    > +       Assert(MyReplicationSlot);
    > ```
    >
    > I think no need to require *slot as an argument. We can use the variable to shorten
    > like update_local_synced_slot().
    >
    Fixed
    
    > ```
    > # Verify pg_sync_replication_slots is failing
    > ok( $stderr =~ /skipping slot synchronization because the received slot sync/,
    >         'pg_sync_replication_slots failed as expected');
    > ```
    >
    > This may be matter of taste, but can you check whole of log message? Latter part
    > indicates the actual reason.
    >
    The latter part of the message contains LSN values, which are not
    stable across runs. To avoid hard-coding specific LSNs, I matched the
    fixed, non-variable parts of the message while still covering the
    reason for the failure.
    
    > ```
    > # Detach injection point
    > $standby->safe_psql(
    >         'postgres', q{
    >         SELECT injection_points_detach('slot-sync-skip');
    >         SELECT injection_points_wakeup('slot-sync-skip');
    > });
    > ```
    >
    > Not mandatory, but you can quit the background session if you release the
    > injection point.
    Fixed
    
    Apart from the above changes. I have renamed the functions to
    consistently use the term 'slotsync' instead of 'slot_sync'
    update_slot_sync_skip_stats -> update_slotsync_skip_stats
    pgstat_report_replslot_sync_skip -> pgstat_report_replslotsync_skip
    
    I have attached the updated v8 patch with the latest changes.
    
    Thanks,
    Shlok Kyal