Thread

  1. Re: How can end users know the cause of LR slot sync delays?

    Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-11-26T06:28:41Z

    On Wed, 26 Nov 2025 at 10:25, shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Nov 26, 2025 at 9:30 AM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
    > >
    > > On Wed, 26 Nov 2025 at 09:00, Hayato Kuroda (Fujitsu)
    > > <kuroda.hayato@fujitsu.com> wrote:
    > > >
    > > > Dear Hou,
    > > >
    > > > > I think we did not sync slots to standby2, so I removed the checks for that.
    > > > >
    > > > > I also adjusted the test in a way that it cleans up existing slots before starting
    > > > > new tests.
    > > >
    > > > Thanks for updating the patch. I confirmed on my env that your patch could be
    > > > applied cleanly and tests were passed. Pgperltidy say nothing for your patch.
    > > > Also, I preferred the current style.
    > > >
    > > > I think it is worth checking on BF to see how they say.
    > > >
    > >
    > > Thanks Amit for pushing the 0001 patch.
    > > Thanks Hou-san and Kuroda-san on fixing the test.
    > >
    > > I have rebased the 0002 patch on the current HEAD.
    > >
    >
    > Thanks. Please find a few comments:
    >
    > 1)
    >
    > +       The reason for the last slot synchronization skip. This field
    > is set only
    > +       for logical slots that are being synchronized from a primary
    > server (that
    > +       is, those whose <structfield>synced</structfield> field is
    > +       <literal>true</literal>). The value of this column has no meaning on the
    > +       primary server; it defaults to <literal>none</literal> for all
    > slots, but
    > +       may (if leftover from a promoted standby) also have a value other than
    > +       <literal>none</literal>. Possible values are:
    >
    > We can make this similar to existing fields (slotsync_skip_count and
    > slotsync_skip_at):
    >
    > Slot synchronization occurs only on standby servers and thus this column has
    > no meaning on the primary server.
    >
    > 2)
    > Doc on possible values of slotsync_skip_reason can be improved. Example
    >
    > +          <literal>remote_behind</literal> means that the last slot
    > +          synchronization was skipped because the slot is ahead of the
    > +          corresponding failover slot on the primary.
    >
    > We can get rid of 'last slot synchronization was skipped' from all the
    > reasons. (See 'invalidation_reason' possible values for reference, it
    > does not mention 'was invalidated' in any).
    >
    > 3)
    > +          <literal>wal_not_flushed</literal> means that the last slot
    > +          synchronization was skipped because the standby had not flushed the
    > +          WAL corresponding to the confirmed flush position on the remote slot.
    >
    > I am not sure if we need to mention 'confirmed flush position'. Shall we say:
    >
    > '.....because the standby had not flushed the WAL corresponding to the
    > position reserved on the remote slot'.
    > Thoughts?
    >
    I think the suggested wording would be more clear to understand for
    the user. Added the change.
    
    > 4)
    > +          <literal>none</literal> means that the last slot synchronization
    > +          completed successfully.
    >
    > Do we even need to mention 'none' in doc? 'invalidation_reason' does
    > not mention it.
    >
    > 5)
    > postgres=# select slot_name, invalidation_reason, slotsync_skip_reason
    > from pg_replication_slots;
    >    slot_name    | invalidation_reason | slotsync_skip_reason
    > ----------------+---------------------+----------------------
    >  failover_slot2 |                     | none
    >
    > Shall we keep 'slotsync_skip_reason' as NULL instead of 'none' similar
    > to invalidation_reason. Thoughts?
    >
    I agree with your suggestion. I have removed the 'none' value and used
    NULL instead.
    
    > 6)
    > If we plan to keep slotsync_skip_reason as NULL instead of 'none' for
    > non-skipped cases (above comment), then below code can be optimised as
    > we do not need to update 'none' as stats.
    > 'skip_reason' and last update() call can then be removed and we can
    > simply call update_slotsync_skip_stats() instead of 'skip_reason =
    > SS_SKIP_NO_CONSISTENT_SNAPSHOT'.
    >
    > update_local_synced_slot():
    >
    > + SlotSyncSkipReason skip_reason = SS_SKIP_NONE;
    > + update_slotsync_skip_stats(SS_SKIP_REMOTE_BEHIND);
    > + skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
    > + /* Update slot sync skip stats */
    > + update_slotsync_skip_stats(skip_reason);
    >
    I think we need this change even if we use NULL instead of 'none'.
    This change ensures that the slot sync reason is set to NULL if slot
    sync is successful.
    
    > 7)
    >
    > +static char *
    > +GetSlotSyncSkipReason(SlotSyncSkipReason reason)
    >
    > We are passing 'SlotSyncSkipReason' and function name says
    > 'GetSlotSyncSkipReason', looks confusing.
    > Shall we rename function name to GetSlotSyncSkipString or
    > GetSlotSyncSkipReasonName (similar to GetSlotInvalidationCauseName)
    >
    > 8)
    >
    > + * A slotsync skip typically occurs only for temporary slots. For
    > + * persistent slots it is extremely rare (e.g., cases like
    > + * SS_SKIP_WAL_NOT_FLUSHED or SS_SKIP_REMOTE_BEHIND). Also, temporary
    > + * slots are dropped after server restart, so there is no value in
    > + * persisting the slotsync_skip_reason.
    > + */
    > + SlotSyncSkipReason slotsync_skip_reason;
    >
    > I feel, 'Also-->Since' will make more sense here.
    >
    >
    > 9)
    > In doc for slotsync_skip_count and slotsync_skip_at
    >
    > + Slot
    > +        synchronization occur only on standby servers and thus this column has
    > +        no meaning on the primary server.
    >
    > occur --> occurs
    
    I have also addressed the remaining comments and attached the updated patch.
    
    Thanks,
    Shlok Kyal