Thread

  1. Re: How can end users know the cause of LR slot sync delays?

    Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-11-12T10:57:56Z

    On Fri, 31 Oct 2025 at 11:30, Hayato Kuroda (Fujitsu)
    <kuroda.hayato@fujitsu.com> wrote:
    >
    > Dear Shlok,
    >
    > Thanks for updating the patch. Few comments:
    >
    > ```
    >        The reason for the last slot synchronization skip. This field is set only
    >        for logical slots that are being synced from a primary server (that is,
    >        those whose <structfield>synced</structfield> field is
    >        <literal>true</literal>).
    > ```
    >
    > What happens if the slot has a skip reason and the standby is promoted?
    > Will the attribute be retained? If so, do we have to add some notes like "sync"?
    >
    I tested it and I agree that slot sync skip stats will be retained. I
    have updated the docs accordingly
    
    > ```
    > +/* Update slot sync skip stats */
    > +static void
    > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason skip_reason,
    > +                                                       bool acquire_slot)
    > ```
    >
    > Let's follow existing codes; ReplicationSlotSetInactiveSince(), third argument
    > can be `acquire_lock`.
    >
    I have updated it according to latest comment by Shveta in [1]
    
    > ```
    > +       /*
    > +        * Update the slot sync related stats in pg_stat_replication_slot when a
    > +        * slot sync is skipped
    > +        */
    > +       if (skip_reason != SS_SKIP_NONE)
    > +               pgstat_report_replslot_sync_skip(slot);
    > ```
    >
    > Is it OK to call pgstat_report_replslot_sync_skip() without any locks?
    >
    I think we need to acquire the slot before the call to
    pgstat_report_replslot_sync_skip. This ensures
    'pgstat_acquire_replslot' is called and the stats entry for the slot
    is already present when we are updating the slot. Also in a similar
    function 'pgstat_report_replslot', the comments says we should ensure
    that pgstat_create_replslot() or pgstat_acquire_replslot() is called
    before updating the stats.
    
    > ```
    >                 ReplicationSlotAcquire(NameStr(slot->data.name), true, true);
    > ```
    >
    > Can you clarify the reason error_if_invalid=true? Other codes in the file use
    > error_if_invalid=false.
    >
    I agree we should keep error_if_invalid=false because if
    error_if_invalid=true and we try to acquire an invalidated slot an
    error is thrown and it will exit the slotsync worker.
    
    > ```
    > +       /* Update the slot sync skip reason */
    > +       SpinLockAcquire(&slot->mutex);
    > +       if (slot->slot_sync_skip_reason != skip_reason)
    > +               slot->slot_sync_skip_reason = skip_reason;
    > +       SpinLockRelease(&slot->mutex);
    > ```
    >
    > Now the replication slot can be always acquired. Do we still have to acquire the
    > spinlock even for reading the value? In other words, can we move SpinLockAcquire()
    > and SpinLockRelease() into inside the if block?
    >
    I agree that we use SpinLockAcquire() and SpinLockRelease() inside the
    if block. I have fixed it.
    
    > ```
    > # Copyright (c) 2024-2025, PostgreSQL Global Development Group
    > ```
    >
    > I think 2024 can be removed.
    >
    Fixed
    
    > ```
    > my $primary = PostgreSQL::Test::Cluster->new('publisher');
    > ```
    >
    > s/publisher/primary/.
    >
    Fixed
    
    > ```
    > # Pause steaming replication connection so that standby can lag behind
    > unlink($primary->data_dir . '/pg_hba.conf');
    > $primary->append_conf(
    >         'pg_hba.conf', qq{
    > local   all             all                                     trust
    > host    all             all             127.0.0.1/32            trust
    > host    all             all             ::1/128                 trust
    > });
    > $primary->restart;
    > ```
    >
    > Not sure it can be called like "Pause". how about like:
    > ```
    > Update pg_hba.conf and restart primar to reject streaming replication connections.
    > WAL records won't be replicated to the standby until .conf is restored.
    > ```
    Fixed
    
    > ```
    > # Attempt to sync replication slots while standby is behind
    > ($result, $stdout, $stderr) =
    >   $standby->psql('postgres', "SELECT pg_sync_replication_slots();");
    > ```
    >
    > Can you verify the $stderr that synchornization was failed? I cannot find other
    > tests which checks the message. It is enough to do once.
    Added
    
    > ```
    > $result = $standby->safe_psql(
    >         'postgres',
    >         "SELECT slot_sync_skip_reason FROM pg_replication_slots
    >      WHERE slot_name = 'slot_sync' AND synced AND NOT temporary"
    > );
    > is($result, 'missing_wal_record', "slot sync skip when standby is behind");
    > ```
    >
    > I found the test does twice; can we remove second one?
    >
    Fixed
    
    > ```
    > # Cleanup: drop the logical slot and ensure standby catches up
    > $primary->safe_psql('postgres',
    >         "SELECT pg_drop_replication_slot('slot_sync')");
    > $primary->wait_for_replay_catchup($standby);
    >
    > $standby->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
    >
    > # Test for case when slot sync is skipped when the remote slot is
    > # behind the local slot.
    > $primary->safe_psql('postgres',
    >         "SELECT pg_create_logical_replication_slot('slot_sync', 'test_decoding', false, false, true)"
    > );
    > ```
    >
    > Can we use reset function instead of dropping it?
    >
    Here the test checks 'remote_behind' case for first slot sync cycle
    (when slot is in temporary slot). To achieve that we need to recreate
    the slot.
    
    I have addressed the comment and attached the updated patch in [1].
    [1]: https://www.postgresql.org/message-id/CANhcyEXEyUt8TycOqkgdWT%2BNeJASM%3D1acU1dK64UNsJeKMQFQA%40mail.gmail.com
    
    Thanks,
    Shlok Kyal