Thread
-
Re: How can end users know the cause of LR slot sync delays?
Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-11-12T10:57:56Z
On Fri, 31 Oct 2025 at 11:30, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shlok, > > Thanks for updating the patch. Few comments: > > ``` > The reason for the last slot synchronization skip. This field is set only > for logical slots that are being synced from a primary server (that is, > those whose <structfield>synced</structfield> field is > <literal>true</literal>). > ``` > > What happens if the slot has a skip reason and the standby is promoted? > Will the attribute be retained? If so, do we have to add some notes like "sync"? > I tested it and I agree that slot sync skip stats will be retained. I have updated the docs accordingly > ``` > +/* Update slot sync skip stats */ > +static void > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason skip_reason, > + bool acquire_slot) > ``` > > Let's follow existing codes; ReplicationSlotSetInactiveSince(), third argument > can be `acquire_lock`. > I have updated it according to latest comment by Shveta in [1] > ``` > + /* > + * Update the slot sync related stats in pg_stat_replication_slot when a > + * slot sync is skipped > + */ > + if (skip_reason != SS_SKIP_NONE) > + pgstat_report_replslot_sync_skip(slot); > ``` > > Is it OK to call pgstat_report_replslot_sync_skip() without any locks? > I think we need to acquire the slot before the call to pgstat_report_replslot_sync_skip. This ensures 'pgstat_acquire_replslot' is called and the stats entry for the slot is already present when we are updating the slot. Also in a similar function 'pgstat_report_replslot', the comments says we should ensure that pgstat_create_replslot() or pgstat_acquire_replslot() is called before updating the stats. > ``` > ReplicationSlotAcquire(NameStr(slot->data.name), true, true); > ``` > > Can you clarify the reason error_if_invalid=true? Other codes in the file use > error_if_invalid=false. > I agree we should keep error_if_invalid=false because if error_if_invalid=true and we try to acquire an invalidated slot an error is thrown and it will exit the slotsync worker. > ``` > + /* Update the slot sync skip reason */ > + SpinLockAcquire(&slot->mutex); > + if (slot->slot_sync_skip_reason != skip_reason) > + slot->slot_sync_skip_reason = skip_reason; > + SpinLockRelease(&slot->mutex); > ``` > > Now the replication slot can be always acquired. Do we still have to acquire the > spinlock even for reading the value? In other words, can we move SpinLockAcquire() > and SpinLockRelease() into inside the if block? > I agree that we use SpinLockAcquire() and SpinLockRelease() inside the if block. I have fixed it. > ``` > # Copyright (c) 2024-2025, PostgreSQL Global Development Group > ``` > > I think 2024 can be removed. > Fixed > ``` > my $primary = PostgreSQL::Test::Cluster->new('publisher'); > ``` > > s/publisher/primary/. > Fixed > ``` > # Pause steaming replication connection so that standby can lag behind > unlink($primary->data_dir . '/pg_hba.conf'); > $primary->append_conf( > 'pg_hba.conf', qq{ > local all all trust > host all all 127.0.0.1/32 trust > host all all ::1/128 trust > }); > $primary->restart; > ``` > > Not sure it can be called like "Pause". how about like: > ``` > Update pg_hba.conf and restart primar to reject streaming replication connections. > WAL records won't be replicated to the standby until .conf is restored. > ``` Fixed > ``` > # Attempt to sync replication slots while standby is behind > ($result, $stdout, $stderr) = > $standby->psql('postgres', "SELECT pg_sync_replication_slots();"); > ``` > > Can you verify the $stderr that synchornization was failed? I cannot find other > tests which checks the message. It is enough to do once. Added > ``` > $result = $standby->safe_psql( > 'postgres', > "SELECT slot_sync_skip_reason FROM pg_replication_slots > WHERE slot_name = 'slot_sync' AND synced AND NOT temporary" > ); > is($result, 'missing_wal_record', "slot sync skip when standby is behind"); > ``` > > I found the test does twice; can we remove second one? > Fixed > ``` > # Cleanup: drop the logical slot and ensure standby catches up > $primary->safe_psql('postgres', > "SELECT pg_drop_replication_slot('slot_sync')"); > $primary->wait_for_replay_catchup($standby); > > $standby->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); > > # Test for case when slot sync is skipped when the remote slot is > # behind the local slot. > $primary->safe_psql('postgres', > "SELECT pg_create_logical_replication_slot('slot_sync', 'test_decoding', false, false, true)" > ); > ``` > > Can we use reset function instead of dropping it? > Here the test checks 'remote_behind' case for first slot sync cycle (when slot is in temporary slot). To achieve that we need to recreate the slot. I have addressed the comment and attached the updated patch in [1]. [1]: https://www.postgresql.org/message-id/CANhcyEXEyUt8TycOqkgdWT%2BNeJASM%3D1acU1dK64UNsJeKMQFQA%40mail.gmail.com Thanks, Shlok Kyal