Thread
-
Re: How can end users know the cause of LR slot sync delays?
Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-10-30T12:16:02Z
On Mon, 20 Oct 2025 at 14:27, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shlok, > > > > 01. > > > ``` > > > + /* Update the slot sync reason */ > > > + SpinLockAcquire(&slot->mutex); > > > + if (slot->slot_sync_skip_reason != skip_reason) > > > + slot->slot_sync_skip_reason = skip_reason; > > > + SpinLockRelease(&slot->mutex); > > > ``` > > > > > > Per my understanding, spinlock is acquired when the attribute on the shared > > memory > > > is updating. Can you check other parts and follow the rukle? > > > > > > 02. > > > ``` > > > + SpinLockAcquire(&slot->mutex); > > > + synced = slot->data.synced; > > > + SpinLockRelease(&slot->mutex); > > > ``` > > > > > > Same as 1. > > > > > I checked and found the following comment: > > * - Individual fields are protected by mutex where only the backend owning > > * the slot is authorized to update the fields from its own slot. The > > * backend owning the slot does not need to take this lock when reading its > > * own fields, while concurrent backends not owning this slot should take the > > * lock when reading this slot's data. > > I realised that the patch does not entirely follow the above rule. As per my understanding rule says: 1. If we want to update a field of a slot we need to own the slot (or we can say acquire the slot) 2. In the above case we also need to take a Spinlock on the slot to update any field. 3. If we want to read a field and if we own the slot we do not need a spinlock on the slot. 4. If we want to read a field and if we do not own the slot we need a spinlock on the slot. For the current patch (v5), in the function "synchronize_one_slot" we are calling "update_slot_sync_skip_stats" even if we do not own the slot. So, as per the rule I have updated "update_slot_sync_skip_stats" to own the slot before updating. > > So for the above two cases we are updating the > > 'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this > > can happen before the slot sync worker acquires the slot or owns the > > slot. > > Also in the same code at a later stage we are again checking the > > synced flag and we do that while holding a spin lock. Based on these > > observations I think we should take Spinlock in both cases. > > Hmm, regarding the update_slot_sync_skip_stats(), the replication slot has already been > acquired except synchronize_one_slot() case. > Can we avoid acquiring the spinlock as much as possible by adding an argument? > Or it just introduces additional complexity? > After updating the code as per rule, I think we always have to take a Spinlock on the slot when we are updating any field. > > > 09. > > > ``` > > > +my $connstr_1 = $primary->connstr; > > > ``` > > > > > > Since this is an only connection string in the test, suffix _1 is not needed. > > > > > Fixed > > Same as the comment, can you replace "standby1" to "stanby"? > Fixed > > > > > 10. > > > ``` > > > +# Simulate standby connection failure by modifying pg_hba.conf > > > +unlink($primary->data_dir . '/pg_hba.conf'); > > > +$primary->append_conf('pg_hba.conf', > > > + qq{local all all > > trust} > > > +); > > > ``` > > > > > > What if the system does not have Unix domain socket? I'm afraid all connections > > > could be brocked in this case. > > > > > I have used an injection point to simulate this scenario instead of > > changing the contents of pg_hba.conf files. > > Can you clarify the reason why you used the injection point? > I'm not sure the injection point is beneficial here. I feel the point can be added > when we handle the timing issue, race condition etc, but walreceiver may not have > strong reasons to stop exact at that point. > > Regarding the content of pg_hba.conf, I felt below lines might be enough: > ``` > local all all trust > host all all 127.0.0.1/32 trust > ``` > I checked this. By default pg_hba.conf has contents as: ``` # "local" is for Unix domain socket connections only local all all trust # IPv4 local connections: host all all 127.0.0.1/32 trust # IPv6 local connections: host all all ::1/128 trust # Allow replication connections from localhost, by a user with the # replication privilege. local replication all trust host replication all 127.0.0.1/32 trust host replication all ::1/128 trust ``` Now for our test to prevent the streaming replication we can set pg_hba.conf to ``` local all all trust host all all 127.0.0.1/32 trust host all all ::1/128 trust ``` And then to restore streaming replication we can add following to pg_hba.conf: ``` local replication all trust host replication all 127.0.0.1/32 trust host replication all ::1/128 trust ``` I think this would be sufficient for our testing. Thoughts? > Also, here are comments for v5. > > ``` > + <para> > + Reason of the last slot synchronization skip. > + </para></entry> > ``` > > Possible values must be clarified. This was posted in [1] but seemed to be missed. Sorry, I missed it. I have updated it in the latest patch. > > ``` > + /* Update the slot sync reason */ > ``` > > It is better to clarify updating the *skip* reason Fixed > > ``` > - ReplicationSlot *slot; > + ReplicationSlot *slot = NULL; > ``` > > No need to initialize as NULL. > Fixed > ``` > +#include "utils/injection_point.h" > ... > + INJECTION_POINT("walreceiver", NULL); > ``` > > As I told above, I have a concern to add the injection point. I want to hear > other's opinion as well. > Removed it for now as per my analysis we can modify pg_hba.conf to simulate the scenario. > ``` > + else > + { > + /* Update the slot sync stats */ > + Assert(!found_consistent_snapshot || > + *found_consistent_snapshot); > + update_slot_sync_skip_stats(slot, SS_SKIP_NONE); > + } > ``` > > Your patch may have another issue; if both confirmed_flush_lsn are the same > but we do not have the consistent snapshot yet, we would get the assertion failure. > (Again, not sure it can really happen) > Can we use the condition as another if part? At that time we must clarify why > it is OK to pass in case of found_consistent_snapshot == NULL. > Fixed > ``` > +# Attach injection point to simulate wait > +$standby_psql->query_safe( > + q(select injection_points_attach('slot-sync-skip','wait'))); > ``` > > I have been considering whether we can remove the injection point here or not. > I think the point is used because the being synchronized slot is still temporary > one; they would be cleaned up by ReplicationSlotCleanup(). > Can we reproduce the skip event for the permanent slot? I cannot come up with, > but if possible no need to introduce the injection point. I tried reproducing it but was not able to come up with a test without injection point. Will further try to reproduce it without injection point. > > [1]: https://www.postgresql.org/message-id/OSCPR01MB14966A618A8C61EC3DEE486A4F517A%40OSCPR01MB14966.jpnprd01.prod.outlook.com > I have attached the latest patch. Thanks, Shlok Kyal