Thread

Re: How can end users know the cause of LR slot sync delays?

Shlok Kyal <shlok.kyal.oss@gmail.com> — 2025-10-30T12:16:02Z
On Mon, 20 Oct 2025 at 14:27, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Shlok,
>
> > > 01.
> > > ```
> > > +       /* Update the slot sync reason */
> > > +       SpinLockAcquire(&slot->mutex);
> > > +       if (slot->slot_sync_skip_reason != skip_reason)
> > > +               slot->slot_sync_skip_reason = skip_reason;
> > > +       SpinLockRelease(&slot->mutex);
> > > ```
> > >
> > > Per my understanding, spinlock is acquired when the attribute on the shared
> > memory
> > > is updating. Can you check other parts and follow the rukle?
> > >
> > > 02.
> > > ```
> > > +                       SpinLockAcquire(&slot->mutex);
> > > +                       synced = slot->data.synced;
> > > +                       SpinLockRelease(&slot->mutex);
> > > ```
> > >
> > > Same as 1.
> > >
> > I checked and found the following comment:
> >  * - Individual fields are protected by mutex where only the backend owning
> >  * the slot is authorized to update the fields from its own slot.  The
> >  * backend owning the slot does not need to take this lock when reading its
> >  * own fields, while concurrent backends not owning this slot should take the
> >  * lock when reading this slot's data.
> >
I realised that the patch does not entirely follow the above rule. As
per my understanding rule says:
1. If we want to update a field of a slot we need to own the slot (or
we can say acquire the slot)
2. In the above case we also need to take a Spinlock on the slot to
update any field.
3. If we want to read a field and if we own the slot we do not need a
spinlock on the slot.
4. If we want to read a field and if we do not own the slot we need a
spinlock on the slot.

For the current patch (v5), in the function "synchronize_one_slot" we
are calling "update_slot_sync_skip_stats" even if we do not own the
slot.
So, as per the rule I have updated  "update_slot_sync_skip_stats" to
own the slot before updating.

> > So for the above two cases we are updating the
> > 'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this
> > can happen before the slot sync worker acquires the slot or owns the
> > slot.
> > Also in the same code at a later stage we are again checking the
> > synced flag and we do that while holding a spin lock. Based on these
> > observations I think we should take Spinlock in both cases.
>
> Hmm, regarding the update_slot_sync_skip_stats(), the replication slot has already been
> acquired except synchronize_one_slot() case.
> Can we avoid acquiring the spinlock as much as possible by adding an argument?
> Or it just introduces additional complexity?
>
After updating the code as per rule, I think we always have to take a
Spinlock on the slot when we are updating any field.

> > > 09.
> > > ```
> > > +my $connstr_1 = $primary->connstr;
> > > ```
> > >
> > > Since this is an only connection string in the test, suffix _1 is not needed.
> > >
> > Fixed
>
> Same as the comment, can you replace "standby1" to "stanby"?
>
Fixed

> >
> > > 10.
> > > ```
> > > +# Simulate standby connection failure by modifying pg_hba.conf
> > > +unlink($primary->data_dir . '/pg_hba.conf');
> > > +$primary->append_conf('pg_hba.conf',
> > > +       qq{local   all             all
> > trust}
> > > +);
> > > ```
> > >
> > > What if the system does not have Unix domain socket? I'm afraid all connections
> > > could be brocked in this case.
> > >
> > I have used an injection point to simulate this scenario instead of
> > changing the contents of pg_hba.conf files.
>
> Can you clarify the reason why you used the injection point?
> I'm not sure the injection point is beneficial here. I feel the point can be added
> when we handle the timing issue, race condition etc, but walreceiver may not have
> strong reasons to stop exact at that point.
>
> Regarding the content of pg_hba.conf, I felt below lines might be enough:
> ```
> local   all             all                                     trust
> host    all             all             127.0.0.1/32            trust
> ```
>
I checked this.
By default pg_hba.conf has contents as:
```
# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust
```
Now for our test to prevent the streaming replication we can set pg_hba.conf to
```
local   all             all                                     trust
host    all             all             127.0.0.1/32            trust
host    all             all             ::1/128                 trust
```
And then to restore streaming replication we can add following to pg_hba.conf:
```
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust
```
I think this would be sufficient for our testing. Thoughts?

> Also, here are comments for v5.
>
> ```
> +      <para>
> +       Reason of the last slot synchronization skip.
> +      </para></entry>
> ```
>
> Possible values must be clarified. This was posted in [1] but seemed to be missed.
Sorry, I missed it. I have updated it in the latest patch.

>
> ```
> +       /* Update the slot sync reason */
> ```
>
> It is better to clarify updating the *skip* reason
Fixed

>
> ```
> -       ReplicationSlot *slot;
> +       ReplicationSlot *slot = NULL;
> ```
>
> No need to initialize as NULL.
>
Fixed

> ```
> +#include "utils/injection_point.h"
> ...
> +                               INJECTION_POINT("walreceiver", NULL);
> ```
>
> As I told above, I have a concern to add the injection point. I want to hear
> other's opinion as well.
>
Removed it for now as per my analysis we can modify pg_hba.conf to
simulate the scenario.

> ```
> +                       else
> +                       {
> +                               /* Update the slot sync stats */
> +                               Assert(!found_consistent_snapshot ||
> +                                          *found_consistent_snapshot);
> +                               update_slot_sync_skip_stats(slot, SS_SKIP_NONE);
> +                       }
> ```
>
> Your patch may have another issue; if both confirmed_flush_lsn are the same
> but we do not have the consistent snapshot yet, we would get the assertion failure.
> (Again, not sure it can really happen)
> Can we use the condition as another if part? At that time we must clarify why
> it is OK to pass in case of found_consistent_snapshot == NULL.
>
Fixed

> ```
> +# Attach injection point to simulate wait
> +$standby_psql->query_safe(
> +       q(select injection_points_attach('slot-sync-skip','wait')));
> ```
>
> I have been considering whether we can remove the injection point here or not.
> I think the point is used because the being synchronized slot is still temporary
> one; they would be cleaned up by ReplicationSlotCleanup().
> Can we reproduce the skip event for the permanent slot? I cannot come up with,
> but if possible no need to introduce the injection point.
I tried reproducing it but was not able to come up with a test without
injection point. Will further try to reproduce it without injection
point.

>
> [1]: https://www.postgresql.org/message-id/OSCPR01MB14966A618A8C61EC3DEE486A4F517A%40OSCPR01MB14966.jpnprd01.prod.outlook.com
>

I have attached the latest patch.

Thanks,
Shlok Kyal