Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Enhance slot synchronization API to respect promotion signal.

  2. Fix inconsistent elevel in pg_sync_replication_slots() retry logic.

  3. Refactor slot synchronization logic in slotsync.c.

  4. Fix intermittent BF failure in 040_standby_failover_slots_sync.

  5. Add retry logic to pg_sync_replication_slots().

  6. Fix LOCK_TIMEOUT handling in slotsync worker.

  7. Add slotsync skip statistics.

  1. Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-06-24T10:40:58Z

    Hello,
    
    Creating this thread for a POC based on discussions in thread [1].
    Hou-san had created this patch, and I just cleaned up some documents,
    did some testing and now sharing the patch here.
    
    In this patch, the pg_sync_replication_slots() API now waits
    indefinitely for the remote slot to catch up. We could later add a
    timeout parameter to control maximum wait time if this approach seems
    acceptable. If there are more ideas on improving this patch, let me
    know.
    
    regards,
    Ajin Cherian
    [1]  - https://www.postgresql.org/message-id/CAF1DzPWTcg%2Bm%2Bx%2BoVVB%3Dy4q9%3DPYYsL_mujVp7uJr-_oUtWNGbA%40mail.gmail.com
    
  2. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-06-24T10:48:18Z

    Sorry, I attached the wrong file. Attaching the correct file now.
    
    regards,
    Ajin Cerian
    Fujitsu Australia
    
  3. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-02T09:55:56Z

    On Tue, Jun 24, 2025 at 4:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Hello,
    >
    > Creating this thread for a POC based on discussions in thread [1].
    > Hou-san had created this patch, and I just cleaned up some documents,
    > did some testing and now sharing the patch here.
    >
    > In this patch, the pg_sync_replication_slots() API now waits
    > indefinitely for the remote slot to catch up. We could later add a
    > timeout parameter to control maximum wait time if this approach seems
    > acceptable. If there are more ideas on improving this patch, let me
    > know.
    
    +1 on the idea.
    I believe the timeout option may not be necessary here, since the API
    can be manually canceled if needed. Otherwise, the recommended
    approach is to let it complete. But I would like to know what others
    think here.
    
    Few comments:
    
    1)
    When the API is waiting for the primary to advance, standby fails to
    handle promotion requests. Promotion fails:
    ./pg_ctl -D ../../standbydb/ promote -w
    waiting for server to promote.................stopped waiting
    pg_ctl: server did not promote in time
    
    See the logs at [1]
    
    2)
    Also when the API is waiting for a long time, it just dumps the
    'waiting for remote_slot..' LOG only once. Do you think it makes sense
    to log it at a regular interval until the wait is over? See logs at
    [1]. It dumped the log once in 3minutes.
    
    3)
    + /*
    + * It is possible to get null value for restart_lsn if the slot is
    + * invalidated on the primary server, so handle accordingly.
    + */
    
    + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
    + {
    + /*
    + * The slot won't be persisted by the caller; it will be cleaned up
    + * at the end of synchronization.
    + */
    + ereport(WARNING,
    + errmsg("aborting initial sync for slot \"%s\"",
    +    remote_slot->name),
    + errdetail("This slot was invalidated on the primary server."));
    
    Which case are we referring to here where null restart_lsn would mean
    invalidation? Can you please point me to such code where it happens or
    a test-case which does that. I tried a few invalidation cases, but did
    not hit it.
    
    
    [1]:
    Log file:
    2025-07-02 14:38:09.851 IST [153187] LOG:  waiting for remote slot
    "failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local
    slot LSN (0/3003F60) and catalog xmin (767)
    2025-07-02 14:38:09.851 IST [153187] STATEMENT:  SELECT
    pg_sync_replication_slots();
    2025-07-02 14:41:36.200 IST [153164] LOG:  received promote request
    
    thanks
    Shveta
    
    
    
    
  4. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-09T04:53:40Z

    Please find few more comments:
    
    1)
    In  pg_sync_replication_slots() doc, we have this:
    
    "Note that this function is primarily intended for testing and
    debugging purposes and should be used with caution. Additionally, this
    function cannot be executed if ...."
    
    We can get rid of this info as well and change to:
    
    "Note that this function cannot be executed if...."
    
    2)
    We got rid of NOTE in logicaldecoding.sgml, but now the page does not
    mention pg_sync_replication_slots() at all. We need to bring back the
    change removed by [1] (or something on similar line) which is this:
    
    -     <command>CREATE SUBSCRIPTION</command> during slot creation, and
    then calling
    -     <link linkend="pg-sync-replication-slots">
    -     <function>pg_sync_replication_slots</function></link>
    -     on the standby. By setting <link linkend="guc-sync-replication-slots">
    +     <command>CREATE SUBSCRIPTION</command> during slot creation.
    +     Additionally, enabling <link linkend="guc-sync-replication-slots">
    +     <varname>sync_replication_slots</varname></link> on the standby
    +     is required. By enabling <link linkend="guc-sync-replication-slots">
    
    3)
    wait_for_primary_slot_catchup():
    + /*
    + * It is possible to get null values for confirmed_lsn and
    + * catalog_xmin if on the primary server the slot is just created with
    + * a valid restart_lsn and slot-sync worker has fetched the slot
    + * before the primary server could set valid confirmed_lsn and
    + * catalog_xmin.
    + */
    
    Do we need this special handling? We already have one such handling in
    synchronize_slots(). please see:
                    /*
                     * If restart_lsn, confirmed_lsn or catalog_xmin is
    invalid but the
                     * slot is valid, that means we have fetched the
    remote_slot in its
                     * RS_EPHEMERAL state. In such a case, don't sync it;
    we can always
                     * sync it in the next sync cycle when the remote_slot
    is persisted
                     * and has valid lsn(s) and xmin values.
                     */
                    if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
                             XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
                             !TransactionIdIsValid(remote_slot->catalog_xmin)) &&
                            remote_slot->invalidated == RS_INVAL_NONE)
                            pfree(remote_slot);
    
    
    Due to the above check in synchronize_slots(), we will not reach
    wait_for_primary_slot_catchup() when any of confirmed_lsn or
    catalog_xmin is not initialized.
    
    
    [1]: https://www.postgresql.org/message-id/CAJpy0uAD_La2vi%2BB%2BiSBbCYTMayMstvbF9ndrAJysL9t5fHtbQ%40mail.gmail.com
    
    
    thanks
    Shveta
    
    
    
    
  5. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-07-16T09:30:17Z

    On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > Few comments:
    >
    > 1)
    > When the API is waiting for the primary to advance, standby fails to
    > handle promotion requests. Promotion fails:
    > ./pg_ctl -D ../../standbydb/ promote -w
    > waiting for server to promote.................stopped waiting
    > pg_ctl: server did not promote in time
    >
    > See the logs at [1]
    >
    
    I've modified this to handle promotion request and stop slot
    synchronization if standby is promoted.
    
    
    > 2)
    > Also when the API is waiting for a long time, it just dumps the
    > 'waiting for remote_slot..' LOG only once. Do you think it makes sense
    > to log it at a regular interval until the wait is over? See logs at
    > [1]. It dumped the log once in 3minutes.
    >
    
    I've modified it to log once every 10 seconds.
    
    > 3)
    > + /*
    > + * It is possible to get null value for restart_lsn if the slot is
    > + * invalidated on the primary server, so handle accordingly.
    > + */
    >
    > + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
    > + {
    > + /*
    > + * The slot won't be persisted by the caller; it will be cleaned up
    > + * at the end of synchronization.
    > + */
    > + ereport(WARNING,
    > + errmsg("aborting initial sync for slot \"%s\"",
    > +    remote_slot->name),
    > + errdetail("This slot was invalidated on the primary server."));
    >
    > Which case are we referring to here where null restart_lsn would mean
    > invalidation? Can you please point me to such code where it happens or
    > a test-case which does that. I tried a few invalidation cases, but did
    > not hit it.
    >
    
    I've removed all this code as the checks for null restart_lsn and
    other parameters are handled in earlier functions and we won't get
    here if these had null, I've added asserts to check that it is not
    null.
    
    
    On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > Please find few more comments:
    >
    > 1)
    > In  pg_sync_replication_slots() doc, we have this:
    >
    > "Note that this function is primarily intended for testing and
    > debugging purposes and should be used with caution. Additionally, this
    > function cannot be executed if ...."
    >
    > We can get rid of this info as well and change to:
    >
    > "Note that this function cannot be executed if...."
    
    Modified as requested.
    
    
    >
    > 2)
    > We got rid of NOTE in logicaldecoding.sgml, but now the page does not
    > mention pg_sync_replication_slots() at all. We need to bring back the
    > change removed by [1] (or something on similar line) which is this:
    >
    > -     <command>CREATE SUBSCRIPTION</command> during slot creation, and
    > then calling
    > -     <link linkend="pg-sync-replication-slots">
    > -     <function>pg_sync_replication_slots</function></link>
    > -     on the standby. By setting <link linkend="guc-sync-replication-slots">
    > +     <command>CREATE SUBSCRIPTION</command> during slot creation.
    > +     Additionally, enabling <link linkend="guc-sync-replication-slots">
    > +     <varname>sync_replication_slots</varname></link> on the standby
    > +     is required. By enabling <link linkend="guc-sync-replication-slots">
    >
    
    I've added that back.
    
    
    > 3)
    > wait_for_primary_slot_catchup():
    > + /*
    > + * It is possible to get null values for confirmed_lsn and
    > + * catalog_xmin if on the primary server the slot is just created with
    > + * a valid restart_lsn and slot-sync worker has fetched the slot
    > + * before the primary server could set valid confirmed_lsn and
    > + * catalog_xmin.
    > + */
    >
    > Do we need this special handling? We already have one such handling in
    > synchronize_slots(). please see:
    >                 /*
    >                  * If restart_lsn, confirmed_lsn or catalog_xmin is
    > invalid but the
    >                  * slot is valid, that means we have fetched the
    > remote_slot in its
    >                  * RS_EPHEMERAL state. In such a case, don't sync it;
    > we can always
    >                  * sync it in the next sync cycle when the remote_slot
    > is persisted
    >                  * and has valid lsn(s) and xmin values.
    >                  */
    >                 if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
    >                          XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
    >                          !TransactionIdIsValid(remote_slot->catalog_xmin)) &&
    >                         remote_slot->invalidated == RS_INVAL_NONE)
    >                         pfree(remote_slot);
    >
    >
    > Due to the above check in synchronize_slots(), we will not reach
    > wait_for_primary_slot_catchup() when any of confirmed_lsn or
    > catalog_xmin is not initialized.
    >
    
    Yes,  you are correct. I've removed all that logic.
    
    The modified patch (v2) is attached.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  6. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-16T09:57:47Z

    On Wed, Jul 16, 2025 at 3:00 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > Few comments:
    > >
    > > 1)
    > > When the API is waiting for the primary to advance, standby fails to
    > > handle promotion requests. Promotion fails:
    > > ./pg_ctl -D ../../standbydb/ promote -w
    > > waiting for server to promote.................stopped waiting
    > > pg_ctl: server did not promote in time
    > >
    > > See the logs at [1]
    > >
    >
    > I've modified this to handle promotion request and stop slot
    > synchronization if standby is promoted.
    >
    >
    > > 2)
    > > Also when the API is waiting for a long time, it just dumps the
    > > 'waiting for remote_slot..' LOG only once. Do you think it makes sense
    > > to log it at a regular interval until the wait is over? See logs at
    > > [1]. It dumped the log once in 3minutes.
    > >
    >
    > I've modified it to log once every 10 seconds.
    >
    > > 3)
    > > + /*
    > > + * It is possible to get null value for restart_lsn if the slot is
    > > + * invalidated on the primary server, so handle accordingly.
    > > + */
    > >
    > > + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
    > > + {
    > > + /*
    > > + * The slot won't be persisted by the caller; it will be cleaned up
    > > + * at the end of synchronization.
    > > + */
    > > + ereport(WARNING,
    > > + errmsg("aborting initial sync for slot \"%s\"",
    > > +    remote_slot->name),
    > > + errdetail("This slot was invalidated on the primary server."));
    > >
    > > Which case are we referring to here where null restart_lsn would mean
    > > invalidation? Can you please point me to such code where it happens or
    > > a test-case which does that. I tried a few invalidation cases, but did
    > > not hit it.
    > >
    >
    > I've removed all this code as the checks for null restart_lsn and
    > other parameters are handled in earlier functions and we won't get
    > here if these had null, I've added asserts to check that it is not
    > null.
    >
    >
    > On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > Please find few more comments:
    > >
    > > 1)
    > > In  pg_sync_replication_slots() doc, we have this:
    > >
    > > "Note that this function is primarily intended for testing and
    > > debugging purposes and should be used with caution. Additionally, this
    > > function cannot be executed if ...."
    > >
    > > We can get rid of this info as well and change to:
    > >
    > > "Note that this function cannot be executed if...."
    >
    > Modified as requested.
    >
    >
    > >
    > > 2)
    > > We got rid of NOTE in logicaldecoding.sgml, but now the page does not
    > > mention pg_sync_replication_slots() at all. We need to bring back the
    > > change removed by [1] (or something on similar line) which is this:
    > >
    > > -     <command>CREATE SUBSCRIPTION</command> during slot creation, and
    > > then calling
    > > -     <link linkend="pg-sync-replication-slots">
    > > -     <function>pg_sync_replication_slots</function></link>
    > > -     on the standby. By setting <link linkend="guc-sync-replication-slots">
    > > +     <command>CREATE SUBSCRIPTION</command> during slot creation.
    > > +     Additionally, enabling <link linkend="guc-sync-replication-slots">
    > > +     <varname>sync_replication_slots</varname></link> on the standby
    > > +     is required. By enabling <link linkend="guc-sync-replication-slots">
    > >
    >
    > I've added that back.
    >
    >
    > > 3)
    > > wait_for_primary_slot_catchup():
    > > + /*
    > > + * It is possible to get null values for confirmed_lsn and
    > > + * catalog_xmin if on the primary server the slot is just created with
    > > + * a valid restart_lsn and slot-sync worker has fetched the slot
    > > + * before the primary server could set valid confirmed_lsn and
    > > + * catalog_xmin.
    > > + */
    > >
    > > Do we need this special handling? We already have one such handling in
    > > synchronize_slots(). please see:
    > >                 /*
    > >                  * If restart_lsn, confirmed_lsn or catalog_xmin is
    > > invalid but the
    > >                  * slot is valid, that means we have fetched the
    > > remote_slot in its
    > >                  * RS_EPHEMERAL state. In such a case, don't sync it;
    > > we can always
    > >                  * sync it in the next sync cycle when the remote_slot
    > > is persisted
    > >                  * and has valid lsn(s) and xmin values.
    > >                  */
    > >                 if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
    > >                          XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
    > >                          !TransactionIdIsValid(remote_slot->catalog_xmin)) &&
    > >                         remote_slot->invalidated == RS_INVAL_NONE)
    > >                         pfree(remote_slot);
    > >
    > >
    > > Due to the above check in synchronize_slots(), we will not reach
    > > wait_for_primary_slot_catchup() when any of confirmed_lsn or
    > > catalog_xmin is not initialized.
    > >
    >
    > Yes,  you are correct. I've removed all that logic.
    >
    > The modified patch (v2) is attached.
    >
    
    I am not able to apply the patch to the latest head or even to a week
    back version. Can you please check and rebase?
    
    thanks
    Shveta
    
    
    
    
  7. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-07-16T10:16:51Z

    > I am not able to apply the patch to the latest head or even to a week
    > back version. Can you please check and rebase?
    >
    > thanks
    > Shveta
    
    Rebased.
    
    Regards,
    Ajin Cherian
    Fujitsu Australia.
    
  8. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-17T04:04:25Z

    On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > > I am not able to apply the patch to the latest head or even to a week
    > > back version. Can you please check and rebase?
    > >
    > > thanks
    > > Shveta
    >
    > Rebased.
    >
    
    Thanks. Please find a few comments:
    
    
    1)
    /* Any slot with NULL in these fields should not have made it this far */
    
    It is good to get rid of the case where we had checks for NULL
    confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
    state), as that has already been checked by synchronize_slots() and
    such a slot will not even reach wait_for_primary_slot_catchup(). But a
    slot can still be invalidated on primary anytime, and thus during this
    wait, we should check for primary's invalidation as we were doing in
    v1.
    
    2)
    + * If in SQL API synchronization, and we've been  promoted, then no point
    
     extra space before promoted.
    
    3)
    
    + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
    
    We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
    checked at the beginning of this function.
    
    4)
    + ereport(WARNING,
    + errmsg("aborting sync for slot \"%s\"",
    + remote_slot->name),
    + errdetail("Promotion occurred before this slot was fully"
    +   " synchronized."));
    + pfree(cmd.data);
    +
    + return false;
    
    a) Please add an error-code.
    
    b) Shall we change msg to
    
    errmsg("aborting sync for slot \"%s\"",
    remote_slot->name),
                        errhint("%s cannot be executed once promotion is
    triggered.",
    
    "pg_sync_replication_slots()")));
    
    
    
    5)
    Instead of using PromoteIsTriggered, shall we rely on
    'SlotSyncCtx->stopSignaled' as we do when we start this API.
    
    6)
    In logicaldecoding.sgml, we can get rid of "Additionally, enabling
    sync_replication_slots on the standby is required" to make it same as
    what we had prior to the patch I pointed earlier.
    
    Or better we can refine it to below. Thoughts?
    
    The logical replication slots on the primary can be enabled for
    synchronization to the hot standby by using the failover parameter of
    pg_create_logical_replication_slot, or by using the failover option of
    CREATE SUBSCRIPTION during slot creation. After that, synchronization
    can be performed either manually by calling pg_sync_replication_slots
    on the standby, or automatically by enabling sync_replication_slots on
    the standby. When sync_replication_slots is enabled, the failover
    slots are periodically synchronized by the slot sync worker. For the
    synchronization to work, .....
    
    thanks
    Shveta
    
    
    
    
  9. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-18T04:43:54Z

    On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > > I am not able to apply the patch to the latest head or even to a week
    > > > back version. Can you please check and rebase?
    > > >
    > > > thanks
    > > > Shveta
    > >
    > > Rebased.
    > >
    >
    > Thanks. Please find a few comments:
    >
    >
    > 1)
    > /* Any slot with NULL in these fields should not have made it this far */
    >
    > It is good to get rid of the case where we had checks for NULL
    > confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
    > state), as that has already been checked by synchronize_slots() and
    > such a slot will not even reach wait_for_primary_slot_catchup(). But a
    > slot can still be invalidated on primary anytime, and thus during this
    > wait, we should check for primary's invalidation as we were doing in
    > v1.
    >
    > 2)
    > + * If in SQL API synchronization, and we've been  promoted, then no point
    >
    >  extra space before promoted.
    >
    > 3)
    >
    > + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
    >
    > We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
    > checked at the beginning of this function.
    >
    > 4)
    > + ereport(WARNING,
    > + errmsg("aborting sync for slot \"%s\"",
    > + remote_slot->name),
    > + errdetail("Promotion occurred before this slot was fully"
    > +   " synchronized."));
    > + pfree(cmd.data);
    > +
    > + return false;
    >
    > a) Please add an error-code.
    >
    > b) Shall we change msg to
    >
    > errmsg("aborting sync for slot \"%s\"",
    > remote_slot->name),
    >                     errhint("%s cannot be executed once promotion is
    > triggered.",
    >
    > "pg_sync_replication_slots()")));
    >
    >
    >
    > 5)
    > Instead of using PromoteIsTriggered, shall we rely on
    > 'SlotSyncCtx->stopSignaled' as we do when we start this API.
    >
    > 6)
    > In logicaldecoding.sgml, we can get rid of "Additionally, enabling
    > sync_replication_slots on the standby is required" to make it same as
    > what we had prior to the patch I pointed earlier.
    >
    > Or better we can refine it to below. Thoughts?
    >
    > The logical replication slots on the primary can be enabled for
    > synchronization to the hot standby by using the failover parameter of
    > pg_create_logical_replication_slot, or by using the failover option of
    > CREATE SUBSCRIPTION during slot creation. After that, synchronization
    > can be performed either manually by calling pg_sync_replication_slots
    > on the standby, or automatically by enabling sync_replication_slots on
    > the standby. When sync_replication_slots is enabled, the failover
    > slots are periodically synchronized by the slot sync worker. For the
    > synchronization to work, .....
    
    I am wondering if we should provide an optional parameter to
    pg_sync_replication_slots(), to control whether to wait for the slot
    to be synced or just return with ERROR as it is doing now, default can
    be wait.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  10. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-18T05:15:31Z

    On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > > I am not able to apply the patch to the latest head or even to a week
    > > > > back version. Can you please check and rebase?
    > > > >
    > > > > thanks
    > > > > Shveta
    > > >
    > > > Rebased.
    > > >
    > >
    > > Thanks. Please find a few comments:
    > >
    > >
    > > 1)
    > > /* Any slot with NULL in these fields should not have made it this far */
    > >
    > > It is good to get rid of the case where we had checks for NULL
    > > confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
    > > state), as that has already been checked by synchronize_slots() and
    > > such a slot will not even reach wait_for_primary_slot_catchup(). But a
    > > slot can still be invalidated on primary anytime, and thus during this
    > > wait, we should check for primary's invalidation as we were doing in
    > > v1.
    > >
    > > 2)
    > > + * If in SQL API synchronization, and we've been  promoted, then no point
    > >
    > >  extra space before promoted.
    > >
    > > 3)
    > >
    > > + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
    > >
    > > We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
    > > checked at the beginning of this function.
    > >
    > > 4)
    > > + ereport(WARNING,
    > > + errmsg("aborting sync for slot \"%s\"",
    > > + remote_slot->name),
    > > + errdetail("Promotion occurred before this slot was fully"
    > > +   " synchronized."));
    > > + pfree(cmd.data);
    > > +
    > > + return false;
    > >
    > > a) Please add an error-code.
    > >
    > > b) Shall we change msg to
    > >
    > > errmsg("aborting sync for slot \"%s\"",
    > > remote_slot->name),
    > >                     errhint("%s cannot be executed once promotion is
    > > triggered.",
    > >
    > > "pg_sync_replication_slots()")));
    > >
    > >
    > >
    > > 5)
    > > Instead of using PromoteIsTriggered, shall we rely on
    > > 'SlotSyncCtx->stopSignaled' as we do when we start this API.
    > >
    > > 6)
    > > In logicaldecoding.sgml, we can get rid of "Additionally, enabling
    > > sync_replication_slots on the standby is required" to make it same as
    > > what we had prior to the patch I pointed earlier.
    > >
    > > Or better we can refine it to below. Thoughts?
    > >
    > > The logical replication slots on the primary can be enabled for
    > > synchronization to the hot standby by using the failover parameter of
    > > pg_create_logical_replication_slot, or by using the failover option of
    > > CREATE SUBSCRIPTION during slot creation. After that, synchronization
    > > can be performed either manually by calling pg_sync_replication_slots
    > > on the standby, or automatically by enabling sync_replication_slots on
    > > the standby. When sync_replication_slots is enabled, the failover
    > > slots are periodically synchronized by the slot sync worker. For the
    > > synchronization to work, .....
    >
    > I am wondering if we should provide an optional parameter to
    > pg_sync_replication_slots(), to control whether to wait for the slot
    > to be synced or just return with ERROR as it is doing now, default can
    > be wait.
    >
    
    Do you mean specifically in case of promotion or in general, as in do
    not wait for primary to catch-up (anytime) and exit and drop the
    temporary slot while exiting?
    
    thanks
    Shveta
    
    
    
    
  11. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-18T05:22:37Z

    On Fri, Jul 18, 2025 at 10:45 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > >
    > > On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > >
    > > > > > I am not able to apply the patch to the latest head or even to a week
    > > > > > back version. Can you please check and rebase?
    > > > > >
    > > > > > thanks
    > > > > > Shveta
    > > > >
    > > > > Rebased.
    > > > >
    > > >
    > > > Thanks. Please find a few comments:
    > > >
    > > >
    > > > 1)
    > > > /* Any slot with NULL in these fields should not have made it this far */
    > > >
    > > > It is good to get rid of the case where we had checks for NULL
    > > > confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
    > > > state), as that has already been checked by synchronize_slots() and
    > > > such a slot will not even reach wait_for_primary_slot_catchup(). But a
    > > > slot can still be invalidated on primary anytime, and thus during this
    > > > wait, we should check for primary's invalidation as we were doing in
    > > > v1.
    > > >
    > > > 2)
    > > > + * If in SQL API synchronization, and we've been  promoted, then no point
    > > >
    > > >  extra space before promoted.
    > > >
    > > > 3)
    > > >
    > > > + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
    > > >
    > > > We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
    > > > checked at the beginning of this function.
    > > >
    > > > 4)
    > > > + ereport(WARNING,
    > > > + errmsg("aborting sync for slot \"%s\"",
    > > > + remote_slot->name),
    > > > + errdetail("Promotion occurred before this slot was fully"
    > > > +   " synchronized."));
    > > > + pfree(cmd.data);
    > > > +
    > > > + return false;
    > > >
    > > > a) Please add an error-code.
    > > >
    > > > b) Shall we change msg to
    > > >
    > > > errmsg("aborting sync for slot \"%s\"",
    > > > remote_slot->name),
    > > >                     errhint("%s cannot be executed once promotion is
    > > > triggered.",
    > > >
    > > > "pg_sync_replication_slots()")));
    > > >
    > > >
    > > >
    > > > 5)
    > > > Instead of using PromoteIsTriggered, shall we rely on
    > > > 'SlotSyncCtx->stopSignaled' as we do when we start this API.
    > > >
    > > > 6)
    > > > In logicaldecoding.sgml, we can get rid of "Additionally, enabling
    > > > sync_replication_slots on the standby is required" to make it same as
    > > > what we had prior to the patch I pointed earlier.
    > > >
    > > > Or better we can refine it to below. Thoughts?
    > > >
    > > > The logical replication slots on the primary can be enabled for
    > > > synchronization to the hot standby by using the failover parameter of
    > > > pg_create_logical_replication_slot, or by using the failover option of
    > > > CREATE SUBSCRIPTION during slot creation. After that, synchronization
    > > > can be performed either manually by calling pg_sync_replication_slots
    > > > on the standby, or automatically by enabling sync_replication_slots on
    > > > the standby. When sync_replication_slots is enabled, the failover
    > > > slots are periodically synchronized by the slot sync worker. For the
    > > > synchronization to work, .....
    > >
    > > I am wondering if we should provide an optional parameter to
    > > pg_sync_replication_slots(), to control whether to wait for the slot
    > > to be synced or just return with ERROR as it is doing now, default can
    > > be wait.
    > >
    >
    > Do you mean specifically in case of promotion or in general, as in do
    > not wait for primary to catch-up (anytime) and exit and drop the
    > temporary slot while exiting?
    
    I am specifically pointing to the exposed function
    pg_sync_replication_slots() which was earlier non-blocking and was
    giving an error if the primary slot for not catch-up, so we have
    improved the functionality in this thread by making it wait for
    primary to catch-up instead of just throwing an error.  But my
    question was since this is a user facing function so shall we keep the
    old behavior intact with some optional parameters so that if the user
    chooses not to wait they have options to do that?
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  12. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-18T05:55:47Z

    On Fri, Jul 18, 2025 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Fri, Jul 18, 2025 at 10:45 AM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Fri, Jul 18, 2025 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > > >
    > > > On Thu, Jul 17, 2025 at 9:34 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > > >
    > > > > > > I am not able to apply the patch to the latest head or even to a week
    > > > > > > back version. Can you please check and rebase?
    > > > > > >
    > > > > > > thanks
    > > > > > > Shveta
    > > > > >
    > > > > > Rebased.
    > > > > >
    > > > >
    > > > > Thanks. Please find a few comments:
    > > > >
    > > > >
    > > > > 1)
    > > > > /* Any slot with NULL in these fields should not have made it this far */
    > > > >
    > > > > It is good to get rid of the case where we had checks for NULL
    > > > > confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
    > > > > state), as that has already been checked by synchronize_slots() and
    > > > > such a slot will not even reach wait_for_primary_slot_catchup(). But a
    > > > > slot can still be invalidated on primary anytime, and thus during this
    > > > > wait, we should check for primary's invalidation as we were doing in
    > > > > v1.
    > > > >
    > > > > 2)
    > > > > + * If in SQL API synchronization, and we've been  promoted, then no point
    > > > >
    > > > >  extra space before promoted.
    > > > >
    > > > > 3)
    > > > >
    > > > > + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
    > > > >
    > > > > We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
    > > > > checked at the beginning of this function.
    > > > >
    > > > > 4)
    > > > > + ereport(WARNING,
    > > > > + errmsg("aborting sync for slot \"%s\"",
    > > > > + remote_slot->name),
    > > > > + errdetail("Promotion occurred before this slot was fully"
    > > > > +   " synchronized."));
    > > > > + pfree(cmd.data);
    > > > > +
    > > > > + return false;
    > > > >
    > > > > a) Please add an error-code.
    > > > >
    > > > > b) Shall we change msg to
    > > > >
    > > > > errmsg("aborting sync for slot \"%s\"",
    > > > > remote_slot->name),
    > > > >                     errhint("%s cannot be executed once promotion is
    > > > > triggered.",
    > > > >
    > > > > "pg_sync_replication_slots()")));
    > > > >
    > > > >
    > > > >
    > > > > 5)
    > > > > Instead of using PromoteIsTriggered, shall we rely on
    > > > > 'SlotSyncCtx->stopSignaled' as we do when we start this API.
    > > > >
    > > > > 6)
    > > > > In logicaldecoding.sgml, we can get rid of "Additionally, enabling
    > > > > sync_replication_slots on the standby is required" to make it same as
    > > > > what we had prior to the patch I pointed earlier.
    > > > >
    > > > > Or better we can refine it to below. Thoughts?
    > > > >
    > > > > The logical replication slots on the primary can be enabled for
    > > > > synchronization to the hot standby by using the failover parameter of
    > > > > pg_create_logical_replication_slot, or by using the failover option of
    > > > > CREATE SUBSCRIPTION during slot creation. After that, synchronization
    > > > > can be performed either manually by calling pg_sync_replication_slots
    > > > > on the standby, or automatically by enabling sync_replication_slots on
    > > > > the standby. When sync_replication_slots is enabled, the failover
    > > > > slots are periodically synchronized by the slot sync worker. For the
    > > > > synchronization to work, .....
    > > >
    > > > I am wondering if we should provide an optional parameter to
    > > > pg_sync_replication_slots(), to control whether to wait for the slot
    > > > to be synced or just return with ERROR as it is doing now, default can
    > > > be wait.
    > > >
    > >
    > > Do you mean specifically in case of promotion or in general, as in do
    > > not wait for primary to catch-up (anytime) and exit and drop the
    > > temporary slot while exiting?
    >
    > I am specifically pointing to the exposed function
    > pg_sync_replication_slots() which was earlier non-blocking and was
    > giving an error if the primary slot for not catch-up, so we have
    > improved the functionality in this thread by making it wait for
    > primary to catch-up instead of just throwing an error.  But my
    > question was since this is a user facing function so shall we keep the
    > old behavior intact with some optional parameters so that if the user
    > chooses not to wait they have options to do that?
    >
    
    Okay.  I see your point. Yes, it was non-blocking earlier but it was
    not giving ERROR, it was just dumping in logilfe that primary is
    behind and thus slot-sync could not be done.
    
    If we continue using the non-blocking mode, there’s a risk that the
    API may never successfully sync the slots. This is because it
    eventually drops the temporary slot on exit, and when it tries to
    create a new one later on subsequent call, it’s likely that the new
    slot will again be ahead of the primary. This may happen if we have
    continuous ongoing writes on the primary and the logical slot is not
    being consumed at the same pace.
    
    My preference would be to avoid including such an option as it is
    confusing. With such an option in place, users may think that
    slot-sync is completed while that may not be the case. But if it's
    necessary for backward compatibility, it should be okay to provide it
    as a non-default option as you suggested. Would like to know what
    others think of this.
    
    thanks
    Shveta
    
    
    
    
  13. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-18T06:01:04Z

    On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > Okay.  I see your point. Yes, it was non-blocking earlier but it was
    > not giving ERROR, it was just dumping in logilfe that primary is
    > behind and thus slot-sync could not be done.
    >
    > If we continue using the non-blocking mode, there’s a risk that the
    > API may never successfully sync the slots. This is because it
    > eventually drops the temporary slot on exit, and when it tries to
    > create a new one later on subsequent call, it’s likely that the new
    > slot will again be ahead of the primary. This may happen if we have
    > continuous ongoing writes on the primary and the logical slot is not
    > being consumed at the same pace.
    >
    > My preference would be to avoid including such an option as it is
    > confusing. With such an option in place, users may think that
    > slot-sync is completed while that may not be the case.
    
    Fair enough
    
     But if it's
    > necessary for backward compatibility, it should be okay to provide it
    > as a non-default option as you suggested. Would like to know what
    > others think of this.
    
    I think we don't need to maintain backward compatibility here as it
    was not behaving sanely before, so I am fine with providing the
    behaviour we are doing with the patch.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  14. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-07-19T11:39:49Z

    On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > Okay.  I see your point. Yes, it was non-blocking earlier but it was
    > > not giving ERROR, it was just dumping in logilfe that primary is
    > > behind and thus slot-sync could not be done.
    > >
    > > If we continue using the non-blocking mode, there’s a risk that the
    > > API may never successfully sync the slots. This is because it
    > > eventually drops the temporary slot on exit, and when it tries to
    > > create a new one later on subsequent call, it’s likely that the new
    > > slot will again be ahead of the primary. This may happen if we have
    > > continuous ongoing writes on the primary and the logical slot is not
    > > being consumed at the same pace.
    > >
    > > My preference would be to avoid including such an option as it is
    > > confusing. With such an option in place, users may think that
    > > slot-sync is completed while that may not be the case.
    >
    > Fair enough
    >
    
    I think if we want we may return bool and return false when sync is
    not complete say due to promotion or other reason like timeout.
    However, at this stage it is not very clear whether it will be useful
    to provide additional timeout parameter. But we can consider retruning
    true/false depending on whether we are successful in syncing the slots
    or not.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  15. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-07-21T04:38:40Z

    On Sat, Jul 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > >
    > > On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > Okay.  I see your point. Yes, it was non-blocking earlier but it was
    > > > not giving ERROR, it was just dumping in logilfe that primary is
    > > > behind and thus slot-sync could not be done.
    > > >
    > > > If we continue using the non-blocking mode, there’s a risk that the
    > > > API may never successfully sync the slots. This is because it
    > > > eventually drops the temporary slot on exit, and when it tries to
    > > > create a new one later on subsequent call, it’s likely that the new
    > > > slot will again be ahead of the primary. This may happen if we have
    > > > continuous ongoing writes on the primary and the logical slot is not
    > > > being consumed at the same pace.
    > > >
    > > > My preference would be to avoid including such an option as it is
    > > > confusing. With such an option in place, users may think that
    > > > slot-sync is completed while that may not be the case.
    > >
    > > Fair enough
    > >
    >
    > I think if we want we may return bool and return false when sync is
    > not complete say due to promotion or other reason like timeout.
    > However, at this stage it is not very clear whether it will be useful
    > to provide additional timeout parameter. But we can consider retruning
    > true/false depending on whether we are successful in syncing the slots
    > or not.
    
    I am not very sure if in the current scenario, such a return-value
    will have any value addition. Since this function will be waiting
    indefinitely until all the slots are synced, it is supposed to return
    true in such normal scenarios. If it is interrupted by promotion or
    user cancels it manually, then it is supposed to return false. But in
    those cases, a more helpful approach would be to log a clear WARNING
    or ERROR message like "sync interrupted by promotion" (or similar
    reasons), rather than relying on a return value. In future, if we plan
    to add a timeout-parameter, then this return value makes more sense as
    in normal scenarios as well, as it can easily return false if the
    timeout value is short or the number of slots are huge or are stuck
    waiting on primary.
    
    Additionally, if we do return a value, there may be an expectation
    that the API should also provide details on the list of slots that
    couldn't be synced. That could introduce unnecessary complexity at
    this stage. We can avoid it for now and consider adding such
    enhancements later if we receive relevant customer feedback. Please
    note that our recommended approach for syncing slots still remains the
    'slot sync worker' method.
    
    thanks
    Shveta
    
    
    
    
  16. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-07-21T05:55:42Z

    On Mon, Jul 21, 2025 at 10:08 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Sat, Jul 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > > >
    > > > On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > Okay.  I see your point. Yes, it was non-blocking earlier but it was
    > > > > not giving ERROR, it was just dumping in logilfe that primary is
    > > > > behind and thus slot-sync could not be done.
    > > > >
    > > > > If we continue using the non-blocking mode, there’s a risk that the
    > > > > API may never successfully sync the slots. This is because it
    > > > > eventually drops the temporary slot on exit, and when it tries to
    > > > > create a new one later on subsequent call, it’s likely that the new
    > > > > slot will again be ahead of the primary. This may happen if we have
    > > > > continuous ongoing writes on the primary and the logical slot is not
    > > > > being consumed at the same pace.
    > > > >
    > > > > My preference would be to avoid including such an option as it is
    > > > > confusing. With such an option in place, users may think that
    > > > > slot-sync is completed while that may not be the case.
    > > >
    > > > Fair enough
    > > >
    > >
    > > I think if we want we may return bool and return false when sync is
    > > not complete say due to promotion or other reason like timeout.
    > > However, at this stage it is not very clear whether it will be useful
    > > to provide additional timeout parameter. But we can consider retruning
    > > true/false depending on whether we are successful in syncing the slots
    > > or not.
    >
    > I am not very sure if in the current scenario, such a return-value
    > will have any value addition. Since this function will be waiting
    > indefinitely until all the slots are synced, it is supposed to return
    > true in such normal scenarios. If it is interrupted by promotion or
    > user cancels it manually, then it is supposed to return false. But in
    > those cases, a more helpful approach would be to log a clear WARNING
    > or ERROR message like "sync interrupted by promotion" (or similar
    > reasons), rather than relying on a return value. In future, if we plan
    > to add a timeout-parameter, then this return value makes more sense as
    > in normal scenarios as well, as it can easily return false if the
    > timeout value is short or the number of slots are huge or are stuck
    > waiting on primary.
    >
    > Additionally, if we do return a value, there may be an expectation
    > that the API should also provide details on the list of slots that
    > couldn't be synced. That could introduce unnecessary complexity at
    > this stage. We can avoid it for now and consider adding such
    > enhancements later if we receive relevant customer feedback.
    >
    
    makes sense.
    
    > Please
    > note that our recommended approach for syncing slots still remains the
    > 'slot sync worker' method.
    >
    
    Right.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  17. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-21T05:59:53Z

    On Mon, Jul 21, 2025 at 10:08 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Sat, Jul 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Fri, Jul 18, 2025 at 11:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > > >
    > > > On Fri, Jul 18, 2025 at 11:25 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > Okay.  I see your point. Yes, it was non-blocking earlier but it was
    > > > > not giving ERROR, it was just dumping in logilfe that primary is
    > > > > behind and thus slot-sync could not be done.
    > > > >
    > > > > If we continue using the non-blocking mode, there’s a risk that the
    > > > > API may never successfully sync the slots. This is because it
    > > > > eventually drops the temporary slot on exit, and when it tries to
    > > > > create a new one later on subsequent call, it’s likely that the new
    > > > > slot will again be ahead of the primary. This may happen if we have
    > > > > continuous ongoing writes on the primary and the logical slot is not
    > > > > being consumed at the same pace.
    > > > >
    > > > > My preference would be to avoid including such an option as it is
    > > > > confusing. With such an option in place, users may think that
    > > > > slot-sync is completed while that may not be the case.
    > > >
    > > > Fair enough
    > > >
    > >
    > > I think if we want we may return bool and return false when sync is
    > > not complete say due to promotion or other reason like timeout.
    > > However, at this stage it is not very clear whether it will be useful
    > > to provide additional timeout parameter. But we can consider retruning
    > > true/false depending on whether we are successful in syncing the slots
    > > or not.
    >
    > I am not very sure if in the current scenario, such a return-value
    > will have any value addition. Since this function will be waiting
    > indefinitely until all the slots are synced, it is supposed to return
    > true in such normal scenarios. If it is interrupted by promotion or
    > user cancels it manually, then it is supposed to return false. But in
    > those cases, a more helpful approach would be to log a clear WARNING
    > or ERROR message like "sync interrupted by promotion" (or similar
    > reasons), rather than relying on a return value. In future, if we plan
    > to add a timeout-parameter, then this return value makes more sense as
    > in normal scenarios as well, as it can easily return false if the
    > timeout value is short or the number of slots are huge or are stuck
    > waiting on primary.
    
    > Additionally, if we do return a value, there may be an expectation
    > that the API should also provide details on the list of slots that
    > couldn't be synced. That could introduce unnecessary complexity at
    > this stage. We can avoid it for now and consider adding such
    > enhancements later if we receive relevant customer feedback. Please
    > note that our recommended approach for syncing slots still remains the
    > 'slot sync worker' method.
    
    +1
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  18. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-07-31T09:40:57Z

    On Thu, Jul 17, 2025 at 2:04 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Jul 16, 2025 at 3:47 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > > I am not able to apply the patch to the latest head or even to a week
    > > > back version. Can you please check and rebase?
    > > >
    > > > thanks
    > > > Shveta
    > >
    > > Rebased.
    > >
    >
    > Thanks. Please find a few comments:
    >
    >
    > 1)
    > /* Any slot with NULL in these fields should not have made it this far */
    >
    > It is good to get rid of the case where we had checks for NULL
    > confirmed_lsn and catalog_xmin (i.e. when slot was in RS_EPHEMERAL
    > state), as that has already been checked by synchronize_slots() and
    > such a slot will not even reach wait_for_primary_slot_catchup(). But a
    > slot can still be invalidated on primary anytime, and thus during this
    > wait, we should check for primary's invalidation as we were doing in
    > v1.
    >
    
    I've added back the check for invalidated slots.
    
    > 2)
    > + * If in SQL API synchronization, and we've been  promoted, then no point
    >
    >  extra space before promoted.
    
    Fixed.
    
    >
    > 3)
    >
    > + if (!AmLogicalSlotSyncWorkerProcess() && PromoteIsTriggered())
    >
    > We don't need 'AmLogicalSlotSyncWorkerProcess' as that is already
    > checked at the beginning of this function.
    >
    
    Fixed.
    
    > 4)
    > + ereport(WARNING,
    > + errmsg("aborting sync for slot \"%s\"",
    > + remote_slot->name),
    > + errdetail("Promotion occurred before this slot was fully"
    > +   " synchronized."));
    > + pfree(cmd.data);
    > +
    > + return false;
    >
    > a) Please add an error-code.
    >
    > b) Shall we change msg to
    >
    > errmsg("aborting sync for slot \"%s\"",
    > remote_slot->name),
    >                     errhint("%s cannot be executed once promotion is
    > triggered.",
    >
    > "pg_sync_replication_slots()")));
    >
    >
    
    Since there is already an error return in the start if promotion is
    triggered, I've kept the same error code and message here as well for
    consistency.
    
    >
    > 5)
    > Instead of using PromoteIsTriggered, shall we rely on
    > 'SlotSyncCtx->stopSignaled' as we do when we start this API.
    >
    
    Fixed.
    
    > 6)
    > In logicaldecoding.sgml, we can get rid of "Additionally, enabling
    > sync_replication_slots on the standby is required" to make it same as
    > what we had prior to the patch I pointed earlier.
    >
    > Or better we can refine it to below. Thoughts?
    >
    > The logical replication slots on the primary can be enabled for
    > synchronization to the hot standby by using the failover parameter of
    > pg_create_logical_replication_slot, or by using the failover option of
    > CREATE SUBSCRIPTION during slot creation. After that, synchronization
    > can be performed either manually by calling pg_sync_replication_slots
    > on the standby, or automatically by enabling sync_replication_slots on
    > the standby. When sync_replication_slots is enabled, the failover
    > slots are periodically synchronized by the slot sync worker. For the
    > synchronization to work, .....
    >
    
    Updated as above.
    
    Patch v3 attached.
    
    Regards,
    Ajin Cherian
    Fujitsu Australia
    
  19. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-01T06:32:06Z

    On Thu, Jul 31, 2025 at 3:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Patch v3 attached.
    >
    
    Thanks for the patch. I tested it, please find a few comments:
    
    
    1)
    it hits an assert
    (slotsync_reread_config()-->Assert(sync_replication_slots)) when API
    is trying to sync and is in wait loop while in another session, I
    enable sync_replication_slots using:
    
    ALTER SYSTEM SET sync_replication_slots = 'on';
    SELECT pg_reload_conf();
    
    Assert:
    025-08-01 10:55:43.637 IST [118576] STATEMENT:  SELECT
    pg_sync_replication_slots();
    2025-08-01 10:55:51.730 IST [118563] LOG:  received SIGHUP, reloading
    configuration files
    2025-08-01 10:55:51.731 IST [118563] LOG:  parameter
    "sync_replication_slots" changed to "on"
    TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
    Line: 1334, PID: 118576
    postgres: shveta postgres [local]
    SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
    postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
    2025-08-01 10:55:51.739 IST [118666] ERROR:  cannot synchronize
    replication slots concurrently
    postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
    postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
    postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
    postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
    postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
    postgres: shveta postgres [local]
    SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
    postgres: shveta postgres [local]
    SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]
    
    2)
    + ereport(ERROR,
    + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    + errmsg("cannot synchronize replication slots when"
    + " standby promotion is ongoing")));
    
    I think better error message will be:
    "exiting from slot synchronization as promotion is triggered"
    
    This will be better suited in log file as well after below wait statements:
    LOG:  continuing to wait for remote slot "failover_slot" LSN
    (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
    and catalog xmin (757)
    STATEMENT:  SELECT pg_sync_replication_slots();
    
    3)
    API dumps this when it is waiting for primary:
    
    ----
    LOG:  could not synchronize replication slot "failover_slot2"
    DETAIL:  Synchronization could lead to data loss, because the remote
    slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
    has LSN 0/03066E70 and catalog xmin 770.
    STATEMENT:  SELECT pg_sync_replication_slots();
    LOG:  waiting for remote slot "failover_slot2" LSN (0/3066E70) and
    catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
    (770)
    STATEMENT:  SELECT pg_sync_replication_slots();
    LOG:  continuing to wait for remote slot "failover_slot2" LSN
    (0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
    and catalog xmin (770)
    STATEMENT:  SELECT pg_sync_replication_slots();
    ----
    
    Unsure if we shall still dump 'could not synchronize..' when it is
    going to retry until it succeeds? The concerned log gives a feeling
    that we are done trying and could not synchronize it. What do you
    think?
    
    thanks
    Shveta
    
    
    
    
  20. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-01T09:20:06Z

    On Fri, Aug 1, 2025 at 12:02 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Thu, Jul 31, 2025 at 3:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Patch v3 attached.
    > >
    >
    > Thanks for the patch. I tested it, please find a few comments:
    >
    >
    > 1)
    > it hits an assert
    > (slotsync_reread_config()-->Assert(sync_replication_slots)) when API
    > is trying to sync and is in wait loop while in another session, I
    > enable sync_replication_slots using:
    >
    > ALTER SYSTEM SET sync_replication_slots = 'on';
    > SELECT pg_reload_conf();
    >
    > Assert:
    > 025-08-01 10:55:43.637 IST [118576] STATEMENT:  SELECT
    > pg_sync_replication_slots();
    > 2025-08-01 10:55:51.730 IST [118563] LOG:  received SIGHUP, reloading
    > configuration files
    > 2025-08-01 10:55:51.731 IST [118563] LOG:  parameter
    > "sync_replication_slots" changed to "on"
    > TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
    > Line: 1334, PID: 118576
    > postgres: shveta postgres [local]
    > SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
    > postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
    > 2025-08-01 10:55:51.739 IST [118666] ERROR:  cannot synchronize
    > replication slots concurrently
    > postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
    > postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
    > postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
    > postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
    > postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
    > postgres: shveta postgres [local]
    > SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
    > postgres: shveta postgres [local]
    > SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]
    >
    > 2)
    > + ereport(ERROR,
    > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    > + errmsg("cannot synchronize replication slots when"
    > + " standby promotion is ongoing")));
    >
    > I think better error message will be:
    > "exiting from slot synchronization as promotion is triggered"
    >
    > This will be better suited in log file as well after below wait statements:
    > LOG:  continuing to wait for remote slot "failover_slot" LSN
    > (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
    > and catalog xmin (757)
    > STATEMENT:  SELECT pg_sync_replication_slots();
    >
    > 3)
    > API dumps this when it is waiting for primary:
    >
    > ----
    > LOG:  could not synchronize replication slot "failover_slot2"
    > DETAIL:  Synchronization could lead to data loss, because the remote
    > slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
    > has LSN 0/03066E70 and catalog xmin 770.
    > STATEMENT:  SELECT pg_sync_replication_slots();
    > LOG:  waiting for remote slot "failover_slot2" LSN (0/3066E70) and
    > catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
    > (770)
    > STATEMENT:  SELECT pg_sync_replication_slots();
    > LOG:  continuing to wait for remote slot "failover_slot2" LSN
    > (0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
    > and catalog xmin (770)
    > STATEMENT:  SELECT pg_sync_replication_slots();
    > ----
    >
    > Unsure if we shall still dump 'could not synchronize..' when it is
    > going to retry until it succeeds? The concerned log gives a feeling
    > that we are done trying and could not synchronize it. What do you
    > think?
    >
    
    A few more comments:
    
    4)
    
    + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
    + {
    + ereport(WARNING,
    + errmsg("aborting initial sync for slot \"%s\"",
    +    remote_slot->name),
    + errdetail("This slot was not found on the primary server."));
    +
    + pfree(cmd.data);
    + walrcv_clear_result(res);
    +
    + return false;
    + }
    
    We may have 'aborting sync for slot', can remove 'initial'.
    
    5)
    I tried a test where there were 4 slots on the publisher, where one
    was getting used while the others were not. Initiated
    pg_sync_replication_slots on standby. Forced unused slots to be
    invalidated by setting idle_replication_slot_timeout=60 on primary,
    due to which API finished but gave a warning:
    
    postgres=# SELECT pg_sync_replication_slots();
    WARNING:  aborting initial sync for slot "failover_slot"
    DETAIL:  This slot was invalidated on the primary server.
    WARNING:  aborting initial sync for slot "failover_slot2"
    DETAIL:  This slot was invalidated on the primary server.
    WARNING:  aborting initial sync for slot "failover_slot3"
    DETAIL:  This slot was invalidated on the primary server.
     pg_sync_replication_slots
    ---------------------------
    
    (1 row)
    
    Do we need these warnings here? I think we can have it as a LOG rather
    than having it on console. Thoughts?
    
    If we  inclined towards WARNING here, will ti be better to have it as
    a single line:
    
    WARNING:  aborting sync for slot "failover_slot" as the slot was
    invalidated on primary
    WARNING:  aborting sync for slot "failover_slot1" as the slot was
    invalidated on primary
    WARNING:  aborting sync for slot "failover_slot2" as the slot was
    invalidated on primary
    
    
    6)
    - * We do not drop the slot because the restart_lsn can be ahead of the
    - * current location when recreating the slot in the next cycle. It may
    - * take more time to create such a slot. Therefore, we keep this slot
    - * and attempt the synchronization in the next cycle.
    + * If we're in the slotsync worker, we retain the slot and retry in the
    + * next cycle. The restart_lsn might advance by then, allowing the slot
    + * to be created successfully later.
      */
    
    I like the previous command better as it was conveying the side-effect
    of dropping the slot herer. Can we try to incorporate the previous
    comment here and make it specific to slotsync workers?
    
    thanks
    Shveta
    
    
    
    
  21. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-08-04T06:00:49Z

    On Fri, Aug 1, 2025 at 2:50 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > 5)
    > I tried a test where there were 4 slots on the publisher, where one
    > was getting used while the others were not. Initiated
    > pg_sync_replication_slots on standby. Forced unused slots to be
    > invalidated by setting idle_replication_slot_timeout=60 on primary,
    > due to which API finished but gave a warning:
    >
    > postgres=# SELECT pg_sync_replication_slots();
    > WARNING:  aborting initial sync for slot "failover_slot"
    > DETAIL:  This slot was invalidated on the primary server.
    > WARNING:  aborting initial sync for slot "failover_slot2"
    > DETAIL:  This slot was invalidated on the primary server.
    > WARNING:  aborting initial sync for slot "failover_slot3"
    > DETAIL:  This slot was invalidated on the primary server.
    >  pg_sync_replication_slots
    > ---------------------------
    >
    > (1 row)
    >
    > Do we need these warnings here? I think we can have it as a LOG rather
    > than having it on console. Thoughts?
    >
    
    What is the behaviour of a slotsync worker in the same case? I don't
    see any such WARNING messages in the code of slotsync worker, so why
    do we want a different behaviour here?
    
    Few other comments:
    ======================
    1.
    update_and_persist_local_synced_slot()
    {
    ...
    + /*
    + * For SQL API synchronization, we wait for the remote slot to catch up
    + * here, since we can't assume the SQL API will be called again soon.
    + * We will retry the sync once the slot catches up.
    + *
    + * Note: This will return false if a promotion is triggered on the
    + * standby while waiting, in which case we stop syncing and drop the
    + * temporary slot.
    + */
    + if (!wait_for_primary_slot_catchup(wrconn, remote_slot))
    + return false;
    
    Why is the wait added at this level? Shouldn't it be at API level aka
    in SyncReplicationSlots() or pg_sync_replication_slots() similar to
    what we do in ReplSlotSyncWorkerMain() for slotsync workers?
    
    2.
     REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
    ...
    +REPLICATION_SLOTSYNC_PRIMARY_CATCHUP "Waiting for the primary to catch-up."
    
    Can't we reuse existing waitevent REPLICATION_SLOTSYNC_MAIN? We may
    want to change the description. Is there a specific reason to add a
    new wait_event for this API?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  22. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-04T06:49:40Z

    On Mon, Aug 4, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Fri, Aug 1, 2025 at 2:50 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > 5)
    > > I tried a test where there were 4 slots on the publisher, where one
    > > was getting used while the others were not. Initiated
    > > pg_sync_replication_slots on standby. Forced unused slots to be
    > > invalidated by setting idle_replication_slot_timeout=60 on primary,
    > > due to which API finished but gave a warning:
    > >
    > > postgres=# SELECT pg_sync_replication_slots();
    > > WARNING:  aborting initial sync for slot "failover_slot"
    > > DETAIL:  This slot was invalidated on the primary server.
    > > WARNING:  aborting initial sync for slot "failover_slot2"
    > > DETAIL:  This slot was invalidated on the primary server.
    > > WARNING:  aborting initial sync for slot "failover_slot3"
    > > DETAIL:  This slot was invalidated on the primary server.
    > >  pg_sync_replication_slots
    > > ---------------------------
    > >
    > > (1 row)
    > >
    > > Do we need these warnings here? I think we can have it as a LOG rather
    > > than having it on console. Thoughts?
    > >
    >
    > What is the behaviour of a slotsync worker in the same case? I don't
    > see any such WARNING messages in the code of slotsync worker, so why
    > do we want a different behaviour here?
    >
    
    We don’t have continuous waiting in the slot-sync worker if the remote
    slot is behind the local slot. But if during the first sync cycle the
    remote slot is behind, we keep the local slot as a temporary slot. In
    the next sync cycle, if we find the remote slot is invalidated, we
    mark the local slot as invalidated too, keeping it in this temporary
    state. There are no LOG or WARNING messages in this case. When the
    slot-sync worker stops or shuts down (like during promotion), it
    cleans up this temporary slot.
    
    Now, for the API behavior: if the remote slot is behind the local
    slot, the API enters a wait loop and logs:
    
    LOG:  waiting for remote slot "failover_slot" LSN (0/3000060) and
    catalog xmin (755) to pass local slot LSN (0/3000060) and catalog xmin
    (770)
    
    If it keeps waiting, every 10 seconds it logs:
    LOG:  continuing to wait for remote slot "failover_slot" LSN
    (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
    and catalog xmin (770)
    
    If the remote slot becomes invalidated during this wait, currently it
    logs a WARNING and moves to syncing the next slot:
    WARNING:  aborting initial sync for slot "failover_slot" as the slot
    was invalidated on primary
    
    I think this WARNING is too strong. We could change it to a LOG
    message instead, mark the local slot as invalidated, exit the wait
    loop, and move on to syncing the next slot.
    
    Even though this LOG is not there in slotsync worker case, I think it
    makes more sense in API case due to continuous LOGs which suggested
    that API was waiting to sync this slot in wait-loop and thus we shall
    conclude it either by saying wait-over (like we do in successful sync
    case) or we can say 'LOG: aborting wait as the remote slot was
    invalidated' instead of above WARNING message. What do you suggest?
    
    thanks
    Shveta
    
    
    
    
  23. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-08-04T10:11:20Z

    On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Mon, Aug 4, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Fri, Aug 1, 2025 at 2:50 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > 5)
    > > > I tried a test where there were 4 slots on the publisher, where one
    > > > was getting used while the others were not. Initiated
    > > > pg_sync_replication_slots on standby. Forced unused slots to be
    > > > invalidated by setting idle_replication_slot_timeout=60 on primary,
    > > > due to which API finished but gave a warning:
    > > >
    > > > postgres=# SELECT pg_sync_replication_slots();
    > > > WARNING:  aborting initial sync for slot "failover_slot"
    > > > DETAIL:  This slot was invalidated on the primary server.
    > > > WARNING:  aborting initial sync for slot "failover_slot2"
    > > > DETAIL:  This slot was invalidated on the primary server.
    > > > WARNING:  aborting initial sync for slot "failover_slot3"
    > > > DETAIL:  This slot was invalidated on the primary server.
    > > >  pg_sync_replication_slots
    > > > ---------------------------
    > > >
    > > > (1 row)
    > > >
    > > > Do we need these warnings here? I think we can have it as a LOG rather
    > > > than having it on console. Thoughts?
    > > >
    > >
    > > What is the behaviour of a slotsync worker in the same case? I don't
    > > see any such WARNING messages in the code of slotsync worker, so why
    > > do we want a different behaviour here?
    > >
    >
    > We don’t have continuous waiting in the slot-sync worker if the remote
    > slot is behind the local slot. But if during the first sync cycle the
    > remote slot is behind, we keep the local slot as a temporary slot. In
    > the next sync cycle, if we find the remote slot is invalidated, we
    > mark the local slot as invalidated too, keeping it in this temporary
    > state. There are no LOG or WARNING messages in this case. When the
    > slot-sync worker stops or shuts down (like during promotion), it
    > cleans up this temporary slot.
    >
    > Now, for the API behavior: if the remote slot is behind the local
    > slot, the API enters a wait loop and logs:
    >
    > LOG:  waiting for remote slot "failover_slot" LSN (0/3000060) and
    > catalog xmin (755) to pass local slot LSN (0/3000060) and catalog xmin
    > (770)
    >
    > If it keeps waiting, every 10 seconds it logs:
    > LOG:  continuing to wait for remote slot "failover_slot" LSN
    > (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
    > and catalog xmin (770)
    >
    > If the remote slot becomes invalidated during this wait, currently it
    > logs a WARNING and moves to syncing the next slot:
    > WARNING:  aborting initial sync for slot "failover_slot" as the slot
    > was invalidated on primary
    >
    > I think this WARNING is too strong. We could change it to a LOG
    > message instead, mark the local slot as invalidated, exit the wait
    > loop, and move on to syncing the next slot.
    >
    > Even though this LOG is not there in slotsync worker case, I think it
    > makes more sense in API case due to continuous LOGs which suggested
    > that API was waiting to sync this slot in wait-loop and thus we shall
    > conclude it either by saying wait-over (like we do in successful sync
    > case) or we can say 'LOG: aborting wait as the remote slot was
    > invalidated' instead of above WARNING message. What do you suggest?
    >
    
    I also think LOG is a better choice for this because there is nothing
    we can expect users to do even after seeing this. I feel this is more
    of an info for users.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  24. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-05T03:58:27Z

    On Mon, Aug 4, 2025 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Mon, Aug 4, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > >
    > > > On Fri, Aug 1, 2025 at 2:50 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > 5)
    > > > > I tried a test where there were 4 slots on the publisher, where one
    > > > > was getting used while the others were not. Initiated
    > > > > pg_sync_replication_slots on standby. Forced unused slots to be
    > > > > invalidated by setting idle_replication_slot_timeout=60 on primary,
    > > > > due to which API finished but gave a warning:
    > > > >
    > > > > postgres=# SELECT pg_sync_replication_slots();
    > > > > WARNING:  aborting initial sync for slot "failover_slot"
    > > > > DETAIL:  This slot was invalidated on the primary server.
    > > > > WARNING:  aborting initial sync for slot "failover_slot2"
    > > > > DETAIL:  This slot was invalidated on the primary server.
    > > > > WARNING:  aborting initial sync for slot "failover_slot3"
    > > > > DETAIL:  This slot was invalidated on the primary server.
    > > > >  pg_sync_replication_slots
    > > > > ---------------------------
    > > > >
    > > > > (1 row)
    > > > >
    > > > > Do we need these warnings here? I think we can have it as a LOG rather
    > > > > than having it on console. Thoughts?
    > > > >
    > > >
    > > > What is the behaviour of a slotsync worker in the same case? I don't
    > > > see any such WARNING messages in the code of slotsync worker, so why
    > > > do we want a different behaviour here?
    > > >
    > >
    > > We don’t have continuous waiting in the slot-sync worker if the remote
    > > slot is behind the local slot. But if during the first sync cycle the
    > > remote slot is behind, we keep the local slot as a temporary slot. In
    > > the next sync cycle, if we find the remote slot is invalidated, we
    > > mark the local slot as invalidated too, keeping it in this temporary
    > > state. There are no LOG or WARNING messages in this case. When the
    > > slot-sync worker stops or shuts down (like during promotion), it
    > > cleans up this temporary slot.
    > >
    > > Now, for the API behavior: if the remote slot is behind the local
    > > slot, the API enters a wait loop and logs:
    > >
    > > LOG:  waiting for remote slot "failover_slot" LSN (0/3000060) and
    > > catalog xmin (755) to pass local slot LSN (0/3000060) and catalog xmin
    > > (770)
    > >
    > > If it keeps waiting, every 10 seconds it logs:
    > > LOG:  continuing to wait for remote slot "failover_slot" LSN
    > > (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
    > > and catalog xmin (770)
    > >
    > > If the remote slot becomes invalidated during this wait, currently it
    > > logs a WARNING and moves to syncing the next slot:
    > > WARNING:  aborting initial sync for slot "failover_slot" as the slot
    > > was invalidated on primary
    > >
    > > I think this WARNING is too strong. We could change it to a LOG
    > > message instead, mark the local slot as invalidated, exit the wait
    > > loop, and move on to syncing the next slot.
    > >
    > > Even though this LOG is not there in slotsync worker case, I think it
    > > makes more sense in API case due to continuous LOGs which suggested
    > > that API was waiting to sync this slot in wait-loop and thus we shall
    > > conclude it either by saying wait-over (like we do in successful sync
    > > case) or we can say 'LOG: aborting wait as the remote slot was
    > > invalidated' instead of above WARNING message. What do you suggest?
    > >
    >
    > I also think LOG is a better choice for this because there is nothing
    > we can expect users to do even after seeing this. I feel this is more
    > of an info for users.
    >
    
    Yes, it is more of an info for users.
    
    > 1.
    > update_and_persist_local_synced_slot()
    > {
    > ...
    > + /*
    > + * For SQL API synchronization, we wait for the remote slot to catch up
    > + * here, since we can't assume the SQL API will be called again soon.
    > + * We will retry the sync once the slot catches up.
    > + *
    > + * Note: This will return false if a promotion is triggered on the
    > + * standby while waiting, in which case we stop syncing and drop the
    > + * temporary slot.
    > + */
    > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot))
    > + return false;
    >
    > Why is the wait added at this level? Shouldn't it be at API level aka
    > in SyncReplicationSlots() or pg_sync_replication_slots() similar to
    > what we do in ReplSlotSyncWorkerMain() for slotsync workers?
    
    The initial goal was to perform a single sync cycle for all slots. The
    logic was simple, if any slot couldn’t be synced because its remote
    slot was lagging, we would wait for the remote slot to catch up, and
    only then move on to the next slot.
    
    But if we consider moving wait logic to SyncReplicationSlots(), we
    will necessarily have to go with the logic that it will attempt to
    sync all slots in the first sync cycle, skipping those where remote
    slots are lagging.  It will then continue with multiple sync cycles
    until all slots are successfully synced. But if new remote slots are
    added in the meantime, they will be picked up in the next cycle, and
    the API then has to wait on those as well and this cycle may go on for
    longer.
    
    If we want to avoid continuously syncing newly added slots in later
    cycles and instead focus only on the ones that failed to sync during
    the first attempt, one approach is to maintain a list of failed slots
    from the initial cycle and only retry those in subsequent attempts.
    But this will add complexity to the implementation.
    
    IMO, attempting multiple sync cycles essentially makes the API behave
    more like slotsyncworker, which might not be desirable. I feel that
    performing just one sync cycle in the API is more in line with the
    expected behavior. And for that, the current implementation of
    wait-logic seems simpler. But let me know if you think otherwise or I
    have not understood your point clearly. I am open to more
    approaches/ideas here.
    
    thanks
    Shveta
    
    
    
    
  25. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-08-05T06:22:39Z

    On Tue, Aug 5, 2025 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Mon, Aug 4, 2025 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > If we want to avoid continuously syncing newly added slots in later
    > cycles and instead focus only on the ones that failed to sync during
    > the first attempt, one approach is to maintain a list of failed slots
    > from the initial cycle and only retry those in subsequent attempts.
    > But this will add complexity to the implementation.
    >
    
    There will be some additional code for this but overall it improves
    the code in the lower level functions. We may want to use the existing
    remote_slot list for this purpose.
    
    The current proposed change in low-level functions appears to be
    difficult to maintain, especially the change proposed in
    update_and_persist_local_synced_slot(). If we can find a better way to
    achieve the same then we can consider the current approach as well.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  26. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-06T02:03:28Z

    On Fri, Aug 1, 2025 at 4:32 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > Thanks for the patch. I tested it, please find a few comments:
    >
    >
    > 1)
    > it hits an assert
    > (slotsync_reread_config()-->Assert(sync_replication_slots)) when API
    > is trying to sync and is in wait loop while in another session, I
    > enable sync_replication_slots using:
    >
    > ALTER SYSTEM SET sync_replication_slots = 'on';
    > SELECT pg_reload_conf();
    >
    > Assert:
    > 025-08-01 10:55:43.637 IST [118576] STATEMENT:  SELECT
    > pg_sync_replication_slots();
    > 2025-08-01 10:55:51.730 IST [118563] LOG:  received SIGHUP, reloading
    > configuration files
    > 2025-08-01 10:55:51.731 IST [118563] LOG:  parameter
    > "sync_replication_slots" changed to "on"
    > TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
    > Line: 1334, PID: 118576
    > postgres: shveta postgres [local]
    > SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
    > postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
    > 2025-08-01 10:55:51.739 IST [118666] ERROR:  cannot synchronize
    > replication slots concurrently
    > postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
    > postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
    > postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
    > postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
    > postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
    > postgres: shveta postgres [local]
    > SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
    > postgres: shveta postgres [local]
    > SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]
    >
    
    Fixed.
    
    > 2)
    > + ereport(ERROR,
    > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    > + errmsg("cannot synchronize replication slots when"
    > + " standby promotion is ongoing")));
    >
    > I think better error message will be:
    > "exiting from slot synchronization as promotion is triggered"
    >
    > This will be better suited in log file as well after below wait statements:
    > LOG:  continuing to wait for remote slot "failover_slot" LSN
    > (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
    > and catalog xmin (757)
    > STATEMENT:  SELECT pg_sync_replication_slots();
    >
    
    Fixed.
    
    > 3)
    > API dumps this when it is waiting for primary:
    >
    > ----
    > LOG:  could not synchronize replication slot "failover_slot2"
    > DETAIL:  Synchronization could lead to data loss, because the remote
    > slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
    > has LSN 0/03066E70 and catalog xmin 770.
    > STATEMENT:  SELECT pg_sync_replication_slots();
    > LOG:  waiting for remote slot "failover_slot2" LSN (0/3066E70) and
    > catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
    > (770)
    > STATEMENT:  SELECT pg_sync_replication_slots();
    > LOG:  continuing to wait for remote slot "failover_slot2" LSN
    > (0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
    > and catalog xmin (770)
    > STATEMENT:  SELECT pg_sync_replication_slots();
    > ----
    >
    > Unsure if we shall still dump 'could not synchronize..' when it is
    > going to retry until it succeeds? The concerned log gives a feeling
    > that we are done trying and could not synchronize it. What do you
    > think?
    >
    
    I've modified the log to now say, "initial sync of replication slot
    \"%s\" failed; will keep retrying"
    
    
    On Fri, Aug 1, 2025 at 7:20 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > A few more comments:
    >
    > 4)
    >
    > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
    > + {
    > + ereport(WARNING,
    > + errmsg("aborting initial sync for slot \"%s\"",
    > +    remote_slot->name),
    > + errdetail("This slot was not found on the primary server."));
    > +
    > + pfree(cmd.data);
    > + walrcv_clear_result(res);
    > +
    > + return false;
    > + }
    >
    > We may have 'aborting sync for slot', can remove 'initial'.
    
    Fixed.
    
    >
    > 5)
    > I tried a test where there were 4 slots on the publisher, where one
    > was getting used while the others were not. Initiated
    > pg_sync_replication_slots on standby. Forced unused slots to be
    > invalidated by setting idle_replication_slot_timeout=60 on primary,
    > due to which API finished but gave a warning:
    >
    > postgres=# SELECT pg_sync_replication_slots();
    > WARNING:  aborting initial sync for slot "failover_slot"
    > DETAIL:  This slot was invalidated on the primary server.
    > WARNING:  aborting initial sync for slot "failover_slot2"
    > DETAIL:  This slot was invalidated on the primary server.
    > WARNING:  aborting initial sync for slot "failover_slot3"
    > DETAIL:  This slot was invalidated on the primary server.
    >  pg_sync_replication_slots
    > ---------------------------
    >
    > (1 row)
    >
    > Do we need these warnings here? I think we can have it as a LOG rather
    > than having it on console. Thoughts?
    >
    > If we  inclined towards WARNING here, will ti be better to have it as
    > a single line:
    >
    > WARNING:  aborting sync for slot "failover_slot" as the slot was
    > invalidated on primary
    > WARNING:  aborting sync for slot "failover_slot1" as the slot was
    > invalidated on primary
    > WARNING:  aborting sync for slot "failover_slot2" as the slot was
    > invalidated on primary
    >
    
    I've changed it to LOG now.
    
    >
    > 6)
    > - * We do not drop the slot because the restart_lsn can be ahead of the
    > - * current location when recreating the slot in the next cycle. It may
    > - * take more time to create such a slot. Therefore, we keep this slot
    > - * and attempt the synchronization in the next cycle.
    > + * If we're in the slotsync worker, we retain the slot and retry in the
    > + * next cycle. The restart_lsn might advance by then, allowing the slot
    > + * to be created successfully later.
    >   */
    >
    > I like the previous command better as it was conveying the side-effect
    > of dropping the slot herer. Can we try to incorporate the previous
    > comment here and make it specific to slotsync workers?
    >
    
    Reverted to the previous version.
    
    Attaching patch v4 which addresses these comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  27. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-06T02:05:11Z

    On Tue, Aug 5, 2025 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Tue, Aug 5, 2025 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Mon, Aug 4, 2025 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > >
    > > > On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > If we want to avoid continuously syncing newly added slots in later
    > > cycles and instead focus only on the ones that failed to sync during
    > > the first attempt, one approach is to maintain a list of failed slots
    > > from the initial cycle and only retry those in subsequent attempts.
    > > But this will add complexity to the implementation.
    > >
    >
    > There will be some additional code for this but overall it improves
    > the code in the lower level functions. We may want to use the existing
    > remote_slot list for this purpose.
    >
    > The current proposed change in low-level functions appears to be
    > difficult to maintain, especially the change proposed in
    > update_and_persist_local_synced_slot(). If we can find a better way to
    > achieve the same then we can consider the current approach as well.
    >
    
    Next patch, I'll work on addressing this comment. I'll need to
    restructure the code to make this happen.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
    
    
    
  28. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-06T03:17:49Z

    On Wed, Aug 6, 2025 at 7:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Tue, Aug 5, 2025 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Tue, Aug 5, 2025 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > On Mon, Aug 4, 2025 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > > >
    > > > > On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > If we want to avoid continuously syncing newly added slots in later
    > > > cycles and instead focus only on the ones that failed to sync during
    > > > the first attempt, one approach is to maintain a list of failed slots
    > > > from the initial cycle and only retry those in subsequent attempts.
    > > > But this will add complexity to the implementation.
    > > >
    > >
    > > There will be some additional code for this but overall it improves
    > > the code in the lower level functions. We may want to use the existing
    > > remote_slot list for this purpose.
    > >
    > > The current proposed change in low-level functions appears to be
    > > difficult to maintain, especially the change proposed in
    > > update_and_persist_local_synced_slot(). If we can find a better way to
    > > achieve the same then we can consider the current approach as well.
    > >
    >
    > Next patch, I'll work on addressing this comment. I'll need to
    > restructure the code to make this happen.
    >
    
    Okay, thanks Ajin. I will resume review after this comment is
    addressed as I am assuming that the new logic will get rid of most of
    the current wait logic and thus it makes sense to review it after it
    is addressed.
    
    thanks
    Shveta
    
    
    
    
  29. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-08-08T13:21:59Z

    On Wed, Aug 6, 2025 at 8:48 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Aug 6, 2025 at 7:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Tue, Aug 5, 2025 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > >
    > > > On Tue, Aug 5, 2025 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > On Mon, Aug 4, 2025 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > > > >
    > > > > > On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > If we want to avoid continuously syncing newly added slots in later
    > > > > cycles and instead focus only on the ones that failed to sync during
    > > > > the first attempt, one approach is to maintain a list of failed slots
    > > > > from the initial cycle and only retry those in subsequent attempts.
    > > > > But this will add complexity to the implementation.
    > > > >
    > > >
    > > > There will be some additional code for this but overall it improves
    > > > the code in the lower level functions. We may want to use the existing
    > > > remote_slot list for this purpose.
    > > >
    > > > The current proposed change in low-level functions appears to be
    > > > difficult to maintain, especially the change proposed in
    > > > update_and_persist_local_synced_slot(). If we can find a better way to
    > > > achieve the same then we can consider the current approach as well.
    > > >
    > >
    > > Next patch, I'll work on addressing this comment. I'll need to
    > > restructure the code to make this happen.
    > >
    >
    > Okay, thanks Ajin. I will resume review after this comment is
    > addressed as I am assuming that the new logic will get rid of most of
    > the current wait logic and thus it makes sense to review it after it
    > is addressed.
    
    There's also a minor merge conflict because func.sgml is not split
    into multiple files.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  30. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-11T08:05:43Z

    On Tue, Aug 5, 2025 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Tue, Aug 5, 2025 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Mon, Aug 4, 2025 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > >
    > > > On Mon, Aug 4, 2025 at 12:19 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > If we want to avoid continuously syncing newly added slots in later
    > > cycles and instead focus only on the ones that failed to sync during
    > > the first attempt, one approach is to maintain a list of failed slots
    > > from the initial cycle and only retry those in subsequent attempts.
    > > But this will add complexity to the implementation.
    > >
    >
    > There will be some additional code for this but overall it improves
    > the code in the lower level functions. We may want to use the existing
    > remote_slot list for this purpose.
    >
    > The current proposed change in low-level functions appears to be
    > difficult to maintain, especially the change proposed in
    > update_and_persist_local_synced_slot(). If we can find a better way to
    > achieve the same then we can consider the current approach as well.
    
    Right. I've reworked the design to have the wait at a much lower
    level. I've also used a single WAIT EVENT -
    REPLICATION_SLOTSYNC_PRIMARY_CATCHUP for both the slotsync worker and
    the sync API.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  31. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-11T08:06:52Z

    On Fri, Aug 8, 2025 at 11:22 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    >
    > There's also a minor merge conflict because func.sgml is not split
    > into multiple files.
    >
    
    Yes, I fixed this.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
    
    
    
  32. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-13T04:47:23Z

    On Mon, Aug 11, 2025 at 1:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Fri, Aug 8, 2025 at 11:22 PM Ashutosh Bapat
    > <ashutosh.bapat.oss@gmail.com> wrote:
    > >
    > >
    > > There's also a minor merge conflict because func.sgml is not split
    > > into multiple files.
    > >
    >
    > Yes, I fixed this.
    >
    
    Thanks for the patch. Please find a few comments:
    
    1)
    We can merge refresh_remote_slots and fetch_remote_slots by passing an
    argument of remote_list. If no remote_list passed, fetch all failover
    slots, else extend the query and fetch only the listed ones.
    
    2)
    We can get rid of 'sync_iterations' and the logic within, as I think
    there is no need to distinguish between slotsync and API in terms of
    logs.
    
    3)
    sync_start_pending is not needed to be passed to
    update_and_persist_local_synced_slot(), as the output of this function
    is good enough to tell whether slot is persisted or not.
    
    4)
    Also how about having sync-pending in SlotSyncCtxStruct. It can be set
    unconditionally by both slotsync and API, but will be used by API. I
    think it can simplify the code.
    
    5)
    We can get rid of 'pending_sync_start_slots', as it is not being used anywhere.
    
    6)
    Also we can mention in comments as to why we are using the old
    remote_slots list in refresh_remote_slots() during subsequent cycles
    of API rather than using only the pending-slot list.
    
    thanks
    Shveta
    
    
    
    
  33. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-14T01:58:26Z

    On Wed, Aug 13, 2025 at 2:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Mon, Aug 11, 2025 at 1:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Fri, Aug 8, 2025 at 11:22 PM Ashutosh Bapat
    > > <ashutosh.bapat.oss@gmail.com> wrote:
    > > >
    > > >
    > > > There's also a minor merge conflict because func.sgml is not split
    > > > into multiple files.
    > > >
    > >
    > > Yes, I fixed this.
    > >
    >
    > Thanks for the patch. Please find a few comments:
    >
    > 1)
    > We can merge refresh_remote_slots and fetch_remote_slots by passing an
    > argument of remote_list. If no remote_list passed, fetch all failover
    > slots, else extend the query and fetch only the listed ones.
    >
    
    Done.
    
    > 2)
    > We can get rid of 'sync_iterations' and the logic within, as I think
    > there is no need to distinguish between slotsync and API in terms of
    > logs.
    >
    
    Done.
    
    > 3)
    > sync_start_pending is not needed to be passed to
    > update_and_persist_local_synced_slot(), as the output of this function
    > is good enough to tell whether slot is persisted or not.
    >
    > 4)
    > Also how about having sync-pending in SlotSyncCtxStruct. It can be set
    > unconditionally by both slotsync and API, but will be used by API. I
    > think it can simplify the code.
    >
    
    Done.
    
    > 5)
    > We can get rid of 'pending_sync_start_slots', as it is not being used anywhere.
    >
    
    Fixed.
    
    > 6)
    > Also we can mention in comments as to why we are using the old
    > remote_slots list in refresh_remote_slots() during subsequent cycles
    > of API rather than using only the pending-slot list.
    
    Done.
    
    Patch v6 attached.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  34. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-14T06:43:59Z

    On Thu, Aug 14, 2025 at 7:28 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Patch v6 attached.
    >
    
    Thanks Ajin. Please find my comments:
    
    1)
    SyncReplicationSlots:
    + remote_slots = fetch_or_refresh_remote_slots(wrconn, NIL);
    +
    + /* Retry until all slots are sync ready atleast */
    + for (;;)
    + {
    + bool some_slot_updated = false;
    +
    + /*
    + * Refresh the remote slot data. We keep using the original slot
    + * list, even if some slots are already sync ready, so that all
    + * slots are updated with the latest status from the primary.
    + */
    + remote_slots = fetch_or_refresh_remote_slots(wrconn, remote_slots);
    
    When the API  begins, it seems we are fetching remote_list twice
    before we even sync it once. We can get rid of
    'fetch_or_refresh_remote_slots' from outside the loop and retain the
    inside one. At first call, remote_slots will be NIL and thus it will
    fetch all slots and in subsequent calls, it will be populated one.
    
    
    2)
    SyncReplicationSlots:
    + /*
    + * The syscache access in fetch_or_refresh_remote_slots() needs a
    + * transaction env.
    + */
    + if (!IsTransactionState()) {
    + StartTransactionCommand();
    + started_tx = true;
    + }
    
    + if (started_tx)
    + CommitTransactionCommand();
    
    Shall we move these 2 inside fetch_or_refresh_remote_slots() (both
    worker and APi flow) similar to how validate_remote_info() also has it
    inside?
    
    3)
    SyncReplicationSlots:
    + /* Done if all slots are atleast sync ready */
    + if (!SlotSyncCtx->slot_not_persisted)
    + break;
    + else
    + {
    + /* wait for 2 seconds before retrying */
    + wait_for_slot_activity(some_slot_updated, true);
    
    No need to have 'else' block here. The code can be put without having
    'else' because 'if' when true, breaks from the loop.
    
    
    4)
    'fetch_or_refresh_remote_slots' can be renamed to 'fetch_remote_slots'
    simply and then a comment can define an extra argument. Because
    ultimately we are re-fetching some/all slots in both cases.
    
    5)
    In the case of API, wait_for_slot_activity() does not change its wait
    time based on 'some_slot_updated'. I think we can pull 'WaitLatch,
    ResetLatch' in API-function itself and lets not change worker's
    wait_for_slot_activity().
    
    6)
    fetch_or_refresh_remote_slots:
    + {
    + if (is_refresh)
    + {
    + ereport(WARNING,
    + errmsg("could not fetch updated failover logical slots info"
    +    " from the primary server: %s",
    +    res->err));
    + pfree(query.data);
    + return remote_slot_list; /* Return original list on refresh failure */
    + }
    + else
    + {
    + ereport(ERROR,
    + errmsg("could not fetch failover logical slots info from the primary
    server: %s",
    +    res->err));
    + }
    + }
    
    I think there is no need for different behaviour here for worker and
    API. Since worker errors-out here, we can make API also error-out.
    
    7)
    +fetch_or_refresh_remote_slots(WalReceiverConn *wrconn, List *remote_slot_list)
    
    We can name the argument as 'target_slot_list' and replace the name
    'updated_slot_list' with 'remote_slot_list'.
    
    
    8)
    + /* If refreshing, free the original list structures */
    + if (is_refresh)
    + {
    + foreach(lc, remote_slot_list)
    + {
    + RemoteSlot *old_slot = (RemoteSlot *) lfirst(lc);
    + pfree(old_slot);
    + }
    + list_free(remote_slot_list);
    + }
    
    We can get rid of 'is_refresh' and can simply check if
    'target_slot_list != NIL', free it. We can use list_free_deep instead
    of freeing each element. Having said that, it looks slightly odd to
    free the list in this function, I will think  more here. Meanwhile, we
    can do this.
    
    
    9)
    -update_and_persist_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
    +update_and_persist_local_synced_slot(WalReceiverConn * wrconn,
    + RemoteSlot * remote_slot, Oid remote_dbid)
    
    We can get rid of wrconn as we are not using it. Same with wrconn
    argument for synchronize_one_slot()
    
    10)
    + /* used by pg_sync_replication_slots() API only */
    + bool slot_not_persisted;
    
    We can move comment outside structure. We can first define it and then
    say the above line.
    
    
    11)
    + SlotSyncCtx->slot_not_persisted = false;
    
    This may overwrite the 'slot_not_persisted' set for the previous slot
    and ultimately make it 'false' at the end of cycle even though we had
    few not-persisted slots in the beginning of cycle. Should it be:
    
    SlotSyncCtx->slot_not_persisted |= false;
    
    12)
    Shall we rename this to : slot_persistence_pending (based on many
    other modules using similar names: detach_pending, send_pending,
    callback_pending)?
    
    13)
    - errmsg("could not synchronize replication slot \"%s\"",
    -    remote_slot->name),
    - errdetail("Synchronization could lead to data loss, because the
    remote slot needs WAL at LSN %X/%08X and catalog xmin %u, but the
    standby has LSN %X/%08X and catalog xmin %u.",
    -   LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    -   remote_slot->catalog_xmin,
    -   LSN_FORMAT_ARGS(slot->data.restart_lsn),
    -   slot->data.catalog_xmin));
    + errmsg("Replication slot \"%s\" is not sync ready; will keep retrying",
    +    remote_slot->name),
    + errdetail("Attempting Synchronization could lead to data loss,
    because the remote slot needs WAL at LSN %X/%08X and catalog xmin %u,
    but the standby has LSN %X/%08X and catalog xmin %u.",
    + LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    + remote_slot->catalog_xmin,
    + LSN_FORMAT_ARGS(slot->data.restart_lsn),
    + slot->data.catalog_xmin));
    
    We can retain the same message as it was put after a lot of
    discussion. We can attempt to change if others comment. The idea is
    since a worker dumps it in each subsequent cycle (if such a situation
    arises), on the same basis now the API can also do so because it is
    also performing multiple cycles now. Earlier I had suggested changing
    it for API based on messages 'continuing to wait..' which  are no
    longer there now.
    
    thanks
    Shveta
    
    
    
    
  35. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-08-14T12:44:29Z

    On Thu, Aug 14, 2025 at 12:14 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > 8)
    > + /* If refreshing, free the original list structures */
    > + if (is_refresh)
    > + {
    > + foreach(lc, remote_slot_list)
    > + {
    > + RemoteSlot *old_slot = (RemoteSlot *) lfirst(lc);
    > + pfree(old_slot);
    > + }
    > + list_free(remote_slot_list);
    > + }
    >
    > We can get rid of 'is_refresh' and can simply check if
    > 'target_slot_list != NIL', free it. We can use list_free_deep instead
    > of freeing each element. Having said that, it looks slightly odd to
    > free the list in this function, I will think  more here. Meanwhile, we
    > can do this.
    >
    
    +1.  The function prologue doesn't mention that the original list is
    deep freed. So a caller may try to access it after this call, which
    will lead to a crash. As a safe programming practice we should let the
    caller free the original list if it is not needed anymore OR modify
    the input list in-place and return it for the convenience of the
    caller like all list_* interfaces. At least we should document this
    behavior in the function prologue. You could also use foreach_ptr
    instead of foreach.
    
    > 13)
    > - errmsg("could not synchronize replication slot \"%s\"",
    > -    remote_slot->name),
    > - errdetail("Synchronization could lead to data loss, because the
    > remote slot needs WAL at LSN %X/%08X and catalog xmin %u, but the
    > standby has LSN %X/%08X and catalog xmin %u.",
    > -   LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    > -   remote_slot->catalog_xmin,
    > -   LSN_FORMAT_ARGS(slot->data.restart_lsn),
    > -   slot->data.catalog_xmin));
    > + errmsg("Replication slot \"%s\" is not sync ready; will keep retrying",
    > +    remote_slot->name),
    > + errdetail("Attempting Synchronization could lead to data loss,
    > because the remote slot needs WAL at LSN %X/%08X and catalog xmin %u,
    > but the standby has LSN %X/%08X and catalog xmin %u.",
    > + LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    > + remote_slot->catalog_xmin,
    > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
    > + slot->data.catalog_xmin));
    >
    > We can retain the same message as it was put after a lot of
    > discussion. We can attempt to change if others comment. The idea is
    > since a worker dumps it in each subsequent cycle (if such a situation
    > arises), on the same basis now the API can also do so because it is
    > also performing multiple cycles now. Earlier I had suggested changing
    > it for API based on messages 'continuing to wait..' which  are no
    > longer there now.
    
    Also we usually don't use capital letters at the start of the error
    message. Any reason this is different?
    
    Some more
    
    + * When called from pg_sync_replication_slots, use a fixed 2
    + * second wait time.
    
    Function prologue doesn't mention this. Probably the prologue should
    contain only the first sentence there. Rest of the prologues just
    repeat comments in the function. The function is small enough that a
    reader could read the details from the function instead of the
    prologue.
    
    + wait_time = SLOTSYNC_API_NAPTIME_MS;
    + } else {
    
    } else and { should be on separate lines.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  36. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-19T05:24:57Z

    On Thu, Aug 14, 2025 at 4:44 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Thu, Aug 14, 2025 at 7:28 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Patch v6 attached.
    > >
    >
    > Thanks Ajin. Please find my comments:
    >
    > 1)
    > SyncReplicationSlots:
    > + remote_slots = fetch_or_refresh_remote_slots(wrconn, NIL);
    > +
    > + /* Retry until all slots are sync ready atleast */
    > + for (;;)
    > + {
    > + bool some_slot_updated = false;
    > +
    > + /*
    > + * Refresh the remote slot data. We keep using the original slot
    > + * list, even if some slots are already sync ready, so that all
    > + * slots are updated with the latest status from the primary.
    > + */
    > + remote_slots = fetch_or_refresh_remote_slots(wrconn, remote_slots);
    >
    > When the API  begins, it seems we are fetching remote_list twice
    > before we even sync it once. We can get rid of
    > 'fetch_or_refresh_remote_slots' from outside the loop and retain the
    > inside one. At first call, remote_slots will be NIL and thus it will
    > fetch all slots and in subsequent calls, it will be populated one.
    >
    
    Fixed.
    
    >
    > 2)
    > SyncReplicationSlots:
    > + /*
    > + * The syscache access in fetch_or_refresh_remote_slots() needs a
    > + * transaction env.
    > + */
    > + if (!IsTransactionState()) {
    > + StartTransactionCommand();
    > + started_tx = true;
    > + }
    >
    > + if (started_tx)
    > + CommitTransactionCommand();
    >
    > Shall we move these 2 inside fetch_or_refresh_remote_slots() (both
    > worker and APi flow) similar to how validate_remote_info() also has it
    > inside?
    >
    
    I tried this but it doesn't work because when the transaction is
    committed, the memory allocation for the remote slots are also freed.
    So, this needs to be on the outside.
    
    > 3)
    > SyncReplicationSlots:
    > + /* Done if all slots are atleast sync ready */
    > + if (!SlotSyncCtx->slot_not_persisted)
    > + break;
    > + else
    > + {
    > + /* wait for 2 seconds before retrying */
    > + wait_for_slot_activity(some_slot_updated, true);
    >
    > No need to have 'else' block here. The code can be put without having
    > 'else' because 'if' when true, breaks from the loop.
    >
    
    Fixed.
    
    >
    > 4)
    > 'fetch_or_refresh_remote_slots' can be renamed to 'fetch_remote_slots'
    > simply and then a comment can define an extra argument. Because
    > ultimately we are re-fetching some/all slots in both cases.
    >
    
    Done.
    
    > 5)
    > In the case of API, wait_for_slot_activity() does not change its wait
    > time based on 'some_slot_updated'. I think we can pull 'WaitLatch,
    > ResetLatch' in API-function itself and lets not change worker's
    > wait_for_slot_activity().
    >
    
    Done.
    
    > 6)
    > fetch_or_refresh_remote_slots:
    > + {
    > + if (is_refresh)
    > + {
    > + ereport(WARNING,
    > + errmsg("could not fetch updated failover logical slots info"
    > +    " from the primary server: %s",
    > +    res->err));
    > + pfree(query.data);
    > + return remote_slot_list; /* Return original list on refresh failure */
    > + }
    > + else
    > + {
    > + ereport(ERROR,
    > + errmsg("could not fetch failover logical slots info from the primary
    > server: %s",
    > +    res->err));
    > + }
    > + }
    >
    > I think there is no need for different behaviour here for worker and
    > API. Since worker errors-out here, we can make API also error-out.
    >
    
    Fixed.
    
    > 7)
    > +fetch_or_refresh_remote_slots(WalReceiverConn *wrconn, List *remote_slot_list)
    >
    > We can name the argument as 'target_slot_list' and replace the name
    > 'updated_slot_list' with 'remote_slot_list'.
    >
    
    Fixed.
    
    >
    > 8)
    > + /* If refreshing, free the original list structures */
    > + if (is_refresh)
    > + {
    > + foreach(lc, remote_slot_list)
    > + {
    > + RemoteSlot *old_slot = (RemoteSlot *) lfirst(lc);
    > + pfree(old_slot);
    > + }
    > + list_free(remote_slot_list);
    > + }
    >
    > We can get rid of 'is_refresh' and can simply check if
    > 'target_slot_list != NIL', free it. We can use list_free_deep instead
    > of freeing each element. Having said that, it looks slightly odd to
    > free the list in this function, I will think  more here. Meanwhile, we
    > can do this.
    >
    
    Fixed.
    
    >
    > 9)
    > -update_and_persist_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
    > +update_and_persist_local_synced_slot(WalReceiverConn * wrconn,
    > + RemoteSlot * remote_slot, Oid remote_dbid)
    >
    > We can get rid of wrconn as we are not using it. Same with wrconn
    > argument for synchronize_one_slot()
    >
    
    Done.
    
    > 10)
    > + /* used by pg_sync_replication_slots() API only */
    > + bool slot_not_persisted;
    >
    > We can move comment outside structure. We can first define it and then
    > say the above line.
    >
    
    Done.
    
    >
    > 11)
    > + SlotSyncCtx->slot_not_persisted = false;
    >
    > This may overwrite the 'slot_not_persisted' set for the previous slot
    > and ultimately make it 'false' at the end of cycle even though we had
    > few not-persisted slots in the beginning of cycle. Should it be:
    >
    > SlotSyncCtx->slot_not_persisted |= false;
    >
    
    Fixed.
    
    > 12)
    > Shall we rename this to : slot_persistence_pending (based on many
    > other modules using similar names: detach_pending, send_pending,
    > callback_pending)?
    >
    
    Done.
    
    > 13)
    > - errmsg("could not synchronize replication slot \"%s\"",
    > -    remote_slot->name),
    > - errdetail("Synchronization could lead to data loss, because the
    > remote slot needs WAL at LSN %X/%08X and catalog xmin %u, but the
    > standby has LSN %X/%08X and catalog xmin %u.",
    > -   LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    > -   remote_slot->catalog_xmin,
    > -   LSN_FORMAT_ARGS(slot->data.restart_lsn),
    > -   slot->data.catalog_xmin));
    > + errmsg("Replication slot \"%s\" is not sync ready; will keep retrying",
    > +    remote_slot->name),
    > + errdetail("Attempting Synchronization could lead to data loss,
    > because the remote slot needs WAL at LSN %X/%08X and catalog xmin %u,
    > but the standby has LSN %X/%08X and catalog xmin %u.",
    > + LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    > + remote_slot->catalog_xmin,
    > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
    > + slot->data.catalog_xmin));
    >
    > We can retain the same message as it was put after a lot of
    > discussion. We can attempt to change if others comment. The idea is
    > since a worker dumps it in each subsequent cycle (if such a situation
    > arises), on the same basis now the API can also do so because it is
    > also performing multiple cycles now. Earlier I had suggested changing
    > it for API based on messages 'continuing to wait..' which  are no
    > longer there now.
    >
    
    Done.
    
    
    On Thu, Aug 14, 2025 at 10:44 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Thu, Aug 14, 2025 at 12:14 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > >
    > > 8)
    > > + /* If refreshing, free the original list structures */
    > > + if (is_refresh)
    > > + {
    > > + foreach(lc, remote_slot_list)
    > > + {
    > > + RemoteSlot *old_slot = (RemoteSlot *) lfirst(lc);
    > > + pfree(old_slot);
    > > + }
    > > + list_free(remote_slot_list);
    > > + }
    > >
    > > We can get rid of 'is_refresh' and can simply check if
    > > 'target_slot_list != NIL', free it. We can use list_free_deep instead
    > > of freeing each element. Having said that, it looks slightly odd to
    > > free the list in this function, I will think  more here. Meanwhile, we
    > > can do this.
    > >
    >
    > +1.  The function prologue doesn't mention that the original list is
    > deep freed. So a caller may try to access it after this call, which
    > will lead to a crash. As a safe programming practice we should let the
    > caller free the original list if it is not needed anymore OR modify
    > the input list in-place and return it for the convenience of the
    > caller like all list_* interfaces. At least we should document this
    > behavior in the function prologue. You could also use foreach_ptr
    > instead of foreach.
    >
    
    I've changed the logic so that it is the responsibility of the caller
    to free the list.
    
    > > 13)
    > > - errmsg("could not synchronize replication slot \"%s\"",
    > > -    remote_slot->name),
    > > - errdetail("Synchronization could lead to data loss, because the
    > > remote slot needs WAL at LSN %X/%08X and catalog xmin %u, but the
    > > standby has LSN %X/%08X and catalog xmin %u.",
    > > -   LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    > > -   remote_slot->catalog_xmin,
    > > -   LSN_FORMAT_ARGS(slot->data.restart_lsn),
    > > -   slot->data.catalog_xmin));
    > > + errmsg("Replication slot \"%s\" is not sync ready; will keep retrying",
    > > +    remote_slot->name),
    > > + errdetail("Attempting Synchronization could lead to data loss,
    > > because the remote slot needs WAL at LSN %X/%08X and catalog xmin %u,
    > > but the standby has LSN %X/%08X and catalog xmin %u.",
    > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn),
    > > + remote_slot->catalog_xmin,
    > > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
    > > + slot->data.catalog_xmin));
    > >
    > > We can retain the same message as it was put after a lot of
    > > discussion. We can attempt to change if others comment. The idea is
    > > since a worker dumps it in each subsequent cycle (if such a situation
    > > arises), on the same basis now the API can also do so because it is
    > > also performing multiple cycles now. Earlier I had suggested changing
    > > it for API based on messages 'continuing to wait..' which  are no
    > > longer there now.
    >
    > Also we usually don't use capital letters at the start of the error
    > message. Any reason this is different?
    >
    
    Retained the old message.
    
    
    > Some more
    >
    > + * When called from pg_sync_replication_slots, use a fixed 2
    > + * second wait time.
    >
    > Function prologue doesn't mention this. Probably the prologue should
    > contain only the first sentence there. Rest of the prologues just
    > repeat comments in the function. The function is small enough that a
    > reader could read the details from the function instead of the
    > prologue.
    >
    > + wait_time = SLOTSYNC_API_NAPTIME_MS;
    > + } else {
    >
    > } else and { should be on separate lines.
    >
    
    I've removed the changes in this function and it is now the same as before.
    
    Attaching patch v7 addressing all the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  37. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-19T09:42:05Z

    On Tue, Aug 19, 2025 at 10:55 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attaching patch v7 addressing all the above comments.
    >
    
    Thank You for the patches. Please find a few comments:
    
    1)
    We are not resetting 'slot_persistence_pending' to false anywhere. So
    once it hits the flow which sets it to true, it will never become
    false even if remote-slot catches up in subsequent cycles, resulting
    in a hang of the API. We shall reset it before starting a new
    iteration in SyncReplicationSlots().
    
    2)
    We need to clean 'slot_persistence_pending' in reset_syncing_flag() as
    well which is called at the end of API or in failure of API. Even
    though the slot sync worker is not using it, we should clean it up in
    slotsync_worker_onexit() as well.
    
    3)
    + /* slot has been persisted, no need to retry */
    + SlotSyncCtx->slot_persistence_pending |= false;
    +
    
    This will not be needed once we reset this flag before each iteration
    in SyncReplicationSlots()
    
    4)
    -synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
    +synchronize_one_slot(WalReceiverConn * wrconn, RemoteSlot * remote_slot,
    + Oid remote_dbid)
    
    wrconn not used anywhere.
    
    5)
    + bool is_refresh = (target_slot_list!= NIL);
    
    is_refresh is not needed. We can simply check if
    target_slot_list!=NIL, then append it to cmd.
    
    6)
     * If remote_slot_list is NIL, fetches all failover logical slots from the
     * primary server. If remote_slot_list is provided, refreshes only those
     * specific slots with current values from the primary server.
    
    The usage of the word 'refreshing' is confusing. Since we are
    allocating a new remote-list everytime (instead of reusing or
    refreshing previous one), we can simply say:
    
    ------
    Fetches the failover logical slots info from the primary server
    
    If target_slot_list is NIL, fetches all failover logical slots from
    the primary server, otherwise fetches only the ones mentioned in
    target_slot_list
    ------
    
    The 'Parameters:' can also be adjusted accordingly.
    
    
    7)
     * Returns a list of RemoteSlot structures. If refreshing and the query fails,
     * returns the original list. Slots that no longer exist on the primary will
     * be removed from the list.
    
     This can be removed.
    
    
    8)
    - * If restart_lsn, confirmed_lsn or catalog_xmin is invalid but the
    - * slot is valid, that means we have fetched the remote_slot in its
    - * RS_EPHEMERAL state. In such a case, don't sync it; we can always
    - * sync it in the next sync cycle when the remote_slot is persisted
    - * and has valid lsn(s) and xmin values.
    - *
    - * XXX: In future, if we plan to expose 'slot->data.persistency' in
    - * pg_replication_slots view, then we can avoid fetching RS_EPHEMERAL
    - * slots in the first place.
    + * Apply ephemeral slot filtering. Skip slots that are in RS_EPHEMERAL
    + * state (invalid LSNs/xmin but not explicitly invalidated).
    
    We can retain the original comment.
    
    9)
    Apart from above, there are many changes (alignement, comments etc)
    which are not related to this particular improvement. We can get rid
    of those changes. The patch should have the changes pertaining to
    current improvement alone.
    
    thanks
    Shveta
    
    
    
    
  38. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Kirill Reshke <reshkekirill@gmail.com> — 2025-08-19T10:57:29Z

    On Tue, 19 Aug 2025 at 10:25, Ajin Cherian <itsajin@gmail.com> wrote:
    >
    
    > Attaching patch v7 addressing all the above comments.
    
    
    Looks like this thread is not attached to any commitfest entry?
    If so, can you please create one[0]? This will be beneficial for
    thread, both simplifying patch review and (possibly) increasing the
    number of reviewers.
    
    [0] https://commitfest.postgresql.org/55/new/
    
    -- 
    Best regards,
    Kirill Reshke
    
    
    
    
  39. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-20T05:22:58Z

    On Tue, Aug 19, 2025 at 7:42 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Aug 19, 2025 at 10:55 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Attaching patch v7 addressing all the above comments.
    > >
    >
    > Thank You for the patches. Please find a few comments:
    >
    > 1)
    > We are not resetting 'slot_persistence_pending' to false anywhere. So
    > once it hits the flow which sets it to true, it will never become
    > false even if remote-slot catches up in subsequent cycles, resulting
    > in a hang of the API. We shall reset it before starting a new
    > iteration in SyncReplicationSlots().
    >
    > 2)
    > We need to clean 'slot_persistence_pending' in reset_syncing_flag() as
    > well which is called at the end of API or in failure of API. Even
    > though the slot sync worker is not using it, we should clean it up in
    > slotsync_worker_onexit() as well.
    >
    
    Done.
    
    > 3)
    > + /* slot has been persisted, no need to retry */
    > + SlotSyncCtx->slot_persistence_pending |= false;
    > +
    >
    > This will not be needed once we reset this flag before each iteration
    > in SyncReplicationSlots()
    >
    
    Removed.
    
    > 4)
    > -synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
    > +synchronize_one_slot(WalReceiverConn * wrconn, RemoteSlot * remote_slot,
    > + Oid remote_dbid)
    >
    > wrconn not used anywhere.
    >
    
    Removed.
    
    > 5)
    > + bool is_refresh = (target_slot_list!= NIL);
    >
    > is_refresh is not needed. We can simply check if
    > target_slot_list!=NIL, then append it to cmd.
    >
    
    Changed.
    
    > 6)
    >  * If remote_slot_list is NIL, fetches all failover logical slots from the
    >  * primary server. If remote_slot_list is provided, refreshes only those
    >  * specific slots with current values from the primary server.
    >
    > The usage of the word 'refreshing' is confusing. Since we are
    > allocating a new remote-list everytime (instead of reusing or
    > refreshing previous one), we can simply say:
    >
    > ------
    > Fetches the failover logical slots info from the primary server
    >
    > If target_slot_list is NIL, fetches all failover logical slots from
    > the primary server, otherwise fetches only the ones mentioned in
    > target_slot_list
    > ------
    >
    > The 'Parameters:' can also be adjusted accordingly.
    >
    
    Done.
    
    >
    > 7)
    >  * Returns a list of RemoteSlot structures. If refreshing and the query fails,
    >  * returns the original list. Slots that no longer exist on the primary will
    >  * be removed from the list.
    >
    >  This can be removed.
    >
    
    Done.
    
    >
    > 8)
    > - * If restart_lsn, confirmed_lsn or catalog_xmin is invalid but the
    > - * slot is valid, that means we have fetched the remote_slot in its
    > - * RS_EPHEMERAL state. In such a case, don't sync it; we can always
    > - * sync it in the next sync cycle when the remote_slot is persisted
    > - * and has valid lsn(s) and xmin values.
    > - *
    > - * XXX: In future, if we plan to expose 'slot->data.persistency' in
    > - * pg_replication_slots view, then we can avoid fetching RS_EPHEMERAL
    > - * slots in the first place.
    > + * Apply ephemeral slot filtering. Skip slots that are in RS_EPHEMERAL
    > + * state (invalid LSNs/xmin but not explicitly invalidated).
    >
    > We can retain the original comment.
    >
    
    Done.
    
    > 9)
    > Apart from above, there are many changes (alignement, comments etc)
    > which are not related to this particular improvement. We can get rid
    > of those changes. The patch should have the changes pertaining to
    > current improvement alone.
    >
    
    I've removed them.
    Attaching patch v8 addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  40. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-20T05:24:23Z

    On Tue, Aug 19, 2025 at 8:57 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
    >
    > On Tue, 19 Aug 2025 at 10:25, Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    >
    > > Attaching patch v7 addressing all the above comments.
    >
    >
    > Looks like this thread is not attached to any commitfest entry?
    > If so, can you please create one[0]? This will be beneficial for
    > thread, both simplifying patch review and (possibly) increasing the
    > number of reviewers.
    >
    > [0] https://commitfest.postgresql.org/55/new/
    
    Done.
    https://commitfest.postgresql.org/patch/5976/
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
    
    
    
  41. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-22T05:44:45Z

    On Wed, Aug 20, 2025 at 10:53 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > I've removed them.
    > Attaching patch v8 addressing the above comments.
    >
    
    Thanks for the patch. Please find a few comments:
    
    1)
    When the API is in progress, and meanwhile in another session we turn
    off hot_standby_feedback, the API session terminates abnormally.
    
    postgres=# SELECT pg_sync_replication_slots();
    server closed the connection unexpectedly
    
    It seems slotsync_reread_config() is not adjusted for API. It does
    proc_exit assuming it is a worker process.
    
    2)
    slotsync_worker_onexit():
    
            SlotSyncCtx->slot_persistence_pending = false;
    
            /*
             * If syncing_slots is true, it indicates that the process errored out
             * without resetting the flag. So, we need to clean up shared memory and
             * reset the flag here.
             */
            if (syncing_slots)
            {
                    SlotSyncCtx->syncing = false;
                    syncing_slots = false;
            }
    
    Shall we reset slot_persistence_pending inside 'if (syncing_slots)'?
    slot_persistence_pending can not be true without syncing_slots being
    true.
    
    3)
    reset_syncing_flag():
    
      SpinLockAcquire(&SlotSyncCtx->mutex);
      SlotSyncCtx->syncing = false;
    + SlotSyncCtx->slot_persistence_pending = false;
      SpinLockRelease(&SlotSyncCtx->mutex);
    
    Here we are changing slot_persistence_pending under mutex, while at
    other places, it is not protected by mutex. Is it intentional here?
    
    4)
    On rethinking, we maintain anything in shared memory if it has to be
    shared between a few processes. 'slot_persistence_pending' OTOH is
    required to be set and accessed by only one process at a time. Shall
    we move it out of SlotSyncCtxStruct and keep it static similar to
    'syncing_slots'? Rest of the setting, resetting flow remains the same.
    What do you think?
    
    
    5)
    /* Build the query based on whether we're fetching all or refreshing
    specific slots */
    
    Perhaps we can shift this comment to where we actually append
    target_slot_list. Better to have it something like:
    'If target_slot_list is provided, construct the query only to fetch given slots'
    
    
    thanks
    Shveta
    
    
    
    
  42. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-26T04:28:30Z

    On Fri, Aug 22, 2025 at 3:44 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Aug 20, 2025 at 10:53 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > I've removed them.
    > > Attaching patch v8 addressing the above comments.
    > >
    >
    > Thanks for the patch. Please find a few comments:
    >
    > 1)
    > When the API is in progress, and meanwhile in another session we turn
    > off hot_standby_feedback, the API session terminates abnormally.
    >
    > postgres=# SELECT pg_sync_replication_slots();
    > server closed the connection unexpectedly
    >
    > It seems slotsync_reread_config() is not adjusted for API. It does
    > proc_exit assuming it is a worker process.
    >
    
    I've removed the API calling ProcessSlotSyncInterrupts() as I don't
    think the API needs to specifically handle shutdown requests, it works
    without calling this. And the other config checks, I don't think the
    API needs to handle, I think we should leave it to the user.
    
    > 2)
    > slotsync_worker_onexit():
    >
    >         SlotSyncCtx->slot_persistence_pending = false;
    >
    >         /*
    >          * If syncing_slots is true, it indicates that the process errored out
    >          * without resetting the flag. So, we need to clean up shared memory and
    >          * reset the flag here.
    >          */
    >         if (syncing_slots)
    >         {
    >                 SlotSyncCtx->syncing = false;
    >                 syncing_slots = false;
    >         }
    >
    > Shall we reset slot_persistence_pending inside 'if (syncing_slots)'?
    > slot_persistence_pending can not be true without syncing_slots being
    > true.
    >
    > 3)
    > reset_syncing_flag():
    >
    >   SpinLockAcquire(&SlotSyncCtx->mutex);
    >   SlotSyncCtx->syncing = false;
    > + SlotSyncCtx->slot_persistence_pending = false;
    >   SpinLockRelease(&SlotSyncCtx->mutex);
    >
    > Here we are changing slot_persistence_pending under mutex, while at
    > other places, it is not protected by mutex. Is it intentional here?
    >
    > 4)
    > On rethinking, we maintain anything in shared memory if it has to be
    > shared between a few processes. 'slot_persistence_pending' OTOH is
    > required to be set and accessed by only one process at a time. Shall
    > we move it out of SlotSyncCtxStruct and keep it static similar to
    > 'syncing_slots'? Rest of the setting, resetting flow remains the same.
    > What do you think?
    >
    
    Yes, I agree. I have modified it accordingly.
    
    >
    > 5)
    > /* Build the query based on whether we're fetching all or refreshing
    > specific slots */
    >
    > Perhaps we can shift this comment to where we actually append
    > target_slot_list. Better to have it something like:
    > 'If target_slot_list is provided, construct the query only to fetch given slots'
    >
    
    Changed.
    
    Attaching patch v9 addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  43. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-29T05:01:39Z

    On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Fri, Aug 22, 2025 at 3:44 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Wed, Aug 20, 2025 at 10:53 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > >
    > > > I've removed them.
    > > > Attaching patch v8 addressing the above comments.
    > > >
    > >
    > > Thanks for the patch. Please find a few comments:
    > >
    > > 1)
    > > When the API is in progress, and meanwhile in another session we turn
    > > off hot_standby_feedback, the API session terminates abnormally.
    > >
    > > postgres=# SELECT pg_sync_replication_slots();
    > > server closed the connection unexpectedly
    > >
    > > It seems slotsync_reread_config() is not adjusted for API. It does
    > > proc_exit assuming it is a worker process.
    > >
    >
    > I've removed the API calling ProcessSlotSyncInterrupts() as I don't
    > think the API needs to specifically handle shutdown requests, it works
    > without calling this. And the other config checks, I don't think the
    > API needs to handle, I think we should leave it to the user.
    >
    > > 2)
    > > slotsync_worker_onexit():
    > >
    > >         SlotSyncCtx->slot_persistence_pending = false;
    > >
    > >         /*
    > >          * If syncing_slots is true, it indicates that the process errored out
    > >          * without resetting the flag. So, we need to clean up shared memory and
    > >          * reset the flag here.
    > >          */
    > >         if (syncing_slots)
    > >         {
    > >                 SlotSyncCtx->syncing = false;
    > >                 syncing_slots = false;
    > >         }
    > >
    > > Shall we reset slot_persistence_pending inside 'if (syncing_slots)'?
    > > slot_persistence_pending can not be true without syncing_slots being
    > > true.
    > >
    > > 3)
    > > reset_syncing_flag():
    > >
    > >   SpinLockAcquire(&SlotSyncCtx->mutex);
    > >   SlotSyncCtx->syncing = false;
    > > + SlotSyncCtx->slot_persistence_pending = false;
    > >   SpinLockRelease(&SlotSyncCtx->mutex);
    > >
    > > Here we are changing slot_persistence_pending under mutex, while at
    > > other places, it is not protected by mutex. Is it intentional here?
    > >
    > > 4)
    > > On rethinking, we maintain anything in shared memory if it has to be
    > > shared between a few processes. 'slot_persistence_pending' OTOH is
    > > required to be set and accessed by only one process at a time. Shall
    > > we move it out of SlotSyncCtxStruct and keep it static similar to
    > > 'syncing_slots'? Rest of the setting, resetting flow remains the same.
    > > What do you think?
    > >
    >
    > Yes, I agree. I have modified it accordingly.
    >
    > >
    > > 5)
    > > /* Build the query based on whether we're fetching all or refreshing
    > > specific slots */
    > >
    > > Perhaps we can shift this comment to where we actually append
    > > target_slot_list. Better to have it something like:
    > > 'If target_slot_list is provided, construct the query only to fetch given slots'
    > >
    >
    > Changed.
    >
    > Attaching patch v9 addressing the above comments.
    >
    
    Thank You for the patches. Please find a few comments.
    
    1)
    Change not needed:
    
      * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    + *
      */
    
    2)
    Regarding the naptime in API, I was thinking why not to use
    wait_for_slot_activity() directly? If there are no slots activity, it
    will keep on doubling the naptime starting from 2sec till max of
    30sec. Thoughts?
    
    We can rename MIN_SLOTSYNC_WORKER_NAPTIME_MS and
    MAX_SLOTSYNC_WORKER_NAPTIME_MS to MIN_SLOTSYNC_NAPTIME_MS and
    MAX_SLOTSYNC_NAPTIME_MS in such a case.
    
    3)
    + *   target_slot_list - List of RemoteSlot structures to refresh, or NIL to
    + *                      fetch all failover slots
    
    Can we please change it to:
    
    List of failover logical slots to fetch from primary, or NIL to fetch
    all failover logical slots
    
    4)
    In the worker, before each call to synchronize_slots(), we are
    starting a new transaction. It aligns with the previous implementation
    where StartTransaction was inside synchronize_slots(). But in API, we
    are doing StartTransaction once outside of the loop instead of doing
    before each synchronize_slots(), is it intentional? It may keep the
    transaction open for a long duration for the case where slots are not
    getting persisted soon.
    
    5)
    With ProcessSlotSyncInterrupts() being removed from API, can you
    please check the behaviour of API on smart-shutdown and rest of the
    shutdown modes? It should behave like other APIs. And what happens if
    I change primary_conninfo to some non-existing server when the API is
    running. Does it error out or keep retrying?
    
    thanks
    Shveta
    
    
    
    
  44. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-08-29T06:12:39Z

    On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > 4)
    > In the worker, before each call to synchronize_slots(), we are
    > starting a new transaction. It aligns with the previous implementation
    > where StartTransaction was inside synchronize_slots(). But in API, we
    > are doing StartTransaction once outside of the loop instead of doing
    > before each synchronize_slots(), is it intentional? It may keep the
    > transaction open for a long duration for the case where slots are not
    > getting persisted soon.
    >
    
    I’ll address your other comments separately, but I wanted to respond
    to this one first. I did try the approach you suggested, but the issue
    is that we use the remote_slots list across loop iterations. If we end
    the transaction at the end of each iteration, the list gets freed and
    is no longer available for the next pass. Each iteration relies on the
    remote_slots list from the previous one to build the new list, which
    is why we can’t free it inside the loop.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
    
    
    
  45. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-08-29T08:50:41Z

    On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > 4)
    > > In the worker, before each call to synchronize_slots(), we are
    > > starting a new transaction. It aligns with the previous implementation
    > > where StartTransaction was inside synchronize_slots(). But in API, we
    > > are doing StartTransaction once outside of the loop instead of doing
    > > before each synchronize_slots(), is it intentional? It may keep the
    > > transaction open for a long duration for the case where slots are not
    > > getting persisted soon.
    > >
    >
    > I’ll address your other comments separately, but I wanted to respond
    > to this one first. I did try the approach you suggested, but the issue
    > is that we use the remote_slots list across loop iterations. If we end
    > the transaction at the end of each iteration, the list gets freed and
    > is no longer available for the next pass. Each iteration relies on the
    > remote_slots list from the previous one to build the new list, which
    > is why we can’t free it inside the loop.
    
    Isn't that just a matter of allocating the list in appropriate long
    lived memory context?
    
    Here are some more comments
    <para>
    - The logical replication slots on the primary can be synchronized to
    - the hot standby by using the <literal>failover</literal> parameter of
    + The logical replication slots on the primary can be enabled for
    + synchronization to the hot standby by using the
    + <literal>failover</literal> parameter of
    
    This change corresponds to an existing feature. Should be a separate
    patch, which we may want to backport.
    
    - on the standby, the failover slots can be synchronized periodically in
    + <command>CREATE SUBSCRIPTION</command> during slot creation. After that,
    + synchronization can be be performed either manually by calling
    + <link linkend="pg-sync-replication-slots">
    ... snip ...
    - <note>
    - <para>
    - While enabling <link linkend="guc-sync-replication-slots">
    - <varname>sync_replication_slots</varname></link> allows for automatic
    - periodic synchronization of failover slots, they can also be manually
    ... snip ...
    
    I like the current documentation which separates the discussion of two
    methods. I think we should just improve the second paragraph instead
    of deleting it and changing the first one.
    
    * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    + *
    
    unnecessary blank line
    
    +/*
    + * Flag used by pg_sync_replication_slots()
    + * to do retries if the slot did not persist while syncing.
    + */
    +static bool slot_persistence_pending = false;
    
    I don't think we need to keep a global variable for this. The variable
    is used only inside SyncReplicationSlots() and the call depth is not
    more than a few calls. From synchronize_slots(), before which the
    variable is reset and after which the variable is checked, to
    update_and_persist_local_synced_slot() which sets the variable, all
    the functions return bool. All of them can be made to return an
    integer status instead indicating the result of the operation. If we
    do so we could check the return value of synchronize_slots() to decide
    whether to retry or not, isntead of maintaining a global variable
    which has a much wider scope than required. It's difficult to keep it
    updated over the time.
    
    + * Parameters:
    + * wrconn - Connection to the primary server
    + * target_slot_list - List of RemoteSlot structures to refresh, or NIL to
    + * fetch all failover slots
    *
    - * Returns TRUE if any of the slots gets updated in this sync-cycle.
    
    Need to describe the return value.
    
    */
    -static bool
    -synchronize_slots(WalReceiverConn *wrconn)
    +static List *
    +fetch_remote_slots(WalReceiverConn *wrconn, List *target_slot_list)
    
    I like the way this function is broken down into two. That breaking down is
    useful even without this feature.
    
    - /* Construct the remote_slot tuple and synchronize each slot locally */
    + /* Process the slot information */
    
    Probably these comments don't make much sense or they repeat what's
    already there in the function prologue.
    
    else
    - /* Create list of remote slots */
    + /* Add to updated list */
    
    Probably these comments don't make much sense or they repeat what's
    already there in the function prologue.
    
    @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    
    The function is too cute to be useful. The code should be part of
    ReplSlotSyncWorkerMain() just like other worker's main functions.
    
    void
    SyncReplicationSlots(WalReceiverConn *wrconn)
    {
    PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    {
    
    Shouldn't this function call CheckForInterrupts() somewhere in the
    loop since it could be potentially an infinite loop?
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  46. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-08-29T09:07:32Z

    On Fri, Aug 29, 2025 at 2:20 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > 4)
    > > > In the worker, before each call to synchronize_slots(), we are
    > > > starting a new transaction. It aligns with the previous implementation
    > > > where StartTransaction was inside synchronize_slots(). But in API, we
    > > > are doing StartTransaction once outside of the loop instead of doing
    > > > before each synchronize_slots(), is it intentional? It may keep the
    > > > transaction open for a long duration for the case where slots are not
    > > > getting persisted soon.
    > > >
    > >
    > > I’ll address your other comments separately, but I wanted to respond
    > > to this one first. I did try the approach you suggested, but the issue
    > > is that we use the remote_slots list across loop iterations. If we end
    > > the transaction at the end of each iteration, the list gets freed and
    > > is no longer available for the next pass. Each iteration relies on the
    > > remote_slots list from the previous one to build the new list, which
    > > is why we can’t free it inside the loop.
    >
    > Isn't that just a matter of allocating the list in appropriate long
    > lived memory context?
    >
    
    +1. Since we're reallocating the list each time we fetch it from the
    remote server, it's not suitable for long-lived memory storage.
    Instead, should we extract the slot names during the initial fetch of
    failover slots and store them in the appropriate memory context? This
    extraction would only need to happen only when
    slot_persistence_pending is true during the first sync cycle.
    
    > Here are some more comments
    > <para>
    > - The logical replication slots on the primary can be synchronized to
    > - the hot standby by using the <literal>failover</literal> parameter of
    > + The logical replication slots on the primary can be enabled for
    > + synchronization to the hot standby by using the
    > + <literal>failover</literal> parameter of
    >
    > This change corresponds to an existing feature. Should be a separate
    > patch, which we may want to backport.
    >
    > - on the standby, the failover slots can be synchronized periodically in
    > + <command>CREATE SUBSCRIPTION</command> during slot creation. After that,
    > + synchronization can be be performed either manually by calling
    > + <link linkend="pg-sync-replication-slots">
    > ... snip ...
    > - <note>
    > - <para>
    > - While enabling <link linkend="guc-sync-replication-slots">
    > - <varname>sync_replication_slots</varname></link> allows for automatic
    > - periodic synchronization of failover slots, they can also be manually
    > ... snip ...
    >
    > I like the current documentation which separates the discussion of two
    > methods. I think we should just improve the second paragraph instead
    > of deleting it and changing the first one.
    >
    > * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    > + *
    >
    > unnecessary blank line
    >
    > +/*
    > + * Flag used by pg_sync_replication_slots()
    > + * to do retries if the slot did not persist while syncing.
    > + */
    > +static bool slot_persistence_pending = false;
    >
    > I don't think we need to keep a global variable for this. The variable
    > is used only inside SyncReplicationSlots() and the call depth is not
    > more than a few calls. From synchronize_slots(), before which the
    > variable is reset and after which the variable is checked, to
    > update_and_persist_local_synced_slot() which sets the variable, all
    > the functions return bool. All of them can be made to return an
    > integer status instead indicating the result of the operation. If we
    > do so we could check the return value of synchronize_slots() to decide
    > whether to retry or not, isntead of maintaining a global variable
    > which has a much wider scope than required. It's difficult to keep it
    > updated over the time.
    >
    > + * Parameters:
    > + * wrconn - Connection to the primary server
    > + * target_slot_list - List of RemoteSlot structures to refresh, or NIL to
    > + * fetch all failover slots
    > *
    > - * Returns TRUE if any of the slots gets updated in this sync-cycle.
    >
    > Need to describe the return value.
    >
    > */
    > -static bool
    > -synchronize_slots(WalReceiverConn *wrconn)
    > +static List *
    > +fetch_remote_slots(WalReceiverConn *wrconn, List *target_slot_list)
    >
    > I like the way this function is broken down into two. That breaking down is
    > useful even without this feature.
    >
    > - /* Construct the remote_slot tuple and synchronize each slot locally */
    > + /* Process the slot information */
    >
    > Probably these comments don't make much sense or they repeat what's
    > already there in the function prologue.
    >
    > else
    > - /* Create list of remote slots */
    > + /* Add to updated list */
    >
    > Probably these comments don't make much sense or they repeat what's
    > already there in the function prologue.
    >
    > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    >
    > The function is too cute to be useful. The code should be part of
    > ReplSlotSyncWorkerMain() just like other worker's main functions.
    >
    
    I was thinking we can retain wait_for_slot_activity() as this can even
    be invoked from API flow. See my comment# 2 in [1]
    
    [1]: https://www.postgresql.org/message-id/CAJpy0uASzojKbzinpNu29xuYGsSRnSo%3D22CLhXaSt_43TVoBhQ%40mail.gmail.com
    
    thanks
    Shveta
    
    
    
    
  47. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-08-29T10:44:14Z

    On Fri, Aug 29, 2025 at 2:37 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Fri, Aug 29, 2025 at 2:20 PM Ashutosh Bapat
    > <ashutosh.bapat.oss@gmail.com> wrote:
    > >
    > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    > >
    > > The function is too cute to be useful. The code should be part of
    > > ReplSlotSyncWorkerMain() just like other worker's main functions.
    > >
    >
    > I was thinking we can retain wait_for_slot_activity() as this can even
    > be invoked from API flow. See my comment# 2 in [1]
    
    We want the SQL callable function to finish as fast as possible, and
    make all the slots sync ready as fast as possible. So a shorter nap
    time makes sense. We don't want to increase it per iteration. But sync
    worker is a long running worker and can afford to wait longer. In fact
    it should wait longer so as not to load the primary and the standby.
    Given that the naptimes in both cases can not be controlled by the
    same logic, I think it's better not to use the same function. Each of
    them should have separate code for napping. That way the logic which
    decides the nap time is closer to the code that naps making it more
    readable.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  48. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-01T03:30:37Z

    On Fri, Aug 29, 2025 at 4:14 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Fri, Aug 29, 2025 at 2:37 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Fri, Aug 29, 2025 at 2:20 PM Ashutosh Bapat
    > > <ashutosh.bapat.oss@gmail.com> wrote:
    > > >
    > > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    > > >
    > > > The function is too cute to be useful. The code should be part of
    > > > ReplSlotSyncWorkerMain() just like other worker's main functions.
    > > >
    > >
    > > I was thinking we can retain wait_for_slot_activity() as this can even
    > > be invoked from API flow. See my comment# 2 in [1]
    >
    > We want the SQL callable function to finish as fast as possible, and
    > make all the slots sync ready as fast as possible. So a shorter nap
    > time makes sense. We don't want to increase it per iteration. But sync
    > worker is a long running worker and can afford to wait longer. In fact
    > it should wait longer so as not to load the primary and the standby.
    > Given that the naptimes in both cases can not be controlled by the
    > same logic, I think it's better not to use the same function.
    
    Okay, makes sense.
    
    thanks
    Shveta
    
    
    
    
  49. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-09-03T06:28:02Z

    On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Thank You for the patches. Please find a few comments.
    >
    > 1)
    > Change not needed:
    >
    >   * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    > + *
    >   */
    >
    
    Removed.
    
    > 2)
    > Regarding the naptime in API, I was thinking why not to use
    > wait_for_slot_activity() directly? If there are no slots activity, it
    > will keep on doubling the naptime starting from 2sec till max of
    > 30sec. Thoughts?
    >
    > We can rename MIN_SLOTSYNC_WORKER_NAPTIME_MS and
    > MAX_SLOTSYNC_WORKER_NAPTIME_MS to MIN_SLOTSYNC_NAPTIME_MS and
    > MAX_SLOTSYNC_NAPTIME_MS in such a case.
    >
    
    Not changing this since further discussion clarified this.
    
    > 3)
    > + *   target_slot_list - List of RemoteSlot structures to refresh, or NIL to
    > + *                      fetch all failover slots
    >
    > Can we please change it to:
    >
    > List of failover logical slots to fetch from primary, or NIL to fetch
    > all failover logical slots
    >
    
    Changed the variable itself.
    
    > 4)
    > In the worker, before each call to synchronize_slots(), we are
    > starting a new transaction. It aligns with the previous implementation
    > where StartTransaction was inside synchronize_slots(). But in API, we
    > are doing StartTransaction once outside of the loop instead of doing
    > before each synchronize_slots(), is it intentional? It may keep the
    > transaction open for a long duration for the case where slots are not
    > getting persisted soon.
    >
    
    I've added a new memory context to handle slot_names
    
    > 5)
    > With ProcessSlotSyncInterrupts() being removed from API, can you
    > please check the behaviour of API on smart-shutdown and rest of the
    > shutdown modes? It should behave like other APIs. And what happens if
    > I change primary_conninfo to some non-existing server when the API is
    > running. Does it error out or keep retrying?
    >
    
    I've tested with different types of shutdown and it seems to be
    handled corerctly. However, yes, if configuration changed, the API
    does not handle. I've written a new function
    slotsync_api_reread_config() to specifically handle configuration
    changes in API context as it is different from slotsync worker logic.
    
    
    On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Fri, Aug 29, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > 4)
    > > > In the worker, before each call to synchronize_slots(), we are
    > > > starting a new transaction. It aligns with the previous implementation
    > > > where StartTransaction was inside synchronize_slots(). But in API, we
    > > > are doing StartTransaction once outside of the loop instead of doing
    > > > before each synchronize_slots(), is it intentional? It may keep the
    > > > transaction open for a long duration for the case where slots are not
    > > > getting persisted soon.
    > > >
    > >
    > > I’ll address your other comments separately, but I wanted to respond
    > > to this one first. I did try the approach you suggested, but the issue
    > > is that we use the remote_slots list across loop iterations. If we end
    > > the transaction at the end of each iteration, the list gets freed and
    > > is no longer available for the next pass. Each iteration relies on the
    > > remote_slots list from the previous one to build the new list, which
    > > is why we can’t free it inside the loop.
    >
    > Isn't that just a matter of allocating the list in appropriate long
    > lived memory context?
    >
    
    Yes, changed this.
    
    > Here are some more comments
    > <para>
    > - The logical replication slots on the primary can be synchronized to
    > - the hot standby by using the <literal>failover</literal> parameter of
    > + The logical replication slots on the primary can be enabled for
    > + synchronization to the hot standby by using the
    > + <literal>failover</literal> parameter of
    >
    > This change corresponds to an existing feature. Should be a separate
    > patch, which we may want to backport.
    >
    
    Removed.
    
    > - on the standby, the failover slots can be synchronized periodically in
    > + <command>CREATE SUBSCRIPTION</command> during slot creation. After that,
    > + synchronization can be be performed either manually by calling
    > + <link linkend="pg-sync-replication-slots">
    > ... snip ...
    > - <note>
    > - <para>
    > - While enabling <link linkend="guc-sync-replication-slots">
    > - <varname>sync_replication_slots</varname></link> allows for automatic
    > - periodic synchronization of failover slots, they can also be manually
    > ... snip ...
    >
    > I like the current documentation which separates the discussion of two
    > methods. I think we should just improve the second paragraph instead
    > of deleting it and changing the first one.
    >
    > * without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    > + *
    >
    > unnecessary blank line
    >
    
    Changed.
    
    > +/*
    > + * Flag used by pg_sync_replication_slots()
    > + * to do retries if the slot did not persist while syncing.
    > + */
    > +static bool slot_persistence_pending = false;
    >
    > I don't think we need to keep a global variable for this. The variable
    > is used only inside SyncReplicationSlots() and the call depth is not
    > more than a few calls. From synchronize_slots(), before which the
    > variable is reset and after which the variable is checked, to
    > update_and_persist_local_synced_slot() which sets the variable, all
    > the functions return bool. All of them can be made to return an
    > integer status instead indicating the result of the operation. If we
    > do so we could check the return value of synchronize_slots() to decide
    > whether to retry or not, isntead of maintaining a global variable
    > which has a much wider scope than required. It's difficult to keep it
    > updated over the time.
    >
    
    The problem is that all those calls synchronize_slots() and
    update_and_persist_local_synced_slot() are shared with the slotsync
    worker logic and API. Hence, changing this will affect slotsync_worker
    logic as well. While the API needs to spefically retry only if the
    initial sync fails, the slotsync worker will always be retrying. I
    feel using a global variable is a more convenient way of doing this.
    
    > + * Parameters:
    > + * wrconn - Connection to the primary server
    > + * target_slot_list - List of RemoteSlot structures to refresh, or NIL to
    > + * fetch all failover slots
    > *
    > - * Returns TRUE if any of the slots gets updated in this sync-cycle.
    >
    > Need to describe the return value.
    >
    
    Added.
    
    > */
    > -static bool
    > -synchronize_slots(WalReceiverConn *wrconn)
    > +static List *
    > +fetch_remote_slots(WalReceiverConn *wrconn, List *target_slot_list)
    >
    > I like the way this function is broken down into two. That breaking down is
    > useful even without this feature.
    >
    > - /* Construct the remote_slot tuple and synchronize each slot locally */
    > + /* Process the slot information */
    >
    > Probably these comments don't make much sense or they repeat what's
    > already there in the function prologue.
    >
    > else
    > - /* Create list of remote slots */
    > + /* Add to updated list */
    >
    > Probably these comments don't make much sense or they repeat what's
    > already there in the function prologue.
    >
    
    Removed these comments.
    
    > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    >
    > The function is too cute to be useful. The code should be part of
    > ReplSlotSyncWorkerMain() just like other worker's main functions.
    >
    
    But this wouldn't be part of this feature.
    
    > void
    > SyncReplicationSlots(WalReceiverConn *wrconn)
    > {
    > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    > {
    >
    > Shouldn't this function call CheckForInterrupts() somewhere in the
    > loop since it could be potentially an infinite loop?
    
    I've tested this and I see that interrupts are being handled by
    sending SIGQUIT and SIGINT to the backend process.
    
    Attaching v10 with the above changes.
    
    regards,
    Ajin Cerian
    Fujitsu Australia
    
  50. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-03T08:47:04Z

    On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attaching v10 with the above changes.
    >
    
    The patch does not apply on HEAD. Can you please rebase?
    
    thanks
    Shveta
    
    
    
    
  51. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-09-03T09:49:39Z

    On Wed, Sep 3, 2025 at 6:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Attaching v10 with the above changes.
    > >
    >
    > The patch does not apply on HEAD. Can you please rebase?
    
    Rebased and made a small change as well to use TopMemoryContext rather
    than create a new context for slot_list.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  52. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-04T06:35:16Z

    On Wed, Sep 3, 2025 at 3:19 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 6:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Attaching v10 with the above changes.
    > > >
    > >
    > > The patch does not apply on HEAD. Can you please rebase?
    >
    > Rebased and made a small change as well to use TopMemoryContext rather
    > than create a new context for slot_list.
    >
    
    Thanks for the patch. Please find a few comments:
    
    1)
    /* Clean up slot_names if allocated in TopMemoryContext */
    if (slot_names)
    {
    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    list_free_deep(slot_names);
    MemoryContextSwitchTo(oldcontext);
    }
    
    I think we can free slot_names without switching the context. Can you
    please check this?
    
    2)
    We should add a comment for:
    a) why we are using the slot-names from the first cycle instead of
    fetching all failover slots in each cycle.
    b) why are we relocating remote_slot list everytime.
    
    
    3)
    @@ -1130,7 +1180,7 @@ slotsync_reread_config(void)
    
    - Assert(sync_replication_slots);
    + Assert(!AmLogicalSlotSyncWorkerProcess() || sync_replication_slots);
    
    Do we still need this change after slotsync_api_reread_config?
    
    4)
    +static void ProcessSlotSyncInterrupts(void);
    
    This is not needed.
    
    
    5)
    
    + /* update flag, so that we retry */
    + slot_persistence_pending = true;
    
    Can we tweak it to: 'Update the flag so that the API can retry'
    
    6)
    SyncReplicationSlots():
    + /* Free the current remote_slots list */
    + if (remote_slots)
    + list_free_deep(remote_slots);
    
    Do we need a 'remote_slots' check, won't it manage it internally?  We
    don't have it in ReplSlotSyncWorkerMain().
    
    7)
    slotsync_api_reread_config
    
    + ereport(ERROR,
    + (errcode(ERRCODE_CONFIG_FILE_ERROR),
    + errmsg("cannot continue slot synchronization due to parameter changes"),
    + errdetail("Critical replication parameters (primary_conninfo,
    primary_slot_name, or hot_standby_feedback) have changed since
    pg_sync_replication_slots() started."),
    + errhint("Retry pg_sync_replication_slots() to use the updated
    configuration.")));
    
    I am unsure if we need to mention '(primary_conninfo,
    primary_slot_name, or hot_standby_feedback)', but would like to know
    what others think.
    
    thanks
    Shveta
    
    
    
    
  53. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-05T13:21:23Z

    On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat
    > <ashutosh.bapat.oss@gmail.com> wrote:
    > >
    > > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > > +/*
    > > + * Flag used by pg_sync_replication_slots()
    > > + * to do retries if the slot did not persist while syncing.
    > > + */
    > > +static bool slot_persistence_pending = false;
    > >
    > > I don't think we need to keep a global variable for this. The variable
    > > is used only inside SyncReplicationSlots() and the call depth is not
    > > more than a few calls. From synchronize_slots(), before which the
    > > variable is reset and after which the variable is checked, to
    > > update_and_persist_local_synced_slot() which sets the variable, all
    > > the functions return bool. All of them can be made to return an
    > > integer status instead indicating the result of the operation. If we
    > > do so we could check the return value of synchronize_slots() to decide
    > > whether to retry or not, isntead of maintaining a global variable
    > > which has a much wider scope than required. It's difficult to keep it
    > > updated over the time.
    > >
    >
    > The problem is that all those calls synchronize_slots() and
    > update_and_persist_local_synced_slot() are shared with the slotsync
    > worker logic and API. Hence, changing this will affect slotsync_worker
    > logic as well. While the API needs to spefically retry only if the
    > initial sync fails, the slotsync worker will always be retrying. I
    > feel using a global variable is a more convenient way of doing this.
    
    AFAICS, it's a matter of expanding the scope of what's returned by
    those functions. The worker may not want to use the whole expanded
    scope but the API will use it. That shouldn't change the functionality
    of the worker, but it will help avoid the global variable - which have
    much wider scope and their maintenance can be prone to bugs.
    
    >
    > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    > >
    > > The function is too cute to be useful. The code should be part of
    > > ReplSlotSyncWorkerMain() just like other worker's main functions.
    > >
    >
    > But this wouldn't be part of this feature.
    >
    > > void
    > > SyncReplicationSlots(WalReceiverConn *wrconn)
    > > {
    > > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    > > {
    > >
    > > Shouldn't this function call CheckForInterrupts() somewhere in the
    > > loop since it could be potentially an infinite loop?
    >
    > I've tested this and I see that interrupts are being handled by
    > sending SIGQUIT and SIGINT to the backend process.
    
    Can you please point me to the code (the call to
    CHECK_FOR_INTERRUPTS()) which processes these interrupts while
    pg_sync_replication_slots() is executing, especially when the function
    is waiting while syncing a slot.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  54. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Sharma <ashu.coek88@gmail.com> — 2025-09-05T18:09:45Z

    Hi,
    
    On Wed, Sep 3, 2025 at 3:20 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 6:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Attaching v10 with the above changes.
    > > >
    > >
    > > The patch does not apply on HEAD. Can you please rebase?
    >
    > Rebased and made a small change as well to use TopMemoryContext rather
    > than create a new context for slot_list.
    >
    
    If the remote slot is inactive since long and lagging behind the
    standby, this function (pg_sync_replication_slots) could end up
    waiting indefinitely. While it can certainly be cancelled manually,
    that behavior might not be ideal for everyone. That’s my
    understanding; please let me know if you see it differently.
    
    --
    With Regards,
    Ashutosh Sharma.
    
    
    
    
  55. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Sharma <ashu.coek88@gmail.com> — 2025-09-05T18:34:56Z

    On Fri, Sep 5, 2025 at 6:52 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat
    > > <ashutosh.bapat.oss@gmail.com> wrote:
    > > >
    > > > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > > +/*
    > > > + * Flag used by pg_sync_replication_slots()
    > > > + * to do retries if the slot did not persist while syncing.
    > > > + */
    > > > +static bool slot_persistence_pending = false;
    > > >
    > > > I don't think we need to keep a global variable for this. The variable
    > > > is used only inside SyncReplicationSlots() and the call depth is not
    > > > more than a few calls. From synchronize_slots(), before which the
    > > > variable is reset and after which the variable is checked, to
    > > > update_and_persist_local_synced_slot() which sets the variable, all
    > > > the functions return bool. All of them can be made to return an
    > > > integer status instead indicating the result of the operation. If we
    > > > do so we could check the return value of synchronize_slots() to decide
    > > > whether to retry or not, isntead of maintaining a global variable
    > > > which has a much wider scope than required. It's difficult to keep it
    > > > updated over the time.
    > > >
    > >
    > > The problem is that all those calls synchronize_slots() and
    > > update_and_persist_local_synced_slot() are shared with the slotsync
    > > worker logic and API. Hence, changing this will affect slotsync_worker
    > > logic as well. While the API needs to spefically retry only if the
    > > initial sync fails, the slotsync worker will always be retrying. I
    > > feel using a global variable is a more convenient way of doing this.
    >
    > AFAICS, it's a matter of expanding the scope of what's returned by
    > those functions. The worker may not want to use the whole expanded
    > scope but the API will use it. That shouldn't change the functionality
    > of the worker, but it will help avoid the global variable - which have
    > much wider scope and their maintenance can be prone to bugs.
    >
    > >
    > > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    > > >
    > > > The function is too cute to be useful. The code should be part of
    > > > ReplSlotSyncWorkerMain() just like other worker's main functions.
    > > >
    > >
    > > But this wouldn't be part of this feature.
    > >
    > > > void
    > > > SyncReplicationSlots(WalReceiverConn *wrconn)
    > > > {
    > > > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    > > > {
    > > >
    > > > Shouldn't this function call CheckForInterrupts() somewhere in the
    > > > loop since it could be potentially an infinite loop?
    > >
    > > I've tested this and I see that interrupts are being handled by
    > > sending SIGQUIT and SIGINT to the backend process.
    >
    > Can you please point me to the code (the call to
    > CHECK_FOR_INTERRUPTS()) which processes these interrupts while
    > pg_sync_replication_slots() is executing, especially when the function
    > is waiting while syncing a slot.
    >
    
    I noticed that the function libpqrcv_processTuples, which is invoked
    by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is
    currently helping in processing interrupts while we are in an infinite
    loop within SyncReplicationSlots(). I’m just pointing this out based
    on my observation while reviewing the changes in this patch. Ajin,
    please correct me if I’m mistaken. If not, can we always rely on this
    particular check for interrupts.
    
    --
    With Regards,
    Ashutosh Sharma.
    
    
    
    
  56. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-09-06T03:43:49Z

    On Fri, Sep 5, 2025 at 11:39 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 3:20 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Wed, Sep 3, 2025 at 6:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > >
    > > > > Attaching v10 with the above changes.
    > > > >
    > > >
    > > > The patch does not apply on HEAD. Can you please rebase?
    > >
    > > Rebased and made a small change as well to use TopMemoryContext rather
    > > than create a new context for slot_list.
    > >
    >
    > If the remote slot is inactive since long and lagging behind the
    > standby, this function (pg_sync_replication_slots) could end up
    > waiting indefinitely. While it can certainly be cancelled manually,
    > that behavior might not be ideal for everyone. That’s my
    > understanding; please let me know if you see it differently.
    >
    
    Such a case can be addressed by having additional timeout parameters.
    We can do that as an additional patch if the use case is important
    enough to address.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  57. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-08T04:20:04Z

    On Sat, Sep 6, 2025 at 12:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    >
    > On Fri, Sep 5, 2025 at 6:52 PM Ashutosh Bapat
    > <ashutosh.bapat.oss@gmail.com> wrote:
    > >
    > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > I've tested this and I see that interrupts are being handled by
    > > > sending SIGQUIT and SIGINT to the backend process.
    > >
    > > Can you please point me to the code (the call to
    > > CHECK_FOR_INTERRUPTS()) which processes these interrupts while
    > > pg_sync_replication_slots() is executing, especially when the function
    > > is waiting while syncing a slot.
    > >
    >
    > I noticed that the function libpqrcv_processTuples, which is invoked
    > by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is
    > currently helping in processing interrupts while we are in an infinite
    > loop within SyncReplicationSlots(). I’m just pointing this out based
    > on my observation while reviewing the changes in this patch. Ajin,
    > please correct me if I’m mistaken. If not, can we always rely on this
    > particular check for interrupts.
    
    It doesn't seem good to rely on CHECKF_FOR_INTERRUPTS from so far
    away. It's better to have one being called from SyncReplicationSlots()
    which has the wait loop. That's how the other functions which have
    potentially long wait loops do.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  58. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-09-08T04:21:23Z

    On Sat, Sep 6, 2025 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Fri, Sep 5, 2025 at 11:39 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    > > If the remote slot is inactive since long and lagging behind the
    > > standby, this function (pg_sync_replication_slots) could end up
    > > waiting indefinitely. While it can certainly be cancelled manually,
    > > that behavior might not be ideal for everyone. That’s my
    > > understanding; please let me know if you see it differently.
    > >
    >
    > Such a case can be addressed by having additional timeout parameters.
    > We can do that as an additional patch if the use case is important
    > enough to address.
    >
    
    Or we could rely on statement_timeout or the user cancelling the query
    explicitly.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  59. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Sharma <ashu.coek88@gmail.com> — 2025-09-08T04:26:49Z

    On Mon, Sep 8, 2025 at 9:51 AM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Sat, Sep 6, 2025 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Fri, Sep 5, 2025 at 11:39 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    > > > If the remote slot is inactive since long and lagging behind the
    > > > standby, this function (pg_sync_replication_slots) could end up
    > > > waiting indefinitely. While it can certainly be cancelled manually,
    > > > that behavior might not be ideal for everyone. That’s my
    > > > understanding; please let me know if you see it differently.
    > > >
    > >
    > > Such a case can be addressed by having additional timeout parameters.
    > > We can do that as an additional patch if the use case is important
    > > enough to address.
    > >
    >
    > Or we could rely on statement_timeout or the user cancelling the query
    > explicitly.
    >
    
    Sure. thanks Amit and Ashutosh.
    
    --
    With Regards,
    Ashutosh Sharma.
    
    
    
    
  60. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-08T04:31:16Z

    On Fri, Sep 5, 2025 at 6:51 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > On Fri, Aug 29, 2025 at 6:50 PM Ashutosh Bapat
    > > <ashutosh.bapat.oss@gmail.com> wrote:
    > > >
    > > > On Fri, Aug 29, 2025 at 11:42 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > > +/*
    > > > + * Flag used by pg_sync_replication_slots()
    > > > + * to do retries if the slot did not persist while syncing.
    > > > + */
    > > > +static bool slot_persistence_pending = false;
    > > >
    > > > I don't think we need to keep a global variable for this. The variable
    > > > is used only inside SyncReplicationSlots() and the call depth is not
    > > > more than a few calls. From synchronize_slots(), before which the
    > > > variable is reset and after which the variable is checked, to
    > > > update_and_persist_local_synced_slot() which sets the variable, all
    > > > the functions return bool. All of them can be made to return an
    > > > integer status instead indicating the result of the operation. If we
    > > > do so we could check the return value of synchronize_slots() to decide
    > > > whether to retry or not, isntead of maintaining a global variable
    > > > which has a much wider scope than required. It's difficult to keep it
    > > > updated over the time.
    > > >
    > >
    > > The problem is that all those calls synchronize_slots() and
    > > update_and_persist_local_synced_slot() are shared with the slotsync
    > > worker logic and API. Hence, changing this will affect slotsync_worker
    > > logic as well. While the API needs to spefically retry only if the
    > > initial sync fails, the slotsync worker will always be retrying. I
    > > feel using a global variable is a more convenient way of doing this.
    >
    > AFAICS, it's a matter of expanding the scope of what's returned by
    > those functions. The worker may not want to use the whole expanded
    > scope but the API will use it. That shouldn't change the functionality
    > of the worker, but it will help avoid the global variable - which have
    > much wider scope and their maintenance can be prone to bugs.
    >
    
    I think this can be done.
    
    > >
    > > > @@ -1276,7 +1331,7 @@ wait_for_slot_activity(bool some_slot_updated)
    > > >
    > > > The function is too cute to be useful. The code should be part of
    > > > ReplSlotSyncWorkerMain() just like other worker's main functions.
    > > >
    > >
    > > But this wouldn't be part of this feature.
    > >
    > > > void
    > > > SyncReplicationSlots(WalReceiverConn *wrconn)
    > > > {
    > > > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    > > > {
    > > >
    > > > Shouldn't this function call CheckForInterrupts() somewhere in the
    > > > loop since it could be potentially an infinite loop?
    > >
    > > I've tested this and I see that interrupts are being handled by
    > > sending SIGQUIT and SIGINT to the backend process.
    >
    > Can you please point me to the code (the call to
    > CHECK_FOR_INTERRUPTS()) which processes these interrupts while
    > pg_sync_replication_slots() is executing, especially when the function
    > is waiting while syncing a slot.
    >
    > > I noticed that the function libpqrcv_processTuples, which is invoked
    > > by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is
    > > currently helping in processing interrupts while we are in an infinite
    > > loop within SyncReplicationSlots(). I’m just pointing this out based
    > > on my observation while reviewing the changes in this patch. Ajin,
    > > please correct me if I’m mistaken. If not, can we always rely on this
    > > particular check for interrupts.
    >
    > It doesn't seem good to rely on CHECKF_FOR_INTERRUPTS from so far
    > away. It's better to have one being called from SyncReplicationSlots()
    > which has the wait loop. That's how the other functions which have
    > potentially long wait loops do.
    
    +1
    
    thanks
    Shveta
    
    
    
    
  61. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Sharma <ashu.coek88@gmail.com> — 2025-09-08T04:33:20Z

    Hi,
    
    Sharing some of my review comments, please look if these make sense to you.
    
    On Wed, Sep 3, 2025 at 3:20 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 6:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Attaching v10 with the above changes.
    > > >
    > >
    > > The patch does not apply on HEAD. Can you please rebase?
    >
    > Rebased and made a small change as well to use TopMemoryContext rather
    > than create a new context for slot_list.
    >
    
    + /*
    + * If we've been promoted, then no point
    + * continuing.
    + */
    + if (SlotSyncCtx->stopSignaled)
    + {
    + ereport(ERROR,
    + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    + errmsg("exiting from slot synchronization as"
    + " promotion is triggered")));
    + break;
    + }
    "break" statement here looks redundant to me.
    
    --
    
    + *
    + * Repeatedly fetches and updates replication slot information from the
    + * primary until all slots are at least "sync ready". Retry is done after 2
    + * sec wait. Exits early if promotion is triggered or certain critical
    + * configuration parameters have changed.
      */
    
    wait for 2 seconds before retrying - the constant is
    SLOTSYNC_API_NAPTIME_MS, so technically it may *not* always be 2s if
    the macro changes. Maybe reword to “wait for SLOTSYNC_API_NAPTIME_MS
    before retrying” would look better?
    
    --
    
    /* Retry until all slots are sync ready atleast */
    
    and
    
    
    /* Done if all slots are atleast sync ready */
    
    atleast -> "at least". I am just making this comment because at few
    places in the same file I see "at least" and not "atleast".
    
    --
    
    +static void ProcessSlotSyncInterrupts(void);
    
    Is this change related to this patch?
    
    --
    
    +     <command>CREATE SUBSCRIPTION</command> during slot creation. After that,
    +     synchronization can be be performed either manually by calling
    +     <link linkend="pg-sync-replication-slots">
    
    double "be".
    
    --
    With Regards,
    Ashutosh Sharma.
    
    
    
    
  62. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-09-09T12:06:47Z

    On Thu, Sep 4, 2025 at 4:35 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 3:19 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Thanks for the patch. Please find a few comments:
    >
    > 1)
    > /* Clean up slot_names if allocated in TopMemoryContext */
    > if (slot_names)
    > {
    > oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    > list_free_deep(slot_names);
    > MemoryContextSwitchTo(oldcontext);
    > }
    >
    > I think we can free slot_names without switching the context. Can you
    > please check this?
    >
    
    Removed this.
    
    > 2)
    > We should add a comment for:
    > a) why we are using the slot-names from the first cycle instead of
    > fetching all failover slots in each cycle.
    > b) why are we relocating remote_slot list everytime.
    >
    
    I have added a comment, let me know if any more is required.
    
    >
    > 3)
    > @@ -1130,7 +1180,7 @@ slotsync_reread_config(void)
    >
    > - Assert(sync_replication_slots);
    > + Assert(!AmLogicalSlotSyncWorkerProcess() || sync_replication_slots);
    >
    > Do we still need this change after slotsync_api_reread_config?
    >
    
    Removed.
    
    > 4)
    > +static void ProcessSlotSyncInterrupts(void);
    >
    > This is not needed.
    >
    
    Removed.
    
    >
    > 5)
    >
    > + /* update flag, so that we retry */
    > + slot_persistence_pending = true;
    >
    > Can we tweak it to: 'Update the flag so that the API can retry'
    
    Updated.
    
    >
    > 6)
    > SyncReplicationSlots():
    > + /* Free the current remote_slots list */
    > + if (remote_slots)
    > + list_free_deep(remote_slots);
    >
    > Do we need a 'remote_slots' check, won't it manage it internally?  We
    > don't have it in ReplSlotSyncWorkerMain().
    >
    
    Changed.
    
    > 7)
    > slotsync_api_reread_config
    >
    > + ereport(ERROR,
    > + (errcode(ERRCODE_CONFIG_FILE_ERROR),
    > + errmsg("cannot continue slot synchronization due to parameter changes"),
    > + errdetail("Critical replication parameters (primary_conninfo,
    > primary_slot_name, or hot_standby_feedback) have changed since
    > pg_sync_replication_slots() started."),
    > + errhint("Retry pg_sync_replication_slots() to use the updated
    > configuration.")));
    >
    > I am unsure if we need to mention '(primary_conninfo,
    > primary_slot_name, or hot_standby_feedback)', but would like to know
    > what others think.
    >
    
    Leaving this as is now.
    
    On Mon, Sep 8, 2025 at 2:20 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Sat, Sep 6, 2025 at 12:05 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    > >
    > > On Fri, Sep 5, 2025 at 6:52 PM Ashutosh Bapat
    > > <ashutosh.bapat.oss@gmail.com> wrote:
    > > >
    > > > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > >
    > > > > I've tested this and I see that interrupts are being handled by
    > > > > sending SIGQUIT and SIGINT to the backend process.
    > > >
    > > > Can you please point me to the code (the call to
    > > > CHECK_FOR_INTERRUPTS()) which processes these interrupts while
    > > > pg_sync_replication_slots() is executing, especially when the function
    > > > is waiting while syncing a slot.
    > > >
    > >
    > > I noticed that the function libpqrcv_processTuples, which is invoked
    > > by fetch_remote_slots, includes a CHECK_FOR_INTERRUPTS call. This is
    > > currently helping in processing interrupts while we are in an infinite
    > > loop within SyncReplicationSlots(). I’m just pointing this out based
    > > on my observation while reviewing the changes in this patch. Ajin,
    > > please correct me if I’m mistaken. If not, can we always rely on this
    > > particular check for interrupts.
    >
    > It doesn't seem good to rely on CHECKF_FOR_INTERRUPTS from so far
    > away. It's better to have one being called from SyncReplicationSlots()
    > which has the wait loop. That's how the other functions which have
    > potentially long wait loops do.
    >
    
    Ok, I agree. Added the Check.
    
    
    On Mon, Sep 8, 2025 at 2:33 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    >
    > Hi,
    >
    > Sharing some of my review comments, please look if these make sense to you.
    >
    >
    > + /*
    > + * If we've been promoted, then no point
    > + * continuing.
    > + */
    > + if (SlotSyncCtx->stopSignaled)
    > + {
    > + ereport(ERROR,
    > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    > + errmsg("exiting from slot synchronization as"
    > + " promotion is triggered")));
    > + break;
    > + }
    > "break" statement here looks redundant to me.
    >
    
    Removed.
    
    > --
    >
    > + *
    > + * Repeatedly fetches and updates replication slot information from the
    > + * primary until all slots are at least "sync ready". Retry is done after 2
    > + * sec wait. Exits early if promotion is triggered or certain critical
    > + * configuration parameters have changed.
    >   */
    >
    > wait for 2 seconds before retrying - the constant is
    > SLOTSYNC_API_NAPTIME_MS, so technically it may *not* always be 2s if
    > the macro changes. Maybe reword to “wait for SLOTSYNC_API_NAPTIME_MS
    > before retrying” would look better?
    >
    
    I've removed the reference to 2 seconds.
    
    > --
    >
    > /* Retry until all slots are sync ready atleast */
    >
    > and
    >
    >
    > /* Done if all slots are atleast sync ready */
    >
    > atleast -> "at least". I am just making this comment because at few
    > places in the same file I see "at least" and not "atleast".
    >
    
    Changed.
    
    > --
    >
    > +static void ProcessSlotSyncInterrupts(void);
    >
    > Is this change related to this patch?
    >
    
    It was used earlier, and forgot to change it. Removed now.
    
    > --
    >
    > +     <command>CREATE SUBSCRIPTION</command> during slot creation. After that,
    > +     synchronization can be be performed either manually by calling
    > +     <link linkend="pg-sync-replication-slots">
    >
    > double "be".
    >
    
    Fixed
    
    On Fri, Sep 5, 2025 at 11:21 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Wed, Sep 3, 2025 at 11:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > AFAICS, it's a matter of expanding the scope of what's returned by
    > those functions. The worker may not want to use the whole expanded
    > scope but the API will use it. That shouldn't change the functionality
    > of the worker, but it will help avoid the global variable - which have
    > much wider scope and their maintenance can be prone to bugs.
    >
    > >
    
    Ok, added a new return parameter for these functions that will return
    if there are any slots pending persistence and removed the global
    variable.
    
    Attached v11 patch addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  63. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-10T04:45:31Z

    On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attached v11 patch addressing the above comments.
    >
    
    Please find a few comments:
    
    1)
    
    + Retry is done after 2
    + * sec wait. Exits early if promotion is triggered or certain critical
    
    We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
    
    2)
    + /*
    + * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    + * fetch all failover-enabled slots. Note that we reuse slot_names from
    + * the previous iteration; re-fetching all failover slots each time could
    + * cause an endless loop.
    + */
    
    a)
    the previous iteration --> the first iteration.
    
    b) Also we can mention the reason why we take names from first
    iteration instead of going for pending ones alone, something like:
    
    Instead of reprocessing only the pending slots in each iteration, it's
    better to process all the slots received in the first iteration.
    This ensures that by the time we're done, all slots reflect the latest values.
    
    3)
    + remote_slots = fetch_remote_slots(wrconn, slot_names);
    +
    +
    + /* Attempt to synchronize slots */
    + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
    
    One extra blank line can be removed
    
    4)
    
    + /* Clean up slot_names if allocated in TopMemoryContext */
    + if (slot_names)
    + list_free_deep(slot_names);
    
    Can we please move it before 'ReplicationSlotCleanup'.
    
    5)
    In case of error in subsequent iteration, slot_names allocated from
    TopMemoryContext will be left unfreed?
    
    6)
    + ListCell   *lc;
    + bool first_slot = true;
    
    Shall we move these two to concerned if-block:
    if (slot_names != NIL)
    
    7)
     * The slot_persistence_pending flag is used by the pg_sync_replication_slots
     * API to track if any slots could not be persisted and need to be retried.
    
    a) Instead of mentioning only about slot_persistence_pending argument
    in concerned function's header, we shall define all the arguments.
    
    b) We can remove the 'flag' term from the comments as it is a
    function-argument now.
    
    8)
    I think we should add briefly in the header of the file about the new
    behaviour of API.
    
    
    thanks
    Shveta
    
    
    
    
  64. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-09-15T12:47:43Z

    On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Attached v11 patch addressing the above comments.
    > >
    >
    > Please find a few comments:
    >
    > 1)
    >
    > + Retry is done after 2
    > + * sec wait. Exits early if promotion is triggered or certain critical
    >
    > We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
    >
    
    Changed.
    
    > 2)
    > + /*
    > + * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    > + * fetch all failover-enabled slots. Note that we reuse slot_names from
    > + * the previous iteration; re-fetching all failover slots each time could
    > + * cause an endless loop.
    > + */
    >
    > a)
    > the previous iteration --> the first iteration.
    >
    > b) Also we can mention the reason why we take names from first
    > iteration instead of going for pending ones alone, something like:
    >
    > Instead of reprocessing only the pending slots in each iteration, it's
    > better to process all the slots received in the first iteration.
    > This ensures that by the time we're done, all slots reflect the latest values.
    >
    > 3)
    > + remote_slots = fetch_remote_slots(wrconn, slot_names);
    > +
    > +
    > + /* Attempt to synchronize slots */
    > + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
    >
    > One extra blank line can be removed
    >
    
    Fixed.
    
    > 4)
    >
    > + /* Clean up slot_names if allocated in TopMemoryContext */
    > + if (slot_names)
    > + list_free_deep(slot_names);
    >
    > Can we please move it before 'ReplicationSlotCleanup'.
    >
    
    Fixed.
    
    > 5)
    > In case of error in subsequent iteration, slot_names allocated from
    > TopMemoryContext will be left unfreed?
    >
    
    I've changed the logic so that even on error, slot_names are freed.
    
    > 6)
    > + ListCell   *lc;
    > + bool first_slot = true;
    >
    > Shall we move these two to concerned if-block:
    > if (slot_names != NIL)
    >
    
    Changed.
    
    > 7)
    >  * The slot_persistence_pending flag is used by the pg_sync_replication_slots
    >  * API to track if any slots could not be persisted and need to be retried.
    >
    > a) Instead of mentioning only about slot_persistence_pending argument
    > in concerned function's header, we shall define all the arguments.
    >
    > b) We can remove the 'flag' term from the comments as it is a
    > function-argument now.
    >
    
    Changed.
    
    > 8)
    > I think we should add briefly in the header of the file about the new
    > behaviour of API.
    >
    
    Added.
    
    Attaching patch v12 addressing these comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  65. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-16T06:23:36Z

    On Mon, Sep 15, 2025 at 6:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Attached v11 patch addressing the above comments.
    > > >
    > >
    > > Please find a few comments:
    > >
    > > 1)
    > >
    > > + Retry is done after 2
    > > + * sec wait. Exits early if promotion is triggered or certain critical
    > >
    > > We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
    > >
    >
    > Changed.
    >
    > > 2)
    > > + /*
    > > + * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    > > + * fetch all failover-enabled slots. Note that we reuse slot_names from
    > > + * the previous iteration; re-fetching all failover slots each time could
    > > + * cause an endless loop.
    > > + */
    > >
    > > a)
    > > the previous iteration --> the first iteration.
    > >
    > > b) Also we can mention the reason why we take names from first
    > > iteration instead of going for pending ones alone, something like:
    > >
    > > Instead of reprocessing only the pending slots in each iteration, it's
    > > better to process all the slots received in the first iteration.
    > > This ensures that by the time we're done, all slots reflect the latest values.
    > >
    > > 3)
    > > + remote_slots = fetch_remote_slots(wrconn, slot_names);
    > > +
    > > +
    > > + /* Attempt to synchronize slots */
    > > + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
    > >
    > > One extra blank line can be removed
    > >
    >
    > Fixed.
    >
    > > 4)
    > >
    > > + /* Clean up slot_names if allocated in TopMemoryContext */
    > > + if (slot_names)
    > > + list_free_deep(slot_names);
    > >
    > > Can we please move it before 'ReplicationSlotCleanup'.
    > >
    >
    > Fixed.
    >
    > > 5)
    > > In case of error in subsequent iteration, slot_names allocated from
    > > TopMemoryContext will be left unfreed?
    > >
    >
    > I've changed the logic so that even on error, slot_names are freed.
    
    I see that you have freed 'slot_names' in config-changed and promotion
    case but the ERROR can come from other flows as well. The idea was to
    somehow to free it (if possible) in slotsync_failure_callback() by
    passing it as an argument, like we do for 'wrconn'.
    
    > > 6)
    > > + ListCell   *lc;
    > > + bool first_slot = true;
    > >
    > > Shall we move these two to concerned if-block:
    > > if (slot_names != NIL)
    > >
    >
    > Changed.
    >
    > > 7)
    > >  * The slot_persistence_pending flag is used by the pg_sync_replication_slots
    > >  * API to track if any slots could not be persisted and need to be retried.
    > >
    > > a) Instead of mentioning only about slot_persistence_pending argument
    > > in concerned function's header, we shall define all the arguments.
    > >
    > > b) We can remove the 'flag' term from the comments as it is a
    > > function-argument now.
    > >
    >
    > Changed.
    >
    > > 8)
    > > I think we should add briefly in the header of the file about the new
    > > behaviour of API.
    > >
    >
    > Added.
    >
    > Attaching patch v12 addressing these comments.
    >
    
    Thank You for the patch. Please find a few comments:
    
    1)
    + bool slot_persistence_pending = false;
    
    We can move this declaration outside of the loop. And I think we don't
    need to initialize as we are resetting it to false before each
    iteration.
    
    2)
    
    + /* Switch to long-lived TopMemoryContext to store slot names */
    + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    +
    + /* Extract slot names from the remote slots */
    + slot_names = extract_slot_names(remote_slots);
    +
    + MemoryContextSwitchTo(oldcontext);
    
    I think it will be better if we move 'MemoryContextSwitchTo' calls
    inside extract_slot_names() itself. The entire logic related to
    'slot_names' will then be consolidated in one place
    
    3)
    + * The slot_persistence_pending flag is used by the pg_sync_replication_slots
    + * API to track if any slots could not be persisted and need to be retried.
    
    Can we change it to below. We can have it started in a new line after
    a blank line (see how remote_slot_precedes, found_consistent_snapshot
    are defined)
    
    *slot_persistence_pending is set to true if any of the slots fail to
    persist. It is utilized by the pg_sync_replication_slots() API.
    
    Please change it in both synchronize_one_slot() and
    update_and_persist_local_synced_slot()
    
    4)
    a)
    + Update the
    + * slot_persistence_pending flag, so the API can retry.
      */
    
    b)
     /* update the flag, so that the API can retry */
    
    It will be good if we can remove 'flag' usage from both occurrences in
    update_and_persist_local_synced_slot().
    
    5)
    Similar to ProcessSlotSyncInterrupts() for worker, shall we have one
    such function for API which can have all 3 things:
    
    {
    /*
     * If we've been promoted, then no point
     * continuing.
     */
    if (SlotSyncCtx->stopSignaled)
    {
       ereport(ERROR,
      (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
      errmsg("exiting from slot synchronization as"
      " promotion is triggered")));
    }
    
    CHECK_FOR_INTERRUPTS();
    
    if (ConfigReloadPending)
        slotsync_api_reread_config();
    }
    
    And similar to the worker case, we can have it checked in the
    beginning of the loop. Thoughts?
    
    thanks
    Shveta
    
    
    
    
  66. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Sharma <ashu.coek88@gmail.com> — 2025-09-16T11:41:58Z

    On Tue, Sep 16, 2025 at 11:53 AM shveta malik <shveta.malik@gmail.com>
    wrote:
    >
    > On Mon, Sep 15, 2025 at 6:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com>
    wrote:
    > > >
    > > > On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > >
    > > > > Attached v11 patch addressing the above comments.
    > > > >
    > > >
    > > > Please find a few comments:
    > > >
    > > > 1)
    > > >
    > > > + Retry is done after 2
    > > > + * sec wait. Exits early if promotion is triggered or certain
    critical
    > > >
    > > > We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
    > > >
    > >
    > > Changed.
    > >
    > > > 2)
    > > > + /*
    > > > + * Fetch remote slot info for the given slot_names. If slot_names is
    NIL,
    > > > + * fetch all failover-enabled slots. Note that we reuse slot_names
    from
    > > > + * the previous iteration; re-fetching all failover slots each time
    could
    > > > + * cause an endless loop.
    > > > + */
    > > >
    > > > a)
    > > > the previous iteration --> the first iteration.
    > > >
    > > > b) Also we can mention the reason why we take names from first
    > > > iteration instead of going for pending ones alone, something like:
    > > >
    > > > Instead of reprocessing only the pending slots in each iteration, it's
    > > > better to process all the slots received in the first iteration.
    > > > This ensures that by the time we're done, all slots reflect the
    latest values.
    > > >
    > > > 3)
    > > > + remote_slots = fetch_remote_slots(wrconn, slot_names);
    > > > +
    > > > +
    > > > + /* Attempt to synchronize slots */
    > > > + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
    > > >
    > > > One extra blank line can be removed
    > > >
    > >
    > > Fixed.
    > >
    > > > 4)
    > > >
    > > > + /* Clean up slot_names if allocated in TopMemoryContext */
    > > > + if (slot_names)
    > > > + list_free_deep(slot_names);
    > > >
    > > > Can we please move it before 'ReplicationSlotCleanup'.
    > > >
    > >
    > > Fixed.
    > >
    > > > 5)
    > > > In case of error in subsequent iteration, slot_names allocated from
    > > > TopMemoryContext will be left unfreed?
    > > >
    > >
    > > I've changed the logic so that even on error, slot_names are freed.
    >
    > I see that you have freed 'slot_names' in config-changed and promotion
    > case but the ERROR can come from other flows as well. The idea was to
    > somehow to free it (if possible) in slotsync_failure_callback() by
    > passing it as an argument, like we do for 'wrconn'.
    >
    
    Are you suggesting introducing a structure (for example, SlotSyncContext as
    shown below) that encapsulates both wrconn and slot_names, and then passing
    a pointer to this structure as the Datum argument to the
    slotsync_failure_callback cleanup function, so that the callback can handle
    freeing wrconn and slot_names and maybe some other members within the
    structure that allocate memory?
    
    /*
     * Extended structure that can hold both connection and slot_names info
     */
    typedef struct SlotSyncContext
    {
    
    WalReceiverConn *wrconn;          /* Must be first for compatibility */
    List     *slot_names;                    /* Pointer to slot_names list */
    bool extended;                           /* Flag to indicate extended
    context */
    
    } SlotSyncContext;
    
    SyncReplicationSlots(WalReceiverConn *wrconn)
    {
    
    SlotSyncContext sync_ctx;
    ...
    ...
    /* Initialize extended context */
    sync_ctx.wrconn = wrconn;
    sync_ctx.slot_names_ptr = &slot_names;
    sync_ctx.extended = true;
    
    
    PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback,
    PointerGetDatum(&sync_ctx));
    
    
    ...
    }
    
    --
    With Regards,
    Ashutosh Sharma.
    
  67. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-16T12:12:42Z

    On Tue, Sep 16, 2025 at 5:12 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
    >
    >
    > On Tue, Sep 16, 2025 at 11:53 AM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Mon, Sep 15, 2025 at 6:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > > >
    > > > > On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > > >
    > > > > > Attached v11 patch addressing the above comments.
    > > > > >
    > > > >
    > > > > Please find a few comments:
    > > > >
    > > > > 1)
    > > > >
    > > > > + Retry is done after 2
    > > > > + * sec wait. Exits early if promotion is triggered or certain critical
    > > > >
    > > > > We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
    > > > >
    > > >
    > > > Changed.
    > > >
    > > > > 2)
    > > > > + /*
    > > > > + * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    > > > > + * fetch all failover-enabled slots. Note that we reuse slot_names from
    > > > > + * the previous iteration; re-fetching all failover slots each time could
    > > > > + * cause an endless loop.
    > > > > + */
    > > > >
    > > > > a)
    > > > > the previous iteration --> the first iteration.
    > > > >
    > > > > b) Also we can mention the reason why we take names from first
    > > > > iteration instead of going for pending ones alone, something like:
    > > > >
    > > > > Instead of reprocessing only the pending slots in each iteration, it's
    > > > > better to process all the slots received in the first iteration.
    > > > > This ensures that by the time we're done, all slots reflect the latest values.
    > > > >
    > > > > 3)
    > > > > + remote_slots = fetch_remote_slots(wrconn, slot_names);
    > > > > +
    > > > > +
    > > > > + /* Attempt to synchronize slots */
    > > > > + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
    > > > >
    > > > > One extra blank line can be removed
    > > > >
    > > >
    > > > Fixed.
    > > >
    > > > > 4)
    > > > >
    > > > > + /* Clean up slot_names if allocated in TopMemoryContext */
    > > > > + if (slot_names)
    > > > > + list_free_deep(slot_names);
    > > > >
    > > > > Can we please move it before 'ReplicationSlotCleanup'.
    > > > >
    > > >
    > > > Fixed.
    > > >
    > > > > 5)
    > > > > In case of error in subsequent iteration, slot_names allocated from
    > > > > TopMemoryContext will be left unfreed?
    > > > >
    > > >
    > > > I've changed the logic so that even on error, slot_names are freed.
    > >
    > > I see that you have freed 'slot_names' in config-changed and promotion
    > > case but the ERROR can come from other flows as well. The idea was to
    > > somehow to free it (if possible) in slotsync_failure_callback() by
    > > passing it as an argument, like we do for 'wrconn'.
    > >
    >
    > Are you suggesting introducing a structure (for example, SlotSyncContext as shown below) that encapsulates both wrconn and slot_names, and then passing a pointer to this structure as the Datum argument to the slotsync_failure_callback cleanup function, so that the callback can handle freeing wrconn and slot_names and maybe some other members within the structure that allocate memory?
    >
    
    Yes, as I do not see any other simpler way to take care of this
    memory-free in all ERROR scenarios.
    
    > /*
    >  * Extended structure that can hold both connection and slot_names info
    >  */
    > typedef struct SlotSyncContext
    > {
    >
    > WalReceiverConn *wrconn;          /* Must be first for compatibility */
    > List     *slot_names;                    /* Pointer to slot_names list */
    > bool extended;                           /* Flag to indicate extended context */
    >
    > } SlotSyncContext;
    >
    > SyncReplicationSlots(WalReceiverConn *wrconn)
    > {
    >
    > SlotSyncContext sync_ctx;
    > ...
    > ...
    > /* Initialize extended context */
    > sync_ctx.wrconn = wrconn;
    > sync_ctx.slot_names_ptr = &slot_names;
    > sync_ctx.extended = true;
    >
    >
    > PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(&sync_ctx));
    >
    
    Yes, like this.
    
    thanks
    Shveta
    
    
    
    
  68. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-09-22T10:51:14Z

    On Tue, Sep 16, 2025 at 4:23 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Mon, Sep 15, 2025 at 6:17 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > Thank You for the patch. Please find a few comments:
    >
    > 1)
    > + bool slot_persistence_pending = false;
    >
    > We can move this declaration outside of the loop. And I think we don't
    > need to initialize as we are resetting it to false before each
    > iteration.
    >
    
    Fixed.
    
    > 2)
    >
    > + /* Switch to long-lived TopMemoryContext to store slot names */
    > + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    > +
    > + /* Extract slot names from the remote slots */
    > + slot_names = extract_slot_names(remote_slots);
    > +
    > + MemoryContextSwitchTo(oldcontext);
    >
    > I think it will be better if we move 'MemoryContextSwitchTo' calls
    > inside extract_slot_names() itself. The entire logic related to
    > 'slot_names' will then be consolidated in one place
    >
    
    Changed,
    
    > 3)
    > + * The slot_persistence_pending flag is used by the pg_sync_replication_slots
    > + * API to track if any slots could not be persisted and need to be retried.
    >
    > Can we change it to below. We can have it started in a new line after
    > a blank line (see how remote_slot_precedes, found_consistent_snapshot
    > are defined)
    >
    > *slot_persistence_pending is set to true if any of the slots fail to
    > persist. It is utilized by the pg_sync_replication_slots() API.
    >
    > Please change it in both synchronize_one_slot() and
    > update_and_persist_local_synced_slot()
    >
    
    Changed.
    
    > 4)
    > a)
    > + Update the
    > + * slot_persistence_pending flag, so the API can retry.
    >   */
    >
    > b)
    >  /* update the flag, so that the API can retry */
    >
    > It will be good if we can remove 'flag' usage from both occurrences in
    > update_and_persist_local_synced_slot().
    >
    
    Changed.
    
    > 5)
    > Similar to ProcessSlotSyncInterrupts() for worker, shall we have one
    > such function for API which can have all 3 things:
    >
    > {
    > /*
    >  * If we've been promoted, then no point
    >  * continuing.
    >  */
    > if (SlotSyncCtx->stopSignaled)
    > {
    >    ereport(ERROR,
    >   (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    >   errmsg("exiting from slot synchronization as"
    >   " promotion is triggered")));
    > }
    >
    > CHECK_FOR_INTERRUPTS();
    >
    > if (ConfigReloadPending)
    >     slotsync_api_reread_config();
    > }
    >
    > And similar to the worker case, we can have it checked in the
    > beginning of the loop. Thoughts?
    >
    
    Changed it and added a function - ProcessSlotSyncAPIChanges()
    
    Created a patch v13 with these changes.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  69. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-23T04:59:18Z

    On Mon, Sep 22, 2025 at 4:21 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Created a patch v13 with these changes.
    >
    
    Please find a few comments:
    
    1)
    + /* update the failure structure so that it can be freed on error */
    + fparams.slot_names = slot_names;
    +
    
    Since slot_names is assigned only once, we can make the above
    assignment as well only once, inside the if-block where we initialize
    slot_names.
    
    2)
    extract_slot_names():
    
    + foreach(lc, remote_slots)
    + {
    + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
    + char       *slot_name;
    +
    + /* Switch to long-lived TopMemoryContext to store slot names */
    + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    +
    + slot_name = pstrdup(remote_slot->name);
    + slot_names = lappend(slot_names, slot_name);
    +
    + MemoryContextSwitchTo(oldcontext);
    + }
    
    It will be better to move 'MemoryContextSwitchTo' calls outside of the
    loop. No need to switch the context for each slot.
    
    3)
    ProcessSlotSyncAPIChanges() gives a feeling that it is actually
    processing API changes where instead it is processing interrupts or
    config changes. Can we please rename to ProcessSlotSyncAPIInterrupts()
    
    4)
    I prefer version 11's slotsync_api_reread_config() over current
    slotsync_api_config_changed(). There, even error was taken care of
    inside the function, which to me looked better and similar to how
    slotsync worker deals with it.
    
    I have made some comment changes, attached the patch. Please include
    it if you find it okay.
    
    thanks
    Shveta
    
  70. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-23T06:41:56Z

    On Tue, Sep 23, 2025 at 10:29 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Mon, Sep 22, 2025 at 4:21 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Created a patch v13 with these changes.
    > >
    >
    > Please find a few comments:
    >
    > 1)
    > + /* update the failure structure so that it can be freed on error */
    > + fparams.slot_names = slot_names;
    > +
    >
    > Since slot_names is assigned only once, we can make the above
    > assignment as well only once, inside the if-block where we initialize
    > slot_names.
    >
    > 2)
    > extract_slot_names():
    >
    > + foreach(lc, remote_slots)
    > + {
    > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
    > + char       *slot_name;
    > +
    > + /* Switch to long-lived TopMemoryContext to store slot names */
    > + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    > +
    > + slot_name = pstrdup(remote_slot->name);
    > + slot_names = lappend(slot_names, slot_name);
    > +
    > + MemoryContextSwitchTo(oldcontext);
    > + }
    >
    > It will be better to move 'MemoryContextSwitchTo' calls outside of the
    > loop. No need to switch the context for each slot.
    >
    > 3)
    > ProcessSlotSyncAPIChanges() gives a feeling that it is actually
    > processing API changes where instead it is processing interrupts or
    > config changes. Can we please rename to ProcessSlotSyncAPIInterrupts()
    >
    > 4)
    > I prefer version 11's slotsync_api_reread_config() over current
    > slotsync_api_config_changed(). There, even error was taken care of
    > inside the function, which to me looked better and similar to how
    > slotsync worker deals with it.
    >
    > I have made some comment changes, attached the patch. Please include
    > it if you find it okay.
    >
    
    Tested the patch, few more suggestions
    
    5)
    Currently the error message is:
    
    postgres=# SELECT pg_sync_replication_slots();
    ERROR:  cannot continue slot synchronization due to parameter changes
    DETAIL:  Critical replication parameters (primary_conninfo,
    primary_slot_name, or hot_standby_feedback) have changed  since
    pg_sync_replication_slots() started.
    HINT:  Retry pg_sync_replication_slots() to use the updated configuration.
    
    a)
    To be consistent with other error-messages, can we change ERROR msg
    to: 'cannot continue replication slots synchronization due to
    parameter changes'
    
    b)
    There is double space in DETAIL msg: "have changed  since"
    
    Will it be better to shorten the DETAIL as: 'One or more of
    primary_conninfo, primary_slot_name, or hot_standby_feedback were
    modified.'
    
    6)
    postgres=# SELECT pg_sync_replication_slots();
    ERROR:  exiting from slot synchronization as promotion is triggered
    
    Shall we rephrase it similar to the previous message: 'cannot continue
    replication slots synchronization as standby promotion is triggered'
    
    thanks
    Shveta
    
    
    
    
  71. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-09-24T12:05:10Z

    On Tue, Sep 23, 2025 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Mon, Sep 22, 2025 at 4:21 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Created a patch v13 with these changes.
    > >
    >
    > Please find a few comments:
    >
    > 1)
    > + /* update the failure structure so that it can be freed on error */
    > + fparams.slot_names = slot_names;
    > +
    >
    > Since slot_names is assigned only once, we can make the above
    > assignment as well only once, inside the if-block where we initialize
    > slot_names.
    >
    
    Changed.
    
    > 2)
    > extract_slot_names():
    >
    > + foreach(lc, remote_slots)
    > + {
    > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
    > + char       *slot_name;
    > +
    > + /* Switch to long-lived TopMemoryContext to store slot names */
    > + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    > +
    > + slot_name = pstrdup(remote_slot->name);
    > + slot_names = lappend(slot_names, slot_name);
    > +
    > + MemoryContextSwitchTo(oldcontext);
    > + }
    >
    > It will be better to move 'MemoryContextSwitchTo' calls outside of the
    > loop. No need to switch the context for each slot.
    >
    
    Changed.
    
    > 3)
    > ProcessSlotSyncAPIChanges() gives a feeling that it is actually
    > processing API changes where instead it is processing interrupts or
    > config changes. Can we please rename to ProcessSlotSyncAPIInterrupts()
    >
    
    Changed.
    
    > 4)
    > I prefer version 11's slotsync_api_reread_config() over current
    > slotsync_api_config_changed(). There, even error was taken care of
    > inside the function, which to me looked better and similar to how
    > slotsync worker deals with it.
    >
    
    Changed.
    
    > I have made some comment changes, attached the patch. Please include
    > it if you find it okay.
    
    Incorporated.
    
    
    
    On Tue, Sep 23, 2025 at 4:42 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > Tested the patch, few more suggestions
    >
    > 5)
    > Currently the error message is:
    >
    > postgres=# SELECT pg_sync_replication_slots();
    > ERROR:  cannot continue slot synchronization due to parameter changes
    > DETAIL:  Critical replication parameters (primary_conninfo,
    > primary_slot_name, or hot_standby_feedback) have changed  since
    > pg_sync_replication_slots() started.
    > HINT:  Retry pg_sync_replication_slots() to use the updated configuration.
    >
    > a)
    > To be consistent with other error-messages, can we change ERROR msg
    > to: 'cannot continue replication slots synchronization due to
    > parameter changes'
    >
    
    Changed.
    
    > b)
    > There is double space in DETAIL msg: "have changed  since"
    >
    > Will it be better to shorten the DETAIL as: 'One or more of
    > primary_conninfo, primary_slot_name, or hot_standby_feedback were
    > modified.'
    >
    
    Changed.
    
    > 6)
    > postgres=# SELECT pg_sync_replication_slots();
    > ERROR:  exiting from slot synchronization as promotion is triggered
    >
    > Shall we rephrase it similar to the previous message: 'cannot continue
    > replication slots synchronization as standby promotion is triggered'
    >
    Changed.
    
    Attaching patch v14 incorporating the above changes.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  72. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-09-26T10:14:36Z

    On Wed, Sep 24, 2025 at 5:35 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Tue, Sep 23, 2025 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Mon, Sep 22, 2025 at 4:21 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > >
    > > > Created a patch v13 with these changes.
    > > >
    > >
    > > Please find a few comments:
    > >
    > > 1)
    > > + /* update the failure structure so that it can be freed on error */
    > > + fparams.slot_names = slot_names;
    > > +
    > >
    > > Since slot_names is assigned only once, we can make the above
    > > assignment as well only once, inside the if-block where we initialize
    > > slot_names.
    > >
    >
    > Changed.
    >
    > > 2)
    > > extract_slot_names():
    > >
    > > + foreach(lc, remote_slots)
    > > + {
    > > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
    > > + char       *slot_name;
    > > +
    > > + /* Switch to long-lived TopMemoryContext to store slot names */
    > > + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    > > +
    > > + slot_name = pstrdup(remote_slot->name);
    > > + slot_names = lappend(slot_names, slot_name);
    > > +
    > > + MemoryContextSwitchTo(oldcontext);
    > > + }
    > >
    > > It will be better to move 'MemoryContextSwitchTo' calls outside of the
    > > loop. No need to switch the context for each slot.
    > >
    >
    > Changed.
    >
    > > 3)
    > > ProcessSlotSyncAPIChanges() gives a feeling that it is actually
    > > processing API changes where instead it is processing interrupts or
    > > config changes. Can we please rename to ProcessSlotSyncAPIInterrupts()
    > >
    >
    > Changed.
    >
    > > 4)
    > > I prefer version 11's slotsync_api_reread_config() over current
    > > slotsync_api_config_changed(). There, even error was taken care of
    > > inside the function, which to me looked better and similar to how
    > > slotsync worker deals with it.
    > >
    >
    > Changed.
    >
    > > I have made some comment changes, attached the patch. Please include
    > > it if you find it okay.
    >
    > Incorporated.
    >
    >
    >
    > On Tue, Sep 23, 2025 at 4:42 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > >
    > > Tested the patch, few more suggestions
    > >
    > > 5)
    > > Currently the error message is:
    > >
    > > postgres=# SELECT pg_sync_replication_slots();
    > > ERROR:  cannot continue slot synchronization due to parameter changes
    > > DETAIL:  Critical replication parameters (primary_conninfo,
    > > primary_slot_name, or hot_standby_feedback) have changed  since
    > > pg_sync_replication_slots() started.
    > > HINT:  Retry pg_sync_replication_slots() to use the updated configuration.
    > >
    > > a)
    > > To be consistent with other error-messages, can we change ERROR msg
    > > to: 'cannot continue replication slots synchronization due to
    > > parameter changes'
    > >
    >
    > Changed.
    
    It seems it is missed to be changed perhaps due to bringing back
    previous implementation of slotsync_api_reread_config()
    
    >
    > > b)
    > > There is double space in DETAIL msg: "have changed  since"
    > >
    > > Will it be better to shorten the DETAIL as: 'One or more of
    > > primary_conninfo, primary_slot_name, or hot_standby_feedback were
    > > modified.'
    > >
    >
    > Changed.
    >
    
    This too is missed.
    
    > > 6)
    > > postgres=# SELECT pg_sync_replication_slots();
    > > ERROR:  exiting from slot synchronization as promotion is triggered
    > >
    > > Shall we rephrase it similar to the previous message: 'cannot continue
    > > replication slots synchronization as standby promotion is triggered'
    > >
    > Changed.
    >
    > Attaching patch v14 incorporating the above changes.
    >
    
    Few trivial comments:
    
    1)
    slotsync_api_reread_config():
    + * Returns true if conninfo, primary_slot_name or hot_standby_feedback changed.
    
    This comment is no longer valid, we can remove it.
    
    2)
      /* We are done with sync, so reset sync flag */
      reset_syncing_flag();
    +
    
    This extra blank line is not needed.
    
    thanks
    Shveta
    
    
    
    
  73. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-02T11:23:34Z

    On Fri, Sep 26, 2025 at 8:14 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > It seems it is missed to be changed perhaps due to bringing back
    > previous implementation of slotsync_api_reread_config()
    >
    > >
    > > > b)
    > > > There is double space in DETAIL msg: "have changed  since"
    > > >
    > > > Will it be better to shorten the DETAIL as: 'One or more of
    > > > primary_conninfo, primary_slot_name, or hot_standby_feedback were
    > > > modified.'
    > > >
    > >
    > > Changed.
    > >
    
    Fixed.
    
    >
    > This too is missed.
    >
    > > > 6)
    > > > postgres=# SELECT pg_sync_replication_slots();
    > > > ERROR:  exiting from slot synchronization as promotion is triggered
    > > >
    > > > Shall we rephrase it similar to the previous message: 'cannot continue
    > > > replication slots synchronization as standby promotion is triggered'
    > > >
    > > Changed.
    > >
    > > Attaching patch v14 incorporating the above changes.
    > >
    >
    
    This was rephrased.
    
    > Few trivial comments:
    >
    > 1)
    > slotsync_api_reread_config():
    > + * Returns true if conninfo, primary_slot_name or hot_standby_feedback changed.
    >
    > This comment is no longer valid, we can remove it.
    >
    
    Removed.
    
    > 2)
    >   /* We are done with sync, so reset sync flag */
    >   reset_syncing_flag();
    > +
    >
    > This extra blank line is not needed.
    
    Fixed.
    
    Attaching patch v15 addressing these comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  74. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-06T04:52:09Z

    On Thu, Oct 2, 2025 at 4:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Attaching patch v15 addressing these comments.
    >
    
    It seems v15 had a compilation issue. Resolved it. Attaching v15 again.
    
    thanks
    Shveta
    
  75. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-06T06:12:46Z

    On Mon, Oct 6, 2025 at 10:22 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Thu, Oct 2, 2025 at 4:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Attaching patch v15 addressing these comments.
    > >
    >
    > It seems v15 had a compilation issue. Resolved it. Attaching v15 again.
    >
    
    Verified patch, works well.
    
    I have changed doc as per Ashutosh's suggestion in [1]. Please include
    if you find it okay. Attached the patch as txt file.
    
    [1]: https://www.postgresql.org/message-id/CAExHW5vCLTMQcFKZXrT8bjZpQWvhBUL7Ge6Ufb5oSLh0bp10PA%40mail.gmail.com
    
    thanks
    Shveta
    
  76. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-07T09:53:52Z

    Hello Hackers,
    
    In an offline discussion, I was considering adding a TAP test for this
    patch. However, testing the pg_sync_replication_slots() API’s wait
    logic requires a delay of at least 2 seconds, since that’s the
    interval the API sleeps before retrying. I’m not sure it’s acceptable
    to add a TAP test that increases runtime by 2 seconds.
    I’m also wondering if 2 seconds is too long for the API to wait?
    Should we reduce it to something like 200 ms instead? I’d appreciate
    your feedback.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
    
    
    
  77. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-07T10:16:52Z

    On Tue, Oct 7, 2025 at 3:24 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Hello Hackers,
    >
    > In an offline discussion, I was considering adding a TAP test for this
    > patch. However, testing the pg_sync_replication_slots() API’s wait
    > logic requires a delay of at least 2 seconds, since that’s the
    > interval the API sleeps before retrying. I’m not sure it’s acceptable
    > to add a TAP test that increases runtime by 2 seconds.
    > I’m also wondering if 2 seconds is too long for the API to wait?
    > Should we reduce it to something like 200 ms instead? I’d appreciate
    > your feedback.
    >
    
    I feel a shorter nap will be good since it is an API and should finish
    fast. But too short a nap may result in too many primary pings
    specially when primary-slots are not advancing. But that case should
    be a rare one. Shall we have a nap of say 500ms? It is neither too
    short nor too long. Thoughts?
    
    thanks
    Shveta
    
    
    
    
  78. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-10-07T11:19:26Z

    On Tue, Oct 7, 2025 at 3:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Oct 7, 2025 at 3:24 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Hello Hackers,
    > >
    > > In an offline discussion, I was considering adding a TAP test for this
    > > patch. However, testing the pg_sync_replication_slots() API’s wait
    > > logic requires a delay of at least 2 seconds, since that’s the
    > > interval the API sleeps before retrying. I’m not sure it’s acceptable
    > > to add a TAP test that increases runtime by 2 seconds.
    > > I’m also wondering if 2 seconds is too long for the API to wait?
    > > Should we reduce it to something like 200 ms instead? I’d appreciate
    > > your feedback.
    > >
    >
    > I feel a shorter nap will be good since it is an API and should finish
    > fast. But too short a nap may result in too many primary pings
    > specially when primary-slots are not advancing. But that case should
    > be a rare one. Shall we have a nap of say 500ms? It is neither too
    > short nor too long. Thoughts?
    
    Shorter nap times mean higher possibility of wasted CPU cycles - that
    should be avoided. Doing that for a test's sake seems wrong. Is there
    a way that the naptime can controlled by external factors such as
    likelihood of an advanced slot (just firing bullets in the dark) or is
    the naptime controllable by user interface like GUC? The test can use
    those interfaces.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  79. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-07T11:43:17Z

    On Tue, Oct 7, 2025 at 4:49 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Tue, Oct 7, 2025 at 3:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Tue, Oct 7, 2025 at 3:24 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Hello Hackers,
    > > >
    > > > In an offline discussion, I was considering adding a TAP test for this
    > > > patch. However, testing the pg_sync_replication_slots() API’s wait
    > > > logic requires a delay of at least 2 seconds, since that’s the
    > > > interval the API sleeps before retrying. I’m not sure it’s acceptable
    > > > to add a TAP test that increases runtime by 2 seconds.
    > > > I’m also wondering if 2 seconds is too long for the API to wait?
    > > > Should we reduce it to something like 200 ms instead? I’d appreciate
    > > > your feedback.
    > > >
    > >
    > > I feel a shorter nap will be good since it is an API and should finish
    > > fast. But too short a nap may result in too many primary pings
    > > specially when primary-slots are not advancing. But that case should
    > > be a rare one. Shall we have a nap of say 500ms? It is neither too
    > > short nor too long. Thoughts?
    >
    > Shorter nap times mean higher possibility of wasted CPU cycles - that
    > should be avoided. Doing that for a test's sake seems wrong. Is there
    > a way that the naptime can controlled by external factors such as
    > likelihood of an advanced slot (just firing bullets in the dark) or is
    > the naptime controllable by user interface like GUC? The test can use
    > those interfaces.
    >
    
    Yes, we can control naptime based on the fact whether any slots are
    being advanced on primary. This is how a slotsync worker does. It
    keeps on doubling the naptime if there is no activity on primary
    starting from 200ms till max of 30 sec. As soon as activity happens,
    naptime is reduced to 200ms again.
    
    thanks
    Shveta
    
    
    
    
  80. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-10-09T08:43:48Z

    On Tue, Oct 7, 2025 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Oct 7, 2025 at 4:49 PM Ashutosh Bapat
    > <ashutosh.bapat.oss@gmail.com> wrote:
    > >
    > > Shorter nap times mean higher possibility of wasted CPU cycles - that
    > > should be avoided. Doing that for a test's sake seems wrong. Is there
    > > a way that the naptime can controlled by external factors such as
    > > likelihood of an advanced slot (just firing bullets in the dark) or is
    > > the naptime controllable by user interface like GUC? The test can use
    > > those interfaces.
    > >
    >
    > Yes, we can control naptime based on the fact whether any slots are
    > being advanced on primary. This is how a slotsync worker does. It
    > keeps on doubling the naptime if there is no activity on primary
    > starting from 200ms till max of 30 sec. As soon as activity happens,
    > naptime is reduced to 200ms again.
    >
    
    Is there a reason why we don't want to use the same naptime strategy
    for API and worker?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  81. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-09T08:59:58Z

    On Thu, Oct 9, 2025 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Tue, Oct 7, 2025 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Tue, Oct 7, 2025 at 4:49 PM Ashutosh Bapat
    > > <ashutosh.bapat.oss@gmail.com> wrote:
    > > >
    > > > Shorter nap times mean higher possibility of wasted CPU cycles - that
    > > > should be avoided. Doing that for a test's sake seems wrong. Is there
    > > > a way that the naptime can controlled by external factors such as
    > > > likelihood of an advanced slot (just firing bullets in the dark) or is
    > > > the naptime controllable by user interface like GUC? The test can use
    > > > those interfaces.
    > > >
    > >
    > > Yes, we can control naptime based on the fact whether any slots are
    > > being advanced on primary. This is how a slotsync worker does. It
    > > keeps on doubling the naptime if there is no activity on primary
    > > starting from 200ms till max of 30 sec. As soon as activity happens,
    > > naptime is reduced to 200ms again.
    > >
    >
    > Is there a reason why we don't want to use the same naptime strategy
    > for API and worker?
    >
    
    There was a suggestion at [1] for a shorter naptime in case of API.
    
    [1]: https://www.postgresql.org/message-id/CAExHW5sQLJGhEA%2B9ZFVwZUpqfFFP5KPn9w64t3uiHSuiEH-9mQ%40mail.gmail.com
    
    thanks
    Shveta
    
    
    
    
  82. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-10-09T09:12:02Z

    On Tue, Oct 7, 2025 at 4:49 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Tue, Oct 7, 2025 at 3:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Tue, Oct 7, 2025 at 3:24 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Hello Hackers,
    > > >
    > > > In an offline discussion, I was considering adding a TAP test for this
    > > > patch. However, testing the pg_sync_replication_slots() API’s wait
    > > > logic requires a delay of at least 2 seconds, since that’s the
    > > > interval the API sleeps before retrying. I’m not sure it’s acceptable
    > > > to add a TAP test that increases runtime by 2 seconds.
    > > > I’m also wondering if 2 seconds is too long for the API to wait?
    > > > Should we reduce it to something like 200 ms instead? I’d appreciate
    > > > your feedback.
    > > >
    > >
    > > I feel a shorter nap will be good since it is an API and should finish
    > > fast. But too short a nap may result in too many primary pings
    > > specially when primary-slots are not advancing. But that case should
    > > be a rare one. Shall we have a nap of say 500ms? It is neither too
    > > short nor too long. Thoughts?
    >
    > Shorter nap times mean higher possibility of wasted CPU cycles - that
    > should be avoided.
    >
    
    This seems to be exactly opposite of what you argued previously in email [1].
    
    >
     Doing that for a test's sake seems wrong.
    >
    
    Yeah, if test writing is important to cover this case then we can even
    consider using an injection point.
    
    >
     Is there
    > a way that the naptime can controlled by external factors such as
    > likelihood of an advanced slot
    >
    
    We already do this for the worker where the naptime is increased
    gradually when there is no activity on the primary. It is better to
    use the same strategy here. This API is not going to be used
    frequently; rather I would say, one would like to use it just before
    planned switchover. So, I feel it is okay even if the wait time is
    slightly higher when actually required. This would prevent adding
    additional code maintenance for API and worker.
    
    [1] - https://www.postgresql.org/message-id/CAExHW5sQLJGhEA%2B9ZFVwZUpqfFFP5KPn9w64t3uiHSuiEH-9mQ%40mail.gmail.com
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  83. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-10-09T09:26:28Z

    On Thu, Oct 9, 2025 at 2:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Tue, Oct 7, 2025 at 4:49 PM Ashutosh Bapat
    > <ashutosh.bapat.oss@gmail.com> wrote:
    > >
    > > On Tue, Oct 7, 2025 at 3:47 PM shveta malik <shveta.malik@gmail.com> wrote:
    > > >
    > > > On Tue, Oct 7, 2025 at 3:24 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > >
    > > > > Hello Hackers,
    > > > >
    > > > > In an offline discussion, I was considering adding a TAP test for this
    > > > > patch. However, testing the pg_sync_replication_slots() API’s wait
    > > > > logic requires a delay of at least 2 seconds, since that’s the
    > > > > interval the API sleeps before retrying. I’m not sure it’s acceptable
    > > > > to add a TAP test that increases runtime by 2 seconds.
    > > > > I’m also wondering if 2 seconds is too long for the API to wait?
    > > > > Should we reduce it to something like 200 ms instead? I’d appreciate
    > > > > your feedback.
    > > > >
    > > >
    > > > I feel a shorter nap will be good since it is an API and should finish
    > > > fast. But too short a nap may result in too many primary pings
    > > > specially when primary-slots are not advancing. But that case should
    > > > be a rare one. Shall we have a nap of say 500ms? It is neither too
    > > > short nor too long. Thoughts?
    > >
    > > Shorter nap times mean higher possibility of wasted CPU cycles - that
    > > should be avoided.
    > >
    >
    > This seems to be exactly opposite of what you argued previously in email [1].
    >
    > >
    >  Doing that for a test's sake seems wrong.
    > >
    >
    > Yeah, if test writing is important to cover this case then we can even
    > consider using an injection point.
    >
    
    That observation was made to make my point that the logic to decide
    naptime in function and in worker should be separate. The naptime in
    the function can be significantly smaller than the naptime in the
    worker. But making it shorter just for the test's sake isn't a good
    idea. If we could use injection points, better.
    
    
    > >
    >  Is there
    > > a way that the naptime can controlled by external factors such as
    > > likelihood of an advanced slot
    > >
    >
    > We already do this for the worker where the naptime is increased
    > gradually when there is no activity on the primary. It is better to
    > use the same strategy here. This API is not going to be used
    > frequently; rather I would say, one would like to use it just before
    > planned switchover. So, I feel it is okay even if the wait time is
    > slightly higher when actually required. This would prevent adding
    > additional code maintenance for API and worker.
    
    That makes sense.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  84. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-13T07:57:36Z

    On Tue, Oct 7, 2025 at 10:43 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > Yes, we can control naptime based on the fact whether any slots are
    > being advanced on primary. This is how a slotsync worker does. It
    > keeps on doubling the naptime if there is no activity on primary
    > starting from 200ms till max of 30 sec. As soon as activity happens,
    > naptime is reduced to 200ms again.
    >
    
    I have modified the patch to use the same wait logic as the slotsync
    worker. I have also incorporated the document changes that you shared.
    Attaching v16 with the  above changes.
    
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  85. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-15T04:26:55Z

    On Mon, Oct 13, 2025 at 6:57 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > I have modified the patch to use the same wait logic as the slotsync
    > worker. I have also incorporated the document changes that you shared.
    > Attaching v16 with the  above changes.
    >
    
    Updated the patch with a tap test.
    Attaching patch v17 which has a tap test to test the feature added.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  86. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-15T08:38:24Z

    On Wed, Oct 15, 2025 at 9:57 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Updated the patch with a tap test.
    > Attaching patch v17 which has a tap test to test the feature added.
    >
    
    Thanks for the patch. I noticed that in the case of API, we are
    passing 'some_slot_updated' as always false to
    wait_for_slot_activity(). Shouldn't we pass it as actual value just
    like slotsync worker does? There may be a case that in a given cycle,
    one of the temp slots is persisted or one of the persisted slots is
    updated, in such a case we should not double the naptime. The naptime
    doubling logic is only when there is no activity happening on primary.
    
    thanks
    Shveta
    
    
    
    
  87. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-15T09:29:24Z

    On Wed, Oct 15, 2025 at 2:08 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Oct 15, 2025 at 9:57 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Updated the patch with a tap test.
    > > Attaching patch v17 which has a tap test to test the feature added.
    > >
    >
    
    Test also needs correction. It seems the existing test of 'Test
    logical failover slots corresponding to different plugins can be
    synced to the standby.' is disturbed. If it is already tested and need
    not be covered again, then comments need to be changed to clarify
    that; otherwise the test needs to be brought back.
    
    thanks
    Shveta
    
    
    
    
  88. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-22T04:55:17Z

    On Wed, Oct 15, 2025 at 7:38 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Oct 15, 2025 at 9:57 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Updated the patch with a tap test.
    > > Attaching patch v17 which has a tap test to test the feature added.
    > >
    >
    > Thanks for the patch. I noticed that in the case of API, we are
    > passing 'some_slot_updated' as always false to
    > wait_for_slot_activity(). Shouldn't we pass it as actual value just
    > like slotsync worker does? There may be a case that in a given cycle,
    > one of the temp slots is persisted or one of the persisted slots is
    > updated, in such a case we should not double the naptime. The naptime
    > doubling logic is only when there is no activity happening on primary.
    >
    
    I've modified this accordingly
    
    On Wed, Oct 15, 2025 at 8:29 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Oct 15, 2025 at 2:08 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Wed, Oct 15, 2025 at 9:57 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > >
    > > > Updated the patch with a tap test.
    > > > Attaching patch v17 which has a tap test to test the feature added.
    > > >
    > >
    >
    > Test also needs correction. It seems the existing test of 'Test
    > logical failover slots corresponding to different plugins can be
    > synced to the standby.' is disturbed. If it is already tested and need
    > not be covered again, then comments need to be changed to clarify
    > that; otherwise the test needs to be brought back.
    
    I've modified the comments to reflect the new changes.
    
    attaching patch v18 with the above changes.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  89. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-24T09:29:37Z

    On Wed, Oct 22, 2025 at 10:25 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > I've modified the comments to reflect the new changes.
    >
    > attaching patch v18 with the above changes.
    >
    
    Thanks for the patch. The test is still not clear. Can we please add
    the test after the test of "Test logical failover slots corresponding
    to different plugins" finishes instead of adding it in between?
    
    thanks
    Shveta
    
    
    
    
  90. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-27T07:47:28Z

    On Fri, Oct 24, 2025 at 8:29 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Oct 22, 2025 at 10:25 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > I've modified the comments to reflect the new changes.
    > >
    > > attaching patch v18 with the above changes.
    > >
    >
    > Thanks for the patch. The test is still not clear. Can we please add
    > the test after the test of "Test logical failover slots corresponding
    > to different plugins" finishes instead of adding it in between?
    >
    
    I've rewritten the tests again to make this possible. Attaching v19
    which has the modified tap test.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  91. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Japin Li <japinli@hotmail.com> — 2025-10-27T09:22:13Z

    Hi, Ajin
    
    Thanks for updating the patch.
    
    On Mon, 27 Oct 2025 at 18:47, Ajin Cherian <itsajin@gmail.com> wrote:
    > On Fri, Oct 24, 2025 at 8:29 PM shveta malik <shveta.malik@gmail.com> wrote:
    >>
    >> On Wed, Oct 22, 2025 at 10:25 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >> >
    >> >
    >> > I've modified the comments to reflect the new changes.
    >> >
    >> > attaching patch v18 with the above changes.
    >> >
    >>
    >> Thanks for the patch. The test is still not clear. Can we please add
    >> the test after the test of "Test logical failover slots corresponding
    >> to different plugins" finishes instead of adding it in between?
    >>
    >
    > I've rewritten the tests again to make this possible. Attaching v19
    > which has the modified tap test.
    
    Here are some comments on the new patch.
    
    1. Given the existence of the foreach_ptr macro, we can switch the usage of
    foreach to foreach_ptr.
    
    diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
    index 1b78ffc5ff1..5db51407a82 100644
    --- a/src/backend/replication/logical/slotsync.c
    +++ b/src/backend/replication/logical/slotsync.c
    @@ -872,7 +872,6 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
     
     	if (slot_names != NIL)
     	{
    -		ListCell   *lc;
     		bool		first_slot = true;
     
     		/*
    @@ -880,10 +879,8 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
     		 */
     		appendStringInfoString(&query, " AND slot_name IN (");
     
    -		foreach(lc, slot_names)
    +		foreach_ptr(char, slot_name, slot_names)
     		{
    -			char *slot_name = (char *) lfirst(lc);
    -
     			if (!first_slot)
     				appendStringInfoString(&query, ", ");
     
    @@ -1872,15 +1869,13 @@ static List *
     extract_slot_names(List *remote_slots)
     {
     	List		*slot_names = NIL;
    -	ListCell	*lc;
     	MemoryContext oldcontext;
     
     	/* Switch to long-lived TopMemoryContext to store slot names */
     	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
     
    -	foreach(lc, remote_slots)
    +	foreach_ptr(RemoteSlot, remote_slot, remote_slots)
     	{
    -		RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
     		char       *slot_name;
     
     		slot_name = pstrdup(remote_slot->name);
    
    2. To append a signal character, switch from appendStringInfoString() to the
    more efficient appendStringInfoChar().
    
    + 		appendStringInfoString(&query, ")");
    
    3. The query memory can be released immediately after walrcv_exec() because
    there are no subsequent references.
    
    @@ -895,6 +892,7 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
     
     	/* Execute the query */
     	res = walrcv_exec(wrconn, query.data, SLOTSYNC_COLUMN_COUNT, slotRow);
    +	pfree(query.data);
     	if (res->status != WALRCV_OK_TUPLES)
     		ereport(ERROR,
     				errmsg("could not fetch failover logical slots info from the primary server: %s",
    @@ -975,7 +973,6 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
     	}
     
     	walrcv_clear_result(res);
    -	pfree(query.data);
     
     	return remote_slot_list;
     }
    
    > 
    > Fujitsu Australia
    >
    > [2. text/x-diff; v19-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch]...
    
    -- 
    Regards,
    Japin Li
    ChengDu WenWu Information Technology Co., Ltd.
    
    
    
    
  92. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-10-30T10:18:31Z

    On Mon, Oct 27, 2025 at 8:22 PM Japin Li <japinli@hotmail.com> wrote:
    >
    >
    > Hi, Ajin
    >
    > Thanks for updating the patch.
    >
    > On Mon, 27 Oct 2025 at 18:47, Ajin Cherian <itsajin@gmail.com> wrote:
    > > On Fri, Oct 24, 2025 at 8:29 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >>
    > >> On Wed, Oct 22, 2025 at 10:25 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >> >
    > >> >
    > >> > I've modified the comments to reflect the new changes.
    > >> >
    > >> > attaching patch v18 with the above changes.
    > >> >
    > >>
    > >> Thanks for the patch. The test is still not clear. Can we please add
    > >> the test after the test of "Test logical failover slots corresponding
    > >> to different plugins" finishes instead of adding it in between?
    > >>
    > >
    > > I've rewritten the tests again to make this possible. Attaching v19
    > > which has the modified tap test.
    >
    > Here are some comments on the new patch.
    >
    > 1. Given the existence of the foreach_ptr macro, we can switch the usage of
    > foreach to foreach_ptr.
    >
    > diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
    > index 1b78ffc5ff1..5db51407a82 100644
    > --- a/src/backend/replication/logical/slotsync.c
    > +++ b/src/backend/replication/logical/slotsync.c
    > @@ -872,7 +872,6 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >
    >         if (slot_names != NIL)
    >         {
    > -               ListCell   *lc;
    >                 bool            first_slot = true;
    >
    >                 /*
    > @@ -880,10 +879,8 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >                  */
    >                 appendStringInfoString(&query, " AND slot_name IN (");
    >
    > -               foreach(lc, slot_names)
    > +               foreach_ptr(char, slot_name, slot_names)
    >                 {
    > -                       char *slot_name = (char *) lfirst(lc);
    > -
    >                         if (!first_slot)
    >                                 appendStringInfoString(&query, ", ");
    >
    > @@ -1872,15 +1869,13 @@ static List *
    >  extract_slot_names(List *remote_slots)
    >  {
    >         List            *slot_names = NIL;
    > -       ListCell        *lc;
    >         MemoryContext oldcontext;
    >
    >         /* Switch to long-lived TopMemoryContext to store slot names */
    >         oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    >
    > -       foreach(lc, remote_slots)
    > +       foreach_ptr(RemoteSlot, remote_slot, remote_slots)
    >         {
    > -               RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
    >                 char       *slot_name;
    >
    >                 slot_name = pstrdup(remote_slot->name);
    >
    > 2. To append a signal character, switch from appendStringInfoString() to the
    > more efficient appendStringInfoChar().
    >
    > +               appendStringInfoString(&query, ")");
    >
    > 3. The query memory can be released immediately after walrcv_exec() because
    > there are no subsequent references.
    >
    > @@ -895,6 +892,7 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >
    >         /* Execute the query */
    >         res = walrcv_exec(wrconn, query.data, SLOTSYNC_COLUMN_COUNT, slotRow);
    > +       pfree(query.data);
    >         if (res->status != WALRCV_OK_TUPLES)
    >                 ereport(ERROR,
    >                                 errmsg("could not fetch failover logical slots info from the primary server: %s",
    > @@ -975,7 +973,6 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >         }
    >
    >         walrcv_clear_result(res);
    > -       pfree(query.data);
    >
    >         return remote_slot_list;
    >  }
    >
    
    Thanks for your review, Japin. Here's patch v20 addressing the comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  93. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Chao Li <li.evan.chao@gmail.com> — 2025-10-30T11:15:34Z

    Hi Ajin,
    
    I have reviewed v20 and got a few comments:
    
    > On Oct 30, 2025, at 18:18, Ajin Cherian <itsajin@gmail.com> wrote:
    > 
    > <v20-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch>
    
    1 - slotsync.c
    ```
    +		if (slot_names)
    +			list_free_deep(slot_names);
     
     		/* Cleanup the synced temporary slots */
     		ReplicationSlotCleanup(true);
    @@ -1762,5 +2026,5 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
     		/* We are done with sync, so reset sync flag */
     		reset_syncing_flag();
     	}
    -	PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    +	PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(&fparams));
    ```
    
    I am afraid there is a risk of double memory free. Slot_names has been assigned to fparams.slot_names within the  for loop, and it’s freed after the loop. If something gets wrong and slotsync_failure_callback() is called, the function will free fparams.slot_names again.
    
    2 - slotsync.c
    ```
    +			/*
    +			 * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    +			 * fetch all failover-enabled slots. Note that we reuse slot_names from
    +			 * the first iteration; re-fetching all failover slots each time could
    +			 * cause an endless loop. Instead of reprocessing only the pending slots
    +			 * in each iteration, it's better to process all the slots received in
    +			 * the first iteration. This ensures that by the time we're done, all
    +			 * slots reflect the latest values.
    +			 */
    +			remote_slots = fetch_remote_slots(wrconn, slot_names);
    +
    +			/* Attempt to synchronize slots */
    +			some_slot_updated = synchronize_slots(wrconn, remote_slots,
    +												  &slot_persistence_pending);
    +
    +			/*
    +			 * If slot_persistence_pending is true, extract slot names
    +			 * for future iterations (only needed if we haven't done it yet)
    +			 */
    +			if (slot_names == NIL && slot_persistence_pending)
    +			{
    +				slot_names = extract_slot_names(remote_slots);
    +
    +				/* Update the failure structure so that it can be freed on error */
    +				fparams.slot_names = slot_names;
    +			}
    ```
    
    I am thinking if that could be a problem. As you now extract_slot_names() only in the first iteration, if a slot is dropped, and a new slot comes with the same name, will the new slot be incorrectly synced?
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  94. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Japin Li <japinli@hotmail.com> — 2025-10-31T04:42:27Z

    On Thu, 30 Oct 2025 at 21:18, Ajin Cherian <itsajin@gmail.com> wrote:
    > On Mon, Oct 27, 2025 at 8:22 PM Japin Li <japinli@hotmail.com> wrote:
    >>
    >>
    >> Hi, Ajin
    >>
    >> Thanks for updating the patch.
    >>
    >> On Mon, 27 Oct 2025 at 18:47, Ajin Cherian <itsajin@gmail.com> wrote:
    >> > On Fri, Oct 24, 2025 at 8:29 PM shveta malik <shveta.malik@gmail.com> wrote:
    >> >>
    >> >> On Wed, Oct 22, 2025 at 10:25 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >> >> >
    >> >> >
    >> >> > I've modified the comments to reflect the new changes.
    >> >> >
    >> >> > attaching patch v18 with the above changes.
    >> >> >
    >> >>
    >> >> Thanks for the patch. The test is still not clear. Can we please add
    >> >> the test after the test of "Test logical failover slots corresponding
    >> >> to different plugins" finishes instead of adding it in between?
    >> >>
    >> >
    >> > I've rewritten the tests again to make this possible. Attaching v19
    >> > which has the modified tap test.
    >>
    >> Here are some comments on the new patch.
    >>
    >> 1. Given the existence of the foreach_ptr macro, we can switch the usage of
    >> foreach to foreach_ptr.
    >>
    >> diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
    >> index 1b78ffc5ff1..5db51407a82 100644
    >> --- a/src/backend/replication/logical/slotsync.c
    >> +++ b/src/backend/replication/logical/slotsync.c
    >> @@ -872,7 +872,6 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >>
    >>         if (slot_names != NIL)
    >>         {
    >> -               ListCell   *lc;
    >>                 bool            first_slot = true;
    >>
    >>                 /*
    >> @@ -880,10 +879,8 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >>                  */
    >>                 appendStringInfoString(&query, " AND slot_name IN (");
    >>
    >> -               foreach(lc, slot_names)
    >> +               foreach_ptr(char, slot_name, slot_names)
    >>                 {
    >> -                       char *slot_name = (char *) lfirst(lc);
    >> -
    >>                         if (!first_slot)
    >>                                 appendStringInfoString(&query, ", ");
    >>
    >> @@ -1872,15 +1869,13 @@ static List *
    >>  extract_slot_names(List *remote_slots)
    >>  {
    >>         List            *slot_names = NIL;
    >> -       ListCell        *lc;
    >>         MemoryContext oldcontext;
    >>
    >>         /* Switch to long-lived TopMemoryContext to store slot names */
    >>         oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    >>
    >> -       foreach(lc, remote_slots)
    >> +       foreach_ptr(RemoteSlot, remote_slot, remote_slots)
    >>         {
    >> -               RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc);
    >>                 char       *slot_name;
    >>
    >>                 slot_name = pstrdup(remote_slot->name);
    >>
    >> 2. To append a signal character, switch from appendStringInfoString() to the
    >> more efficient appendStringInfoChar().
    >>
    >> +               appendStringInfoString(&query, ")");
    >>
    >> 3. The query memory can be released immediately after walrcv_exec() because
    >> there are no subsequent references.
    >>
    >> @@ -895,6 +892,7 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >>
    >>         /* Execute the query */
    >>         res = walrcv_exec(wrconn, query.data, SLOTSYNC_COLUMN_COUNT, slotRow);
    >> +       pfree(query.data);
    >>         if (res->status != WALRCV_OK_TUPLES)
    >>                 ereport(ERROR,
    >>                                 errmsg("could not fetch failover logical slots info from the primary server: %s",
    >> @@ -975,7 +973,6 @@ fetch_remote_slots(WalReceiverConn *wrconn, List *slot_names)
    >>         }
    >>
    >>         walrcv_clear_result(res);
    >> -       pfree(query.data);
    >>
    >>         return remote_slot_list;
    >>  }
    >>
    >
    > Thanks for your review, Japin. Here's patch v20 addressing the comments.
    >
    
    Thanks for updating the patch.  Here are some comments on v20.
    
    1. Since the content is unchanged, no modification is needed here.
    
    -		 * We do not drop the slot because the restart_lsn can be ahead of the
    -		 * current location when recreating the slot in the next cycle. It may
    -		 * take more time to create such a slot. Therefore, we keep this slot
    -		 * and attempt the synchronization in the next cycle.
    +		 * We do not drop the slot because the restart_lsn can be
    +		 * ahead of the current location when recreating the slot in
    +		 * the next cycle. It may take more time to create such a
    +		 * slot. Therefore, we keep this slot and attempt the
    +		 * synchronization in the next cycle.
    
    2. Could we align the parameter comment style for synchronize_slots() and
    fetch_remote_slots() for better consistency?
    
    3. Is this redundant? It was already initialized to false during declaration.
    
    +			/* Reset flag before every iteration */
    +			slot_persistence_pending = false;
    
    4. A minor nitpick.  The opening brace should be on a new line for style
    consistency.
    
    +			if (!IsTransactionState()) {
    +				StartTransactionCommand();
    +				started_tx = true;
    +			}
    
    5. Given that fparams.slot_names is a list, I suggest we replace NULL with NIL
    for type consistency.
    
    +	fparams.slot_names = NULL;
    
    
    -- 
    Regards,
    Japin Li
    ChengDu WenWu Information Technology Co., Ltd.
    
    
    
    
  95. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Japin Li <japinli@hotmail.com> — 2025-10-31T05:01:51Z

    On Thu, 30 Oct 2025 at 19:15, Chao Li <li.evan.chao@gmail.com> wrote:
    > Hi Ajin,
    >
    > I have reviewed v20 and got a few comments:
    >
    >> On Oct 30, 2025, at 18:18, Ajin Cherian <itsajin@gmail.com> wrote:
    >> 
    >> <v20-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch>
    >
    > 1 - slotsync.c
    > ```
    > +		if (slot_names)
    > +			list_free_deep(slot_names);
    >  
    >  		/* Cleanup the synced temporary slots */
    >  		ReplicationSlotCleanup(true);
    > @@ -1762,5 +2026,5 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
    >  		/* We are done with sync, so reset sync flag */
    >  		reset_syncing_flag();
    >  	}
    > -	PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    > +	PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(&fparams));
    > ```
    >
    > I am afraid there is a risk of double memory free. Slot_names has been assigned to fparams.slot_names within the  for loop, and it’s freed after the loop. If something gets wrong and slotsync_failure_callback() is called, the function will free fparams.slot_names again.
    >
    
    Agreed.
    
    Maybe we should set the fparams.slot_names to NIL immediately after freeing
    the memory.
    
    > 2 - slotsync.c
    > ```
    > +			/*
    > +			 * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    > +			 * fetch all failover-enabled slots. Note that we reuse slot_names from
    > +			 * the first iteration; re-fetching all failover slots each time could
    > +			 * cause an endless loop. Instead of reprocessing only the pending slots
    > +			 * in each iteration, it's better to process all the slots received in
    > +			 * the first iteration. This ensures that by the time we're done, all
    > +			 * slots reflect the latest values.
    > +			 */
    > +			remote_slots = fetch_remote_slots(wrconn, slot_names);
    > +
    > +			/* Attempt to synchronize slots */
    > +			some_slot_updated = synchronize_slots(wrconn, remote_slots,
    > +												  &slot_persistence_pending);
    > +
    > +			/*
    > +			 * If slot_persistence_pending is true, extract slot names
    > +			 * for future iterations (only needed if we haven't done it yet)
    > +			 */
    > +			if (slot_names == NIL && slot_persistence_pending)
    > +			{
    > +				slot_names = extract_slot_names(remote_slots);
    > +
    > +				/* Update the failure structure so that it can be freed on error */
    > +				fparams.slot_names = slot_names;
    > +			}
    > ```
    >
    > I am thinking if that could be a problem. As you now extract_slot_names() only in the first iteration, if a slot is dropped, and a new slot comes with the same name, will the new slot be incorrectly synced?
    >
    
    The slot name alone is insufficient to distinguish between the old and new
    slots.  In this case, the new slot state will overwrite the old.  I see no
    harm in this behavior, but please confirm if this is the desired behavior.
    
    -- 
    Regards,
    Japin Li
    ChengDu WenWu Information Technology Co., Ltd.
    
    
    
    
  96. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-31T05:34:31Z

    On Thu, Oct 30, 2025 at 3:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Thanks for your review, Japin. Here's patch v20 addressing the comments.
    >
    
    Thank You for the patch. Please find a few comment son test:
    
    
    1)
    +# until the slot becomes sync-ready (when the standby catches up to the
    +# slot's restart_lsn).
    
    I think it should be 'when the primary server catches up' or 'when the
    remote slot catches up with the locally reserved position.'
    
    2)
    +# Attempt to synchronize slots using API. This will initially fail because
    +# the slot is not yet sync-ready (standby hasn't caught up to slot's
    restart_lsn),
    +# but the API will wait and retry. Call the API in a background process.
    
    a)
    'This will initially fail ' seems like the API will give an error,
    which is not the case
    
    b) 'standby hasn't caught up to slot's restart_lsn' is not correct.
    
    We can rephrase to:
    # Attempt to synchronize slots using the API. The API will continue
    retrying synchronization until the remote slot catches up with the
    locally reserved position.
    
    3)
    +# Enable the Subscription, so that the slot catches up
    
    slot --> remote slot
    
    4)
    +# Create xl_running_xacts records on the primary for which the
    standby is waiting
    
    Shall we rephrase to below or anything better if you have?:
    Create xl_running_xacts on the primary to speed up restart_lsn advancement.
    
    5)
    +# Confirm that the logical failover slot is created on the standby and is
    +# flagged as 'synced'
    
    Suggestion:
    Verify that the logical failover slot is created on the standby,
    marked as 'synced', and persisted.
    
    (It is important to mention persisted because even temporary slot is
    marked as synced)
    
    thanks
    Shveta
    
    
    
    
  97. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-10-31T05:51:16Z

    On Fri, Oct 31, 2025 at 11:04 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Thu, Oct 30, 2025 at 3:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Thanks for your review, Japin. Here's patch v20 addressing the comments.
    > >
    >
    > Thank You for the patch. Please find a few comment son test:
    >
    >
    > 1)
    > +# until the slot becomes sync-ready (when the standby catches up to the
    > +# slot's restart_lsn).
    >
    > I think it should be 'when the primary server catches up' or 'when the
    > remote slot catches up with the locally reserved position.'
    >
    > 2)
    > +# Attempt to synchronize slots using API. This will initially fail because
    > +# the slot is not yet sync-ready (standby hasn't caught up to slot's
    > restart_lsn),
    > +# but the API will wait and retry. Call the API in a background process.
    >
    > a)
    > 'This will initially fail ' seems like the API will give an error,
    > which is not the case
    >
    > b) 'standby hasn't caught up to slot's restart_lsn' is not correct.
    >
    > We can rephrase to:
    > # Attempt to synchronize slots using the API. The API will continue
    > retrying synchronization until the remote slot catches up with the
    > locally reserved position.
    >
    > 3)
    > +# Enable the Subscription, so that the slot catches up
    >
    > slot --> remote slot
    >
    > 4)
    > +# Create xl_running_xacts records on the primary for which the
    > standby is waiting
    >
    > Shall we rephrase to below or anything better if you have?:
    > Create xl_running_xacts on the primary to speed up restart_lsn advancement.
    >
    > 5)
    > +# Confirm that the logical failover slot is created on the standby and is
    > +# flagged as 'synced'
    >
    > Suggestion:
    > Verify that the logical failover slot is created on the standby,
    > marked as 'synced', and persisted.
    >
    > (It is important to mention persisted because even temporary slot is
    > marked as synced)
    >
    
    Shall we remove this change as it does not belong to the current patch
    directly? I think it was a suggestion earlier, but we shall remove it.
    
    6)
    -# Confirm the synced slot 'lsub1_slot' is retained on the new primary
    +# Confirm that the synced slots 'lsub1_slot' and 'snap_test_slot' are
    retained on the new primary
     is( $standby1->safe_psql(
      'postgres',
      q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
    ('lsub1_slot', 'snap_test_slot') AND synced AND NOT temporary;}
    +
    
    thanks
    Shveta
    
    
    
    
  98. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-11-03T10:33:03Z

    On Thu, Oct 30, 2025 at 10:16 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > Hi Ajin,
    >
    > I have reviewed v20 and got a few comments:
    >
    > > On Oct 30, 2025, at 18:18, Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > <v20-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch>
    >
    > 1 - slotsync.c
    > ```
    > +               if (slot_names)
    > +                       list_free_deep(slot_names);
    >
    >                 /* Cleanup the synced temporary slots */
    >                 ReplicationSlotCleanup(true);
    > @@ -1762,5 +2026,5 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
    >                 /* We are done with sync, so reset sync flag */
    >                 reset_syncing_flag();
    >         }
    > -       PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn));
    > +       PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(&fparams));
    > ```
    >
    > I am afraid there is a risk of double memory free. Slot_names has been assigned to fparams.slot_names within the  for loop, and it’s freed after the loop. If something gets wrong and slotsync_failure_callback() is called, the function will free fparams.slot_names again.
    >
    
    Yes, good catch. I have changed to set fparams.slot_names to NIL after
    freeing it, so that it isn't freed in slotsync_failure_callback().
    
    > 2 - slotsync.c
    > ```
    > +                       /*
    > +                        * Fetch remote slot info for the given slot_names. If slot_names is NIL,
    > +                        * fetch all failover-enabled slots. Note that we reuse slot_names from
    > +                        * the first iteration; re-fetching all failover slots each time could
    > +                        * cause an endless loop. Instead of reprocessing only the pending slots
    > +                        * in each iteration, it's better to process all the slots received in
    > +                        * the first iteration. This ensures that by the time we're done, all
    > +                        * slots reflect the latest values.
    > +                        */
    > +                       remote_slots = fetch_remote_slots(wrconn, slot_names);
    > +
    > +                       /* Attempt to synchronize slots */
    > +                       some_slot_updated = synchronize_slots(wrconn, remote_slots,
    > +                                                                                                 &slot_persistence_pending);
    > +
    > +                       /*
    > +                        * If slot_persistence_pending is true, extract slot names
    > +                        * for future iterations (only needed if we haven't done it yet)
    > +                        */
    > +                       if (slot_names == NIL && slot_persistence_pending)
    > +                       {
    > +                               slot_names = extract_slot_names(remote_slots);
    > +
    > +                               /* Update the failure structure so that it can be freed on error */
    > +                               fparams.slot_names = slot_names;
    > +                       }
    > ```
    >
    > I am thinking if that could be a problem. As you now extract_slot_names() only in the first iteration, if a slot is dropped, and a new slot comes with the same name, will the new slot be incorrectly synced?
    
    It doesn't matter, because the new slot will anyway have a later
    restart_lsn and xmin, and all other attributes of the slot are also
    updated as part of the sync. So, the old slot on the standby will
    resemble the new slot on the primary.
    
    On Fri, Oct 31, 2025 at 3:42 PM Japin Li <japinli@hotmail.com> wrote:
    >
    > Thanks for updating the patch.  Here are some comments on v20.
    >
    > 1. Since the content is unchanged, no modification is needed here.
    >
    > -                * We do not drop the slot because the restart_lsn can be ahead of the
    > -                * current location when recreating the slot in the next cycle. It may
    > -                * take more time to create such a slot. Therefore, we keep this slot
    > -                * and attempt the synchronization in the next cycle.
    > +                * We do not drop the slot because the restart_lsn can be
    > +                * ahead of the current location when recreating the slot in
    > +                * the next cycle. It may take more time to create such a
    > +                * slot. Therefore, we keep this slot and attempt the
    > +                * synchronization in the next cycle.
    >
    
    Changed.
    
    > 2. Could we align the parameter comment style for synchronize_slots() and
    > fetch_remote_slots() for better consistency?
    >
    
    Fixed.
    
    > 3. Is this redundant? It was already initialized to false during declaration.
    >
    > +                       /* Reset flag before every iteration */
    > +                       slot_persistence_pending = false;
    >
    
    Removed.
    
    > 4. A minor nitpick.  The opening brace should be on a new line for style
    > consistency.
    >
    > +                       if (!IsTransactionState()) {
    > +                               StartTransactionCommand();
    > +                               started_tx = true;
    > +                       }
    >
    
    Fixed.
    
    > 5. Given that fparams.slot_names is a list, I suggest we replace NULL with NIL
    > for type consistency.
    >
    > +       fparams.slot_names = NULL;
    >
    
    Changed.
    
    
    
    On Fri, Oct 31, 2025 at 4:34 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Thu, Oct 30, 2025 at 3:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Thanks for your review, Japin. Here's patch v20 addressing the comments.
    > >
    >
    > Thank You for the patch. Please find a few comment son test:
    >
    >
    > 1)
    > +# until the slot becomes sync-ready (when the standby catches up to the
    > +# slot's restart_lsn).
    >
    > I think it should be 'when the primary server catches up' or 'when the
    > remote slot catches up with the locally reserved position.'
    >
    
    Changed.
    
    > 2)
    > +# Attempt to synchronize slots using API. This will initially fail because
    > +# the slot is not yet sync-ready (standby hasn't caught up to slot's
    > restart_lsn),
    > +# but the API will wait and retry. Call the API in a background process.
    >
    > a)
    > 'This will initially fail ' seems like the API will give an error,
    > which is not the case
    >
    > b) 'standby hasn't caught up to slot's restart_lsn' is not correct.
    >
    > We can rephrase to:
    > # Attempt to synchronize slots using the API. The API will continue
    > retrying synchronization until the remote slot catches up with the
    > locally reserved position.
    >
    
    changed accordingly.
    
    > 3)
    > +# Enable the Subscription, so that the slot catches up
    >
    > slot --> remote slot
    >
    > 4)
    > +# Create xl_running_xacts records on the primary for which the
    > standby is waiting
    >
    > Shall we rephrase to below or anything better if you have?:
    > Create xl_running_xacts on the primary to speed up restart_lsn advancement.
    >
    > 5)
    > +# Confirm that the logical failover slot is created on the standby and is
    > +# flagged as 'synced'
    >
    > Suggestion:
    > Verify that the logical failover slot is created on the standby,
    > marked as 'synced', and persisted.
    >
    > (It is important to mention persisted because even temporary slot is
    > marked as synced)
    >
    
    changed as recommended.
    
    I have addressed the above comments in patch v21.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  99. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-04T06:23:34Z

    >
    > I have addressed the above comments in patch v21.
    >
    
    Thank You. Please find a few comments:
    
    1)
    + fparams.slot_names = slot_names = NIL;
    
    I think it is not needed to set slot_names to NIL.
    
    2)
    -    WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN);
    +    WAIT_EVENT_REPLICATION_SLOTSYNC_PRIMARY_CATCHUP);
    
    The new name does not seem appropriate. For the slotsync-worker case,
    even when the primary is not behind, the worker still waits but it is
    not waiting for primary to catch-up. I could not find a better name
    except the original one 'WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN'. We can
    change the explanation to :
    
    "Waiting in main loop of slot sync worker and slot sync API."
    Or
    "Waiting in main loop of slot synchronization."
    
    If anyone has any better name suggestions, we can consider changing.
    
    3)
    
    +# Attempt to synchronize slots using API. The API will continue retrying
    +# synchronization until the remote slot catches up with the locally reserved
    +# position. The API will not return until this happens, to be able to make
    +# further calls, call the API in a background process.
    
    Shall we remove 'with the locally reserved position', it’s already
    explained in the test header and the comment is good enough even
    without it.
    
    4)
    +# Confirm log that the slot has been synced after becoming sync-ready.
    
    Shall we just say:
    Confirm from the log that the slot is sync-ready now.
    
    5)
     # Synchronize the primary server slots to the standby.
     $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
    @@ -945,6 +1007,7 @@ $subscriber1->safe_psql('postgres',
     is( $standby1->safe_psql(
      'postgres',
      q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
    ('lsub1_slot', 'snap_test_slot') AND synced AND NOT temporary;}
    +
      ),
    
    Redundant change.
    
    thanks
    Shveta
    
    
    
    
  100. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-11-06T07:53:02Z

    On Tue, Nov 4, 2025 at 5:23 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > >
    > > I have addressed the above comments in patch v21.
    > >
    >
    > Thank You. Please find a few comments:
    >
    > 1)
    > + fparams.slot_names = slot_names = NIL;
    >
    > I think it is not needed to set slot_names to NIL.
    >
    > 2)
    > -    WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN);
    > +    WAIT_EVENT_REPLICATION_SLOTSYNC_PRIMARY_CATCHUP);
    >
    > The new name does not seem appropriate. For the slotsync-worker case,
    > even when the primary is not behind, the worker still waits but it is
    > not waiting for primary to catch-up. I could not find a better name
    > except the original one 'WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN'. We can
    > change the explanation to :
    >
    > "Waiting in main loop of slot sync worker and slot sync API."
    > Or
    > "Waiting in main loop of slot synchronization."
    >
    > If anyone has any better name suggestions, we can consider changing.
    
    Changed as suggested above.
    
    >
    > 3)
    >
    > +# Attempt to synchronize slots using API. The API will continue retrying
    > +# synchronization until the remote slot catches up with the locally reserved
    > +# position. The API will not return until this happens, to be able to make
    > +# further calls, call the API in a background process.
    >
    > Shall we remove 'with the locally reserved position', it’s already
    > explained in the test header and the comment is good enough even
    > without it.
    >
    
    Changed.
    
    > 4)
    > +# Confirm log that the slot has been synced after becoming sync-ready.
    >
    > Shall we just say:
    > Confirm from the log that the slot is sync-ready now.
    >
    > 5)
    >  # Synchronize the primary server slots to the standby.
    >  $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
    > @@ -945,6 +1007,7 @@ $subscriber1->safe_psql('postgres',
    >  is( $standby1->safe_psql(
    >   'postgres',
    >   q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
    > ('lsub1_slot', 'snap_test_slot') AND synced AND NOT temporary;}
    > +
    >   ),
    >
    > Redundant change.
    
    Removed.
    
    Attaching patch v22 addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  101. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Japin Li <japinli@hotmail.com> — 2025-11-07T05:06:26Z

    On Thu, 06 Nov 2025 at 18:53, Ajin Cherian <itsajin@gmail.com> wrote:
    > On Tue, Nov 4, 2025 at 5:23 PM shveta malik <shveta.malik@gmail.com> wrote:
    >>
    >> >
    >> > I have addressed the above comments in patch v21.
    >> >
    >>
    >> Thank You. Please find a few comments:
    >>
    >> 1)
    >> + fparams.slot_names = slot_names = NIL;
    >>
    >> I think it is not needed to set slot_names to NIL.
    >>
    >> 2)
    >> -    WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN);
    >> +    WAIT_EVENT_REPLICATION_SLOTSYNC_PRIMARY_CATCHUP);
    >>
    >> The new name does not seem appropriate. For the slotsync-worker case,
    >> even when the primary is not behind, the worker still waits but it is
    >> not waiting for primary to catch-up. I could not find a better name
    >> except the original one 'WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN'. We can
    >> change the explanation to :
    >>
    >> "Waiting in main loop of slot sync worker and slot sync API."
    >> Or
    >> "Waiting in main loop of slot synchronization."
    >>
    >> If anyone has any better name suggestions, we can consider changing.
    >
    > Changed as suggested above.
    >
    >>
    >> 3)
    >>
    >> +# Attempt to synchronize slots using API. The API will continue retrying
    >> +# synchronization until the remote slot catches up with the locally reserved
    >> +# position. The API will not return until this happens, to be able to make
    >> +# further calls, call the API in a background process.
    >>
    >> Shall we remove 'with the locally reserved position', it’s already
    >> explained in the test header and the comment is good enough even
    >> without it.
    >>
    >
    > Changed.
    >
    >> 4)
    >> +# Confirm log that the slot has been synced after becoming sync-ready.
    >>
    >> Shall we just say:
    >> Confirm from the log that the slot is sync-ready now.
    >>
    >> 5)
    >>  # Synchronize the primary server slots to the standby.
    >>  $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
    >> @@ -945,6 +1007,7 @@ $subscriber1->safe_psql('postgres',
    >>  is( $standby1->safe_psql(
    >>   'postgres',
    >>   q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
    >> ('lsub1_slot', 'snap_test_slot') AND synced AND NOT temporary;}
    >> +
    >>   ),
    >>
    >> Redundant change.
    >
    > Removed.
    >
    > Attaching patch v22 addressing the above comments.
    
    @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
     LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
     LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
     RECOVERY_WAL_STREAM	"Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
    -REPLICATION_SLOTSYNC_MAIN	"Waiting in main loop of slot sync worker."
     REPLICATION_SLOTSYNC_SHUTDOWN	"Waiting for slot sync worker to shut down."
    +REPLICATION_SLOTSYNC_MAIN	"Waiting in main loop of slot synchronization."
     SYSLOGGER_MAIN	"Waiting in main loop of syslogger process."
     WAL_RECEIVER_MAIN	"Waiting in main loop of WAL receiver process."
     WAL_SENDER_MAIN	"Waiting in main loop of WAL sender process."
    
    I've noticed that all events are sorted alphabetically.  I think we should keep
    the order of REPLICATION_SLOTSYNC_MAIN unchanged.
    
    -- 
    Regards,
    Japin Li
    ChengDu WenWu Information Technology Co., Ltd.
    
    
    
    
  102. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-10T09:31:01Z

    On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
    >
    > >
    > > Attaching patch v22 addressing the above comments.
    >
    > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN    "Waiting in main loop of logical replication apply process."
    >  LOGICAL_LAUNCHER_MAIN  "Waiting in main loop of logical replication launcher process."
    >  LOGICAL_PARALLEL_APPLY_MAIN    "Waiting in main loop of logical replication parallel apply process."
    >  RECOVERY_WAL_STREAM    "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
    > -REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot sync worker."
    >  REPLICATION_SLOTSYNC_SHUTDOWN  "Waiting for slot sync worker to shut down."
    > +REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot synchronization."
    >  SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
    >  WAL_RECEIVER_MAIN      "Waiting in main loop of WAL receiver process."
    >  WAL_SENDER_MAIN        "Waiting in main loop of WAL sender process."
    >
    > I've noticed that all events are sorted alphabetically.  I think we should keep
    > the order of REPLICATION_SLOTSYNC_MAIN unchanged.
    >
    
    +1.
    
    Few trivial comments:
    
    1)
    Since we have always used the term 'SQL function' rather than API in
    existing code, shall we change all references of API to 'SQL function'
    in current patch:
    
    + * If the pg_sync_replication API is used to sync the slots, and if the slots
    "If the SQL function pg_sync_replication_slots() is used.."
    
    + * the reasons mentioned above, then the API also waits and retries until the
    API --> SQL function
    
    + * persist. It is utilized by the pg_sync_replication_slots() API.
     pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
    
    + * the API can retry.
    API --> SQL function
    
    + /* Set this, so that API can retry */
    API --> SQL function
    
    + * persist. It is utilized by the pg_sync_replication_slots() API.
    pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
    
    + * slot_persistence_pending - boolean used by pg_sync_replication_slots
    + *   API to track if any slots could not be
    pg_sync_replication_slots  API --> SQL function pg_sync_replication_slots()
    
    + * Interrupt handler for pg_sync_replication_slots() API.
    pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
    
    
    2)
    ProcessSlotSyncAPIInterrupts
    slotsync_api_reread_config
    -- These also have API in it, but I do not have any better name
    suggestions here, we can retain the current ones and see what others
    say.
    
    3)
    /*
     * Re-read the config file.
     *
     * Exit if any of the slot sync GUCs have changed. The postmaster will
     * restart it.
     */
    static void
    slotsync_reread_config(void)
    
    Shall we change this existing comment to: Re-read the config file for
    slot sync worker.
    
    4)
    
    +/*
    + * Re-read the config file and check for critical parameter changes.
    + *
    + */
    +static void
    +slotsync_api_reread_config(void)
    
    Shall we change comment to:
    /*
     * Re-read the config file for SQL function pg_sync_replication_slots()
     *
     * Emit error if any of the slot sync GUCs have changed.
     */
    
    thanks
    Shveta
    
    
    
    
  103. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-11-10T10:44:27Z

    On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    >
    > 2)
    > ProcessSlotSyncAPIInterrupts
    > slotsync_api_reread_config
    > -- These also have API in it, but I do not have any better name
    > suggestions here, we can retain the current ones and see what others
    > say.
    
    ProcessSlotSyncInterrupts() handles shutdown waiting,
    ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
    be good to explain why we need two different functions for worker and
    SQL function and also explain the difference between them.
    
    $primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
    +$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
    
    I think, the intention behind dropping the slot is to be able to
    create it again for the next test. But there is no comment here
    explaining that. There is also no comment explaining why we are
    dropping both slots here; when the test only needs dropping one.
    That's going to create confusion. One might think that all the slots
    need to be dropped at this stage, and drop and create any future slots
    that are used by prior code, for example. At the end of this test, we
    recreate the slot using pg_create_logical_replication_slot(), which is
    different method of creating slot than this test does. Though I can
    understand the reason, it's not apparent. Generally reusing slot names
    across multiple tests (in this file) is a source of confusion. But at
    least for the test you are adding, you could use a different slot name
    to avoid confusion.
    +# Create some DDL on the primary so that the slot lags behind the standby
    +$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
    +
    +# Attempt to synchronize slots using API. The API will continue retrying
    +# synchronization until the remote slot catches up.
    +# The API will not return until this happens, to be able to make
    +# further calls, call the API in a background process.
    +my $log_offset = -s $standby1->logfile;
    +
    +my $h = $standby1->background_psql('postgres', on_error_stop => 0);
    +
    +$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
    
    If the standby does not receive the WAL corresponding to the DDL
    before this function is executed, the slot will get synchronized
    immediately. I think we have to make sure that the standby has
    received the DDL before executing this function.
    
    Also most of the code which uses query_until has pattern like this:
    $h->query_until(qr/start/, q{\echo start
    SQL command});
    But we expect an empty string here. Why this difference?
    
    I think we need a similar test to test promotion while the function is
    waiting for the slot to become sync-ready.
    
    SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
    are similar with some functional differences. Some part of their code
    needs to be kept in sync in future. How do we achieve that? At least
    we need a comment saying so in each of those patches and keep those
    two codes in proximity.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  104. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-11-12T08:24:18Z

    On Mon, Nov 10, 2025 at 8:31 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
    > >
    > > >
    > > > Attaching patch v22 addressing the above comments.
    > >
    > > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN    "Waiting in main loop of logical replication apply process."
    > >  LOGICAL_LAUNCHER_MAIN  "Waiting in main loop of logical replication launcher process."
    > >  LOGICAL_PARALLEL_APPLY_MAIN    "Waiting in main loop of logical replication parallel apply process."
    > >  RECOVERY_WAL_STREAM    "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
    > > -REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot sync worker."
    > >  REPLICATION_SLOTSYNC_SHUTDOWN  "Waiting for slot sync worker to shut down."
    > > +REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot synchronization."
    > >  SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
    > >  WAL_RECEIVER_MAIN      "Waiting in main loop of WAL receiver process."
    > >  WAL_SENDER_MAIN        "Waiting in main loop of WAL sender process."
    > >
    > > I've noticed that all events are sorted alphabetically.  I think we should keep
    > > the order of REPLICATION_SLOTSYNC_MAIN unchanged.
    > >
    >
    > +1.
    >
    
    Yes, changed.
    
    > Few trivial comments:
    >
    > 1)
    > Since we have always used the term 'SQL function' rather than API in
    > existing code, shall we change all references of API to 'SQL function'
    > in current patch:
    >
    > + * If the pg_sync_replication API is used to sync the slots, and if the slots
    > "If the SQL function pg_sync_replication_slots() is used.."
    >
    > + * the reasons mentioned above, then the API also waits and retries until the
    > API --> SQL function
    >
    > + * persist. It is utilized by the pg_sync_replication_slots() API.
    >  pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
    >
    > + * the API can retry.
    > API --> SQL function
    >
    > + /* Set this, so that API can retry */
    > API --> SQL function
    >
    > + * persist. It is utilized by the pg_sync_replication_slots() API.
    > pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
    >
    > + * slot_persistence_pending - boolean used by pg_sync_replication_slots
    > + *   API to track if any slots could not be
    > pg_sync_replication_slots  API --> SQL function pg_sync_replication_slots()
    >
    > + * Interrupt handler for pg_sync_replication_slots() API.
    > pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
    >
    >
    > 2)
    > ProcessSlotSyncAPIInterrupts
    > slotsync_api_reread_config
    > -- These also have API in it, but I do not have any better name
    > suggestions here, we can retain the current ones and see what others
    > say.
    >
    
    Changed.
    
    > 3)
    > /*
    >  * Re-read the config file.
    >  *
    >  * Exit if any of the slot sync GUCs have changed. The postmaster will
    >  * restart it.
    >  */
    > static void
    > slotsync_reread_config(void)
    >
    > Shall we change this existing comment to: Re-read the config file for
    > slot sync worker.
    >
    
    > 4)
    >
    > +/*
    > + * Re-read the config file and check for critical parameter changes.
    > + *
    > + */
    > +static void
    > +slotsync_api_reread_config(void)
    >
    > Shall we change comment to:
    > /*
    >  * Re-read the config file for SQL function pg_sync_replication_slots()
    >  *
    >  * Emit error if any of the slot sync GUCs have changed.
    >  */
    >
    Changed.
    
    
    
    
    On Mon, Nov 10, 2025 at 9:44 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > >
    > >
    > > 2)
    > > ProcessSlotSyncAPIInterrupts
    > > slotsync_api_reread_config
    > > -- These also have API in it, but I do not have any better name
    > > suggestions here, we can retain the current ones and see what others
    > > say.
    >
    > ProcessSlotSyncInterrupts() handles shutdown waiting,
    > ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
    > be good to explain why we need two different functions for worker and
    > SQL function and also explain the difference between them.
    >
    
    I've updated the function header to explain this. The slot sync worker
    is a specific background worker while the API runs in the regular
    backend, so special handling is not needed.
    
    
    > $primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
    > +$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
    >
    > I think, the intention behind dropping the slot is to be able to
    > create it again for the next test. But there is no comment here
    > explaining that. There is also no comment explaining why we are
    > dropping both slots here; when the test only needs dropping one.
    > That's going to create confusion. One might think that all the slots
    > need to be dropped at this stage, and drop and create any future slots
    > that are used by prior code, for example. At the end of this test, we
    > recreate the slot using pg_create_logical_replication_slot(), which is
    > different method of creating slot than this test does. Though I can
    > understand the reason, it's not apparent. Generally reusing slot names
    > across multiple tests (in this file) is a source of confusion. But at
    > least for the test you are adding, you could use a different slot name
    > to avoid confusion.
    
    I've added a comment there that dropping both the slots is required
    for the next test. Also I cannot change the name of the slot as the
    next tests need the same slot synced.
    
    > +# Create some DDL on the primary so that the slot lags behind the standby
    > +$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
    > +
    > +# Attempt to synchronize slots using API. The API will continue retrying
    > +# synchronization until the remote slot catches up.
    > +# The API will not return until this happens, to be able to make
    > +# further calls, call the API in a background process.
    > +my $log_offset = -s $standby1->logfile;
    > +
    > +my $h = $standby1->background_psql('postgres', on_error_stop => 0);
    > +
    > +$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
    >
    > If the standby does not receive the WAL corresponding to the DDL
    > before this function is executed, the slot will get synchronized
    > immediately. I think we have to make sure that the standby has
    > received the DDL before executing this function.
    
    Yes, I've added a line to make sure that the stanbdy has caught up.
    
    >
    > Also most of the code which uses query_until has pattern like this:
    > $h->query_until(qr/start/, q{\echo start
    > SQL command});
    > But we expect an empty string here. Why this difference?
    >
    
    I've modified it as suggested.
    
    > I think we need a similar test to test promotion while the function is
    > waiting for the slot to become sync-ready.
    >
    
    Unfortunately, that will make this test too long if I add one more
    wait loop for slot sync.
    
    > SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
    > are similar with some functional differences. Some part of their code
    > needs to be kept in sync in future. How do we achieve that? At least
    > we need a comment saying so in each of those patches and keep those
    > two codes in proximity.
    >
    
    I've added a comment in the header of ReplSlotSyncWorkerMain to suggest this.
    
    Attaching patch v23 addressing these comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  105. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-11-19T06:10:42Z

    On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attaching patch v23 addressing these comments.
    >
    
    Few comments:
    =============
    1.
    In contrast, automatic synchronization
           via <varname>sync_replication_slots</varname> provides continuous slot
           updates, enabling seamless failover and supporting high availability.
    -      Therefore, it is the recommended method for synchronizing slots.
    
    I think a slotsync worker should still be a recommended method. So, we
    shouldn't remove the last line.
    
    2. I think we can unify slotsync_api_reread_config() and
    ProcessSlotSyncAPIInterrupts() with corresponding existing functions
    for slotsync worker. Having separate functions for API and worker to
    handle interrupts looks odd and bug-prone w.r.t future changes in this
    area.
    
    3.
    +/*
    + * Helper function to extract slot names from a list of remote slots
    + */
    +static List *
    +extract_slot_names(List *remote_slots)
    +{
    + List *slot_names = NIL;
    + MemoryContext oldcontext;
    +
    + /* Switch to long-lived TopMemoryContext to store slot names */
    + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    +
    + foreach_ptr(RemoteSlot, remote_slot, remote_slots)
    
    Why did we allocate this memory in TopMemoryContext here? We only need
    till this function executes, isn't CurrentMemoryContext (which I think
    is ExprContext) sufficient, if not, why? If we use some
    function/query-level context then we don't need to make it part of
    SlotSyncApiFailureParams. If we can get rid of slot_names from
    SlotSyncApiFailureParams then we probably don't need struct
    SlotSyncApiFailureParams.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  106. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-19T07:04:54Z

    On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Attaching patch v23 addressing these comments.
    >
    
    Thanks for the patch.
    
    I observed that if the API is taking a nap in between slot sync cycles
    and a promotion is triggered during that time, the promotion has to
    wait for the entire nap period to finish before slot-sync stops and
    the process can continue. There should be a mechanism to wake up the
    backend so the API can exit early once stopSignaled is set. How about
    doing SetLatch for the process doing synchronization in
    ShutDownSlotSync()?
    
    thanks
    Shveta
    
    
    
    
  107. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-11-21T03:44:25Z

    On Wed, Nov 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Attaching patch v23 addressing these comments.
    > >
    >
    > Few comments:
    > =============
    > 1.
    > In contrast, automatic synchronization
    >        via <varname>sync_replication_slots</varname> provides continuous slot
    >        updates, enabling seamless failover and supporting high availability.
    > -      Therefore, it is the recommended method for synchronizing slots.
    >
    > I think a slotsync worker should still be a recommended method. So, we
    > shouldn't remove the last line.
    >
    
    Changed.
    
    > 2. I think we can unify slotsync_api_reread_config() and
    > ProcessSlotSyncAPIInterrupts() with corresponding existing functions
    > for slotsync worker. Having separate functions for API and worker to
    > handle interrupts looks odd and bug-prone w.r.t future changes in this
    > area.
    >
    
    I’ve refactored and unified slotsync_api_reread_config() and
    ProcessSlotSyncAPIInterrupts() with their existing counterparts. As
    part of this, I switched the shutdown signal for slot-syncing workers
    and backends from SIGINT to SIGUSR1, so that all slot-synchronization
    processes use a consistent signaling model. Background workers handle
    SIGINT via StatementCancelHandler, which is not suitable for
    coordinated shutdowns, but SIGUSR1 reliably sets the process latch and
    wakes up both worker types. Once awakened, these processes invoke
    ProcessSlotSyncInterrupts(), which now checks
    SlotSyncCtx->stopSignaled to perform a clean shutdown with appropriate
    logs.
    
    > 3.
    > +/*
    > + * Helper function to extract slot names from a list of remote slots
    > + */
    > +static List *
    > +extract_slot_names(List *remote_slots)
    > +{
    > + List *slot_names = NIL;
    > + MemoryContext oldcontext;
    > +
    > + /* Switch to long-lived TopMemoryContext to store slot names */
    > + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
    > +
    > + foreach_ptr(RemoteSlot, remote_slot, remote_slots)
    >
    > Why did we allocate this memory in TopMemoryContext here? We only need
    > till this function executes, isn't CurrentMemoryContext (which I think
    > is ExprContext) sufficient, if not, why? If we use some
    > function/query-level context then we don't need to make it part of
    > SlotSyncApiFailureParams. If we can get rid of slot_names from
    > SlotSyncApiFailureParams then we probably don't need struct
    > SlotSyncApiFailureParams.
    >
    
    In further testing, I've found that a Transaction is always started
    when pg_sync_replication_slots() is called, and there was no need to
    start new transactions and there was no need for a seperate memory
    context for remote_slots. I think the confusion was probably from an
    earlier bug in my code. I have added an assert to make sure that
    transactions are started when in the API (should I make it an error
    instead?).
    
    On Wed, Nov 19, 2025 at 6:05 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Attaching patch v23 addressing these comments.
    > >
    >
    > Thanks for the patch.
    >
    > I observed that if the API is taking a nap in between slot sync cycles
    > and a promotion is triggered during that time, the promotion has to
    > wait for the entire nap period to finish before slot-sync stops and
    > the process can continue. There should be a mechanism to wake up the
    > backend so the API can exit early once stopSignaled is set. How about
    > doing SetLatch for the process doing synchronization in
    > ShutDownSlotSync()?
    >
    
    Yes. To address this, I now also store the background worker pid which
    is calling pg_sync_replication_slots() in SlotSyncCtx->pid, and I've
    modified the ShutDownSlotSync logic to issue a SIGUSR1 which will
    SetLatch on SlotSyncCtx->pid.
    
    Attaching patch v24, addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  108. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-21T09:10:14Z

    On Fri, Nov 21, 2025 at 9:14 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Attaching patch v24, addressing the above comments.
    >
    
    Thanks for the patch. Please find a few comments:
    
    
    1)
    Instead of passing an argument to slotsync_reread_config and
    ProcessSlotSyncInterrupts, we can use 'AmLogicalSlotSyncWorkerProcess'
    to distinguish the worker and API.
    
    2)
    Also, since we are not using a separate memory context, we don't need
    to make structure 'SlotSyncApiFailureParams' for slot_names failure.
    slot_names will be freed with the memory-context itself when
    exec_simple_query finishes.
    
    3)
    - if (old_sync_replication_slots != sync_replication_slots)
    + /* Worker-specific check for sync_replication_slots change */
    + if (!from_api && old_sync_replication_slots != sync_replication_slots)
      {
      ereport(LOG,
    - /* translator: %s is a GUC variable name */
    - errmsg("replication slot synchronization worker will shut down
    because \"%s\" is disabled", "sync_replication_slots"));
    + /* translator: %s is a GUC variable name */
    + errmsg("replication slot synchronization worker will shut down
    because \"%s\" is disabled",
    +    "sync_replication_slots"));
      proc_exit(0);
      }
    
    Here, we need not to have different flow for api and worker. Both can
    quit sync when this parameter is changed. The idea is if someone
    enables 'sync_replication_slots' when API is working, that means we
    need to start slot-sync worker, so it is okay if the API notices this
    and exits too.
    
    4)
    + if (from_api)
    + elevel = ERROR;
    + else
    + elevel = LOG;
    
    - proc_exit(0);
    + ereport(elevel,
    + errmsg("replication slot synchronization will stop because of a
    parameter change"));
    +
    
    We can do:
    ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR, ...);
    
    5)
     SlotSyncCtx->syncing = true;
    
    + /* The worker pid must not be already assigned in SlotSyncCtx */
    + Assert(SlotSyncCtx->pid == InvalidPid);
    +
    
    We can shift Assert before we set the shared-memory flag 'SlotSyncCtx->syncing'.
    
    thanks
    Shveta
    
    
    
    
  109. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-11-24T06:42:28Z

    On Fri, Nov 21, 2025 at 8:10 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Fri, Nov 21, 2025 at 9:14 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Attaching patch v24, addressing the above comments.
    > >
    >
    > Thanks for the patch. Please find a few comments:
    >
    >
    > 1)
    > Instead of passing an argument to slotsync_reread_config and
    > ProcessSlotSyncInterrupts, we can use 'AmLogicalSlotSyncWorkerProcess'
    > to distinguish the worker and API.
    >
    
    Changed as such.
    
    > 2)
    > Also, since we are not using a separate memory context, we don't need
    > to make structure 'SlotSyncApiFailureParams' for slot_names failure.
    > slot_names will be freed with the memory-context itself when
    > exec_simple_query finishes.
    >
    
    Removed.
    
    > 3)
    > - if (old_sync_replication_slots != sync_replication_slots)
    > + /* Worker-specific check for sync_replication_slots change */
    > + if (!from_api && old_sync_replication_slots != sync_replication_slots)
    >   {
    >   ereport(LOG,
    > - /* translator: %s is a GUC variable name */
    > - errmsg("replication slot synchronization worker will shut down
    > because \"%s\" is disabled", "sync_replication_slots"));
    > + /* translator: %s is a GUC variable name */
    > + errmsg("replication slot synchronization worker will shut down
    > because \"%s\" is disabled",
    > +    "sync_replication_slots"));
    >   proc_exit(0);
    >   }
    >
    > Here, we need not to have different flow for api and worker. Both can
    > quit sync when this parameter is changed. The idea is if someone
    > enables 'sync_replication_slots' when API is working, that means we
    > need to start slot-sync worker, so it is okay if the API notices this
    > and exits too.
    >
    
    changed but used a different error message.
    
    > 4)
    > + if (from_api)
    > + elevel = ERROR;
    > + else
    > + elevel = LOG;
    >
    > - proc_exit(0);
    > + ereport(elevel,
    > + errmsg("replication slot synchronization will stop because of a
    > parameter change"));
    > +
    >
    > We can do:
    > ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR, ...);
    >
    
    changed as such.
    
    > 5)
    >  SlotSyncCtx->syncing = true;
    >
    > + /* The worker pid must not be already assigned in SlotSyncCtx */
    > + Assert(SlotSyncCtx->pid == InvalidPid);
    > +
    >
    > We can shift Assert before we set the shared-memory flag 'SlotSyncCtx->syncing'.
    >
    
    Done.
    
    Attaching patch v25 addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  110. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-26T05:01:21Z

    On Mon, Nov 24, 2025 at 12:12 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attaching patch v25 addressing the above comments.
    >
    
    The patch needs rebase due to recent commit 76b78721.
    
    thanks
    Shveta
    
    
    
    
  111. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-26T08:42:29Z

     A few comments:
    
    1)
    +/*
    + * Structure holding parameters that need to be freed on error in
    + * pg_sync_replication_slots()
    + */
    +typedef struct SlotSyncApiFailureParams
    +{
    + WalReceiverConn *wrconn;
    + List *slot_names;
    +} SlotSyncApiFailureParams;
    +
    
    We can get rid of it now as we do not use it.
    
    2)
    ProcessSlotSyncInterrupts():
    
    + if (!AmLogicalSlotSyncWorkerProcess())
    + ereport(ERROR,
    + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    + errmsg("cannot continue replication slots synchronization"
    +    " as standby promotion is triggered"));
    + else
    + {
    
    Can we please reverse the if-else i.e. first worker and then API.
    Negated if-condition can be avoided in this case.
    
    3)
    
    slotsync_reread_config():
    + /* Worker-specific check for sync_replication_slots change */
    
    Now since we check for both API and worker, this comment is not needed.
    
    4)
    - ereport(LOG,
    - errmsg("replication slot synchronization worker will restart because
    of a parameter change"));
    
    - /*
    - * Reset the last-start time for this worker so that the postmaster
    - * can restart it without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    - */
    - SlotSyncCtx->last_start_time = 0;
    + ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
    + errmsg("replication slot synchronization will stop because of a
    parameter change"));
    
    Here, we should retain the same old message for worker i.e. 'worker
    will restart..'. instead of 'synchronization will stop'. I find the
    old message better in this case.
    
    5)
    slotsync_reread_config() is slightly difficult to follow.
    I think in the case of API, we can display a common error message
    instead of 2 different messages for 'sync_replication_slot change' and
    the rest of the parameters. We can mark if any of the parameters
    changed in both 'if' blocks and if the current process has not exited,
    then at the end based on 'parameter-changed', we can deal with API by
    giving a common message. Something like:
    
    /*
     * If we have reached here with a parameter change, we must be running
     * in SQL function, emit error in such a case.
     */
    if (parameter_changed (new variable))
    {
    Assert (!AmLogicalSlotSyncWorkerProcess);
    ereport(ERROR,
    errmsg("replication slot synchronization will stop because of a
    parameter change"));
    }
    
    thanks
    Shveta
    
    
    
    
  112. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-11-28T04:46:18Z

    On Wed, Nov 26, 2025 at 7:42 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >  A few comments:
    >
    > 1)
    > +/*
    > + * Structure holding parameters that need to be freed on error in
    > + * pg_sync_replication_slots()
    > + */
    > +typedef struct SlotSyncApiFailureParams
    > +{
    > + WalReceiverConn *wrconn;
    > + List *slot_names;
    > +} SlotSyncApiFailureParams;
    > +
    >
    > We can get rid of it now as we do not use it.
    >
    
    Removed.
    
    > 2)
    > ProcessSlotSyncInterrupts():
    >
    > + if (!AmLogicalSlotSyncWorkerProcess())
    > + ereport(ERROR,
    > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    > + errmsg("cannot continue replication slots synchronization"
    > +    " as standby promotion is triggered"));
    > + else
    > + {
    >
    > Can we please reverse the if-else i.e. first worker and then API.
    > Negated if-condition can be avoided in this case.
    >
    
    Changed.
    
    > 3)
    >
    > slotsync_reread_config():
    > + /* Worker-specific check for sync_replication_slots change */
    >
    > Now since we check for both API and worker, this comment is not needed.
    >
    
    Removed.
    
    > 4)
    > - ereport(LOG,
    > - errmsg("replication slot synchronization worker will restart because
    > of a parameter change"));
    >
    > - /*
    > - * Reset the last-start time for this worker so that the postmaster
    > - * can restart it without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
    > - */
    > - SlotSyncCtx->last_start_time = 0;
    > + ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
    > + errmsg("replication slot synchronization will stop because of a
    > parameter change"));
    >
    > Here, we should retain the same old message for worker i.e. 'worker
    > will restart..'. instead of 'synchronization will stop'. I find the
    > old message better in this case.
    >
    > 5)
    > slotsync_reread_config() is slightly difficult to follow.
    > I think in the case of API, we can display a common error message
    > instead of 2 different messages for 'sync_replication_slot change' and
    > the rest of the parameters. We can mark if any of the parameters
    > changed in both 'if' blocks and if the current process has not exited,
    > then at the end based on 'parameter-changed', we can deal with API by
    > giving a common message. Something like:
    >
    > /*
    >  * If we have reached here with a parameter change, we must be running
    >  * in SQL function, emit error in such a case.
    >  */
    > if (parameter_changed (new variable))
    > {
    > Assert (!AmLogicalSlotSyncWorkerProcess);
    > ereport(ERROR,
    > errmsg("replication slot synchronization will stop because of a
    > parameter change"));
    > }
    >
    
    Fixed as above.
    
    I've addressed the above comments as well as rebased the patch based
    on changes in commit 76b7872 in patch v26
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  113. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Japin Li <japinli@hotmail.com> — 2025-11-28T06:03:44Z

    On Fri, 28 Nov 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
    > On Wed, Nov 26, 2025 at 7:42 PM shveta malik <shveta.malik@gmail.com>
    > wrote:
    >>
    >>  A few comments:
    >>
    >> 1)
    >> +/*
    >> + * Structure holding parameters that need to be freed on error in
    >> + * pg_sync_replication_slots()
    >> + */
    >> +typedef struct SlotSyncApiFailureParams
    >> +{
    >> + WalReceiverConn *wrconn;
    >> + List *slot_names;
    >> +} SlotSyncApiFailureParams;
    >> +
    >>
    >> We can get rid of it now as we do not use it.
    >>
    >
    > Removed.
    >
    >> 2)
    >> ProcessSlotSyncInterrupts():
    >>
    >> + if (!AmLogicalSlotSyncWorkerProcess())
    >> + ereport(ERROR,
    >> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    >> + errmsg("cannot continue replication slots synchronization"
    >> +    " as standby promotion is triggered"));
    >> + else
    >> + {
    >>
    >> Can we please reverse the if-else i.e. first worker and then API.
    >> Negated if-condition can be avoided in this case.
    >>
    >
    > Changed.
    >
    >> 3)
    >>
    >> slotsync_reread_config():
    >> + /* Worker-specific check for sync_replication_slots change */
    >>
    >> Now since we check for both API and worker, this comment is not
    >> needed.
    >>
    >
    > Removed.
    >
    >> 4)
    >> - ereport(LOG,
    >> - errmsg("replication slot synchronization worker will restart
    >> because
    >> of a parameter change"));
    >>
    >> - /*
    >> - * Reset the last-start time for this worker so that the postmaster
    >> - * can restart it without waiting for
    >> SLOTSYNC_RESTART_INTERVAL_SEC.
    >> - */
    >> - SlotSyncCtx->last_start_time = 0;
    >> + ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
    >> + errmsg("replication slot synchronization will stop because of a
    >> parameter change"));
    >>
    >> Here, we should retain the same old message for worker i.e. 'worker
    >> will restart..'. instead of 'synchronization will stop'. I find the
    >> old message better in this case.
    >>
    >> 5)
    >> slotsync_reread_config() is slightly difficult to follow.
    >> I think in the case of API, we can display a common error message
    >> instead of 2 different messages for 'sync_replication_slot change'
    >> and
    >> the rest of the parameters. We can mark if any of the parameters
    >> changed in both 'if' blocks and if the current process has not
    >> exited,
    >> then at the end based on 'parameter-changed', we can deal with API
    >> by
    >> giving a common message. Something like:
    >>
    >> /*
    >>  * If we have reached here with a parameter change, we must be
    >> running
    >>  * in SQL function, emit error in such a case.
    >>  */
    >> if (parameter_changed (new variable))
    >> {
    >> Assert (!AmLogicalSlotSyncWorkerProcess);
    >> ereport(ERROR,
    >> errmsg("replication slot synchronization will stop because of a
    >> parameter change"));
    >> }
    >>
    >
    > Fixed as above.
    >
    > I've addressed the above comments as well as rebased the patch based
    > on changes in commit 76b7872 in patch v26
    >
    
    1.
    Initialize slot_persistence_pending to false (to avoid uninitialized values, or
    initialize to true by mistaken) in update_and_persist_local_synced_slot(). This
    aligns with the handling of found_consistent_snapshot and remote_slot_precedes
    in update_local_synced_slot().
    
    diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
    index 20eada3393..c55ba11f17 100644
    --- a/src/backend/replication/logical/slotsync.c
    +++ b/src/backend/replication/logical/slotsync.c
    @@ -617,6 +617,9 @@ update_and_persist_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid,
            bool            found_consistent_snapshot = false;
            bool            remote_slot_precedes = false;
    
    +       if (slot_persistence_pending)
    +               *slot_persistence_pending = false;
    +
            /* Slotsync skip stats are handled in function update_local_synced_slot() */
            (void) update_local_synced_slot(remote_slot, remote_dbid,
                                                                            &found_consistent_snapshot,
    
    2.
    This change seems unnecessary。
    
     static void
    -slotsync_reread_config(void)
    +slotsync_reread_config()
    
     static void
    -ProcessSlotSyncInterrupts(void)
    +ProcessSlotSyncInterrupts()
    
    3.
    Since we are already caching the result of AmLogicalSlotSyncWorkerProcess() in
    a local worker variable, how about applying this replacement:
    s/if (AmLogicalSlotSyncWorkerProcess())/if (worker)/g?
    
    +	bool		worker = AmLogicalSlotSyncWorkerProcess();
    +	bool		parameter_changed = false;
     
    -	Assert(sync_replication_slots);
    +	if (AmLogicalSlotSyncWorkerProcess())
    +		Assert(sync_replication_slots);
    
    -- 
    Regards,
    Japin Li
    ChengDu WenWu Information Technology Co., Ltd.
    
    
    
    
  114. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-11-28T06:25:18Z

    On Fri, Nov 28, 2025 at 10:16 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Fixed as above.
    >
    > I've addressed the above comments as well as rebased the patch based
    > on changes in commit 76b7872 in patch v26
    >
    
    Thanks for the patch. Please find a few trivial comments:
    
    1)
    + if (AmLogicalSlotSyncWorkerProcess())
    + Assert(sync_replication_slots);
    
    Here too we can use 'worker'.
    
    2)
    + /* check for sync_replication_slots change */
    
    check --> Check
    
    3)
    Assert (!worker)
    Extra space in between.
    
    4)
    check_and_set_sync_info() and ShutDownSlotSync() refers to the pid as
    worker_pid. But now it could be backend-pid as well.
    Using 'worker' in this variable could be misleading. Shall we make it
    sync_process_pid?
    
    5)
    /*
     * Interrupt handler for main loop of slot sync worker.
     */
    static void
    ProcessSlotSyncInterrupts()
    
    We can modify the comment to include API as well.
    
    6)
    /*
     * Shut down the slot sync worker.
     *
     * This function sends signal to shutdown slot sync worker, if required. It
     * also waits till the slot sync worker has exited or
     * pg_sync_replication_slots() has finished.
     */
    void
    ShutDownSlotSync(void)
    
    We should change comments to give details on API as well.
    
    7)
    +# Remove the standby from the synchronized_standby_slots list and reload the
    +# configuration.
    +$primary->adjust_conf('postgresql.conf', 'synchronized_standby_slots', "''");
    +$primary->reload;
    
    We can update the comment to below for better clarity:
    Remove the dropped sb1_slot from the ...
    
    8)
    +# Attempt to synchronize slots using API. The API will continue retrying
    +# synchronization until the remote slot catches up.
    +# The API will not return until this happens, to be able to make
    +# further calls, call the API in a background process.
    
    We can move these comments atop:
    my $h = $standby2->background_psql('postgres', on_error_stop => 0);
    
    thanks
    Shveta
    
    
    
    
  115. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-02T08:27:57Z

    On Fri, Nov 28, 2025 at 5:03 PM Japin Li <japinli@hotmail.com> wrote:
    >
    > 1.
    > Initialize slot_persistence_pending to false (to avoid uninitialized values, or
    > initialize to true by mistaken) in update_and_persist_local_synced_slot(). This
    > aligns with the handling of found_consistent_snapshot and remote_slot_precedes
    > in update_local_synced_slot().
    >
    > diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
    > index 20eada3393..c55ba11f17 100644
    > --- a/src/backend/replication/logical/slotsync.c
    > +++ b/src/backend/replication/logical/slotsync.c
    > @@ -617,6 +617,9 @@ update_and_persist_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid,
    >         bool            found_consistent_snapshot = false;
    >         bool            remote_slot_precedes = false;
    >
    > +       if (slot_persistence_pending)
    > +               *slot_persistence_pending = false;
    > +
    >         /* Slotsync skip stats are handled in function update_local_synced_slot() */
    >         (void) update_local_synced_slot(remote_slot, remote_dbid,
    >                                                                         &found_consistent_snapshot,
    >
    
    I don't understand what the comment is here.
    
    > 2.
    > This change seems unnecessary。
    >
    >  static void
    > -slotsync_reread_config(void)
    > +slotsync_reread_config()
    >
    >  static void
    > -ProcessSlotSyncInterrupts(void)
    > +ProcessSlotSyncInterrupts()
    >
    
    Fixed.
    
    > 3.
    > Since we are already caching the result of AmLogicalSlotSyncWorkerProcess() in
    > a local worker variable, how about applying this replacement:
    > s/if (AmLogicalSlotSyncWorkerProcess())/if (worker)/g?
    >
    > +       bool            worker = AmLogicalSlotSyncWorkerProcess();
    > +       bool            parameter_changed = false;
    >
    > -       Assert(sync_replication_slots);
    > +       if (AmLogicalSlotSyncWorkerProcess())
    > +               Assert(sync_replication_slots);
    >
    
    Fixed.
    
    
    
    On Fri, Nov 28, 2025 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > Thanks for the patch. Please find a few trivial comments:
    >
    > 1)
    > + if (AmLogicalSlotSyncWorkerProcess())
    > + Assert(sync_replication_slots);
    >
    > Here too we can use 'worker'.
    >
    
    Fixed.
    
    > 2)
    > + /* check for sync_replication_slots change */
    >
    > check --> Check
    >
    
    Fixed.
    
    > 3)
    > Assert (!worker)
    > Extra space in between.
    >
    
    Fixed
    
    > 4)
    > check_and_set_sync_info() and ShutDownSlotSync() refers to the pid as
    > worker_pid. But now it could be backend-pid as well.
    > Using 'worker' in this variable could be misleading. Shall we make it
    > sync_process_pid?
    >
    
    Changed.
    
    > 5)
    > /*
    >  * Interrupt handler for main loop of slot sync worker.
    >  */
    > static void
    > ProcessSlotSyncInterrupts()
    >
    > We can modify the comment to include API as well.
    
    Changed.
    
    >
    > 6)
    > /*
    >  * Shut down the slot sync worker.
    >  *
    >  * This function sends signal to shutdown slot sync worker, if required. It
    >  * also waits till the slot sync worker has exited or
    >  * pg_sync_replication_slots() has finished.
    >  */
    > void
    > ShutDownSlotSync(void)
    >
    > We should change comments to give details on API as well.
    >
    
    changed.
    
    > 7)
    > +# Remove the standby from the synchronized_standby_slots list and reload the
    > +# configuration.
    > +$primary->adjust_conf('postgresql.conf', 'synchronized_standby_slots', "''");
    > +$primary->reload;
    >
    > We can update the comment to below for better clarity:
    > Remove the dropped sb1_slot from the ...
    >
    
    Changed.
    
    > 8)
    > +# Attempt to synchronize slots using API. The API will continue retrying
    > +# synchronization until the remote slot catches up.
    > +# The API will not return until this happens, to be able to make
    > +# further calls, call the API in a background process.
    >
    > We can move these comments atop:
    > my $h = $standby2->background_psql('postgres', on_error_stop => 0);
    >
    
    Changed.
    
    Attached patch v27 addresses the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  116. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-02T09:35:26Z

    On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    >
    > Attached patch v27 addresses the above comments.
    >
    
    Thanks for the patch. Please find a few comments:
    
    1)
    + /* The worker pid must not be already assigned in SlotSyncCtx */
    + Assert(SlotSyncCtx->pid == InvalidPid);
    +
    
    We can mention just 'pid' here instead of 'worker pid'
    
    2)
    + /*
    + * The syscache access in fetch_or_refresh_remote_slots() needs a
    + * transaction env.
    + */
    fetch_or_refresh_remote_slots --> fetch_remote_slots
    
    3)
     SyncReplicationSlots(WalReceiverConn *wrconn)
     {
    +
    
    We can get rid of this blank line at the start of the function.
    
    4)
     /*
      * Shut down the slot sync worker.
      *
    - * This function sends signal to shutdown slot sync worker, if required. It
    + * This function sends signal to shutdown the slot sync process, if
    required. It
      * also waits till the slot sync worker has exited or
      * pg_sync_replication_slots() has finished.
      */
    Shall we change comment to something like (rephrase if required):
    
    Shut down the slot synchronization.
    This function wakes up the slot sync process (either worker or backend
    running SQL function) and sets stopSignaled=true
    so that worker can exit or SQL function pg_sync_replication_slots()
    can finish. It also waits till the slot sync worker has exited or
    pg_sync_replication_slots() has finished.
    
    5)
    We should change the comment atop 'SlotSyncCtxStruct' as well to
    mention that this pid is either the slot sync worker's pid or
    backend's pid running the SQL function. It is needed by the startup
    process to wake these up, so that they can stop synchronization on
    seeing stopSignaled.  <please rephrase as needed>
    
    6)
    + ereport(LOG,
    + errmsg("replication slot synchronization worker is shutting down on
    receiving SIGUSR1"));
    
    SIGUSR1 was actually just a wake-up signal. We may change the comment to:
    replication slot synchronization worker is shutting down as promotion
    is triggered.
    
    7)
    update_synced_slots_inactive_since:
    /* The slot sync worker or SQL function mustn't be running by now */
    - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
    + Assert(!SlotSyncCtx->syncing);
    
    Regarding this, I see that 'update_synced_slots_inactive_since' is
    only called when we are sure that 'syncing' is false. So shouldn't pid
    be also Invalid by that time? Even if it was backend's pid to start
    with, but since backend has stopped syncing (finished or error-ed
    out),
    pid should be reset to Invalid in such a case. And this Assert need
    not to be changed.
    
    8)
    
    + if (sync_process_pid != InvalidPid)
    + kill(sync_process_pid, SIGUSR1);
    
    We can write comments to say wake-up slot sync process.
    
    thanks
    Shveta
    
    
    
    
  117. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-03T03:21:35Z

    On Tue, Dec 2, 2025 at 8:35 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > Attached patch v27 addresses the above comments.
    > >
    >
    > Thanks for the patch. Please find a few comments:
    >
    > 1)
    > + /* The worker pid must not be already assigned in SlotSyncCtx */
    > + Assert(SlotSyncCtx->pid == InvalidPid);
    > +
    >
    > We can mention just 'pid' here instead of 'worker pid'
    >
    
    Changed.
    
    > 2)
    > + /*
    > + * The syscache access in fetch_or_refresh_remote_slots() needs a
    > + * transaction env.
    > + */
    > fetch_or_refresh_remote_slots --> fetch_remote_slots
    >
    > 3)
    >  SyncReplicationSlots(WalReceiverConn *wrconn)
    >  {
    > +
    >
    > We can get rid of this blank line at the start of the function.
    >
    
    Fixed.
    
    > 4)
    >  /*
    >   * Shut down the slot sync worker.
    >   *
    > - * This function sends signal to shutdown slot sync worker, if required. It
    > + * This function sends signal to shutdown the slot sync process, if
    > required. It
    >   * also waits till the slot sync worker has exited or
    >   * pg_sync_replication_slots() has finished.
    >   */
    > Shall we change comment to something like (rephrase if required):
    >
    > Shut down the slot synchronization.
    > This function wakes up the slot sync process (either worker or backend
    > running SQL function) and sets stopSignaled=true
    > so that worker can exit or SQL function pg_sync_replication_slots()
    > can finish. It also waits till the slot sync worker has exited or
    > pg_sync_replication_slots() has finished.
    >
    
    Changed.
    
    > 5)
    > We should change the comment atop 'SlotSyncCtxStruct' as well to
    > mention that this pid is either the slot sync worker's pid or
    > backend's pid running the SQL function. It is needed by the startup
    > process to wake these up, so that they can stop synchronization on
    > seeing stopSignaled.  <please rephrase as needed>
    >
    
    Changed.
    
    > 6)
    > + ereport(LOG,
    > + errmsg("replication slot synchronization worker is shutting down on
    > receiving SIGUSR1"));
    >
    > SIGUSR1 was actually just a wake-up signal. We may change the comment to:
    > replication slot synchronization worker is shutting down as promotion
    > is triggered.
    >
    
    Changed.
    
    > 7)
    > update_synced_slots_inactive_since:
    > /* The slot sync worker or SQL function mustn't be running by now */
    > - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
    > + Assert(!SlotSyncCtx->syncing);
    >
    > Regarding this, I see that 'update_synced_slots_inactive_since' is
    > only called when we are sure that 'syncing' is false. So shouldn't pid
    > be also Invalid by that time? Even if it was backend's pid to start
    > with, but since backend has stopped syncing (finished or error-ed
    > out),
    > pid should be reset to Invalid in such a case. And this Assert need
    > not to be changed.
    >
    
    Fixed.
    
    > 8)
    >
    > + if (sync_process_pid != InvalidPid)
    > + kill(sync_process_pid, SIGUSR1);
    >
    > We can write comments to say wake-up slot sync process.
    >
    
    Added comments.
    
    Attaching patch v28 addressing these comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  118. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-03T04:00:49Z

    On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Tue, Dec 2, 2025 at 8:35 PM shveta malik <shveta.malik@gmail.com> wrote:
    > >
    > > On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > >
    > > > Attached patch v27 addresses the above comments.
    > > >
    > >
    > > Thanks for the patch. Please find a few comments:
    > >
    > > 1)
    > > + /* The worker pid must not be already assigned in SlotSyncCtx */
    > > + Assert(SlotSyncCtx->pid == InvalidPid);
    > > +
    > >
    > > We can mention just 'pid' here instead of 'worker pid'
    > >
    >
    > Changed.
    >
    > > 2)
    > > + /*
    > > + * The syscache access in fetch_or_refresh_remote_slots() needs a
    > > + * transaction env.
    > > + */
    > > fetch_or_refresh_remote_slots --> fetch_remote_slots
    > >
    > > 3)
    > >  SyncReplicationSlots(WalReceiverConn *wrconn)
    > >  {
    > > +
    > >
    > > We can get rid of this blank line at the start of the function.
    > >
    >
    > Fixed.
    >
    > > 4)
    > >  /*
    > >   * Shut down the slot sync worker.
    > >   *
    > > - * This function sends signal to shutdown slot sync worker, if required. It
    > > + * This function sends signal to shutdown the slot sync process, if
    > > required. It
    > >   * also waits till the slot sync worker has exited or
    > >   * pg_sync_replication_slots() has finished.
    > >   */
    > > Shall we change comment to something like (rephrase if required):
    > >
    > > Shut down the slot synchronization.
    > > This function wakes up the slot sync process (either worker or backend
    > > running SQL function) and sets stopSignaled=true
    > > so that worker can exit or SQL function pg_sync_replication_slots()
    > > can finish. It also waits till the slot sync worker has exited or
    > > pg_sync_replication_slots() has finished.
    > >
    >
    > Changed.
    >
    > > 5)
    > > We should change the comment atop 'SlotSyncCtxStruct' as well to
    > > mention that this pid is either the slot sync worker's pid or
    > > backend's pid running the SQL function. It is needed by the startup
    > > process to wake these up, so that they can stop synchronization on
    > > seeing stopSignaled.  <please rephrase as needed>
    > >
    >
    > Changed.
    >
    > > 6)
    > > + ereport(LOG,
    > > + errmsg("replication slot synchronization worker is shutting down on
    > > receiving SIGUSR1"));
    > >
    > > SIGUSR1 was actually just a wake-up signal. We may change the comment to:
    > > replication slot synchronization worker is shutting down as promotion
    > > is triggered.
    > >
    >
    > Changed.
    >
    > > 7)
    > > update_synced_slots_inactive_since:
    > > /* The slot sync worker or SQL function mustn't be running by now */
    > > - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
    > > + Assert(!SlotSyncCtx->syncing);
    > >
    > > Regarding this, I see that 'update_synced_slots_inactive_since' is
    > > only called when we are sure that 'syncing' is false. So shouldn't pid
    > > be also Invalid by that time? Even if it was backend's pid to start
    > > with, but since backend has stopped syncing (finished or error-ed
    > > out),
    > > pid should be reset to Invalid in such a case. And this Assert need
    > > not to be changed.
    > >
    >
    > Fixed.
    >
    > > 8)
    > >
    > > + if (sync_process_pid != InvalidPid)
    > > + kill(sync_process_pid, SIGUSR1);
    > >
    > > We can write comments to say wake-up slot sync process.
    > >
    >
    > Added comments.
    >
    > Attaching patch v28 addressing these comments.
    >
    
    Thanks for the patch. A few trivial comments:
    
    1)
    + ereport(ERROR,
    + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    + errmsg("cannot continue replication slots synchronization"
    +    " as standby promotion is triggered"));
    
    slots->slot, since in every error-message we use 'replication slot
    synchronization'
    
    2)
    + * The pid is either the slot sync worker's pid or the backend's pid running
    + * the SQL function pg_sync_replication_slots(). It is needed by the startup
    + * process to wake these up, so that they can stop synchronization on seeing
    + * stopSignaled on promotion.
    + * Setting stopSignaled is also used to handle the race condition when the
    
    Can we rephrase slightly to indicate clearly that it is the startup
    process which sets 'stopSignaled' during promotion. Suggestion:
    
    The pid is either the slot sync worker’s pid or the backend’s pid running
    the SQL function pg_sync_replication_slots(). When the startup process
    sets stopSignaled during promotion, it uses this pid to wake the
    currently synchronizing process so that the process can immediately stop its
    synchronization work upon seeing stopSignaled set to true.
    Setting stopSignaled....
    
    thanks
    Shveta
    
    
    
    
  119. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Japin Li <japinli@hotmail.com> — 2025-12-03T05:05:51Z

    On Tue, 02 Dec 2025 at 19:27, Ajin Cherian <itsajin@gmail.com> wrote:
    > On Fri, Nov 28, 2025 at 5:03 PM Japin Li <japinli@hotmail.com> wrote:
    >>
    >> 1.
    >> Initialize slot_persistence_pending to false (to avoid uninitialized values, or
    >> initialize to true by mistaken) in update_and_persist_local_synced_slot(). This
    >> aligns with the handling of found_consistent_snapshot and remote_slot_precedes
    >> in update_local_synced_slot().
    >>
    >> diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
    >> index 20eada3393..c55ba11f17 100644
    >> --- a/src/backend/replication/logical/slotsync.c
    >> +++ b/src/backend/replication/logical/slotsync.c
    >> @@ -617,6 +617,9 @@ update_and_persist_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid,
    >>         bool            found_consistent_snapshot = false;
    >>         bool            remote_slot_precedes = false;
    >>
    >> +       if (slot_persistence_pending)
    >> +               *slot_persistence_pending = false;
    >> +
    >>         /* Slotsync skip stats are handled in function update_local_synced_slot() */
    >>         (void) update_local_synced_slot(remote_slot, remote_dbid,
    >>                                                                         &found_consistent_snapshot,
    >>
    >
    > I don't understand what the comment is here.
    
    I mean, we should always set the slot_persistence_pending variable to false
    immediately when entering the update_and_persist_local_synced_slot() function.
    
    For example:
    
    bool	slot_persistence_pending = true;
    
    update_and_persist_local_synced_slot(..., &slot_persistence_pending);
    
    /* Here the slot_persistence_pending is always true, is this expected? */
    
    -- 
    Regards,
    Japin Li
    ChengDu WenWu Information Technology Co., Ltd.
    
    
    
    
  120. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-12-03T11:19:03Z

    On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attaching patch v28 addressing these comments.
    >
    
    Can we extract the part of the patch that handles SIGUSR1 signal
    separately as a first patch and the remaining as a second patch?
    Please do mention the reason in the commit message as to why we are
    changing the signal for SIGINT to SIGUSR1.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  121. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-04T05:21:39Z

    On Wed, Dec 3, 2025 at 10:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Attaching patch v28 addressing these comments.
    > >
    >
    > Can we extract the part of the patch that handles SIGUSR1 signal
    > separately as a first patch and the remaining as a second patch?
    > Please do mention the reason in the commit message as to why we are
    > changing the signal for SIGINT to SIGUSR1.
    >
    
    I have extracted out the SIGUSR1 signal handling changes separately
    into a patch and sharing. I will share the next patch later.
    Let me know if there are any comments for this patch.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  122. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-04T06:03:50Z

    On Thu, Dec 4, 2025 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Dec 3, 2025 at 10:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > Attaching patch v28 addressing these comments.
    > > >
    > >
    > > Can we extract the part of the patch that handles SIGUSR1 signal
    > > separately as a first patch and the remaining as a second patch?
    > > Please do mention the reason in the commit message as to why we are
    > > changing the signal for SIGINT to SIGUSR1.
    > >
    >
    > I have extracted out the SIGUSR1 signal handling changes separately
    > into a patch and sharing. I will share the next patch later.
    > Let me know if there are any comments for this patch.
    >
    
    I have just 2 trivial comments for v29-001:
    
    1)
    -   * receives a SIGINT from the startup process, or when there is an error.
    +   * receives a SIGUSR1 from the startup process, or when there is an error.
    
    In above we should mention stopSignaled rather than SIGUSR1, as
    SIGUSR1 is just a wakeup signal and not termination signal.
    
     2)
    +    else
    +      ereport(ERROR,
    +          errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    +          errmsg("cannot continue replication slot synchronization"
    +               " as standby promotion is triggered"));
    
    Please mention that it is SQL-function in the comment for else-block.
    
    ~~
    
    I tested the touched scenarios and here are the LOGs:
    
    a)
    When promotion is ongoing and the startup  process has terminated
    slot-sync worker but if the postmaster has not noticed that, it may
    end up starting slotsync worker again. For that scenario, we get
    these:
    
    11:03:19.712 IST [151559] LOG:  replication slot synchronization
    worker is shutting down as promotion is triggered
    11:03:19.726 IST [151629] LOG:  slot sync worker started
    11:03:19.795 IST [151629] LOG:  replication slot synchronization
    worker is shutting down as promotion is triggered
    
    b)
    On promotion, API gets this (originating from ProcessSlotSyncInterrupts now):
    postgres=# SELECT pg_sync_replication_slots();
    ERROR:  cannot continue replication slot synchronization as standby
    promotion is triggered
    
    c)
    If any parameter is changed between ValidateSlotSyncParams() and
    ProcessSlotSyncInterrupts() for API, we get this:
    postgres=# SELECT pg_sync_replication_slots();
    ERROR:  replication slot synchronization will stop because of a parameter change
    
    --on re-run (originating from ValidateSlotSyncParams())
    postgres=# SELECT pg_sync_replication_slots();
    ERROR:  replication slot synchronization requires
    "hot_standby_feedback" to be enabled
    
    ~~
    
    The tested scenarios' behaviour looks good to me.
    
    thanks
    Shveta
    
    
    
    
  123. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-04T09:01:04Z

    On Thu, Dec 4, 2025 at 5:04 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    >
    > I have just 2 trivial comments for v29-001:
    >
    > 1)
    > -   * receives a SIGINT from the startup process, or when there is an error.
    > +   * receives a SIGUSR1 from the startup process, or when there is an error.
    >
    > In above we should mention stopSignaled rather than SIGUSR1, as
    > SIGUSR1 is just a wakeup signal and not termination signal.
    >
    
    Fixed.
    
    >  2)
    > +    else
    > +      ereport(ERROR,
    > +          errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    > +          errmsg("cannot continue replication slot synchronization"
    > +               " as standby promotion is triggered"));
    >
    > Please mention that it is SQL-function in the comment for else-block.
    >
    
    Fixed.
    
    > ~~
    >
    > I tested the touched scenarios and here are the LOGs:
    >
    
    Thanks for testing!
    
    Attaching patch v30 with the above changes addressed. I've also run
    pgindent on the changes.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  124. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-09T10:33:59Z

    Hi all,
    
    Since commit 04396ea [1] has been pushed, which included part of the
    changes this patch set was addressing, I have updated and rebased the
    patch set to incorporate those changes.
    
    The patch set now contains two patches:
    
    0001 – Modify the pg_sync_replication_slots API to also handle
    promotion signals and stop synchronization, similar to the slot sync
    worker.
    0002 – Improve pg_sync_replication_slots to wait for and persist slots
    until they are sync-ready.
    
    Please review the updated patch set (v31).
    
    Regards,
    Ajin Cherian
    Fujitsu Australia
    
    [1] https://github.com/postgres/postgres/commit/04396eacd3faeaa4fa3d084a6749e4e384bdf0db
    
  125. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-09T11:45:24Z

    On Tue, Dec 9, 2025 at 4:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Hi all,
    >
    > Since commit 04396ea [1] has been pushed, which included part of the
    > changes this patch set was addressing, I have updated and rebased the
    > patch set to incorporate those changes.
    >
    > The patch set now contains two patches:
    >
    > 0001 – Modify the pg_sync_replication_slots API to also handle
    > promotion signals and stop synchronization, similar to the slot sync
    > worker.
    > 0002 – Improve pg_sync_replication_slots to wait for and persist slots
    > until they are sync-ready.
    >
    > Please review the updated patch set (v31).
    >
    
    Thanks. Please find a few comments on 001:
    
    1)
    Commit message:
    "That meant backends
    that perform slot synchronization via the pg_sync_replication_slots()
    SQL function were not signalled at all because their PIDs were not
    recorded in the slot-sync context."
    
    We should phrase it in the singular ('backend'), since multiple
    backends cannot run simultaneously because concurrent sync is not
    allowed.
    Using the term 'their PIDs' gives an impression that there could be
    multiple PIDs, which is not the case.
    
    2)
      primary_slotname_changed = strcmp(old_primary_slotname, PrimarySlotName) != 0;
    +
      pfree(old_primary_conninfo);
    
    This change to put blank line is not needed.
    
    3)
    
    + /* Check for sync_replication_slots change */
    + /* Check for parameter changes common to both API and worker */
    
    IMO, these comments are not needed as it is self-explanatory. Even if
    we plan to put these, both should be same, either both mentioning API
    and worker or neither.
    
    4)
    - * receives a stop request from the startup process, or when there is an
    + * because receives a stop request from the startup process, or when
    there is an
    
    I think, this change is done by mistake.
    
    5)
    + * Signal slotsync worker or backend process running
    pg_sync_replication_slots()
    + * if running. The process will stop upon detecting that the stopSignaled
    + * flag is set to true.
    
    Comment looks slightly odd. Shall we say:
    
    Signal the slotsync worker or the backend process running
    pg_sync_replication_slots(), if either one is active. The process...
    
    thanks
    Shveta
    
    
    
    
  126. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-10T02:23:10Z

    On Tue, Dec 9, 2025 at 10:45 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Tue, Dec 9, 2025 at 4:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > Hi all,
    > >
    > > Since commit 04396ea [1] has been pushed, which included part of the
    > > changes this patch set was addressing, I have updated and rebased the
    > > patch set to incorporate those changes.
    > >
    > > The patch set now contains two patches:
    > >
    > > 0001 – Modify the pg_sync_replication_slots API to also handle
    > > promotion signals and stop synchronization, similar to the slot sync
    > > worker.
    > > 0002 – Improve pg_sync_replication_slots to wait for and persist slots
    > > until they are sync-ready.
    > >
    > > Please review the updated patch set (v31).
    > >
    >
    > Thanks. Please find a few comments on 001:
    >
    > 1)
    > Commit message:
    > "That meant backends
    > that perform slot synchronization via the pg_sync_replication_slots()
    > SQL function were not signalled at all because their PIDs were not
    > recorded in the slot-sync context."
    >
    > We should phrase it in the singular ('backend'), since multiple
    > backends cannot run simultaneously because concurrent sync is not
    > allowed.
    > Using the term 'their PIDs' gives an impression that there could be
    > multiple PIDs, which is not the case.
    >
    
    Fixed.
    
    > 2)
    >   primary_slotname_changed = strcmp(old_primary_slotname, PrimarySlotName) != 0;
    > +
    >   pfree(old_primary_conninfo);
    >
    > This change to put blank line is not needed.
    >
    
    Removed.
    
    > 3)
    >
    > + /* Check for sync_replication_slots change */
    > + /* Check for parameter changes common to both API and worker */
    >
    > IMO, these comments are not needed as it is self-explanatory. Even if
    > we plan to put these, both should be same, either both mentioning API
    > and worker or neither.
    >
    
    Removed.
    
    > 4)
    > - * receives a stop request from the startup process, or when there is an
    > + * because receives a stop request from the startup process, or when
    > there is an
    >
    > I think, this change is done by mistake.
    >
    
    Yes, removed.
    
    > 5)
    > + * Signal slotsync worker or backend process running
    > pg_sync_replication_slots()
    > + * if running. The process will stop upon detecting that the stopSignaled
    > + * flag is set to true.
    >
    > Comment looks slightly odd. Shall we say:
    >
    > Signal the slotsync worker or the backend process running
    > pg_sync_replication_slots(), if either one is active. The process...
    >
    
    Changed.
    
    Attaching patch v32 addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  127. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Chao Li <li.evan.chao@gmail.com> — 2025-12-10T02:28:39Z

    
    > On Dec 9, 2025, at 18:33, Ajin Cherian <itsajin@gmail.com> wrote:
    > 
    > Hi all,
    > 
    > Since commit 04396ea [1] has been pushed, which included part of the
    > changes this patch set was addressing, I have updated and rebased the
    > patch set to incorporate those changes.
    > 
    > The patch set now contains two patches:
    > 
    > 0001 – Modify the pg_sync_replication_slots API to also handle
    > promotion signals and stop synchronization, similar to the slot sync
    > worker.
    > 0002 – Improve pg_sync_replication_slots to wait for and persist slots
    > until they are sync-ready.
    > 
    > Please review the updated patch set (v31).
    > 
    > Regards,
    > Ajin Cherian
    > Fujitsu Australia
    > 
    > [1] https://github.com/postgres/postgres/commit/04396eacd3faeaa4fa3d084a6749e4e384bdf0db
    > <v31-0001-Signal-backends-running-pg_sync_replication_slot.patch><v31-0002-Improve-initial-slot-synchronization-in-pg_sync_.patch>
    
    Hi Ajin,
    
    I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this patch. So can you please rebase this patch?
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  128. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-10T02:40:10Z

    On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > Hi Ajin,
    >
    > I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this patch. So can you please rebase this patch?
    >
    > Best regards,
    > --
    
    It's been rebased. Have a look at the latest version.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
    
    
    
  129. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-10T04:05:01Z

    On Wed, Dec 10, 2025 at 8:10 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > >
    > > Hi Ajin,
    > >
    > > I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this patch. So can you please rebase this patch?
    > >
    > > Best regards,
    > > --
    >
    > It's been rebased. Have a look at the latest version.
    >
    
    Few comments on 001:
    
    1)
    /*
    * Emit an error if a promotion or a concurrent sync call is in progress.
    * Otherwise, advertise that a sync is in progress.
    */
    static void
    check_and_set_sync_info
    
    We need to change this comment because now this function does not
    handle promotion case.
    
    2)
    +  if (sync_process_pid!= InvalidPid)
    +    kill(sync_process_pid, SIGUSR1);
    
    We need to have space between sync_process_pid and '!='
    
    3)
    + * Exit or throw errors if relevant GUCs have changed depending on whether
    
    errors->error
    
    4)
    In slotsync_reread_config(), even when we mark parameter_changed=true
    in the first if-block, we still go to the second if-block which was
    not needed.  So shall we make second if-block as else-if to avoid
    this? Thoughts?
    
    5)
    As discussed in [1], we can make this change in ProcessSlotSyncInterrupts():
    
    'replication slot synchronization worker is shutting down because
    promotion is triggered'
    to
    'replication slot synchronization worker will stop because promotion
    is triggered'
    
    [1]: https://www.postgresql.org/message-id/6AE56C64-F760-4CBD-BABF-72633D3F7B5E%40gmail.com
    
    thanks
    Shveta
    
    
    
    
  130. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Chao Li <li.evan.chao@gmail.com> — 2025-12-10T04:26:39Z

    
    > On Dec 10, 2025, at 10:40, Ajin Cherian <itsajin@gmail.com> wrote:
    > 
    > On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >> 
    >> Hi Ajin,
    >> 
    >> I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this patch. So can you please rebase this patch?
    >> 
    >> Best regards,
    >> --
    > 
    > It's been rebased. Have a look at the latest version.
    > 
    
    Here are some comments for v32.
    
    1 - 0001
    ```
    - * The slot sync worker's pid is needed by the startup process to shut it
    - * down during promotion. The startup process shuts down the slot sync worker
    - * and also sets stopSignaled=true to handle the race condition when the
    + * The pid is either the slot sync worker's pid or the backend's pid running
    ```
    
    I think we should add single quotes for “pid” and “stopSignaled". Looking at other comment lines, structure fields all have single-quoted:
    ```
    * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
    
    * The 'last_start_time' is needed by postmaster to start the slot sync worker
    ```
    
    2 - 0001 - the same code block as 1
    
    I wonder how to distinct if the “pid” is a slot sync worker or a backend process?
    
    3 - 0001
    ```
    +	bool		worker = AmLogicalSlotSyncWorkerProcess();
    ```
    
    The variable name “worker” doesn’t indicate a bool type, maybe renamed to “is_slotsync_worker”.
    
    4 - 0001
    ```
    +	/*
    +	 * If we have reached here with a parameter change, we must be running in
    +	 * SQL function, emit error in such a case.
    +	 */
    +	if (parameter_changed)
    +	{
    +		Assert(!worker);
    +		ereport(ERROR,
    +				errmsg("replication slot synchronization will stop because of a parameter change"));
     	}
    ```
    
    The Assert(!worker) feels redundant, because it then immediately error out.
    
    5 - 0001
    ```
    + * Exit or throw errors if relevant GUCs have changed depending on whether
    + * called from slotsync worker or from SQL function pg_sync_replication_slots()
    ```
    
    Let’s change “slotsync” to “slot sync” because elsewhere comments all use “slot sync”, just to keep consistent.
    
    6 - 0001
    ```
    - * Interrupt handler for main loop of slot sync worker.
    + * Interrupt handler for main loop of slot sync worker and
    + * SQL function pg_sync_replication_slots().
    ```
    
    Missing “the” before “SQL function”. This comment applies to multiple places.
    
    7 - 0001
    ```
    +	if (sync_process_pid!= InvalidPid)
    +		kill(sync_process_pid, SIGUSR1);
    ```
    
    Nit: missing a white space before “!="
    
    8 - 0002
    ```
    +	if (slot_names != NIL)
     	{
    -		StartTransactionCommand();
    -		started_tx = true;
    +		bool		first_slot = true;
    +
    +		/*
    +		 * Construct the query to fetch only the specified slots
    +		 */
    +		appendStringInfoString(&query, " AND slot_name IN (");
    +
    +		foreach_ptr(char, slot_name, slot_names)
    +		{
    +			if (!first_slot)
    +				appendStringInfoString(&query, ", ");
    +
    +			appendStringInfo(&query, "'%s'", slot_name);
    +			first_slot = false;
    +		}
    +		appendStringInfoChar(&query, ')');
     	}
    ```
    
    The logic of appending “, “ can be slightly simplified as:
    ```
    If (slot_names != NIL)
    {
        Const char *sep = “”;
        appendStringInfoString(&query, " AND slot_name IN (“);
        foreach_ptr(char, slot_name, slot_names)
        {
            appendStringInfo(&query, “%s'%s'", sep, slot_name);
            sep = “, “;
        }
    }
    ```
    
    That saves a “if” check and a appendStringInfoString().
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  131. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-10T05:07:34Z

    On Wed, Dec 10, 2025 at 3:05 PM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Wed, Dec 10, 2025 at 8:10 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > > >
    > > > Hi Ajin,
    > > >
    > > > I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this patch. So can you please rebase this patch?
    > > >
    > > > Best regards,
    > > > --
    > >
    > > It's been rebased. Have a look at the latest version.
    > >
    >
    > Few comments on 001:
    >
    > 1)
    > /*
    > * Emit an error if a promotion or a concurrent sync call is in progress.
    > * Otherwise, advertise that a sync is in progress.
    > */
    > static void
    > check_and_set_sync_info
    >
    > We need to change this comment because now this function does not
    > handle promotion case.
    
    Fixed.
    
    >
    > 2)
    > +  if (sync_process_pid!= InvalidPid)
    > +    kill(sync_process_pid, SIGUSR1);
    >
    > We need to have space between sync_process_pid and '!='
    >
    
    Fixed.
    
    > 3)
    > + * Exit or throw errors if relevant GUCs have changed depending on whether
    >
    > errors->error
    >
    
    Fixed.
    
    > 4)
    > In slotsync_reread_config(), even when we mark parameter_changed=true
    > in the first if-block, we still go to the second if-block which was
    > not needed.  So shall we make second if-block as else-if to avoid
    > this? Thoughts?
    >
    
    Changed as suggested.
    
    > 5)
    > As discussed in [1], we can make this change in ProcessSlotSyncInterrupts():
    >
    > 'replication slot synchronization worker is shutting down because
    > promotion is triggered'
    > to
    > 'replication slot synchronization worker will stop because promotion
    > is triggered'
    >
    > [1]: https://www.postgresql.org/message-id/6AE56C64-F760-4CBD-BABF-72633D3F7B5E%40gmail.com
    >
    
    Changed.
    
    
    On Wed, Dec 10, 2025 at 3:27 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    >
    >
    >
    > Here are some comments for v32.
    >
    > 1 - 0001
    > ```
    > - * The slot sync worker's pid is needed by the startup process to shut it
    > - * down during promotion. The startup process shuts down the slot sync worker
    > - * and also sets stopSignaled=true to handle the race condition when the
    > + * The pid is either the slot sync worker's pid or the backend's pid running
    > ```
    >
    > I think we should add single quotes for “pid” and “stopSignaled". Looking at other comment lines, structure fields all have single-quoted:
    > ```
    > * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
    >
    > * The 'last_start_time' is needed by postmaster to start the slot sync worker
    > ```
    >
    
    Changed as suggested.
    
    > 2 - 0001 - the same code block as 1
    >
    > I wonder how to distinct if the “pid” is a slot sync worker or a backend process?
    >
    
    No, there is now way currently and iis not really required with the
    current logic.
    
    
    > 3 - 0001
    > ```
    > +       bool            worker = AmLogicalSlotSyncWorkerProcess();
    > ```
    >
    > The variable name “worker” doesn’t indicate a bool type, maybe renamed to “is_slotsync_worker”.
    >
    
    Changed as suggested.
    
    > 4 - 0001
    > ```
    > +       /*
    > +        * If we have reached here with a parameter change, we must be running in
    > +        * SQL function, emit error in such a case.
    > +        */
    > +       if (parameter_changed)
    > +       {
    > +               Assert(!worker);
    > +               ereport(ERROR,
    > +                               errmsg("replication slot synchronization will stop because of a parameter change"));
    >         }
    > ```
    >
    > The Assert(!worker) feels redundant, because it then immediately error out.
    >
    
    I don't think it is redundant as Asserts are used to catch unexpected
    code paths in testing.
    
    > 5 - 0001
    > ```
    > + * Exit or throw errors if relevant GUCs have changed depending on whether
    > + * called from slotsync worker or from SQL function pg_sync_replication_slots()
    > ```
    >
    > Let’s change “slotsync” to “slot sync” because elsewhere comments all use “slot sync”, just to keep consistent.
    >
    
    Changed.
    
    > 6 - 0001
    > ```
    > - * Interrupt handler for main loop of slot sync worker.
    > + * Interrupt handler for main loop of slot sync worker and
    > + * SQL function pg_sync_replication_slots().
    > ```
    >
    > Missing “the” before “SQL function”. This comment applies to multiple places.
    >
    
    Changed.
    
    
    > 7 - 0001
    > ```
    > +       if (sync_process_pid!= InvalidPid)
    > +               kill(sync_process_pid, SIGUSR1);
    > ```
    >
    > Nit: missing a white space before “!="
    >
    
    Fixed.
    
    > 8 - 0002
    > ```
    > +       if (slot_names != NIL)
    >         {
    > -               StartTransactionCommand();
    > -               started_tx = true;
    > +               bool            first_slot = true;
    > +
    > +               /*
    > +                * Construct the query to fetch only the specified slots
    > +                */
    > +               appendStringInfoString(&query, " AND slot_name IN (");
    > +
    > +               foreach_ptr(char, slot_name, slot_names)
    > +               {
    > +                       if (!first_slot)
    > +                               appendStringInfoString(&query, ", ");
    > +
    > +                       appendStringInfo(&query, "'%s'", slot_name);
    > +                       first_slot = false;
    > +               }
    > +               appendStringInfoChar(&query, ')');
    >         }
    > ```
    >
    > The logic of appending “, “ can be slightly simplified as:
    > ```
    > If (slot_names != NIL)
    > {
    >     Const char *sep = “”;
    >     appendStringInfoString(&query, " AND slot_name IN (“);
    >     foreach_ptr(char, slot_name, slot_names)
    >     {
    >         appendStringInfo(&query, “%s'%s'", sep, slot_name);
    >         sep = “, “;
    >     }
    > }
    > ```
    >
    > That saves a “if” check and a appendStringInfoString().
    
    I'm not sure if this is much of an improvement, I like the current
    approach and matches with similar coding patterns in the code base.
    
    Attaching v34 addressing the above comments.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  132. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Yilin Zhang <jiezhilove@126.com> — 2025-12-10T05:22:39Z

    
    At 2025-12-10 13:07:34, "Ajin Cherian" <itsajin@gmail.com> wrote:
    >
    >I'm not sure if this is much of an improvement, I like the current
    >approach and matches with similar coding patterns in the code base.
    >
    >Attaching v34 addressing the above comments.
    >
    
    Hi,
    Few comments for v34.
    
    1 - 0002
    
    ```
    
    --- a/src/backend/replication/logical/slotsync.c
    
    +++ b/src/backend/replication/logical/slotsync.c
    
    @@ -39,6 +39,12 @@
    
      * the last cycle. Refer to the comments above wait_for_slot_activity() for
    
      * more details.
    
      *
    
    + * If the SQL function pg_sync_replication is used to sync the slots, and if
    
    ```
    
    Typo, it should be "pg_sync_replication_slots()" instead of "pg_sync_replication".
    
    
    2 - 0002
    
    ```
    
    +		/*
    
    +		 * The syscache access in fetch_or_refresh_remote_slots() needs a
    
    +		 * transaction env.
    
    +		 */
    
    ```
    
    Typo, it should be "fetch_remote_slots()" instead of "fetch_or_refresh_remote_slots()".
    
    
    3 - 0002
    
    ```
    
    +			appendStringInfo(&query, "'%s'", slot_name);
    
    ```
    
    Instead of manually add single quotes to slot name, consider using quote_literal_cstr().
    
    
    While I was reviewing patch v32, Ajin Cherian had already submitted patch v34, but these issues still persisted.
    
    
    Best regards,
    
    --
    
    Yilin Zhang
    
    
  133. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-12-10T12:23:50Z

    On Wed, Dec 10, 2025 at 10:37 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > Attaching v34 addressing the above comments.
    >
    
    0001 looks mostly good to me. I have made minor edits in the comments
    and added error_code for one of the error messages. Please check
    attached and let me know what you think?
    
    -- 
    With Regards,
    Amit Kapila.
    
  134. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Chao Li <li.evan.chao@gmail.com> — 2025-12-10T13:34:26Z

    
    > On Dec 10, 2025, at 20:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
    > 
    > On Wed, Dec 10, 2025 at 10:37 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >> 
    >> Attaching v34 addressing the above comments.
    >> 
    > 
    > 0001 looks mostly good to me. I have made minor edits in the comments
    > and added error_code for one of the error messages. Please check
    > attached and let me know what you think?
    > 
    > -- 
    > With Regards,
    > Amit Kapila.
    > <v35-0001-Signal-backends-running-pg_sync_replication_slot.patch>
    
    I saw you have integrated the missed change from the last push into v35, and v35 looks good to me.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  135. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-11T03:07:03Z

    On Wed, Dec 10, 2025 at 5:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    >
    > 0001 looks mostly good to me. I have made minor edits in the comments
    > and added error_code for one of the error messages. Please check
    > attached and let me know what you think?
    >
    
    v35 looks good to me.
    
    thanks
    Shveta
    
    
    
    
  136. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-11T05:15:16Z

    On Wed, Dec 10, 2025 at 4:23 PM Yilin Zhang <jiezhilove@126.com> wrote:
    >
    > Hi,
    > Few comments for v34.
    >
    
    Thanks for your review!
    
    I've addressed your comments.
    
    As patch 0001 has been pushed. I've rebased and created a new version
    v36 with the remaining patch.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  137. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-11T08:57:38Z

    On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Wed, Dec 10, 2025 at 4:23 PM Yilin Zhang <jiezhilove@126.com> wrote:
    > >
    > > Hi,
    > > Few comments for v34.
    > >
    >
    > Thanks for your review!
    >
    > I've addressed your comments.
    >
    > As patch 0001 has been pushed. I've rebased and created a new version
    > v36 with the remaining patch.
    >
    
    Verified, the patch works well.
    
    thanks
    Shveta
    
    
    
    
  138. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-12-11T12:23:22Z

    On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > As patch 0001 has been pushed. I've rebased and created a new version
    > v36 with the remaining patch.
    >
    
    I have made a number of changes in code comments and docs. Kindly
    review and if you are okay with these then include them in the next
    version.
    
    -- 
    With Regards,
    Amit Kapila.
    
  139. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Chao Li <li.evan.chao@gmail.com> — 2025-12-11T22:49:03Z

    
    > On Dec 11, 2025, at 20:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
    > 
    > On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >> 
    >> As patch 0001 has been pushed. I've rebased and created a new version
    >> v36 with the remaining patch.
    >> 
    > 
    > I have made a number of changes in code comments and docs. Kindly
    > review and if you are okay with these then include them in the next
    > version.
    > 
    
    This diff enhanced docs and comments, overall looks good to me. A few nit comments:
    
    1
    ```
    - * Returns:
    - *	List of remote slot information structures. Returns NIL if no slot
    - *	is found.
    - *
    + * Returns list of remote slot information structures, if any, otherwise,
    + * NIL if no slot is found.
    ```
    
    I think “a” is needed before “list”, and “if any, otherwise,” looks rarely seen in code comments. So suggesting: 
    ```
     * Returns a list of remote slot information structures, or NIL if none
     * are found.
    ```
    
    2
    ```
    - * Parameters:
    - * wrconn - Connection to the primary server
    - * remote_slot_list - List of RemoteSlot structures to synchronize.
    - * slot_persistence_pending - boolean used by SQL function
    - * 							  pg_sync_replication_slots() to track if any slots
    - * 							  could not be persisted and need to be retried.
    + * If slot_persistence_pending is not NULL, it will be set to true if one or
    + * more slots could not be persisted.
    ```
    
    The changed version loses the meaning of retry. So, suggesting:
    ```
     * If slot_persistence_pending is not NULL, it will be set to true if one
     * or more slots could not be persisted.  This allows callers such as
     * pg_sync_replication_slots() to retry those slots.
    ```
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  140. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Ajin Cherian <itsajin@gmail.com> — 2025-12-12T00:05:04Z

    On Thu, Dec 11, 2025 at 11:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > > As patch 0001 has been pushed. I've rebased and created a new version
    > > v36 with the remaining patch.
    > >
    >
    > I have made a number of changes in code comments and docs. Kindly
    > review and if you are okay with these then include them in the next
    > version.
    >
    
    I have included these changes as well as comments by Chao. Attaching
    v37 with the changes.
    
    regards,
    Ajin Cherian
    Fujitsu Australia
    
  141. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    shveta malik <shveta.malik@gmail.com> — 2025-12-12T03:22:55Z

    On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
    >
    > On Thu, Dec 11, 2025 at 11:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > > As patch 0001 has been pushed. I've rebased and created a new version
    > > > v36 with the remaining patch.
    > > >
    > >
    > > I have made a number of changes in code comments and docs. Kindly
    > > review and if you are okay with these then include them in the next
    > > version.
    > >
    >
    > I have included these changes as well as comments by Chao. Attaching
    > v37 with the changes.
    >
    
    Thanks. v37 LGTM.
    
    thanks
    Shveta
    
    
    
    
  142. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-12-15T11:06:20Z

    On Fri, Dec 12, 2025 at 8:53 AM shveta malik <shveta.malik@gmail.com> wrote:
    >
    > On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > >
    > >
    > > I have included these changes as well as comments by Chao. Attaching
    > > v37 with the changes.
    > >
    >
    > Thanks. v37 LGTM.
    >
    
    Pushed.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  143. RE: Improve pg_sync_replication_slots() to wait for primary to advance

    Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-12-17T10:28:28Z

    On Monday, December 15, 2025 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > 
    > On Fri, Dec 12, 2025 at 8:53 AM shveta malik <shveta.malik@gmail.com>
    > wrote:
    > >
    > > On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > >
    > > >
    > > > I have included these changes as well as comments by Chao. Attaching
    > > > v37 with the changes.
    > > >
    > >
    > > Thanks. v37 LGTM.
    > >
    > 
    > Pushed.
    
    My college reported a related BF failure[1] to me off-list.
    
    After analyzing, I think the issue is that the newly added test in
    040_standby_failover_slots_sync synchronizes a replication slot to the standby
    server without configuring synchronized_standby_slots. This omission allows
    logical failover slots to advance beyond the designated physical replication
    slot, resulting in intermittent synchronization failures.
    
    I confirmed the same from the log where the slotsync failed due to the
    reason mentioned above:
    
    --
    2025-12-15 12:30:33.502 CET [3015371][client backend][1/2:0] ERROR:  skipping slot synchronization because the received slot sync LSN 0/06017C90 for slot "lsub1_slot" is ahead of the standby position 0/06017C58
    2025-12-15 12:30:33.502 CET [3015371][client backend][1/2:0] STATEMENT:  SELECT pg_sync_replication_slots();
    --
    
    Here is a small patch to fix it.
    
    [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2025-12-15%2011%3A25%3A38
    
    Best Regards,
    Hou zj
    
  144. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Andres Freund <andres@anarazel.de> — 2025-12-17T17:09:25Z

    Hi,
    
    On 2025-12-17 10:28:28 +0000, Zhijie Hou (Fujitsu) wrote:
    > On Monday, December 15, 2025 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > 
    > > On Fri, Dec 12, 2025 at 8:53 AM shveta malik <shveta.malik@gmail.com>
    > > wrote:
    > > >
    > > > On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
    > > > >
    > > > >
    > > > > I have included these changes as well as comments by Chao. Attaching
    > > > > v37 with the changes.
    > > > >
    > > >
    > > > Thanks. v37 LGTM.
    > > >
    > > 
    > > Pushed.
    > 
    > My college reported a related BF failure[1] to me off-list.
    
    FWIW, this also fails semi-regularly in CI. E.g.
    https://cirrus-ci.com/task/6281872222715904
    https://cirrus-ci.com/task/5243530626465792
    
    Greetings,
    
    Andres Freund
    
    
    
    
  145. Re: Improve pg_sync_replication_slots() to wait for primary to advance

    Amit Kapila <amit.kapila16@gmail.com> — 2025-12-18T11:39:37Z

    On Wed, Dec 17, 2025 at 3:58 PM Zhijie Hou (Fujitsu)
    <houzj.fnst@fujitsu.com> wrote:
    >
    > Here is a small patch to fix it.
    >
    
    Thanks, I've pushed the patch. BTW, looking at the code of slot_sync
    API code path, I could think of the following improvements.
    
    *
    if (remote_slot->confirmed_lsn > latestFlushPtr)
    {
    update_slotsync_skip_stats(SS_SKIP_WAL_NOT_FLUSHED);
    
    /*
    * Can get here only if GUC 'synchronized_standby_slots' on the
    * primary server was not configured correctly.
    */
    ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
    errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
    
    Can we change this ERROR to LOG even for API as now the API also
    retires to sync the slots during initial sync?
    
    * The use of the slot_persistence_pending flag in the internal APIs
    seems to be the reverse of what it should be. I mean to say that
    initially it should be true and when we actually persist the slot then
    we can set it to false.
    
    * We can retry to sync all the slots present in the primary at the
    start of API, not only temporary slots. If we do this then the
    previous point may not be required. Also, please mention something
    like: "It retries cyclically until all the failover slots that existed
    on primary at the start of the function call are synchronized." in the
    function description [1] as well.
    
    [1] - https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION
    -- 
    With Regards,
    Amit Kapila.