Re: Orphaned records in pg_replication_origin_status after subscription drop

Amit Kapila <amit.kapila16@gmail.com>

From: Amit Kapila <amit.kapila16@gmail.com>

To: Michael Paquier <michael@paquier.xyz>

Cc: Postgres hackers <pgsql-hackers@lists.postgresql.org>

Date: 2025-12-19T10:18:21Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Update comments to reflect changes in 8e0d32a4a1.
- 98e8fe57c24f 19 (unreleased) landed
- 214c17bd623e 18.2 landed
- a1cdb8120110 17.8 landed
Fix orphaned origin in shared memory after DROP SUBSCRIPTION
- 25a04aa52b52 14.21 landed
- 90d1beef62de 15.16 landed
- e22e9ab0cd45 16.12 landed
- e063ccc722e4 17.8 landed
- b07c326192d0 18.2 landed
- 8e0d32a4a1bf 19 (unreleased) landed

On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Some colleagues have reported that it is possible to finish with
> orphaned records in pg_replication_origin_status, as an effect of
> table synchronization workers that miss some cleanup actions around
> replorigin_drop_by_name() when a subscription that spawned these
> workers (which themselves create origins through replorigin_create()
> called in LogicalRepSyncTableStart()) is dropped.
>
> Please find attached a small script, courtesy of Tenglong Gu and
> Daisuke Higuchi, that have been digging into all that.
>
> This problem is true since ce0fdbfe9722 that has added the creation of
> a replication origin for table synchronization workers, as of 14 up to
> HEAD.  One issue with these origin states is that they are sticky: a
> restart of the node that held the subscription keeps them around due
> to them getting flushed in replorigin_checkpoint.  If my understanding
> is right, this also means that origin slots are still tracked as in
> use, implying that replorigin_session_setup() would not be able to
> reuse them as far as I understand.
>
> So, in summary (please correct me if necessary), DropSubscription()
> attempts a round of cleanup for all the replication origins with a
> loop over ReplicationOriginNameForLogicalRep(), and while in this case
> we attempt to drop the origins created by the workers that are tracked
> as orphaned, the syscache lookup of REPLORIGIDENT fails within the
> DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
> the replication origin is already gone, but its state has persisted in
> memory.  Then, why would the replication origin be already gone with
> its state in memory not cleaned up yet?  Well, the problematic test
> case shows that the subscription is dropped while the spawned
> tablesync workers do the initial table copy, and the replication
> origin is created in the same transaction as the COPY.  DROP
> SUBSCRIPTION tells the tablesync workers to stop, they stop with the
> COPY failing and the origin is not seen as something that exists for
> the session that drops the subscription.  The replication state in
> memory goes out of sync.
>
> So it looks like to me that we are missing a code path where the
> replication origin is dropped but we just ignore to reset the
> replication state in memory, leading to bloat of the origin slots
> available.  worker.c does things differently: a small transaction is
> used for the origin creation, hence we would never really see the
> replication state memory going out-of-sync with the replorigin
> catalog.
>
> One solution would be for tablesync.c to do the same as worker.c: we
> could use a short transaction that sets the memory state and creates
> the replication origin, then move to a second transaction for the
> COPY.  A second solution, slightly more complicated, is to create some
> specific logic to reset the progress state of the origin that's been
> created in the transaction that runs the initial COPY, which I guess
> should be in the shape of a transaction abort callback.  All these
> symptoms are pointing out that it is not a good idea, IMO, to expect a
> replication origin to exist when dropping a subscription when an
> origin has been created in the context of a transaction that can take
> a very long time to run, aka the initial tablesync's COPY, so using a
> short transaction feels like a good thing to do here.
>

Yeah, either of these ways are okay. The only minor point in creating
replication origin in a separate transaction is that the other part of
operations on origin still need be done at the current place, so the
origin handling will look a bit separated. So, I have a slight
preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
should clear origin state via replorigin_state_clear.

-- 
With Regards,
Amit Kapila.