Re: Orphaned records in pg_replication_origin_status after subscription drop
Amit Kapila <amit.kapila16@gmail.com>
From: Amit Kapila <amit.kapila16@gmail.com>
To: Michael Paquier <michael@paquier.xyz>
Cc: Postgres hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-12-19T10:18:21Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Update comments to reflect changes in 8e0d32a4a1.
- 98e8fe57c24f 19 (unreleased) landed
- 214c17bd623e 18.2 landed
- a1cdb8120110 17.8 landed
-
Fix orphaned origin in shared memory after DROP SUBSCRIPTION
- 25a04aa52b52 14.21 landed
- 90d1beef62de 15.16 landed
- e22e9ab0cd45 16.12 landed
- e063ccc722e4 17.8 landed
- b07c326192d0 18.2 landed
- 8e0d32a4a1bf 19 (unreleased) landed
On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote: > > Some colleagues have reported that it is possible to finish with > orphaned records in pg_replication_origin_status, as an effect of > table synchronization workers that miss some cleanup actions around > replorigin_drop_by_name() when a subscription that spawned these > workers (which themselves create origins through replorigin_create() > called in LogicalRepSyncTableStart()) is dropped. > > Please find attached a small script, courtesy of Tenglong Gu and > Daisuke Higuchi, that have been digging into all that. > > This problem is true since ce0fdbfe9722 that has added the creation of > a replication origin for table synchronization workers, as of 14 up to > HEAD. One issue with these origin states is that they are sticky: a > restart of the node that held the subscription keeps them around due > to them getting flushed in replorigin_checkpoint. If my understanding > is right, this also means that origin slots are still tracked as in > use, implying that replorigin_session_setup() would not be able to > reuse them as far as I understand. > > So, in summary (please correct me if necessary), DropSubscription() > attempts a round of cleanup for all the replication origins with a > loop over ReplicationOriginNameForLogicalRep(), and while in this case > we attempt to drop the origins created by the workers that are tracked > as orphaned, the syscache lookup of REPLORIGIDENT fails within the > DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry: > the replication origin is already gone, but its state has persisted in > memory. Then, why would the replication origin be already gone with > its state in memory not cleaned up yet? Well, the problematic test > case shows that the subscription is dropped while the spawned > tablesync workers do the initial table copy, and the replication > origin is created in the same transaction as the COPY. DROP > SUBSCRIPTION tells the tablesync workers to stop, they stop with the > COPY failing and the origin is not seen as something that exists for > the session that drops the subscription. The replication state in > memory goes out of sync. > > So it looks like to me that we are missing a code path where the > replication origin is dropped but we just ignore to reset the > replication state in memory, leading to bloat of the origin slots > available. worker.c does things differently: a small transaction is > used for the origin creation, hence we would never really see the > replication state memory going out-of-sync with the replorigin > catalog. > > One solution would be for tablesync.c to do the same as worker.c: we > could use a short transaction that sets the memory state and creates > the replication origin, then move to a second transaction for the > COPY. A second solution, slightly more complicated, is to create some > specific logic to reset the progress state of the origin that's been > created in the transaction that runs the initial COPY, which I guess > should be in the shape of a transaction abort callback. All these > symptoms are pointing out that it is not a good idea, IMO, to expect a > replication origin to exist when dropping a subscription when an > origin has been created in the context of a transaction that can take > a very long time to run, aka the initial tablesync's COPY, so using a > short transaction feels like a good thing to do here. > Yeah, either of these ways are okay. The only minor point in creating replication origin in a separate transaction is that the other part of operations on origin still need be done at the current place, so the origin handling will look a bit separated. So, I have a slight preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback should clear origin state via replorigin_state_clear. -- With Regards, Amit Kapila.