Thread
-
Re: Improve conflict detection when replication origins are reused
shveta malik <shveta.malik@gmail.com> — 2026-05-15T09:56:58Z
On Fri, May 15, 2026 at 8:56 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, May 14, 2026 at 8:35 AM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > Hi hackers, > > > > While reviewing the issue reported at [1] and the proposed solutions > > at [2], I noticed a related problem: false negative conflict detection > > when a 'ReplOriginId' gets reused. > > > > In logical replication, conflict detection relies on the tuple’s > > replication origin ('roident'). The problem is that if a subscription > > is dropped and a new subscription later reuses the same origin ID, the > > apply worker may incorrectly treat incoming changes as “its own” > > changes and skip conflict detection. > > > > A simple example: > > 1. Create subscription sub1 with 'roident = 1' > > 2. Replicate some rows into table 't1' > > 3. Drop 'sub1' > > 4. Create another subscription 'sub2' > > 5. `sub2` reuses 'roident = 1' > > 6. New updates arrive for rows previously written by 'sub1' > > At this point, conflict detection sees: > > tuple_origin == current_origin > > > > and incorrectly assumes the row was written by the current > > subscription instance, so no 'update_origin_differ' conflict is > > raised. > > I agree with the problem sattement. I will prioritize the review soon. > > > This may look harmless in this simple setup, but it becomes > > problematic if the new subscription is connected to a different > > publisher, because real conflicts can then be silently missed. > > > > I explored two possible approaches to solve this: > > > > Approach 1. Zero out old origin IDs in commit_ts data when dropping a > > subscription > > ---------------------- > > - When a subscription is dropped and its replication origin becomes > > free, scan all 'commit_ts' SLRU entries and replace that old origin ID > > with 'InvalidRepOriginId (0)'. > > - So rows previously written by the old subscription would no longer > > appear to belong to any active replication origin. > > - A new subscription reusing the same 'roident' will always conflict > > with origin '0'. > > > > Pros: > > - Fixes the stale-origin problem completely and may also help solve > > the tablesync-origin issue discussed in [1] > > - No additional checks needed during conflict detection > > > > Cons: > > - Requires scanning the entire 'commit_ts' SLRU during DROP > > SUBSCRIPTION, so it can become very expensive on large systems > > - Not crash-safe currently(patch): > > - if the server crashes midway, some entries may still contain the > > old origin ID > > - after restart, reused origins can again lead to missed conflicts > > - Making this fully crash-safe would likely require WAL logging or > > recovery-time reprocessing. > > > > Approach 2. Store replication origin creation time > > ---------------------- > > - Add a creation timestamp for each replication origin > > - During conflict check: > > if tuple_origin != current_origin > > -> existing behavior > > if tuple_origin == current_origin > > -> compare tuple commit timestamp with origin creation time > > if tuple_commit_ts <= origin_creation_time > > -> treat as an origin reuse case and raise conflict > > > > Pros: > > ------- > > - No additional processing during DROP SUBSCRIPTION > > - Lightweight runtime check (just one timestamp comparison) > > - Naturally crash-safe since origin creation is WAL-logged already > > > > Cons: > > - Requires a catalog schema change > > - The <= comparison can produce false-positive conflicts for rows > > committed at the exact same microsecond as origin creation > > - May require additional handling for upgraded origins > > > > IMO, the second approach currently looks more practical because it > > avoids the heavy SLRU scan and crash-recovery complexity. > > > > Attached: > > - Patch for approach 1 > > - Patch for approach 2 > > - A TAP test reproducing the issue > > > > Note: The patches are manually tested for the reported issue, but not > > yet tested for performance or additional edge cases. > > > > Feedback and suggestions are welcome. > > > > [1] https://www.postgresql.org/message-id/CALDaNm3Y6Y4Mub6QC8fZKnNy5jZspELQYCoQF_FL2Zwzweu%3Dog%40mail.gmail.com > > [2] https://www.postgresql.org/message-id/CAA4eK1LxGXR7jOAKh0B8N362S-Q3b6GhBxxcV_HxUaicEPq5Cg%40mail.gmail.com > > > > -- Nisha, I think we will get the same problem in another scenario too: create pub1-server1 create pub1-server2 create sub1-server3; subscribing to pub1-server1 --On both server1 and server2, insert same set of rows: insert into tab1 values (10), (20), (30); Sub1 (server3) will get the rows from server1. Now alter sub1 to connect to server2 (you will have to create slot manually on server2) SELECT pg_create_logical_replication_slot('sub1', 'pgoutput', false, false, false); --Now perform the update on server2: update tab1 set i=11 where i=10; The subscriber on server3 will receive update form server2 and will update the row inserted by server1 origianlly without raising update_origin_differ. Can you please confirm if my understanding of the problem statement is correct and if the scenario above will also result in a similar situation? IIUC, in such a case, the proposed solutions may not work directly and will need to be further evolved. I will think more once you confirm my understanding. thanks Shveta