Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

ocean_li_996 <ocean_li_996@163.com>

From: ocean_li_996 <ocean_li_996@163.com>
To: "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>
Cc: "'Alexander Lakhin'" <exclusion@gmail.com>, "pgsql-bugs@lists.postgresql.org" <pgsql-bugs@lists.postgresql.org>, "feichanghong@qq.com" <feichanghong@qq.com>, amit.kapila16@gmail.com
Date: 2024-03-11T16:43:01Z
Lists: pgsql-bugs

Attachments

Dear Hayato:


At 2024-03-08 15:35:16, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
>Dear Haiyang, Alexander,
>
>Below part describes the second failure. The issue would be occurred when:
>
>1) Logical decoding starts from the middle of a sub-transaction, and
>2) NEW_CID record is initially decoded in the sub-transaction, and
>3) An arbitrary changes are decoded in the sub-transaction.

>...


## Issue 1
Thanks for your  reproducer. I have investigated this issue. The scenario that caused the issue
is indeed as you described above. I had not realized before that a serialized snapshot file from
one replication slot could be utilized by another replication slot to achieve a consistent state. 
This issue would not arise with only one replication slot, as SnapBuildXactNeedsSkip ensures that.


Using the spec test case you provided (without any fix patch), I got another Stuck trace on REL_14_STABLE :
```
#0  0x00007f93604e0277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f93604e1968 in __GI_abort () at abort.c:90
#2  0x0000000000b084da in ExceptionalCondition (conditionName=0xcc94c8 "prev_first_lsn < cur_txn->first_lsn", errorType=0xcc92c4 "FailedAssertion", fileName=0xcc9310 "reorderbuffer.c",
    lineNumber=916) at assert.c:69
#3  0x00000000008df064 in AssertTXNLsnOrder (rb=0x2a48590) at reorderbuffer.c:916
#4  0x00000000008de9a1 in ReorderBufferTXNByXid (rb=0x2a48590, xid=756, create=true, is_new=0x0, lsn=24749416, create_as_top=true) at reorderbuffer.c:669
#5  0x00000000008e2d00 in ReorderBufferAddNewTupleCids (rb=0x2a48590, xid=756, lsn=24749416, node=..., tid=..., cmin=1, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3198
#6  0x00000000008e7ed7 in SnapBuildProcessNewCid (builder=0x2a44570, xid=757, lsn=24749416, xlrec=0x2a16d38) at snapbuild.c:823
#7  0x00000000008d14b3 in DecodeHeap2Op (ctx=0x2a324f0, buf=0x7ffd886a4bc0) at decode.c:471
#8  0x00000000008d0bfa in LogicalDecodingProcessRecord (ctx=0x2a324f0, record=0x2a328f0) at decode.c:151
#9  0x00000000008d8688 in pg_logical_slot_get_changes_guts (fcinfo=0x2a206b0, confirm=true, binary=false) at logicalfuncs.c:296
#10 0x00000000008d87bd in pg_logical_slot_get_changes (fcinfo=0x2a206b0) at logicalfuncs.c:365
```



Then, I think the simplest process would be:
```
ReorderBufferProcessXid(xid = 757, lsn)
	-> A ReorderBufferTXN for subtransaciton (757) was generated.
	   The first_lsn was the head of NEW_CID record.
	DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID)
		SnapBuildProcessNewCid()	
			ReorderBufferAddNewTupleCids(..., xlrec->top_xid = 756,...)
				ReorderBufferTXNByXid(xid = 756, lsn)
					-> A ReorderBufferTXN for top-transaciton (756) was generated.
					   The first_lsn was the head of NEW_CID record. It caused an Assertion failure.
```
> I think the straightforward fix is to associate them to top-sub relationship,
> and attached patch did it.
LGFM, I think it is suitable assign subtransaction after calling ReorderBufferAddNewTupleCids().


## Issue 2
Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
is able to reproduce the issue mentioned in [1] even before commit 6b77048e5.


The approach in [3] is also LGFM.


[1] https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0@postgresql.org
[2] https://www.postgresql.org/message-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com
[3] https://www.postgresql.org/message-id/CAA4eK1KpW5pHMwMp9hfXYvOeEU5Rcbhoc7FxtBOGPgKeyYLDmA@mail.gmail.com


Best Regards,
Haiyang Li