Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix catalog lookup due to wrong snapshot for subtransactions during decoding.

  1. BUG #18369: logical decoding core on AssertTXNLsnOrder()

    PG Bug reporting form <noreply@postgresql.org> — 2024-02-28T07:53:30Z

    The following bug has been logged on the website:
    
    Bug reference:      18369
    Logged by:          haiyang li
    Email address:      ocean_li_996@163.com
    PostgreSQL version: 14.11
    Operating system:   centos7 5.10.84 x86_64
    Description:        
    
    When testing on logical replication module, we encountered a core dump
    issue. The stack trace from the core file is:
    
    ##
    [New LWP 113877]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    Core was generated by `postgres(5432): normal_user dml_full0
    11.164.97.22[37210]SELECT               ".
    Program terminated with signal SIGABRT, Aborted.
    #0  0x00007fe2074a9277 in raise () from /lib64/libc.so.6
    #0  0x00007fe2074a9277 in raise () from /lib64/libc.so.6
    #1  0x00007fe2074aa968 in abort () from /lib64/libc.so.6
    #2  0x00000000010f8d67 in ExceptionalCondition (conditionName=0x17edb50
    "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0x17ed93c
    "FailedAssertion", fileName=0x17ed990 "reorderbuffer.c", lineNumber=762) at
    assert.c:46
    #3  0x0000000000e6145c in AssertTXNLsnOrder (rb=0x4558060) at
    reorderbuffer.c:762
    #4  0x0000000000e60ead in ReorderBufferTXNByXid (rb=0x4558060, xid=19937,
    create=true, is_new=0x0, lsn=12640708096, create_as_top=true) at
    reorderbuffer.c:610
    #5  0x0000000000e6415c in ReorderBufferXidSetCatalogChanges (rb=0x4558060,
    xid=19937, lsn=12640708096) at reorderbuffer.c:2298
    #6  0x0000000000e6bb4b in SnapBuildXidSetCatalogChanges (builder=0x456e160,
    xid=19933, subxcnt=17, subxacts=0x44e08d8, lsn=12640708096) at
    snapbuild.c:2172
    #7  0x0000000000e54fb3 in DecodeCommit (ctx=0x452bc10, buf=0x7ffe4f03af70,
    parsed=0x7ffe4f03ae20, xid=19933) at decode.c:631
    #8  0x0000000000e54556 in DecodeXactOp (ctx=0x452bc10, buf=0x7ffe4f03af70)
    at decode.c:268
    #9  0x0000000000e54124 in LogicalDecodingProcessRecord (ctx=0x452bc10,
    record=0x452bed0) at decode.c:120
    #10 0x0000000000e5adc6 in pg_logical_slot_get_changes_guts
    (fcinfo=0x7ffe4f03b2d0, confirm=true, binary=false) at logicalfuncs.c:329
    #11 0x0000000000e5af76 in pg_logical_slot_get_changes
    (fcinfo=0x7ffe4f03b2d0) at logicalfuncs.c:393
    ...
    #33 0x0000000000f1c0b9 in exec_simple_query (query_string=0x425bb80 "SELECT
    * FROM pg_logical_slot_get_changes("test_logical_decode_slot_0", NULL,
    NULL)") at postgres.c:1570
    ...
    (gdb) f 3
    #3  0x0000000000e6145c in AssertTXNLsnOrder (rb=0x4558060) at
    reorderbuffer.c:762
    (gdb) p /x MyReplicationSlot->data.restart_lsn
    $1 = 0x2f171b8a8
    (gdb) p /x cur_txn->first_lsn
    $2 = 0x2f171e600
    (gdb) p NInitialRunningXacts
    $3 = 1
    (gdb) p *InitialRunningXacts
    $4 = 19933
    ##
    
    As indicated, the problem occurred at the AssertTXNLsnOrder function.
    Moreover, this issue occurred when pg_logical_slot_get_changes function was
    called again because NInitialRunningXacts != 0. 
    
    1) The WAL records from restart_lsn to the corresponding lsn when the issue
    occurred,
    2) personal analysis of the problem,
    3) the steps to reproduce the issue,
    4) personal proposed solution
    will be posted later under this thread.
    
    
  2. Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-02-28T07:57:37Z

    At 2024-02-28 15:53:30, "PG Bug reporting form" <noreply@postgresql.org> wrote:
    
    
    >1) The WAL records from restart_lsn to the corresponding lsn when the issue
    >occurred,
    >2) personal analysis of the problem,
    >3) the steps to reproduce the issue,
    >4) personal proposed solution
    >will be posted later under this thread.
    
    >
    
    
    1) The WAL records from restart_lsn to the corresponding lsn when the issue occurred is supported in attachment file 1.
    
    
    2) As indicated in 1), some invalidation messages are generated in 19933 top xact. After the decoding restarted, the invalidation messages will make 19933 top xact and its subtransaction(s) to be marked as containing catalog change while processing its commit record(see SnapBuildXidSetCatalogChanges() ). In this step, the corresponding subxacts which never procedded before are added into ReorderBuffer with the same first_lsn as top-level xact.  Then, the check in AssertTXNLsnOrder() will failed if  the number of subxact mentioned above more than 1.
    
    
    3) The patch to reproduce the issue is supported in attachment file 2.  DML on temporary table can consume xid and not log any WAL RECORD except it's the firtst subxact of top xact(log ASSIGNMENT record). So we use DML on temporary table to generate two "never procedded before" sunxacts in on top xact. 
    
    
    4) Since it is already known to be a subxact before being add into ReorderBuffer, I think an appropriate fix is extending the ReorderBufferXidSetCatalogChanges function with an is_top parameter to indicate whether the xact is a top-level xact. 
    For a subxact, it would not be added to the toplevel_by_lsn list and would not undergo the AssertTXNLsnOrder check. Of course, it is necessary to introduce a check to verify whether a node is in the list when attempting to remove a node from toplevel_by_lsn.  
    The specific fix patch is provided in Attachment 3.
    
    
    Thanks
    Haiyang Li
  3. Re:Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-02-28T08:20:17Z

    This issue exists in PG 12 -15. 
    
    
    
    
    
    
    
    At 2024-02-28 15:57:37, "ocean_li_996" <ocean_li_996@163.com> wrote:
    
    At 2024-02-28 15:53:30, "PG Bug reporting form" <noreply@postgresql.org> wrote:
    
    >1) The WAL records from restart_lsn to the corresponding lsn when the issue
    >occurred,
    >2) personal analysis of the problem,
    >3) the steps to reproduce the issue,
    >4) personal proposed solution
    >will be posted later under this thread.
    
    >
    
    
    1) The WAL records from restart_lsn to the corresponding lsn when the issue occurred is supported in attachment file 1.
    
    
    2) As indicated in 1), some invalidation messages are generated in 19933 top xact. After the decoding restarted, the invalidation messages will make 19933 top xact and its subtransaction(s) to be marked as containing catalog change while processing its commit record(see SnapBuildXidSetCatalogChanges() ). In this step, the corresponding subxacts which never procedded before are added into ReorderBuffer with the same first_lsn as top-level xact.  Then, the check in AssertTXNLsnOrder() will failed if  the number of subxact mentioned above more than 1.
    
    
    3) The patch to reproduce the issue is supported in attachment file 2.  DML on temporary table can consume xid and not log any WAL RECORD except it's the firtst subxact of top xact(log ASSIGNMENT record). So we use DML on temporary table to generate two "never procedded before" sunxacts in on top xact. 
    
    
    4) Since it is already known to be a subxact before being add into ReorderBuffer, I think an appropriate fix is extending the ReorderBufferXidSetCatalogChanges function with an is_top parameter to indicate whether the xact is a top-level xact. 
    For a subxact, it would not be added to the toplevel_by_lsn list and would not undergo the AssertTXNLsnOrder check. Of course, it is necessary to introduce a check to verify whether a node is in the list when attempting to remove a node from toplevel_by_lsn.  
    The specific fix patch is provided in Attachment 3.
    
    
    Thanks
    Haiyang Li
  4. Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Alexander Lakhin <exclusion@gmail.com> — 2024-02-28T09:00:00Z

    Hello Haiyang Li,
    
    28.02.2024 11:20, ocean_li_996 wrote:
    >
    > This issue exists in PG 12 -15.
    >
    > At 2024-02-28 15:57:37, "ocean_li_996" <ocean_li_996@163.com> wrote:
    >
    >     At 2024-02-28 15:53:30, "PG Bug reporting form" <noreply@postgresql.org> wrote:
    >
    >     >1) The WAL records from restart_lsn to the corresponding lsn when the issue
    >     >occurred,
    >     >2) personal analysis of the problem,
    >     >3) the steps to reproduce the issue,
    >     >4) personal proposed solution
    >     >will be posted later under this thread.
    >     >
    >
    
    Please see the similar issue discussed last year:
    https://www.postgresql.org/message-id/f158d9ca-2057-2836-a522-0b1278be5a53%40gmail.com
    
    With your patch applied (on REL_14_STABLE) I still get:
    TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1623, PID: 92772)
    
    when running make -s installcheck-force -C contrib/test_decoding...
    as specified in that message.
    (You might need to disable REGRESS tests in the Makefile to reach
    isolation tests.)
    
    Best regards,
    Alexander
  5. Re:Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-02-29T02:25:37Z

    Hi Alexander,
    
    
    At 2024-02-28 17:00:00, "Alexander Lakhin" <exclusion@gmail.com> wrote:
    > Please see the similar issue discussed last year:
    > https://www.postgresql.org/message-id/f158d9ca-2057-2836-a522-0b1278be5a53%40gmail.com
    Well, I have to say that the whole thread is a bit long. AFAIC, the two issues exhibited the same symptoms, but they occured in different scenarios. The patch I provided may not solve the problem you're referring to.
    
    
    > With your patch applied (on REL_14_STABLE) I still get:
    > TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1623, PID: 92772)
    >
    > when running make -s installcheck-force -C contrib/test_decoding...
    > as specified in that message.
    > (You might need to disable REGRESS tests in the Makefile to reach
    > isolation tests.)
    I'm not sure if I fully get your mind. I disabled the REGRESS tests in the test_decoding/Makefile and then manually run the command "make -s installcheck-force -C contrib/test_decoding" a few times. I didn't encounter the issue you mentioned. Is this problem consistently reproducible in your environment? BTW, the  issue mentioned in [1] is more similar to your problem. And the patch was not applied to v14.  Maybe it is another issue.
    
    
    [1]: https://www.postgresql.org/message-id/7e4d4a80-3e3c-231f-f886-6cada2aa582b@gmail.com
    
    
    Thanks
    Haiyang Li
  6. Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Alexander Lakhin <exclusion@gmail.com> — 2024-02-29T10:00:00Z

    Hi Haiyang,
    
    29.02.2024 05:25, ocean_li_996 wrote:
    > > With your patch applied (on REL_14_STABLE) I still get:
    > > TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1623, PID: 92772)
    > >
    > > when running make -s installcheck-force -C contrib/test_decoding...
    > > as specified in that message.
    > > (You might need to disable REGRESS tests in the Makefile to reach
    > > isolation tests.)
    > I'm not sure if I fully get your mind. I disabled the REGRESS tests in the test_decoding/Makefile and then manually 
    > run the command "make -s installcheck-force -C contrib/test_decoding" a few times. I didn't encounter the issue you 
    > mentioned. Is this problem consistently reproducible in your environment? BTW, the  issue mentioned in [1] is more 
    > similar to your problem. And the patch was not applied to v14.  Maybe it is another issue.
    >
    
    You can try the following script (assuming a server with the test_decoding
    module installed is running):
    rm -rf contrib/test_decoding_*
    numclients=5
    for ((c=1;c<=numclients;c++)); do
        cp -r contrib/test_decoding contrib/test_decoding_$c
        sed "s/REGRESS = /# REGRESS =/" -i contrib/test_decoding_$c/Makefile
        sed "s/isolation_slot/isolation_slot_$c/" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Use 
    independent slots
        sed "$(printf '$p; %.0s' `seq 50`)" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Repeat the last 
    permutation 50 times
    done
    for ((c=1;c<=numclients;c++)); do
        EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1 
     >"installcheck-$c.log" 2>&1 &
    done
    wait
    
    Though that's really not directly related to the current issue (sorry for
    the wrong direction, my point was that there are still living bugs in this
    area).
    
    I've found that your added test case fails on REL_15_STABLE starting from
    b793a416b (6b77048e5 on REL_14_STABLE), so it looks like this is a new
    defect introduced in REL_14_STABLE, REL_15_STABLE recently with the fix for
    bug #18280.
    
    As to REL_13_STABLE/REL_12_STABLE the failure reproduced starting from
    commits 38dbaaf27/02600886c, a result of the aforementioned discussion:
    https://www.postgresql.org/message-id/CAA4eK1Lx%3Dg09z2k9Teq9ca1eRzfpfxJwFdjyHNwgEKv69KWhrQ%40mail.gmail.com
    
    Best regards,
    Alexander
  7. Re:Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-02-29T16:44:29Z

    At 2024-02-29 18:00:00, "Alexander Lakhin" <exclusion@gmail.com> wrote:
    
    
    > You can try the following script (assuming a server with the test_decoding
    > module installed is running):
    > rm -rf contrib/test_decoding_*
    > numclients=5
    > for ((c=1;c<=numclients;c++)); do
    >    cp -r contrib/test_decoding contrib/test_decoding_$c
    >    sed "s/REGRESS = /# REGRESS =/" -i contrib/test_decoding_$c/Makefile
    >    sed "s/isolation_slot/isolation_slot_$c/" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Use independent slots
    >    sed "$(printf '$p; %.0s' `seq 50`)" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Repeat the last permutation 50 times
    > done
    > for ((c=1;c<=numclients;c++)); do
    >    EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1  >"installcheck-$c.log" 2>&1 &
    > done
    > wait
    Thanks! Before and after applying the changes on REL_14_STABLE, I executed the script (with numclients = 50) four times, respectively. 
    Unfortunately, I wasn't able to replicate the issue you mentioned.
    
    
    > Though that's really not directly related to the current issue (sorry for
    > the wrong direction, my point was that there are still living bugs in this
    > area).
    Got it! I concur with your statement. OTOH, there is no evidence to indicate that the issue is a result of
    v1-0002-Fix-Coredump-On-AssertTXNLsnOrder.patch. 
    
    
    > I've found that your added test case fails on REL_15_STABLE starting from
    > b793a416b (6b77048e5 on REL_14_STABLE), so it looks like this is a new
    > defect introduced in REL_14_STABLE, REL_15_STABLE recently with the fix for
    > bug #18280.
    Oops, I forgot to mention this information in the email. Indeed, the test I provided couldn't reproduce the issue before fixing bug #18280 
    While I haven't tested it, I belive that we can get another reproducing test with a little more complexity (such as needing two transactions
    in sequence).
    
    
    > As to REL_13_STABLE/REL_12_STABLE the failure reproduced starting from
    > commits 38dbaaf27/02600886c, a result of the aforementioned discussion:
    > https://www.postgresql.org/message-id/CAA4eK1Lx%3Dg09z2k9Teq9ca1eRzfpfxJwFdjyHNwgEKv69KWhrQ%40mail.gmail.com
    Indeed.
    
    
    Back to the issue in this thread, are there any suggestions or discussion on the fix patch? 
    
    
    Best Regards
    Haiyang Li.
  8. RE: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-04T12:52:22Z

    Dear Alexander,
    
    I also ran your script after applying patches on REL_14_STABLE, but I got another failure.
    I have not analyzed yet so I share the result as-is.
    Was I missing something?
    
    ```
    #0  0x00007f4f033de387 in raise () from /lib64/libc.so.6
    #1  0x00007f4f033dfa78 in abort () from /lib64/libc.so.6
    #2  0x0000000000b08152 in ExceptionalCondition (conditionName=0xcc8228 "prev_first_lsn < cur_txn->first_lsn",
        errorType=0xcc8024 "FailedAssertion", fileName=0xcc8070 "reorderbuffer.c", lineNumber=916) at assert.c:69
    #3  0x00000000008e01ad in AssertTXNLsnOrder (rb=0x19a95c0) at reorderbuffer.c:916
    #4  0x00000000008dfae8 in ReorderBufferTXNByXid (rb=0x19a95c0, xid=1753, create=true, is_new=0x0, lsn=37430336,
        create_as_top=true) at reorderbuffer.c:669
    #5  0x00000000008e3e04 in ReorderBufferAddNewTupleCids (rb=0x19a95c0, xid=1753, lsn=37430336, node=..., tid=..., cmin=1,
        cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3200
    #6  0x00000000008e9018 in SnapBuildProcessNewCid (builder=0x19af5f0, xid=1754, lsn=37430336, xlrec=0x197db38)
        at snapbuild.c:823
    #7  0x00000000008d2616 in DecodeHeap2Op (ctx=0x1999550, buf=0x7ffcc39d7310) at decode.c:470
    #8  0x00000000008d1df5 in LogicalDecodingProcessRecord (ctx=0x1999550, record=0x1999910) at decode.c:150
    #9  0x00000000008d9782 in pg_logical_slot_get_changes_guts (fcinfo=0x19874b0, confirm=true, binary=false)
        at logicalfuncs.c:296
    #10 0x00000000008d98b7 in pg_logical_slot_get_changes (fcinfo=0x19874b0) at logicalfuncs.c:365
    #11 0x0000000000740001 in ExecMakeTableFunctionResult (setexpr=0x1985a38, econtext=0x19858f0, argContext=0x1987390,
        expectedDesc=0x19ae898, randomAccess=false) at execSRF.c:234
    #12 0x000000000075be2b in FunctionNext (node=0x19856d8) at nodeFunctionscan.c:95
    #13 0x000000000074184a in ExecScanFetch (node=0x19856d8, accessMtd=0x75bd7a <FunctionNext>,
        recheckMtd=0x75c175 <FunctionRecheck>) at execScan.c:132
    #14 0x00000000007418eb in ExecScan (node=0x19856d8, accessMtd=0x75bd7a <FunctionNext>, recheckMtd=0x75c175 <FunctionRecheck>)
        at execScan.c:198
    #15 0x000000000075c1bf in ExecFunctionScan (pstate=0x19856d8) at nodeFunctionscan.c:270
    #16 0x000000000073db6e in ExecProcNodeFirst (node=0x19856d8) at execProcnode.c:464
    #17 0x00000000007322ac in ExecProcNode (node=0x19856d8) at ../../../src/include/executor/executor.h:260
    #18 0x0000000000734b36 in ExecutePlan (estate=0x19854a0, planstate=0x19856d8, use_parallel_mode=false, operation=CMD_SELECT,
        sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x19c2008, execute_once=true) at execMain.c:1551
    #19 0x0000000000732916 in standard_ExecutorRun (queryDesc=0x1975860, direction=ForwardScanDirection, count=0,
        execute_once=true) at execMain.c:361
    --Type <RET> for more, q to quit, c to continue without paging--
    #20 0x0000000000732744 in ExecutorRun (queryDesc=0x1975860, direction=ForwardScanDirection, count=0, execute_once=true)
        at execMain.c:305
    #21 0x000000000097c752 in PortalRunSelect (portal=0x191dd40, forward=true, count=0, dest=0x19c2008) at pquery.c:921
    #22 0x000000000097c411 in PortalRun (portal=0x191dd40, count=9223372036854775807, isTopLevel=true, run_once=true,
        dest=0x19c2008, altdest=0x19c2008, qc=0x7ffcc39d7b00) at pquery.c:765
    #23 0x00000000009760d8 in exec_simple_query (
        query_string=0x18bb7a0 "SELECT data FROM pg_logical_slot_get_changes('isolation_slot_5', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');") at postgres.c:1213
    #24 0x000000000097a5eb in PostgresMain (argc=1, argv=0x7ffcc39d7d90, dbname=0x18e5c28 "regress_5",
        username=0x18e5c08 "hayato") at postgres.c:4513
    #25 0x00000000008b58b6 in BackendRun (port=0x18dd880) at postmaster.c:4540
    #26 0x00000000008b5232 in BackendStartup (port=0x18dd880) at postmaster.c:4262
    #27 0x00000000008b16ff in ServerLoop () at postmaster.c:1748
    #28 0x00000000008b0fd1 in PostmasterMain (argc=5, argv=0x18b6200) at postmaster.c:1420
    #29 0x00000000007b16f8 in main (argc=5, argv=0x18b6200) at main.c:209
    ```
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/<https://www.fujitsu.com/global/>
    
    
  9. Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Alexander Lakhin <exclusion@gmail.com> — 2024-03-05T04:00:00Z

    Hello Kuroda-san,
    
    04.03.2024 15:52, Hayato Kuroda (Fujitsu) wrote:
    >
    > Dear Alexander,
    >
    > I also ran your script after applying patches on REL_14_STABLE, but I got another failure.
    >
    > I have not analyzed yet so I share the result as-is.
    >
    > Was I missing something?
    >
    >
    > ```
    >
    > #0 0x00007f4f033de387 in raise () from /lib64/libc.so.6
    >
    > #1 0x00007f4f033dfa78 in abort () from /lib64/libc.so.6
    >
    > #2 0x0000000000b08152 in ExceptionalCondition (conditionName=0xcc8228 "prev_first_lsn < cur_txn->first_lsn",
    >
    >     errorType=0xcc8024 "FailedAssertion", fileName=0xcc8070 "reorderbuffer.c", lineNumber=916) at assert.c:69
    >
    > #3 0x00000000008e01ad in AssertTXNLsnOrder (rb=0x19a95c0) at reorderbuffer.c:916
    >
    > #4 0x00000000008dfae8 in ReorderBufferTXNByXid (rb=0x19a95c0, xid=1753, create=true, is_new=0x0, lsn=37430336,
    >
    >     create_as_top=true) at reorderbuffer.c:669
    >
    >
    
    Yes, you're right. That script produces another failure as I mentioned
    upthread, but still in AssertTXNLsnOrder().
    I had thought that the fix should cover that case too, but then I found out
    that the defect in question is rather new, so maybe the more focused fix is
    really preferable.
    
    Best regards,
    Alexander
  10. RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-05T09:24:05Z

    Dear Haiyang Li,
    
    Thanks for reporting. I could reproduce the  issue on PG12-15.
    I thought you have already known the reason, but let me share my analysis for
    the confirmation. The root cause is missing consideration for temporary tables.
    
    ## Premise
    
    INSERT/UPDATE/DELETE operations for temporary tables are not recorded to WAL files,
    but xid would be involved for such operations.
    So there is a possibility that a transaction does not have related WAL records
    if only temp tables are modified within the transaction.
    
    Basically such transactions would not be decoded.
    
    ## Found issue
    
    ### Empty transaction is decoded on PG14 and PG15
    
    However, there is a room for generating ReorderBufferTxn for empty transactions,
    which was introduced by 6b77048e5. Conditions are:
    
    1. There are sub transactions which modify only temp tables, and
    2. the top transaction modifies the catalog.
    
    The call-stack toward the generation is below.
    
    ```
    ReorderBufferTXNByXid(create = true, create_as_top = true)
    ReorderBufferXidSetCatalogChanges()    // for sub transactions
    SnapBuildXidSetCatalogChanges()          // for top transaction
    DecodeCommit()                                      // for top transaction
    ```
    
    The path has been introduced by 6b77048e5.
    Previously, calling ReorderBufferXidSetCatalogChanges() for sub transactions
    would be skipped, if they do not have catalog changes or they have not decoded yet.
    However, the commit ensures sub transactions must be marked as containing
    catalog changes, and this also enforces to decode transactions even if it is
    empty.
    
    ### Assertion failure
    
    The empty transactions would be created as top transactions. At that time,
    AssertTXNLsnOrder() is called so that we ensured that first_lsn of top-transactions
    must be strictly higher than previous. But they can be the same if there are more
    than two empty transactions. It led an assertion failure.
    
    ### Considerations on PG12 and PG13
    
    Same failure can be occurred on the PG12 and 13, and the background is bit different.
    343afa967 removed a ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges().
    The function allowed empty transactions being marked as sub-trans, so there had
    been no problem in past. After the commit, assignments were removed, so that the
    empty transactions would be generated as top-transactions.
    
    ## Possible solutions
    
    I think there are several solutions.
    Note that I assumed here that fixes for all the versions should be almost the same.
    
    * Ease the condition in AssertTXNLsnOrder(). If the decoded transaction is empty,
      it can be allowed that the first_lsn is same as previous one.
      PSA file to see my consideration.
    * Generate a ReorderBufferTXN as sub transaction when we are in this path.
      The approach has already been shared by you. However, note that this needs to
      extend the ReorderBufferXidSetCatalogChanges function, and breaks ABI
      compatibility [1].
    * Avoid calling ReorderBufferXidSetCatalogChanges() if the target transaction
      has not been decoded. An concern is that ReorderBuffer does not provide an API
      for checking whether the transaction has been already decoded or not.
    
    I will keep analyzing more and share further updates if found.
    Thought?
    
    [1]: https://wiki.postgresql.org/wiki/Committing_checklist#Maintaining_ABI_compatibility_while_backpatching
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/global/ 
    
    
  11. Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-06T03:33:45Z

    Dear Hayato Kuroda,
    
    
    Thanks for your attention.
    
    
    At 2024-03-05 17:24:05, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
    >Dear Haiyang Li,
    >
    >...
    >## Found issue
    >
    >### Empty transaction is decoded on PG14 and PG15
    >
    >However, there is a room for generating ReorderBufferTxn for empty transactions,
    >which was introduced by 6b77048e5. Conditions are:
    >
    >1. There are sub transactions which modify only temp tables, and
    >2. the top transaction modifies the catalog.
    >
    >The call-stack toward the generation is below.
    >
    >```
    >ReorderBufferTXNByXid(create = true, create_as_top = true)
    >ReorderBufferXidSetCatalogChanges()    // for sub transactions
    >SnapBuildXidSetCatalogChanges()          // for top transaction
    >DecodeCommit()                                      // for top transaction
    >```
    >
    >The path has been introduced by 6b77048e5.
    >Previously, calling ReorderBufferXidSetCatalogChanges() for sub transactions
    >would be skipped, if they do not have catalog changes or they have not decoded yet.
    >However, the commit ensures sub transactions must be marked as containing
    >catalog changes, and this also enforces to decode transactions even if it is
    >empty.
    >
    >### Assertion failure
    >
    >The empty transactions would be created as top transactions. At that time,
    >AssertTXNLsnOrder() is called so that we ensured that first_lsn of top-transactions
    >must be strictly higher than previous. But they can be the same if there are more
    >than two empty transactions. It led an assertion failure.
    
    >
    Your analysis is correct for me.  Actually, I mentioned in [1] that I can reproduce this issue before 6b77048e5. 
    After some attempts and analysis, I also believe that the issue will only occur after 6b77048e5.
    
    
    > ...
    >
    >## Possible solutions
    >
    >I think there are several solutions.
    >Note that I assumed here that fixes for all the versions should be almost the same.
    >
    >* Ease the condition in AssertTXNLsnOrder(). If the decoded transaction is empty,
    >  it can be allowed that the first_lsn is same as previous one.
    
    >  PSA file to see my consideration.
    LGFM.  For my observation, the most case failed on AsserTXNOrder is checking empty
    decoded transaction. Maybe we should pay more attention to review ReorderBufferTXNIsEmpty.
    
    
    >* Generate a ReorderBufferTXN as sub transaction when we are in this path.
    >  The approach has already been shared by you. However, note that this needs to
    >  extend the ReorderBufferXidSetCatalogChanges function, and breaks ABI
    
    >  compatibility [1].
    Yes, It breaks ABI compatibility.  
    
    
    >* Avoid calling ReorderBufferXidSetCatalogChanges() if the target transaction
    >  has not been decoded. An concern is that ReorderBuffer does not provide an API
    >  for checking whether the transaction has been already decoded or not.
    
    >
    I think this idear is  a little complex, especially when considering version compatibility.
    
    
    [1] https://www.postgresql.org/message-id/6444e39.131bc.18df5c0cae3.Coremail.ocean_li_996@163.com
    Best Regards,
    Haiyang Li
  12. RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-07T04:45:15Z

    Dear Haiyang, Alexander,
    
    >LGFM.  For my observation, the most case failed on AsserTXNOrder is checking empty
    >decoded transaction. Maybe we should pay more attention to review ReorderBufferTXNIsEmpty.
    
    While checking on my fresh eyes, I thought the code might be wrong. The first_lsn
    would be same as previous one, when the *PREVIOUS transaction* (not cur_txn) was
    empty. Thought?
    
    Also I found the crash reported on [1] was not resolved by the patch. I'm
    still analyzing but I have not found the good reproducer yet which could be done
    by spec file. Can someone find the workload?
    
    [1]: https://www.postgresql.org/message-id/TYCPR01MB12077573479C5A2471BDE8065F5232%40TYCPR01MB12077.jpnprd01.prod.outlook.com
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/global/ 
    
    
    
    
    
  13. RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-08T07:35:16Z

    Dear Haiyang, Alexander,
    
    I analyzed the second failure reported in [1]. The failure happened on the all
    Supported branches. Attached patches fixes two failures [1] [2] on PG12-PG15,
    and only the failure [1] on PG16-HEAD.
    
    Below part describes the second failure. The issue would be occurred when:
    
    1) Logical decoding starts from the middle of a sub-transaction, and
    2) NEW_CID record is initially decoded in the sub-transaction, and
    3) An arbitrary changes are decoded in the sub-transaction.
    
    ## Stuck trace
    
    Just in case, below is a stuck trace I got.
    
    ```
    #0  0x00007f85e64af387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
    #1  0x00007f85e64b0a78 in __GI_abort () at abort.c:90
    #2  0x0000000000b082d3 in ExceptionalCondition (conditionName=0xcc83e7 "prev_txn_is_empty", 
        errorType=0xcc8184 "FailedAssertion", fileName=0xcc81d0 "reorderbuffer.c", lineNumber=932) at assert.c:69
    #3  0x00000000008e023d in AssertTXNLsnOrder (rb=0x1a69e00) at reorderbuffer.c:932
    #4  0x00000000008e3cf5 in ReorderBufferSetBaseSnapshot (rb=0x1a69e00, xid=757, lsn=24749472, snap=0x1a660e8)
        at reorderbuffer.c:3140
    #5  0x00000000008e910a in SnapBuildProcessChange (builder=0x1a65de0, xid=757, lsn=24749472) at snapbuild.c:799
    #6  0x00000000008d2740 in DecodeHeapOp (ctx=0x1a53d60, buf=0x7ffe24280460) at decode.c:521
    #7  0x00000000008d1df7 in LogicalDecodingProcessRecord (ctx=0x1a53d60, record=0x1a54160) at decode.c:154
    #8  0x00000000008d9766 in pg_logical_slot_get_changes_guts (fcinfo=0x1a41f20, confirm=true, binary=false)
        at logicalfuncs.c:296
    #9  0x00000000008d989b in pg_logical_slot_get_changes (fcinfo=0x1a41f20) at logicalfuncs.c:365
    ...
    ```
    
    ## Reproducer
    
    Each attached patches contained the reproducer.
    As I said above, logical decoding is started from the middle of the sub-transaction
    initial decoded record must be NEW_CID. Below shows the wal records from the
    decoding start point.
    
    ```
    $ pg_waldump tmp_check_iso/data/pg_wal/000000010000000000000001 -s 0/179A4E8
    rmgr: ..., tx:          0, lsn: 0/0179A4E8, prev 0/0179A4A8, desc: CHECKPOINT_ONLINE... 
    rmgr: ..., tx:        757, lsn: 0/0179A560, prev 0/0179A4E8, desc: NEW_CID ...
    rmgr: ..., tx:        757, lsn: 0/0179A5A0, prev 0/0179A560, desc: INSERT+INIT...
    rmgr: ..., tx:        756, lsn: 0/0179A5E0, prev 0/0179A5A0, desc: COMMIT 2024-03-07 13:54:56.243746 UTC; subxacts: 757
    rmgr: ..., tx:          0, lsn: 0/0179A618, prev 0/0179A5E0, desc: RUNNING_XACTS ...
    ```
    
    ## Analysis
    
    SnapBuildProcessNewCid() generates a two transactions (xid = 757 and 756) based on
    the same wal record, so both entries have same first_lsn in ReorderBufferTXN.
    However, they are not associated as top-sub relationship so that they are pushed to
    toplevel_by_lsn list. Since AssertTXNLsnOrder() does not assumes the case which
    two entries have same first_lsn, it leads an assertion failure. Note that the
    sub-transaction is generated earlier than top one and sub is not an empty, an
    initial fix was not sufficient.
    
    ### Detailed analysis
    
    I added a debug variable in AssertTXNLsnOrder to preserve the previous entry in
    toplevel_by_lsn loop, and I found that two ReorderBufferTNXes have same first_lsn.
    Also, according to above, 757 is a sub-transaction of 756.
    
    ```
    (gdb) p *prev
    $1 = {txn_flags = 1, xid = 757, toplevel_xid = 0, gid = 0x0, first_lsn = 24749408, ...nentries = 1, ..., ntuplecids = 0,...}
    (gdb) p *cur_txn 
    $2 = {txn_flags = 0, xid = 756, toplevel_xid = 0, gid = 0x0, first_lsn = 24749408, ...nentries = 0, ..., ntuplecids = 1,...}
    ```
    
    Based on above and some debug outputs, I considered a scenario. Below flow showed
    a case when the sub-transaction (xid = 757) was decoded. 
    
    ```
    DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID)
    	SnapBuildProcessNewCid()
    		ReorderBufferXidSetCatalogChanges(xid, lsn)
    			ReorderBufferTXNByXid(xid, lsn)
    				-> A ReorderBufferTXN for subtransaciton (757) was generated.
    				   The first_lsn was the head of NEW_CID record.
    		ReorderBufferAddNewCommandId(xid, lsn)
    			ReorderBufferQueueChange(xid, lsn)
    				-> A ReorderBufferChange was queued to the subtransaction (757)
    ...
    		ReorderBufferAddNewTupleCids(xlrec->top_xid, lsn)
    			ReorderBufferTXNByXid()
    				-> A ReorderBufferTXN for top-transaciton (756) was generated.
    				   The first_lsn was the head of NEW_CID record.
    ...
    DecodeHeapOp(info = XLOG_HEAP_INSERT)
    	SnapBuildProcessChange(xid)
    		ReorderBufferSetBaseSnapshot(xid, lsn)
    			AssertTXNLsnOrder()
    				-> A subtransaction was found and it was not an empty transaction.
    				-> Next entry was a top-transaction.
    				   The previous entry was not empty, and they had same first_lsn. It caused an Assertion failure.
    ```
    
    ## How to fix
    
    I think the straightforward fix is to associate them to top-sub relationship,
    and attached patch did it.
    
    Thought?
    
    [1]: https://www.postgresql.org/message-id/TYCPR01MB12077573479C5A2471BDE8065F5232%40TYCPR01MB12077.jpnprd01.prod.outlook.com
    [2]: https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0%40postgresql.org
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/ 
    
    
  14. Re: RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Amit Kapila <amit.kapila16@gmail.com> — 2024-03-08T12:59:21Z

    On Wed, Mar 6, 2024 at 9:04 AM ocean_li_996 <ocean_li_996@163.com> wrote:
    >
    > Thanks for your attention.
    >
    > At 2024-03-05 17:24:05, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
    >
    > >Dear Haiyang Li,
    > >
    > >...
    > >## Found issue
    > >
    > >### Empty transaction is decoded on PG14 and PG15
    > >
    > >However, there is a room for generating ReorderBufferTxn for empty transactions,
    > >which was introduced by 6b77048e5. Conditions are:
    > >
    > >1. There are sub transactions which modify only temp tables, and
    > >2. the top transaction modifies the catalog.
    > >
    > >The call-stack toward the generation is below.
    > >
    > >```
    > >ReorderBufferTXNByXid(create = true, create_as_top = true)
    > >ReorderBufferXidSetCatalogChanges()    // for sub transactions
    > >SnapBuildXidSetCatalogChanges()          // for top transaction
    > >DecodeCommit()                                      // for top transaction
    > >```
    > >
    > >The path has been introduced by 6b77048e5.
    > >Previously, calling ReorderBufferXidSetCatalogChanges() for sub transactions
    > >would be skipped, if they do not have catalog changes or they have not decoded yet.
    > >However, the commit ensures sub transactions must be marked as containing
    > >catalog changes, and this also enforces to decode transactions even if it is
    > >empty.
    > >
    > >### Assertion failure
    > >
    > >The empty transactions would be created as top transactions. At that time,
    > >AssertTXNLsnOrder() is called so that we ensured that first_lsn of top-transactions
    > >must be strictly higher than previous. But they can be the same if there are more
    > >than two empty transactions. It led an assertion failure.
    > >
    > Your analysis is correct for me.  Actually, I mentioned in [1] that I can reproduce this issue before 6b77048e5.
    > After some attempts and analysis, I also believe that the issue will only occur after 6b77048e5.
    >
    > > ...
    > >
    > >## Possible solutions
    > >
    > >I think there are several solutions.
    > >Note that I assumed here that fixes for all the versions should be almost the same.
    > >
    > >* Ease the condition in AssertTXNLsnOrder(). If the decoded transaction is empty,
    > >  it can be allowed that the first_lsn is same as previous one.
    > >  PSA file to see my consideration.
    > LGFM.  For my observation, the most case failed on AsserTXNOrder is checking empty
    > decoded transaction. Maybe we should pay more attention to review ReorderBufferTXNIsEmpty.
    >
    > >* Generate a ReorderBufferTXN as sub transaction when we are in this path.
    > >  The approach has already been shared by you. However, note that this needs to
    > >  extend the ReorderBufferXidSetCatalogChanges function, and breaks ABI
    > >  compatibility [1].
    > Yes, It breaks ABI compatibility.
    >
    
    One possibility is introducing ReorderBufferXidSetCatalogChangesEx, a
    new API with a bool parameter. I don't know if this is a good idea but
    I prefer not to tinker with asserts as proposed by Kuroda-San in
    another approach.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  15. RE: RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-09T04:50:02Z

    Dear Amit, 
    
    OK, so let's consider the approach. Note that at this stage, another failure [1]
    was ignored.
    
    Here are patches for all supported versions. The failure on PG12-PG15 has been
    fixed on my env, but you can confirm as well.
    
    The approach is almost same as what initially shared by Haiyang [2]. However,
    instead of extending ReorderBufferXidSetCatalogChanges(), a new function
    ReorderBufferXidSetCatalogChangesEx() was added.
    
    Note again that there were changes also in ReorderBufferAssignChild() and
    ReorderBufferCleanupTXN(). Extended ReorderBufferXidSetCatalogChanges would the
    create ReorderBufferTXN not as top, however, these transaction would not be
    associated with the top one. So there is a possibility that txn->node is invalid.
    IIUC, only ReorderBufferAssignChild() calls ReorderBufferTXNByXid with create = true
    and create_as_top = false, and they would be immediately associated in below.
    
    ```
    	/* add to subtransaction list */
    	dlist_push_tail(&txn->subtxns, &subtxn->node);
    	txn->nsubtxns++;
    ```
    
    [1]: https://www.postgresql.org/message-id/TYCPR01MB12077573479C5A2471BDE8065F5232%40TYCPR01MB12077.jpnprd01.prod.outlook.com
    [2]: https://www.postgresql.org/message-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996%40163.com
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/ 
    
    
  16. Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-11T16:43:01Z

    Dear Hayato:
    
    
    At 2024-03-08 15:35:16, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
    >Dear Haiyang, Alexander,
    >
    >Below part describes the second failure. The issue would be occurred when:
    >
    >1) Logical decoding starts from the middle of a sub-transaction, and
    >2) NEW_CID record is initially decoded in the sub-transaction, and
    >3) An arbitrary changes are decoded in the sub-transaction.
    
    >...
    
    
    ## Issue 1
    Thanks for your  reproducer. I have investigated this issue. The scenario that caused the issue
    is indeed as you described above. I had not realized before that a serialized snapshot file from
    one replication slot could be utilized by another replication slot to achieve a consistent state. 
    This issue would not arise with only one replication slot, as SnapBuildXactNeedsSkip ensures that.
    
    
    Using the spec test case you provided (without any fix patch), I got another Stuck trace on REL_14_STABLE :
    ```
    #0  0x00007f93604e0277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
    #1  0x00007f93604e1968 in __GI_abort () at abort.c:90
    #2  0x0000000000b084da in ExceptionalCondition (conditionName=0xcc94c8 "prev_first_lsn < cur_txn->first_lsn", errorType=0xcc92c4 "FailedAssertion", fileName=0xcc9310 "reorderbuffer.c",
        lineNumber=916) at assert.c:69
    #3  0x00000000008df064 in AssertTXNLsnOrder (rb=0x2a48590) at reorderbuffer.c:916
    #4  0x00000000008de9a1 in ReorderBufferTXNByXid (rb=0x2a48590, xid=756, create=true, is_new=0x0, lsn=24749416, create_as_top=true) at reorderbuffer.c:669
    #5  0x00000000008e2d00 in ReorderBufferAddNewTupleCids (rb=0x2a48590, xid=756, lsn=24749416, node=..., tid=..., cmin=1, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3198
    #6  0x00000000008e7ed7 in SnapBuildProcessNewCid (builder=0x2a44570, xid=757, lsn=24749416, xlrec=0x2a16d38) at snapbuild.c:823
    #7  0x00000000008d14b3 in DecodeHeap2Op (ctx=0x2a324f0, buf=0x7ffd886a4bc0) at decode.c:471
    #8  0x00000000008d0bfa in LogicalDecodingProcessRecord (ctx=0x2a324f0, record=0x2a328f0) at decode.c:151
    #9  0x00000000008d8688 in pg_logical_slot_get_changes_guts (fcinfo=0x2a206b0, confirm=true, binary=false) at logicalfuncs.c:296
    #10 0x00000000008d87bd in pg_logical_slot_get_changes (fcinfo=0x2a206b0) at logicalfuncs.c:365
    ```
    
    
    
    Then, I think the simplest process would be:
    ```
    ReorderBufferProcessXid(xid = 757, lsn)
    	-> A ReorderBufferTXN for subtransaciton (757) was generated.
    	   The first_lsn was the head of NEW_CID record.
    	DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID)
    		SnapBuildProcessNewCid()	
    			ReorderBufferAddNewTupleCids(..., xlrec->top_xid = 756,...)
    				ReorderBufferTXNByXid(xid = 756, lsn)
    					-> A ReorderBufferTXN for top-transaciton (756) was generated.
    					   The first_lsn was the head of NEW_CID record. It caused an Assertion failure.
    ```
    > I think the straightforward fix is to associate them to top-sub relationship,
    > and attached patch did it.
    LGFM, I think it is suitable assign subtransaction after calling ReorderBufferAddNewTupleCids().
    
    
    ## Issue 2
    Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
    is able to reproduce the issue mentioned in [1] even before commit 6b77048e5.
    
    
    The approach in [3] is also LGFM.
    
    
    [1] https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0@postgresql.org
    [2] https://www.postgresql.org/message-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com
    [3] https://www.postgresql.org/message-id/CAA4eK1KpW5pHMwMp9hfXYvOeEU5Rcbhoc7FxtBOGPgKeyYLDmA@mail.gmail.com
    
    
    Best Regards,
    Haiyang Li
  17. RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-12T10:22:59Z

    Dear Haiyang,
    
    Thanks for checking! This reply was still focused only on "Issue 2" in your notation.
    
    >## Issue 2
    >Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
    >is  able to reproduce the issue mentioned in [1] even before commit 6b77048e5.
    
    Good findings. I also confirmed the workload could fail after reverting the 6b77048e5.
    Also confirmed that the patch [1] could fix the workload as well.
    
    permutation "s0_init" "s0_begin" "s0_savepoint" "s0_create_part1" "s0_savepoint_release"
                         "s2_init" "s1_checkpoint" "s1_get_changes" "s0_commit" "s2_get_changes"
    
    ## Analysis
    
    The point was that the serialized snapshot by another replication slot can be reused.
    When the first get_change is called, a consistent snapshot can be serialized because
    of the XLOG_RUNNING_XACTS record (see later).
    The get_changes for the second slot reuses so that it can read WAL records property.
    (If the first slot does not exist, the status of the snapshot would be
    SNAPBUILD_BUILDING_SNAPSHOT. So no records are read)
    
    In the second get_changes, below records are read. First (LOCK, RUNNING_XACTS)
    pair is generated from the slot creation, and second pair comes from the
    CHECKPOINT. I.e., it reads all records from the slot generation.
    
    ```
    ...lsn: 0/01906DB8, prev 0/01906D58, desc: LOCK ...
    ...lsn: 0/01906DF0, prev 0/01906DB8, desc: RUNNING_XACTS ...
    ...lsn: 0/01906E30, prev 0/01906DF0, desc: LOCK ...
    ...lsn: 0/01906E68, prev 0/01906E30, desc: RUNNING_XACTS ...
    ...lsn: 0/01906EA8, prev 0/01906E68, desc: CHECKPOINT_ONLINE ...
    ...lsn: 0/01906F20, prev 0/01906EA8, desc: COMMIT ... subxacts: 728; ... inval msgs: ...
    ```
    
    Also the final COMMIT record contains the info for a subtransaction and
    XACT_XINFO_HAS_INVALS flag, so DecodeCommit()->SnapBuildXidSetCatalogChanges()
    is called transactions.
    
    After that, two ReorderBufferTXNs are created with the same LSN, it fails the
    assertion in AssertTXNLsnOrder().
    
    I will update the patch if above analysis is correct.
    
    >The approach in [3] is also LGFM.
    
    Thanks. I agreed that we should not ease condition for Assert() as much as possible.
    
    [1]: https://www.postgresql.org/message-id/TYCPR01MB1207790E98F0A563280CC39FCF5262%40TYCPR01MB12077.jpnprd01.prod.outlook.com
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/global/ 
    
    
    
    
    
  18. Re:RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-12T15:07:22Z

    Dear Hayato,
    >
    >I will update the patch if above analysis is correct.
    >
    Your analysis is correct for me.
    
    
    Best Regards
    Haiyang Li
  19. RE: Re:RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-13T03:24:01Z

    Dear Haiyang Li,
    
    > Your analysis is correct for me.
    
    Thanks for the confirmation. Here are updated patch for REL12-REL15.
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/global/ 
    
    
  20. Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    feichanghong <feichanghong@qq.com> — 2024-03-13T03:33:55Z

    
    > On Mar 13, 2024, at 11:24, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
    > 
    > Dear Haiyang Li,
    > 
    >> Your analysis is correct for me.
    > 
    > Thanks for the confirmation. Here are updated patch for REL12-REL15.
    > 
    Is it better for ReorderBufferXidSetCatalogChanges to directly call the
    ReorderBufferXidSetCatalogChangesEx function? , which can reduce the cost of
    later maintenance.
    
    
    Best Regards,
    Fei Changhong
    Alibaba Cloud Computing Ltd.
    
    
  21. RE: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-13T04:39:08Z

    Dear Fei,
    
    Thanks for giving a comment!
    
    >Is it better for ReorderBufferXidSetCatalogChanges to directly call the
    >ReorderBufferXidSetCatalogChangesEx function? , which can reduce the cost of
    >later maintenance.
    
    Indeed, PSA fixed version.
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/global/ 
    
    
  22. Re:RE: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-13T05:48:58Z

    Dear Hayato,
    
    At 2024-03-13 12:39:08, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
    >Dear Fei,
    >
    >Thanks for giving a comment!
    >
    >>Is it better for ReorderBufferXidSetCatalogChanges to directly call the
    >>ReorderBufferXidSetCatalogChangesEx function? , which can reduce the cost of
    >>later maintenance.
    >
    >Indeed, PSA fixed version.
    
    >
    I have confirmed  that this patch can fix the iseeue well in this notation.
    Everything is ok for me. I have no more comments on this patch if no
    further discussions.
    Best Regards
    Haiyang Li
    
    
  23. RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-13T09:54:36Z

    Dear hackers,
    
    Regarding the second failure, which was reported the upstream.
    
    While reading some other threads, I found that the similar phenomenon has already
    been reported in [1]. The primal reason was that slots reused the serialized
    snapshot but it was not correct. Therefore, I thought the second failure should
    be discussed on the referred thread and here we should focused on the recently
    proposed fix by me.
    (This meant that my proposed workload may not work well after the fix. Will modify
    the test later if needed.)
    
    [1]: https://www.postgresql.org/message-id/flat/2444AA15-D21B-4CCE-8052-52C7C2DAFE5C%40amazon.com
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/ 
    
    
    
    
    
  24. Re:Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-14T00:49:39Z

    Dear Hayato,
    
    
    To ensure we are on the same page, I think we need to clarify a few things. 
    In this thread, we have actually encountered three distinct issues. Chronologically,
    
    
    Issue #1 is presented in [1], which has been figured out and provided an appropriate fix patch in [5]. 
    
    
    Issue #2 is presented in [2], which has been figured out. But we have not yet provided a fix patch. 
    
    
    Issue #3 is presented in [3], which was initially thought to be the same as issue #2. However, 
    you have found that it might be related to the issue discussed in [4].
    
    
    
    
    If my understanding is correct, I think that multiple issues should be resolved independently.
    In this thread, we should resolve issue #1 using the patch provided in [5]. If we want to prevent 
    the newly added spec test triggering other issues (such as issue #3), revert the spec test to
    the test cases provided in [6] may be a good choice.
    
    
    The issue  #2 can be discussed in a new thread. The issue #3 can be discussed in [4] thread.
    Thought?
    
    
    
    
    [1] https://www.postgresql.org/message-id/18369-ad61699bf91c5bc0@postgresql.org
    [2] https://postgresql.org/message-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com
    [3] https://www.postgresql.org/message-id/TYCPR01MB1207717063D701F597EF98A0CF5272@TYCPR01MB12077.jpnprd01.prod.outlook.com
    [4] https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
    [5] https://www.postgresql.org/message-id/TYCPR01MB12077E7BD3320FFD9024A3DE9F52A2@TYCPR01MB12077.jpnprd01.prod.outlook.com
    [6] https://www.postgresql.org/message-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com
    
    
    
    
    Best Regards,
    Haiyang Li
    
    
    
    
  25. RE: Re:Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-14T01:40:32Z

    Dear Haiyang Li,
    
    > Issue #1 is presented in [1], which has been figured out and provided an appropriate fix patch in [5].
    
    Right.
    
    > Issue #2 is presented in [2], which has been figured out. But we have not yet provided a fix patch.
    > Issue #3 is presented in [3], which was initially thought to be the same as issue #2. However, 
    > you have found that it might be related to the issue discussed in [4].
    
    Sorry for confusing. I meant to say that issues #2 and #3 are the same, and there is a related
    discussion in [4]. 
    
    > If my understanding is correct, I think that multiple issues should be resolved independently.
    > In this thread, we should resolve issue #1 using the patch provided in [5]. If we want to prevent 
    > the newly added spec test triggering other issues (such as issue #3), revert the spec test to
    > the test cases provided in [6] may be a good choice.
    
    Absolutely correct.
    
    > The issue  #2 can be discussed in a new thread. The issue #3 can be discussed in [4] thread.
    > Thought?
    
    Again, I think #2 and #3 are the same, so it should be discussed in [4].
    
    [4]: https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/global/ 
    
    
    
    
    
  26. Re:RE: Re:Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-14T02:14:03Z

    Dear Hayato,
    
    
    Thanks for your confirmation.
    >> In this thread, we should resolve issue #1 using the patch provided in [5]. If we want to prevent 
    >> the newly added spec test triggering other issues (such as issue #3), revert the spec test to
    >> the test cases provided in [6] may be a good choice.
    >
    >Absolutely correct.
    
    If the spec test need to be reverted, the patch in [5] may be updated again.
    
    
    
    
    >Sorry for confusing. I meant to say that issues #2 and #3 are the same, and there is a related
    
    >discussion in [4]. 
    Ok, I will investigate this issue again. And I will post my analysis in [4] if something new is found.
    
    
    [4] https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
    [5] https://www.postgresql.org/message-id/TYCPR01MB12077E7BD3320FFD9024A3DE9F52A2@TYCPR01MB12077.jpnprd01.prod.outlook.com
    
    
    Best Regards,
    Haiyang Li
  27. Re: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Amit Kapila <amit.kapila16@gmail.com> — 2024-03-15T12:46:02Z

    On Wed, Mar 13, 2024 at 3:24 PM Hayato Kuroda (Fujitsu)
    <kuroda.hayato@fujitsu.com> wrote:
    >
    > Regarding the second failure, which was reported the upstream.
    >
    > While reading some other threads, I found that the similar phenomenon has already
    > been reported in [1]. The primal reason was that slots reused the serialized
    > snapshot but it was not correct. Therefore, I thought the second failure should
    > be discussed on the referred thread and here we should focused on the recently
    > proposed fix by me.
    > (This meant that my proposed workload may not work well after the fix. Will modify
    > the test later if needed.)
    >
    
    I agree with this. Is there a reason you are not using the test for
    the issue originally reported in this thread?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  28. Re: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Amit Kapila <amit.kapila16@gmail.com> — 2024-03-16T11:51:59Z

    On Fri, Mar 15, 2024 at 6:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Wed, Mar 13, 2024 at 3:24 PM Hayato Kuroda (Fujitsu)
    > <kuroda.hayato@fujitsu.com> wrote:
    > >
    > > Regarding the second failure, which was reported the upstream.
    > >
    > > While reading some other threads, I found that the similar phenomenon has already
    > > been reported in [1]. The primal reason was that slots reused the serialized
    > > snapshot but it was not correct. Therefore, I thought the second failure should
    > > be discussed on the referred thread and here we should focused on the recently
    > > proposed fix by me.
    > > (This meant that my proposed workload may not work well after the fix. Will modify
    > > the test later if needed.)
    > >
    >
    > I agree with this. Is there a reason you are not using the test for
    > the issue originally reported in this thread?
    >
    
    To verify my theory, I tried by slight modification in the test
    proposed by your patch:
    proposed:
    "s0_init" "s0_begin" "s0_savepoint" "s0_create_part1"
    "s0_savepoint_release" "s2_init" "s1_checkpoint" "s1_get_changes"
    "s0_commit" "s2_get_changes" "s2_stop"
    
    changed:
    "s0_init" "s0_begin" "s0_savepoint" "s0_create_part1"
    "s0_savepoint_release" "s2_init" "s1_checkpoint" "s1_get_changes"
    "so_insert_part1" "s0_commit" "s2_get_changes" "s2_stop"
    
    Note that so_insert_part1 (define it as: Insert into tbl1_part
    values(2);) before s0_commit and I get the error: "ERROR:  could not
    map filenode "base/5/16387" to relation OID". So, I feel such a test
    case needs the fix being discussed in the thread [1]. I don't know
    whether there will be any relation between the fixes for both the
    issues but it seems better to evaluate the other one [1] as well
    before moving forward with the fix in this thread.
    
    [1] - https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  29. RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2024-03-18T06:44:46Z

    Dear Amit,
    
    > I agree with this. Is there a reason you are not using the test for
    > the issue originally reported in this thread?
    
    Firstly the modified version was used to shorten the test workload.
    However, the optimization assumed the BUG which may be fixed in future.
    Here are new patch set which recent changes for testcases were reverted.
    
    Best Regards,
    Hayato Kuroda
    FUJITSU LIMITED
    https://www.fujitsu.com/ 
    
    
  30. Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-03-19T02:58:38Z

    Dear Hayato, Amit,
    
    
    
    
    At 2024-03-18 14:44:46, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:
    
    >Dear Amit,
    >
    >> I agree with this. Is there a reason you are not using the test for
    >> the issue originally reported in this thread?
    >
    >Firstly the modified version was used to shorten the test workload.
    >However, the optimization assumed the BUG which may be fixed in future.
    >Here are new patch set which recent changes for testcases were reverted.
    >
    >Best Regards,
    >Hayato Kuroda
    >FUJITSU LIMITED
    >https://www.fujitsu.com/ 
    
    >
    
    
    The spec test added in latest patch only involves a single logical replication slot.
    IIUC, the issue can be fixed independently, without having to wait for the issue
    in [1] to be resolved first.
    [1]  https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
    
    
    Best Regards,
    Haiyang Li
  31. Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

    ocean_li_996 <ocean_li_996@163.com> — 2024-05-16T07:22:53Z

    Hi all:
    
    
    
    
    Two months have passed by now, and I'm wondering if the issue mentioned in [1] has been resolved yet? 
    
    
    
    
    Returning to this question, does anyone have any new opinions on it? If not, should we consider fixing this problem?
    
    Or is this considered a bug that's not worth fixing?
    
    
    
    [1]  https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com
    
    
    Best Regards,
    Haiyang Li
  32. Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

    Masahiko Sawada <sawada.mshk@gmail.com> — 2024-07-12T02:30:18Z

    On Thu, May 16, 2024 at 4:23 PM ocean_li_996 <ocean_li_996@163.com> wrote:
    >
    > Hi all:
    >
    >
    > Two months have passed by now, and I'm wondering if the issue mentioned in [1] has been resolved yet?
    
    I've pushed the fix, bb19b70081.
    
    >
    > Returning to this question, does anyone have any new opinions on it? If not, should we consider fixing this problem?
    >
    > Or is this considered a bug that's not worth fixing?
    
    I think we should fix this bug too. I've reviewed the proposed v4
    patch for pg15 and here are comments:
    
    ---
    +# DML operations on temporary table temp_t consume XIDs, but do not generate
    +# WAL records. The last decoding restarts from the first checkpoint and doesn't
    +# get any information of last two substransactions that performed
    s0_insert_temp.
    +# While processing the commit record for the corresponding top-level
    +# transaction which will be marked as containing catalog change even before
    +# commit, we ensure that the corresponding substransaction is added into
    +# ReorderBuffer as subxact.
    +permutation "s0_create_temp" "s0_init" "s0_begin" "s0_txid_current"
    "s1_checkpoint" "s0_savepoint" "s0_insert_temp" "s0_savepoint_release"
    "s0_savepoint" "s0_insert_temp" "s0_savepoint_release" "s0_savepoint"
    "s0_insert_temp" "s0_savepoint_release" "s0_create_part1"
    "s1_get_changes" "s0_commit" "s1_get_changes"
    
    I think the new test case can be simplified to something like:
    
    permutation "s0_create_temp" "s0_init" "s0_begin" "s0_txid_current"
    "s1_checkpoint" "s0_truncate" "s0_savepoint" "s0_insert_temp"
    "s0_savepoint" "s0_insert_temp" "s0_savepoint" "s0_insert_temp"
    "s1_get_changes" "s0_commit" "s1_get_changes"
    
    It has fewer steps and emits fewer WAL records.
    
    ---
    +                        *
    +                        * Note: There is a possibility that the
    transaction was created
    +                        * as not the top-level txn, but the flag was
    not set. In this
    +                        * case, the transaction was not added to the
    list. This happens if
    +                        * sub-txns are first recognized and decoded
    by a COMMIT record.
                             */
    -                       dlist_delete(&subtxn->node);
    +                       if (subtxn->node.next != NULL)
    +                               dlist_delete(&subtxn->node);
                    }
    
    I think that the new if statement can be merged to the above else
    block (i.e, "else if (subtx->node.next != NULL)").
    
    It's not clear to me what the "the flag" in the new comment refers to.
    How about rephrasing it to:
    
    /*
     * We already saw this transaction, but might have initially added
     * it to the list of top-level txns.  Now that we know it's not
     * top-level, remove it from there if it's already attached to the
     * list.
     */
    
    ---
            /*
    -        * Remove TXN from its containing list.
    +        * Remove TXN from its containing list if registered.
             *
    
    How about "if attached"? On HEAD, we use the terms "detached" for
    dlist_node (see dlist_node_is_detached() in ilist.h).
    
    ---
    +void
    +ReorderBufferXidSetCatalogChangesEx(ReorderBuffer *rb, TransactionId xid,
    +                                   XLogRecPtr lsn, bool is_top)
     {
        ReorderBufferTXN *txn;
    
    -   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    +   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, is_top);
    
    and
    
     -           ReorderBufferXidSetCatalogChanges(builder->reorder,
    subxacts[i], lsn);
     +           ReorderBufferXidSetCatalogChangesEx(builder->reorder,
    subxacts[i], lsn, false);
    
    As Kuroda-san previously mentioned[1], with this change, we would end
    up with a transaction entry that is not the top-level transaction but
    is not assigned to any top-level transaction. Such transactions are
    created in SnapBuildXidSetCatalogChanges(), used solely for reference
    by SnapBuildCommitTxn(), and promptly cleaned up afterward. Therefore
    I don't believe it would pose a significant issue, but I will
    contemplate it further. I feel it's somewhat of a hacky solution just
    to avoid AssertTXNLsnOrder().
    
    In SnapBuildXidSetCatalogChanges(), since we know that all XIDs in the
    subxacts array are subtransactions of the transaction 'xid', another
    way to fix it would be to properly assign all subtransactions to the
    top-level transaction (i.e., by calling ReorderBufferAssignChild())
    before setting each of them as catalog-changes. However, this change
    alone is insufficient, as it has been possible that we accumulate
    invalidation messages into a transaction that will later be identified
    as a subtransaction. IOW, another assertion in ReorderBufferForget()
    would fail:
    
        /*
         * Process cache invalidation messages if there are any. Even if we're not
         * interested in the transaction's contents, it could have manipulated the
         * catalog and we need to update the caches according to that.
         */
        if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
            ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
                                               txn->invalidations);
        else
            Assert(txn->ninvalidations == 0);
    
    Note that when assigning a subtransaction to its top-level
    transaction, we transfer the subtransaction's snapshot to its
    top-level transaction but not for the invalidation messages in
    ReorderBufferTXN. Consequently, it's possible to have subtransactions
    that lack the base snapshot but possess invalidation messages.
    Therefore, we would need to check if txn->ninvalidations == 0 only for
    top-level transactions. That is:
    
    -   else
    +   else if (!rbtxn_is_known_subxact(txn))
            Assert(txn->ninvalidations == 0);
    
    I'm uncertain which approach is better or if there exists an even
    superior approach. I'll continue contemplating it. Feedback is highly
    welcomed.
    
    Regards,
    
    [1] https://www.postgresql.org/message-id/TYCPR01MB1207790E98F0A563280CC39FCF5262%40TYCPR01MB12077.jpnprd01.prod.outlook.com
    
    --
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com