BUG #18055: logical decoding core on AllocateSnapshotBuilder()
PG Bug reporting form <noreply@postgresql.org>
From: PG Bug reporting form <noreply@postgresql.org>
To: pgsql-bugs@lists.postgresql.org
Cc: ocean_li_996@163.com
Date: 2023-08-14T16:04:48Z
Lists: pgsql-bugs
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix uninitialized access to InitialRunningXacts during decoding after ERROR.
- c7256e6564fa 15.5 landed
- f7d25117ba87 14.10 landed
- c570bb4d61b6 13.13 landed
- 7e57208ed51a 12.17 landed
- feb4e218e5f9 11.22 landed
The following bug has been logged on the website: Bug reference: 18055 Logged by: ocean li Email address: ocean_li_996@163.com PostgreSQL version: 11.9 Operating system: centos7 5.10.84 x86_64 Description: For testing logical decoding module, *pg_logical_slot_get_changes* function is used. Sometimes i got an core whose stack was like that: ==> #0 0x00007f744a7b9277 in raise () from /lib64/libc.so.6 #1 0x00007f744a7ba968 in abort () from /lib64/libc.so.6 #2 0x00000000010edd37 in ExceptionalCondition (conditionName=0x17e2b58 "!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))", errorType=0x17e2b45 "FailedAssertion", fileName=0x17e2b39 "snapbuild.c", lineNumber=381) at assert.c:46 #3 0x0000000000e60b46 in AllocateSnapshotBuilder (reorder=0x551ea98, xmin_horizon=0, start_lsn=1267160216, need_full_snapshot=false) at snapbuild.c:381 #4 0x0000000000e50f70 in StartupDecodingContext (output_plugin_options=0x0, start_lsn=1267160216, xmin_horizon=0, need_full_snapshot=false, fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>, prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>, update_progress=0x0) at logical.c:191 #5 0x0000000000e518b8 in CreateDecodingContext (start_lsn=1267160216, output_plugin_options=0x0, fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>, prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>, update_progress=0x0) at logical.c:486 #6 0x0000000000e53735 in pg_logical_slot_get_changes_guts (fcinfo=0x7ffcd879e3d0, confirm=true, binary=false) at logicalfuncs.c:259 #7 0x0000000000e53b1c in pg_logical_slot_get_changes (fcinfo=0x7ffcd879e3d0) at logicalfuncs.c:393 #8 0x00000000010ff89e in FunctionCallInvokeCheckSPL (fcinfo=0x7ffcd879e3d0) at fmgr.c:2262 ... ==> And in level #3 of stack above, NInitialRunningXacts is 2 and InitialRunningXacts is not NULL observed in one of cores. Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the core, as far as i know, maybe caused by this way: an ERROR raised when calling *pg_logical_slot_get_changes_guts* function. The code part of PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts. Then, calling pg_logical_slot_get_changes_guts again, the core may occur. Unfortunately, I couldn't find a minimal reproduction case. However, I observed an *ERROR: canceling statement due to statement timeout* logged before each core occurred. (For some reason, I can't provide the information of log)