RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
From: "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>
To: "ocean_li_996@163.com" <ocean_li_996@163.com>
Cc: "pgsql-bugs@lists.postgresql.org" <pgsql-bugs@lists.postgresql.org>, Amit
Kapila <amit.kapila16@gmail.com>
Date: 2023-08-17T06:51:02Z
Lists: pgsql-bugs
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix uninitialized access to InitialRunningXacts during decoding after ERROR.
- c7256e6564fa 15.5 landed
- f7d25117ba87 14.10 landed
- c570bb4d61b6 13.13 landed
- 7e57208ed51a 12.17 landed
- feb4e218e5f9 11.22 landed
Attachments
- v2-0001-Cleanup-decoding-context-and-slots-in-error-cases.patch (application/octet-stream) patch v2-0001
On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 18055
> Logged by: ocean li
> Email address: ocean_li_996@163.com
> PostgreSQL version: 11.9
> Operating system: centos7 5.10.84 x86_64
> Description:
>
> For testing logical decoding module, *pg_logical_slot_get_changes* function
> is used. Sometimes i got an core whose stack was like that:
> ...
> And in level #3 of stack above, NInitialRunningXacts is 2 and
> InitialRunningXacts is not NULL observed in one of cores.
>
> Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> core, as far as i know, maybe caused by this way: an ERROR raised when calling
> *pg_logical_slot_get_changes_guts* function. The code part of
> PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts.
> Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> Unfortunately, I couldn't find a minimal reproduction case. However, I
> observed an *ERROR: canceling statement due to statement timeout* logged
> before each core occurred. (For some reason, I can't provide the information of
> log)
Thanks for the report and the fix!
I can also reproduce by the steps[1] in PG15~11(Note that we need to change
LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
records).
About the patch:
---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);
ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+ /* free context, call shutdown callback */
+ FreeDecodingContext(ctx);
+
+ ReplicationSlotRelease();
I think we could not directly call the cleanup functions here. There could be
two risks:
1) it's possible either 'ctx' hasn't been initialized before the error happen,
we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
before entering the catch() block which means we will double cleanup(free) it.
To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
in it and do the null check for 'ctx' before cleaning up.
Just share one small patch based on your fix for reference, feel free to update it.
[1]
Session 1:
DROP TABLE IF EXISTS tbl1;
CREATE TABLE tbl1 (val1 integer, val2 integer);
CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
BEGIN;
SAVEPOINT sp1;
INSERT INTO tbl1 VALUES (1);
Session 2:
CHECKPOINT;
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
Session 1:
TRUNCATE tbl1;
COMMIT;
BEGIN;
INSERT INTO tbl1 VALUES (1);
Session 2:
CHECKPOINT;
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
Session 1:
COMMIT;
Session 2:
-- Use gdb or modify the code to make the first function call fail.
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-- Since the previous command failed and the NInitialRunningXacts was not reset, we should see the Assert() failure this time.
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
Best Regards,
Hou zj