Thread
-
Re: [PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill
Zsolt Parragi <zsolt.parragi@percona.com> — 2026-05-05T20:42:43Z
Hello! I verified both the patch, and that the test case catches the bug without the patch, the overall change seems good to me. + if (did_modify_vm) + INJECTION_POINT_CACHED("heap-force-kill-before-vm-wal", NULL); + if (did_modify_vm && RelationNeedsWAL(rel)) + { log_newpage_buffer(vmbuf, false); + LockBuffer(vmbuf, BUFFER_LOCK_UNLOCK); + } Is the additional if intentional here? Based on the name it seems like it could be simply part of the next if, currently it also fires for unlogged tables. +# Give session 2 time to reach the VM buffer lock (or complete if +# unfixed). We cannot reliably detect blocking from Perl, so just +# sleep briefly. +use Time::HiRes qw(usleep); +usleep(500_000); Wouldn't this be possibly unstable on CI? The following seems to work for me, both for detecting the issue in the unpatched version and to result in a quicker continuation in the patched version: (~1.55sec total execution time compared to ~2sec original) (with the patch, s2 wait for the loc, without the patch it finishes and becomes idle) my $s2 = $node->background_psql('postgres'); my $s2_pid = $s2->query_safe(q{SELECT pg_backend_pid()}); chomp $s2_pid; $s2->query_until(qr/starting_s2/, q(\echo starting_s2 SELECT heap_force_kill('test_vm'::regclass, ARRAY['(1,1)']::tid[]); \echo s2_done )); use Time::HiRes qw(usleep); my $observed = ''; my $deadline = time() + 10; while (time() < $deadline) { $observed = $node->safe_psql('postgres', qq{ SELECT format('%s/%s/%s', coalesce(wait_event_type, 'NULL'), coalesce(wait_event, 'NULL'), coalesce(state, 'NULL')) FROM pg_stat_activity WHERE pid = $s2_pid }); last if $observed =~ m{^Buffer/BufferExclusive/} || $observed =~ m{/idle$}; usleep(50_000); }