Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race
Vlad Lesin <vladlesin@gmail.com>
From: Vlad Lesin <vladlesin@gmail.com>
To: Michael Paquier <michael@paquier.xyz>,
Andrey Borodin <x4mmm@yandex-team.ru>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2026-05-15T12:29:30Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix procLatch ownership race in ProcKill()
- 84b9d6bceab6 19 (unreleased) cited
Attachments
- vVL2-0001-injection_points-extract-condition-header.patch (text/x-patch) patch 0001
- vVL2-0002-injection_points-wake-every-matching-waiter.patch (text/x-patch) patch 0002
- vVL2-0003-add-prockill-lockgroup-regression-test.patch (text/x-patch) patch 0003
- vVL2-0004-fix-lockgroup-double-push-and-leak.patch (text/x-patch) patch 0004
- vVL2-0005-fix-prockill-lockgroup-procLatch-recycle-race.patch (text/x-patch) patch 0005
Hello Michael, thank you for your review. I am sending a new series of patches that incorporate your code review notes. The version prefix is "vVL2-" to distinguish Andrey's patches from mine. On 5/11/26 04:58, Michael Paquier wrote: > On Thu, May 07, 2026 at 04:57:28PM +0500, Andrey Borodin wrote: >> Done so. I still keep Vlad's injection_point_condition.h though. >> It seems useful. But no more Meson\Makefile changes, and new sql stuff lives >> now near removable_cutoff() SQL. > > Because you want to have your callbacks in a new regress_prockill.c > while being able to reuse the structures. I don't see why not on > clarity ground. The creation of injection_point_condition.h could be > done in its own commit, independent of the rest, but that's just me > being pedantic about such matters. I think that this should be > backpatched anyway, that could make the introduction of new tests > easier. I moved the creation of injection_point_condition.h into a separate commit in 0001 patch. Additionally, I noticed that if several processes are waiting on the same injection point, only one of them will be awakened by a single injection_points_wakeup() call. I am not sure if this behavior is intentional; please let me know. For context, MySQL has a similar feature called debug sync points [0], which allows waking up several threads waiting on them. I suppose this same semantics would be useful for injection points as well to simplify tests. I implemented this behavior in the 0002 patch and utilized it in 0003. If you disagree with this approach, I can rewrite the test to use two separate injection points for the lock group leader and follower. > +# Two backends form a lock group, then are terminated concurrently. The test > +# uses injection points placed inside ProcKill() to pause both victims there > +# and verify that a freshly forked backend can claim the recycled PGPROC slot > +# without hitting "latch already owned by PID ..." PANIC. > > This claim is slightly incorrect here. If I undo the fix, place the > two INJECTION_POINT macros and re-run the test, the test does not > produce a PANIC. In order to reproduce a PANIC, it seems that we > would need something more complicated with an extra point after a > DisownLatch(). That is not required to me, still this is misleading > and could be improved, I guess.. Yes, you are right, I fixed the race in 0003 patch with additional injection point. > I think that we should refactor the patch into two pieces, > for clarity: > - Patch 1 refactors the freelist manipulations and introduces the two > new boolean flags, documenting more precily how the freelists are > manipulated and why things are done this way (leader, self, etc.). > - Patch 2, which is about the earlier DisownLatch(), then becomes a > fix of its own. That's the primary fix we care about, to avoid the > entries to be recycled due to a too-early PGPROC entry marked as > available. Done. I have split this into two separate patches, 0004 and 0005. The test in 0003 fails on patch 0004 and passes on 0005. > Also, I suspect that the test is racy? For one, I strongly suspect > that you are going to need tweaks similar 011_lock_stats.pl in > test_misc where we rely on some query_until() with psql banners, to > make sure that the waits actually happen.. I got the point about > pg_stat_activity not being something we can rely on here. That's > something that would be easier to see on an overloaded machine. The test should no longer be racy following the changes in 0003. -- Best regards, Vlad [0] https://dev.mysql.com/doc/dev/mysql-server/9.6.0/PAGE_DEBUG_SYNC.html