Thread
-
Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Matheus Alcantara <matheusssilv97@gmail.com> — 2025-10-25T00:32:54Z
Thanks for the review! On Thu Oct 23, 2025 at 8:42 PM -03, Masahiko Sawada wrote: > On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara > <matheusssilv97@gmail.com> wrote: >> >> On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote: >> > On 2025-Oct-22, Matheus Alcantara wrote: >> > >> >> I'm wondering if the 002_aborted_tx_notifies.pl is still needed with >> >> this architecture being used. I think that it's not, but perhaps is a >> >> good test to keep it? >> > >> > I'd rather have tests than not, but I'd think it needs to be behind >> > PG_TEST_EXTRA because of things like >> > >> > +$node->safe_psql('postgres', 'select consume_xids(10000000);'); >> > >> Attached v10 with wrapping into PG_TEST_EXTRA. > > Thank you for updating the patches! > > I've reviewed the patches and here are the comments. > > v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch: > > + * > + * Returns InvalidTransactionId if there are no unprocessed notifications. > > This comment is not accurate since the function returns > InvalidTransactionId even if all unprocessed notifications are created > on other databases. > Fixed > --- > +TransactionId > +GetOldestQueuedNotifyXid(void) > +{ > > How about renaming it to something like GetOldestNotifyTransactionId()? > Yeap, sounds better, fixed. > --- > + /* page_buffer must be adequately aligned, so use a union */ > + union > + { > + char buf[QUEUE_PAGESIZE]; > + AsyncQueueEntry align; > + } page_buffer; > > asyncQueueReadAllNotifications() uses this union too, so how about > define this type as AlignedQueueEntryPage or something and use it in > both functions? > Good point, fixed. > --- > + > + LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE); > + > > I don't think we need an exclusive lock here. > Fixed. > --- > In GetOldestQueuedNotifyXid() why do we keep holding NotifyQueueLock > while calculating the oldest XID in the queue? > Yeah, I don't think that it's necessary. The asyncQueueReadAllNotifications() for example only hold the lock when reading the QUEUE_BACKEND_POS and QUEUE_HEAD. I think that it's a similar case here, fixed. > --- > + > + /* Process all entries on this page up to head */ > + while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <= > QUEUE_PAGESIZE && > + !QUEUE_POS_EQUAL(pos, head)) > + { > + qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset); > (snip) > + /* Advance to next entry */ > + if (asyncQueueAdvance(&pos, qe->length)) > + break; /* advanced to next page */ > + > > I think the check "curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <= > QUEUE_PAGESIZE" is essentially the same as the one we do in > asyncQueueAdvance(). I think we can refactor the inner loop in > GetOldestQueuedNotifyXid() to something like: > > for (;;) > { > bool reachedEndOfPage; > > qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset); > > // check qe->xid here... > > reachedEndOfPage = asyncQueueAdvance(&pos, qe->length); > > if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head)) > break; > } > Fixed > --- > v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patch: > > This new test directory needs to have .gitignore. > Fixed > --- > +# Copyright (c) 2024-2025, PostgreSQL Global Development Group > + > > It's better to have a short description of this test here. > Fixed > --- > +use File::Path qw(mkpath); > > It seems not necessary. > Fixed > --- > +$node->safe_psql('postgres', > + 'CREATE TABLE t AS SELECT g AS a, g+2 AS b from > generate_series(1,100000) g;' > > Why does it need to insert many rows? > I think that this value was left when I was writing the first versions of this test, I don't think that it's necessary. I reduced to just 10 rows. > --- > +# --- Session 2, multiple notify's, and commit --- > +for my $i (1 .. 10) > +{ > + $node->safe_psql( > + 'postgres', " > + BEGIN; > + NOTIFY s, '$i'; > + COMMIT;"); > +} > > Why does it need to send a notification 10 times? > I just wanted to test with multiple notifications, we can reduce to two or three, there is not specific reason to send 10 notifications. > --- > +# Consume enough XIDs to trigger truncation > +$node->safe_psql('postgres', 'select consume_xids(10000000);'); > + > > I guess it consumes XID too much to trigger truncation. Given the one > clog segment file is 256kB in size, it's enough to consume 1,050,000 > XIDs to move to the next segment file. > Fixed > --- > +# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but > +# we can still get the notification status of the notification > +my $datafronzenxid_freeze = $node->safe_psql('postgres', "select > datfrozenxid from pg_database where datname = 'postgres'"); > +ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced'); > > I think this test passes in all cases, so it is meaningless. Instead, > what we need to check in terms of datfrozenxid is that its value > doesn't get greater than the XID we used for the notification, even > after vacuum freeze. I think we remember the XID used for NOTIFY and > check if the old/new datfrozenxid values are older than that value. > I think that was the goal of this test. I've swap the values to make it more clear. Please let me know if I misunderstood your point here. > --- > +is($notifications_count, 10, 'received all committed notifications'); > + > +done_testing(); > > After consuming the unconsumed notifications, let's do vacuum freeze > and check if datfrozenxid now can be advanced. > Good point, during this I realize that the datafrozenxid was not being advanced even after the queue being consumed. This was because when a backend listener is consuming the notifications it only update their own queue tail pointer and not the global shared tail pointer, so I've included a call to asyncQueueAdvanceTail() at the beginning of GetOldestNotifyTransactionId() to do this. > --- > I would expect to add 002_aborted_tx_notifies.pl in a separate patch > since it's not related to this bug fix. > On the new attached version I've moved this test to a separated patch. > --- > +# Test checks that listeners do not receive notifications from aborted > +# transaction even if notifications have been added to the listen/notify > +# queue. To reproduce it we use the fact that serializable conflicts > +# are checked after tx adds notifications to the queue. > > I wonder if we could implement this test using the isolation test > instead of the tap test. Is there any reason why you used a tap test > for that? > On this new version I just moved the test into a separated patch but I agree that we can implement this using the isolation test, I don't know how we should handle the follow-up patches that Joel have sent on [1], if we should incoportate on 0001 or not. I still need to review and test them. [1] https://www.postgresql.org/message-id/b6f6cbb8-1903-4ca0-ae64-01d84d1e12a3%40app.fastmail.com -- Matheus Alcantara