Thread

  1. Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

    Matheus Alcantara <matheusssilv97@gmail.com> — 2025-10-25T00:32:54Z

    Thanks for the review!
    
    On Thu Oct 23, 2025 at 8:42 PM -03, Masahiko Sawada wrote:
    > On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara
    > <matheusssilv97@gmail.com> wrote:
    >>
    >> On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:
    >> > On 2025-Oct-22, Matheus Alcantara wrote:
    >> >
    >> >> I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
    >> >> this architecture being used. I think that it's not, but perhaps is a
    >> >> good test to keep it?
    >> >
    >> > I'd rather have tests than not, but I'd think it needs to be behind
    >> > PG_TEST_EXTRA because of things like
    >> >
    >> > +$node->safe_psql('postgres', 'select consume_xids(10000000);');
    >> >
    >> Attached v10 with wrapping into PG_TEST_EXTRA.
    >
    > Thank you for updating the patches!
    >
    > I've reviewed the patches and here are the comments.
    >
    > v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch:
    >
    > + *
    > + * Returns InvalidTransactionId if there are no unprocessed notifications.
    >
    > This comment is not accurate since the function returns
    > InvalidTransactionId even if all unprocessed notifications are created
    > on other databases.
    >
    Fixed
    
    > ---
    > +TransactionId
    > +GetOldestQueuedNotifyXid(void)
    > +{
    >
    > How about renaming it to something like GetOldestNotifyTransactionId()?
    >
    Yeap, sounds better, fixed.
    
    > ---
    > +   /* page_buffer must be adequately aligned, so use a union */
    > +   union
    > +   {
    > +       char        buf[QUEUE_PAGESIZE];
    > +       AsyncQueueEntry align;
    > +   }           page_buffer;
    >
    > asyncQueueReadAllNotifications() uses this union too, so how about
    > define this type as AlignedQueueEntryPage or something and use it in
    > both functions?
    >
    Good point, fixed.
    
    > ---
    > +
    > +   LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
    > +
    >
    > I don't think we need an exclusive lock here.
    >
    Fixed.
    
    > ---
    > In GetOldestQueuedNotifyXid() why do we keep holding NotifyQueueLock
    > while calculating the oldest XID in the queue?
    >
    Yeah, I don't think that it's necessary. The
    asyncQueueReadAllNotifications() for example only hold the lock when
    reading the QUEUE_BACKEND_POS and QUEUE_HEAD. I think that it's a
    similar case here, fixed.
    
    > ---
    > +
    > +       /* Process all entries on this page up to head */
    > +       while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
    > QUEUE_PAGESIZE &&
    > +              !QUEUE_POS_EQUAL(pos, head))
    > +       {
    > +           qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
    > (snip)
    > +           /* Advance to next entry */
    > +           if (asyncQueueAdvance(&pos, qe->length))
    > +               break;          /* advanced to next page */
    > +
    >
    > I think the check "curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
    > QUEUE_PAGESIZE" is essentially the same as the one we do in
    > asyncQueueAdvance(). I think we can refactor the inner loop in
    > GetOldestQueuedNotifyXid() to something like:
    >
    > for (;;)
    > {
    >     bool reachedEndOfPage;
    >
    >     qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
    >
    >     // check qe->xid here...
    >
    >     reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);
    >
    >     if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
    >         break;
    > }
    >
    Fixed
    
    > ---
    > v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patch:
    >
    > This new test directory needs to have .gitignore.
    >
    Fixed
    
    > ---
    > +# Copyright (c) 2024-2025, PostgreSQL Global Development Group
    > +
    >
    > It's better to have a short description of this test here.
    >
    Fixed
    
    > ---
    > +use File::Path qw(mkpath);
    >
    > It seems not necessary.
    >
    Fixed
    
    > ---
    > +$node->safe_psql('postgres',
    > +   'CREATE TABLE t AS SELECT g AS a, g+2 AS b from
    > generate_series(1,100000) g;'
    >
    > Why does it need to insert many rows?
    >
    I think that this value was left when I was writing the first versions of
    this test, I don't think that it's necessary. I reduced to just 10 rows.
    
    > ---
    > +# --- Session 2, multiple notify's, and commit ---
    > +for my $i (1 .. 10)
    > +{
    > +   $node->safe_psql(
    > +       'postgres', "
    > +       BEGIN;
    > +       NOTIFY s, '$i';
    > +       COMMIT;");
    > +}
    >
    > Why does it need to send a notification 10 times?
    >
    I just wanted to test with multiple notifications, we can reduce to two
    or three, there is not specific reason to send 10 notifications.
    
    > ---
    > +# Consume enough XIDs to trigger truncation
    > +$node->safe_psql('postgres', 'select consume_xids(10000000);');
    > +
    >
    > I guess it consumes XID too much to trigger truncation. Given the one
    > clog segment file is 256kB in size, it's enough to consume 1,050,000
    > XIDs to move to the next segment file.
    >
    Fixed
    
    > ---
    > +# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
    > +# we can still get the notification status of the notification
    > +my $datafronzenxid_freeze = $node->safe_psql('postgres', "select
    > datfrozenxid from pg_database where datname = 'postgres'");
    > +ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
    >
    > I think this test passes in all cases, so it is meaningless. Instead,
    > what we need to check in terms of datfrozenxid is that its value
    > doesn't get greater than the XID we used for the notification, even
    > after vacuum freeze. I think we remember the XID used for NOTIFY and
    > check if the old/new datfrozenxid values are older than that value.
    >
    I think that was the goal of this test. I've swap the values to make it
    more clear. Please let me know if I misunderstood your point here.
    
    > ---
    > +is($notifications_count, 10, 'received all committed notifications');
    > +
    > +done_testing();
    >
    > After consuming the unconsumed notifications, let's do vacuum freeze
    > and check if datfrozenxid now can be advanced.
    >
    Good point, during this I realize that the datafrozenxid was not being
    advanced even after the queue being consumed. This was because when a
    backend listener is consuming the notifications it only update their own
    queue tail pointer and not the global shared tail pointer, so I've
    included a call to asyncQueueAdvanceTail() at the beginning of
    GetOldestNotifyTransactionId() to do this.
    
    > ---
    > I would expect to add 002_aborted_tx_notifies.pl in a separate patch
    > since it's not related to this bug fix.
    >
    On the new attached version I've moved this test to a separated patch.
    
    > ---
    > +# Test checks that listeners do not receive notifications from aborted
    > +# transaction even if notifications have been added to the listen/notify
    > +# queue. To reproduce it we use the fact that serializable conflicts
    > +# are checked after tx adds notifications to the queue.
    >
    > I wonder if we could implement this test using the isolation test
    > instead of the tap test. Is there any reason why you used a tap test
    > for that?
    >
    On this new version I just moved the test into a separated patch but I
    agree that we can implement this using the isolation test,
    
    I don't know how we should handle the follow-up patches that Joel have
    sent on [1], if we should incoportate on 0001 or not. I still need to
    review and test them.
    
    [1] https://www.postgresql.org/message-id/b6f6cbb8-1903-4ca0-ae64-01d84d1e12a3%40app.fastmail.com
    
    --
    Matheus Alcantara