LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Alexandra Wang <alexandra.wang.oss@gmail.com>
From: Alexandra Wang <alexandra.wang.oss@gmail.com>
To: Daniil Davydov <3danissimo@gmail.com>,
PostgreSQL Hackers <pgsql-hackers@postgresql.org>
Date: 2025-08-06T04:50:01Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Clear 'xid' in dummy async notify entries written to fill up pages
- 84f1bf4afa5e 14.21 landed
- 21a9014cf00a 15.16 landed
- 0e8eaa2181d4 16.12 landed
- d80d5f099502 17.8 landed
- 82fa6b78dba1 18.2 landed
- 0bdc777e8007 19 (unreleased) landed
-
Fix remaining race condition with CLOG truncation and LISTEN/NOTIFY
- c2e58c0711fe 14.21 landed
- 0c862646cf2a 15.16 landed
- 44e8c60be66c 16.12 landed
- c2682810ab7d 17.8 landed
- 7b069a1876e4 18.2 landed
- 797e9ea6e54b 19 (unreleased) landed
-
Fix bug where we truncated CLOG that was still needed by LISTEN/NOTIFY
- eba917d360e7 14.21 landed
- 1a469d7b5b7d 15.16 landed
- 053e1868b7ee 16.12 landed
- d02c03ddc5e3 17.8 landed
- 321ec54625fd 18.2 landed
- 8eeb4a0f7c06 19 (unreleased) landed
-
Escalate ERRORs during async notify processing to FATAL
- 7cb05dd2d198 14.21 landed
- b1da37de21d4 15.16 landed
- c1a5bde003b8 16.12 landed
- b821c92920f0 17.8 landed
- aab4a84bb070 18.2 landed
- 1b4699090eaf 19 (unreleased) landed
-
Limit the size of TID lists during parallel GIN build
- c98dffcb7c70 19 (unreleased) cited
Attachments
- 0001-Advance-tail-of-async-queue-before-updating-datfroze.patch.no-cfbot (application/octet-stream)
Hi,
I'm bringing up a bug that was reported multiple times [1][2][3] in
the bugs list here, for a broader audience.
The issue is that an ERROR like the one below occurs when trying to
register any listener in the database.
test=# listen c21;
ERROR: 58P01: could not access status of transaction 14279685
DETAIL: Could not open file "pg_xact/000D": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087
In [1], Andrei Varashen provided detailed reproduction steps. I’m
copying and pasting his example from thread [1] here, with slight
simplification.
Pre-conditions:
- Disable autovacuum to avoid its intervention.
Steps to reproduce:
1. Create a test table and notify (but not listen) to a channel
(backend 1):
create table test (id int);
insert into test values (1);
notify c1;
2. List pg_xact files so we know its starting state:
➜ ls -lah ~/pg-devel/data/pg_xact
total 16
drwx------@ 3 alex.wang staff 96B Aug 5 20:10 .
drwx------@ 26 alex.wang staff 832B Aug 5 20:12 ..
-rw-------@ 1 alex.wang staff 8.0K Aug 5 20:12 0000
3. Prepare a "test.sql" file for pgbench, and then run pgbench to
generate lots of transactions, so that pg_xact/0000 is completely
filled and leads to pg_xact/0001.
> cat test.sql
UPDATE test SET id = 1;
> pgbench -n -c 80 -j 10 -t 15000 -f ~/workspace/test.sql postgres
4. Verify that pg_xact/0001 is created:
➜ ls -lah ~/pg-devel/data/pg_xact
total 560
drwx------@ 4 alex.wang staff 128B Aug 5 20:25 .
drwx------@ 26 alex.wang staff 832B Aug 5 20:12 ..
-rw-------@ 1 alex.wang staff 256K Aug 5 20:25 0000
-rw-------@ 1 alex.wang staff 16K Aug 5 20:25 0001
5. Execute VACUUM FREEZE on every database on the server to freeze
rows and purge pg_xact/0000.
postgres=# VACUUM FREEZE;
VACUUM
postgres=# \c template1
You are now connected to database "template1" as user "postgres".
template1=# VACUUM FREEZE;
VACUUM
template1=# ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;
ALTER DATABASE
template1=# \c template0
You are now connected to database "template0" as user "postgres".
template0=# VACUUM FREEZE;
VACUUM
postgres=# select datname, datfrozenxid from pg_database;
datname | datfrozenxid
-----------+--------------
postgres | 1200774
template0 | 1200775
template1 | 1200774
(3 rows)
6. Ensure that pg_xact/0000 is gone:
➜ ls -lah ~/pg-devel/data/pg_xact
total 80
drwx------@ 3 alex.wang staff 96B Aug 5 20:29 .
drwx------@ 26 alex.wang staff 832B Aug 5 20:12 ..
-rw-------@ 1 alex.wang staff 40K Aug 5 20:30 0001
7. Try to listen to any channel from any backend connection on the
same database:
postgres=# listen c1;
ERROR: 58P01: could not access status of transaction 773
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087
postgres=# listen c2;
ERROR: 58P01: could not access status of transaction 773
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
LOCATION: SlruReportIOError, slru.c:1087
Here's the stack trace from the master branch:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 21.1
* frame #0: 0x0000000102c798c8
postgres`SlruReportIOError(ctl=0x000000010378e5e0, pageno=0, xid=773) at
slru.c:1084:4
frame #1: 0x0000000102c792b0
postgres`SimpleLruReadPage(ctl=0x000000010378e5e0, pageno=0, write_ok=true,
xid=773) at slru.c:603:4
frame #2: 0x0000000102c79f7c
postgres`SimpleLruReadPage_ReadOnly(ctl=0x000000010378e5e0, pageno=0,
xid=773) at slru.c:661:9
frame #3: 0x0000000102c6a6bc postgres`TransactionIdGetStatus(xid=773,
lsn=0x000000016d2975e8) at clog.c:745:11
frame #4: 0x0000000102c7e924
postgres`TransactionLogFetch(transactionId=773) at transam.c:79:14
frame #5: 0x0000000102c7e74c
postgres`TransactionIdDidCommit(transactionId=773) at transam.c:130:14
frame #6: 0x0000000102dc8b94
postgres`asyncQueueProcessPageEntries(current=0x000000016d297720,
stop=QueuePosition @ 0x000000016d297690, page_buffer="\U00000014",
snapshot=0x000000012d812398) at async.c:2069:13
frame #7: 0x0000000102dc8954 postgres`asyncQueueReadAllNotifications at
async.c:1981:18
frame #8: 0x0000000102dc6b5c postgres`Exec_ListenPreCommit at
async.c:1127:3
frame #9: 0x0000000102dc664c postgres`PreCommit_Notify at async.c:881:6
frame #10: 0x0000000102c8c77c postgres`CommitTransaction at
xact.c:2341:2
frame #11: 0x0000000102c87b2c postgres`CommitTransactionCommandInternal
at xact.c:3214:4
frame #12: 0x0000000102c87a44 postgres`CommitTransactionCommand at
xact.c:3175:10
frame #13: 0x000000010321da94 postgres`finish_xact_command at
postgres.c:2833:3
frame #14: 0x000000010321b64c
postgres`exec_simple_query(query_string="listen c1;") at postgres.c:1298:4
frame #15: 0x000000010321a714 postgres`PostgresMain(dbname="postgres",
username="alex.wang") at postgres.c:4767:7
frame #16: 0x0000000103211a14
postgres`BackendMain(startup_data=0x000000016d299e48, startup_data_len=24)
at backend_startup.c:124:2
frame #17: 0x00000001030e938c
postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=56,
startup_data=0x000000016d299e48, startup_data_len=24,
client_sock=0x000000016d299ed8) at launch_backend.c:290:3
frame #18: 0x00000001030f0d60
postgres`BackendStartup(client_sock=0x000000016d299ed8) at
postmaster.c:3587:8
frame #19: 0x00000001030eebc4 postgres`ServerLoop at postmaster.c:1702:6
frame #20: 0x00000001030ed67c postgres`PostmasterMain(argc=3,
argv=0x00006000021f14e0) at postmaster.c:1400:11
frame #21: 0x0000000102f73e40 postgres`main(argc=3,
argv=0x00006000021f14e0) at main.c:231:4
frame #22: 0x00000001940a2b98 dyld`start + 6076
Daniil Davydov has analyzed the root cause in thread [4] and I agree with
what he said:
On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:
> We have the following logic in the notify queue :
> If there are no listeners within all databases, and we are calling
> LISTEN, then we must iterate from 'tail' to 'head' of the queue and
> check statuses of transactions (see Exec_ListenPreCommit).
> If there is a pruned-away xid in the queue, we will try to access its
> status and get an error.
>
> Because the tail of the queue is not necessarily always advanced
> forward by the listeners, we can get such error without any long lived
> transactions.
>
The fix and workarounds were discussed in [5] and [6]: In [5], Daniil
proposed a patch, which I’ve attached. The patch adds a call to
asyncQueueAdvanceTail() in vac_update_datfrozenxid(), so that VACUUM
advances the async queue tail.
Similarly, calling the built-in function pg_notification_queue_usage()
also advances the async queue tail. So when the issue occurs, calling
this function could also make the error go away. However, this doesn’t
prevent the error from happening in the first place.
My questions:
1. Is it acceptable to drop notifications from the async queue if
there are no active listeners? There might still be notifications that
haven’t been read by any previous listener.
2. If the answer to 1 is no, how can we teach VACUUM to respect the
minimum xid stored in all AsyncQueueEntries?
Best,
Alex
[1]
https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2%40postgresql.org
[2]
https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
[3]
https://www.postgresql.org/message-id/18394-e7459245148578b2@postgresql.org
[4]
https://www.postgresql.org/message-id/CAJDiXgh3Jh2N90Fe4%3DX2qE%2BYAiZ1BSpgznhH%2BvkTHmvK3gjqxw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/CAJDiXgj1BmLKAZ%3DgOC1eETmctt7z%3Daj8MHfmd%2BnORE1P6qncsA%40mail.gmail.com
[6]
https://www.postgresql.org/message-id/CAK98qZ0uZrf1b6L4mDs%2B17M07KCNt6YMc9vD-%2BTmeWaJHRbGpA%40mail.gmail.com