Thread
-
64-bit wait_event and introduction of 32-bit wait_event_arg
Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-12-08T09:54:41Z
Hi all, We were debating internally if making transition to 64-bit wait_event would be an acceptable idea (Robert's primary concern is that it may be too limited info), but I had code to demo this, so let's just discuss it further: After ensuring that 64-bit int math has same performance characteristics as 32-bit one at least on x86_64, i've converted our wait_event_info (32-bit today) to 64-bits while trying to use pg atomics, then used some bit masking voodoo and got the lower 32-bit exposed as new wait_event_arg with some dumb demos. The idea is to encode some specific (limited, but useful!) information into the wait event variable itself, so we could gain access to additional 32-bit of space for details along with wait-event itself to help assessment of some wait-event-related problems. This seems to probably come without any performance impact, at least on reasonable platforms used today in production (for ones with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY that is). Intended use pattern: if I were chasing a certain specific wait_event-related problem, I could extract certain info straight from wait_event_arg, making it much easier than even drilling into other more advanced views (if that's information exposed at all, often it's not). Q0) Key question: does that sound like a good idea to pursue further or are there any blockers to it? Sample demos included in patch, depending on the specific wait_event, wait_event_arg could be: 1. PgSleep could show time since it was launched (simplest thing one can imagine, or we could think about time left maybe too?): pid | backend_type | wait_event_type | wait_event | wait_event_arg | query -------+----------------+-----------------+------------+----------------+---------------------------------------------------------------------------------------------------------------------------------- 78317 | client backend | Timeout | PgSleep | 10 | select 'imagine complex stuff here dozes of kB SQL text query,procedures, functions' as s, pg_sleep(10) as embedded_internally; 2. Passing exact relation oid on where we are waiting for (here pid 82242 was doing "alter table p3 add.." , but it's waiting for the backend that executed "lock table p3 in exclusive mode;"). We can decode wait_event right into relation (p3) postgres=# select pid, backend_type, wait_event_type, wait_event, wait_event_arg, wait_event_arg::regclass, query from pg_stat_activity where state = 'active' and (wait_event_type, wait_event) = ('Lock', 'relation'); pid | backend_type | wait_event_type | wait_event | wait_event_arg | wait_event_arg | query -------+----------------+-----------------+------------+----------------+----------------+-------------------------------- 82242 | client backend | Lock | relation | 16467 | p3 | alter table p3 add id3 bigint; 3. IPC/SyncRep (SyncRepWaitForLSN()) could report PID of the slowest walsender. This is useful in cases where multiple are involved to pinpoint where you might be slow/stuck: pid | application_name | wait_event_type | wait_event | wait_event_arg | q --------+------------------+-----------------+---------------+----------------+------------------------------------------ 120318 | pgbench | IPC | SyncRep | 119689 | INSERT INTO child (parent_id, payload) 120319 | pgbench | IPC | SyncRep | 119689 | INSERT INTO child (parent_id, payload) 120320 | pgbench | IPC | SyncRep | 120248 | INSERT INTO child (parent_id, payload) 120321 | pgbench | IPC | SyncRep | 119689 | INSERT INTO child (parent_id, payload) 119689 | walreceiver2 | Activity | WalSenderMain | | START_REPLICATION 0/DC000000 TIMELINE 1 120248 | walreceiver | Activity | WalSenderMain | | START_REPLICATION 0/E2000000 TIMELINE 1 (then you would basically query pg_stat_replication for pid = 119689 as it seems to be the slowest one here) 4. DataFile could report fd (yes, it can differ from backend to backend [due to fd cache], but it's demo, probably it would be better with oid/relationNumber, but it's not fast to do that :) and although we have dboid already, there's tablespace and dunno how we could squeeze RelFileNumber with tablespace there, possibly we could just use tablespace Oid there too) pid | backend_type | wait_event_type | wait_event | wait_event_arg | query -------+----------------+-----------------+--------------+----------------+------------------------------------------------------------ 77467 | client backend | IO | DataFileRead | 8 | SELECT abalance FROM pgbench_accounts WHERE aid = 8657837; 77470 | client backend | IO | DataFileRead |11 | SELECT abalance FROM pgbench_accounts WHERE aid = 6840630; 5. (Challenging for me) Multixact Wait events - with wait_event_arg, we could report where stuff is really waiting, right, now it's a little guesswork, but with 0002 concept: dbmultixact=# select wait_event_type, wait_event, wait_event_arg, count(*) from pg_stat_activity where state='active' group by wait_event_type, wait_event, wait_event_arg order by 4 desc limit 5; wait_event_type | wait_event | wait_event_arg | count -----------------+---------------------+----------------+------- LWLock | BufferContent | | 365 Lock | tuple | 16494 | 42 LWLock | MultiXactOffsetSLRU | 16494 | 13 Lock | transactionid | | 10 LWLock | MultiXactOffsetSLRU | | 9 dbmultixact=# select pid, query, wait_event_type, wait_event,wait_event_arg from pg_stat_activity where wait_event = 'MultiXactMemberSLRU'; pid | query | wait_event_type | wait_event | wait_event_arg -------+--------------------------------------------------------------------+-----------------+---------------------+---------------- 99864 | INSERT INTO users (loc_id, fname) VALUES (2,'Testing User-2-002'); | LWLock | MultiXactMemberSLRU | 16494 dbmultixact=# select 16494::regclass; regclass ----------- locations dbmultixact=# \d users [..] "users_loc_id_fkey" FOREIGN KEY (loc_id) REFERENCES locations(loc_id) The knowledge (for the end user) what is stored exactly in wait_event_arg (depending on main wait_event) would be coming from docs (probably some table). Probably each different wait_event could be enhanced by some information. Quick performance crosscheck of 0001 alone: /usr/pgsql19/bin/pgbench -c 4 -P 1 -T 30 -S postgres: master: tps = 121020.723246 (without initial connection time) patched: tps = 121802.527000 (without initial connection time) Q1) because we compile without -Wconversion, I was wondering if we shouldn't need a safe/strict uint64 struct-like type that would catch errors when stuff like uint64 return from WaitEventExtensionNew() could be used externally by extensions with uint32? (because we do NOT have -Wtruncation [too verbose?], any return value from uint64 that will be casted silently to uint32 in extensions without any warning. That may cause hangs during tests -- often tests wait for some waitevent to show-up, but it wont). Q2) 0002: Please ignore the 0002 quality, I did not want to sink more time into MultiXact stuff, especially if the main concept would be shot down. The main problem is how one can get RelFileNumber about Relation that faces MultiXact back into LWLockReportWaitStart(). Here I just wanted to see how much rework would be necessary (passing variables, modifying API and so on) - in short: it introduces LWLockAcquire() as fallback to LWLockAcquireExt(.. RelFileNumber r) , but still gets pretty nasty soon sadly, lots of stuff needs to be dumb-adjusted. I would like to point out that I'm a complete multixact/heapam noob, so it is a very dumb way of passing that info for sure, in way too many places. Another thing we could do is basically maybe have some "static uint32 lwlock_relation" inside lwlock and properly just set it there (and reset it) once from within heap*.c or similiar, so then all dependent LWLock routines would OR it (== so it would be visible as wait_event_arg) and we would get the involved RelFileNumber for all operations involved there (at least for LWLocks). While thinking about cons, the only cons that I could think of is that when we would be exposing something as 32-bits , then if the following major release changes some internal structure/data type to be a bit more heavy, it couldn't be exposed anymore like that (think of e.g. 64-bit OIDs?) Any help, opinions, ideas and code/co-authors are more than welcome. -J. [1] Disassembly picture of stock binary taken from PGDG on Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit (rax) registry e.g. in AT&T mnemonics: lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info mov (%rax), %rax // dereference ptr %rax (and it put back into rax) mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax Or different, but very sample example with Intel mnemonics: lea rax,[rip+0x865b09] mov rax,QWORD PTR [rax] // notice it's already RAX and quadword mov DWORD PTR [rax],0x0 [2] x86_64 linux, operations on eax vs rax, that's 1.00646 under non-ideal conditions. Benchmarking int32_t (32-bit) additions... Operations/Second (int32_t): 3.37e+07 Benchmarking int64_t (64-bit) additions... Operations/Second (int64_t): 3.38e+07