Re: POC: make mxidoff 64 bits

Heikki Linnakangas <hlinnaka@iki.fi>

From: Heikki Linnakangas <hlinnaka@iki.fi>
To: Maxim Orlov <orlovmg@gmail.com>
Cc: wenhui qiu <qiuwenhuifx@gmail.com>, Alexander Korotkov <aekorotkov@gmail.com>, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>, Postgres hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-11-12T13:00:02Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix partial read handling in pg_upgrade's multixact conversion

  2. Increase timeout in multixid_conversion upgrade test

  3. Improve sanity checks on multixid members length

  4. Clarify comment on multixid offset wraparound check

  5. Never store 0 as the nextMXact

  6. Add runtime checks for bogus multixact offsets

  7. Widen MultiXactOffset to 64 bits

  8. Move pg_multixact SLRU page format definitions to a separate header

  9. Convert confusing macros in multixact.c to static inline functions

  10. Index SLRUs by 64-bit integers rather than by 32-bit integers

  11. Cope with possible failure of the oldest MultiXact to exist.

Attachments

On 07/11/2025 18:03, Maxim Orlov wrote:
> I tried finding out how long it would take to convert a big number of
> segments. Unfortunately, I only have access to a very old machine right
> now. It took me 7 hours to generate this much data on my old
> Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
> 
> Here are my rough measurements:
> 
> HDD
> $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
> $ time pg_upgrade
> ...
> real    4m59.459s
> user    0m19.974s
> sys     0m13.640s
> 
> SSD
> $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
> $ time pg_upgrade
> ...
> real    4m52.958s
> user    0m19.826s
> sys     0m13.624s
> 
> I aim to get access to more modern stuff and check it all out there.

Thanks, I also did some perf testing on my laptop. I wrote a little 
helper function to consume multixids, and used it to create a v17 
cluster with 100 million multixids. See attached 
consume-mxids.patch.txt. I then ran pg_upgrade on that, and measured how 
long the pg_multixact conversion part of pg_upgrade took. It took about 
1.2 s on my laptop. Extrapolating from that, converting 1 billion 
multixids would take 12 s. These were very simple multixacts with just 
one member each, though; realistic multixacts with more members would 
presumably take a little longer.

In any case, I think we're in an acceptable ballpark here.

There's some very low-hanging fruit though: Profiling with 'linux-perf' 
suggested that a lot of CPU time was spent simply on the function call 
overhead of GetOldMultiXactIdSingleMember, SlruReadSwitchPage, 
RecordNewMultiXact, SlruWriteSwitchPage for each multixact. I added an 
inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to 
eliminate the function call overhead of those in the common case that no 
page switch is needed. With that, the 100 million mxid test case I used 
went from 1.2 s to 0.9 s. We could optimize this further but I think 
this is good enough.

Some other changes since patch set v23:

- Rebased. I committed the wraparound bug fixes.

- I added an SlruFileName() helper function to slru_io.c, and support 
for reading SLRUs with long_segment_names==true. It's not needed 
currently, but it seemed like a weird omission. AllocSlruRead() actually 
left 'long_segment_names' uninitialized which is error-prone. We 
could've just documented it, but it seems just as easy to support it.

- I split the multixact_internal.h header in a separate commit, to make 
it more clear what changes are related to 64-bit offsets

I kept all the new test cases for now. We need to decide which ones are 
worth keeping, and polish and speed up the ones we decide to keep.


I'm getting one failure from the pg_upgrade/008_mxoff test:

> [14:43:38.422](0.530s) not ok 26 - dump outputs from original and restored regression databases match
> [14:43:38.422](0.000s) #   Failed test 'dump outputs from original and restored regression databases match'
> #   at /home/heikki/git-sandbox/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 801.
> [14:43:38.422](0.000s) #          got: '1'
> #     expected: '0'
> === diff of /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted and /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted
> === stdout ===
> --- /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted       2025-11-12 14:43:38.030399957 +0200
> +++ /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted       2025-11-12 14:43:38.314399819 +0200
> @@ -2,8 +2,8 @@
>  -- PostgreSQL database dump
>  --
>  \restrict test
> --- Dumped from database version 17.6
> --- Dumped by pg_dump version 17.6
> +-- Dumped from database version 19devel
> +-- Dumped by pg_dump version 19devel
>  SET statement_timeout = 0;
>  SET lock_timeout = 0;
>  SET idle_in_transaction_session_timeout = 0;=== stderr ===
> === EOF ===
> [14:43:38.425](0.004s) # >>> case #6

I ran the test with:

(rm -rf build/testrun/ build/tmp_install/; 
olddump=/tmp/olddump-regress.sql oldinstall=/home/heikki/pgsql.17stable/ 
meson test -C build --suite setup --suite pg_upgrade)

- Heikki