Re: POC: make mxidoff 64 bits

Heikki Linnakangas <hlinnaka@iki.fi>

From: Heikki Linnakangas <hlinnaka@iki.fi>

To: Maxim Orlov <orlovmg@gmail.com>

Cc: wenhui qiu <qiuwenhuifx@gmail.com>, Alexander Korotkov <aekorotkov@gmail.com>, Postgres hackers <pgsql-hackers@lists.postgresql.org>

Date: 2024-11-15T11:06:00Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Fix partial read handling in pg_upgrade's multixact conversion
- ac94ce8194e5 19 (unreleased) landed
Increase timeout in multixid_conversion upgrade test
- bd43940b02b2 19 (unreleased) landed
Improve sanity checks on multixid members length
- ecb553ae8211 19 (unreleased) landed
Clarify comment on multixid offset wraparound check
- 170361d7b869 14.21 landed
- b0b52b7123ae 15.16 landed
- 7d42e2367c6b 16.12 landed
- cd1a887fe9bf 17.8 landed
- 3fbad030a24d 18.2 landed
- 366dcdaf5779 19 (unreleased) landed
Never store 0 as the nextMXact
- 87a350e1f284 19 (unreleased) landed
Add runtime checks for bogus multixact offsets
- d4b7bde4183b 19 (unreleased) landed
Widen MultiXactOffset to 64 bits
- bd8d9c9bdfa0 19 (unreleased) landed
Move pg_multixact SLRU page format definitions to a separate header
- bb3b1c4f6462 19 (unreleased) landed
Convert confusing macros in multixact.c to static inline functions
- 0099b9408e8c 17.0 landed
Index SLRUs by 64-bit integers rather than by 32-bit integers
- 4ed8f0913bfd 17.0 cited
Cope with possible failure of the oldest MultiXact to exist.
- b6a3444fa635 9.4.4 cited

On 13/11/2024 17:44, Maxim Orlov wrote:
> On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>     On a different note, I'm surprised you're rewriting member segments
>     from
>     scratch, parsing all the individual member groups and writing them out
>     again. There's no change to the members file format, except for the
>     numbering of the files, so you could just copy the files under the new
>     names without paying attention to the contents. It's not wrong to parse
>     them in detail, but I'd assume that it would be simpler not to.
> 
> Yes, at the beginning I also thought that it would be possible to get by 
> with simple copying. But in case of wraparound, we must "bypass" invalid 
> zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 
> values appears in multixact.c So, they must be handled. Bypass, in fact. 
> When we are switched to the 64-bit offsets, we have two options:
> 1). Bypass every ((uint32) offset == 0) value in multixact.c;
> 2). Convert members and bypass invalid value once.
> 
> The first options seem too weird for me. So, we have to repack members 
> and bypass invalid value.

Hmm, so if I understand correctly, this is related to how we determine 
the length of the members array, by looking at the next multixid's 
offset. This is explained in GetMultiXactIdMembers:

> 	/*
> 	 * Find out the offset at which we need to start reading MultiXactMembers
> 	 * and the number of members in the multixact.  We determine the latter as
> 	 * the difference between this multixact's starting offset and the next
> 	 * one's.  However, there are some corner cases to worry about:
> 	 *
> 	 * 1. This multixact may be the latest one created, in which case there is
> 	 * no next one to look at.  In this case the nextOffset value we just
> 	 * saved is the correct endpoint.
> 	 *
> 	 * 2. The next multixact may still be in process of being filled in: that
> 	 * is, another process may have done GetNewMultiXactId but not yet written
> 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
> 	 * the offset entry for that multixact exists (because GetNewMultiXactId
> 	 * won't release MultiXactGenLock until it does) but contains zero
> 	 * (because we are careful to pre-zero offset pages). Because
> 	 * GetNewMultiXactId will never return zero as the starting offset for a
> 	 * multixact, when we read zero as the next multixact's offset, we know we
> 	 * have this case.  We handle this by sleeping on the condition variable
> 	 * we have just for this; the process in charge will signal the CV as soon
> 	 * as it has finished writing the multixact offset.
> 	 *
> 	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
> 	 * handle case #2, there is an ambiguity near the point of offset
> 	 * wraparound.  If we see next multixact's offset is one, is that our
> 	 * multixact's actual endpoint, or did it end at zero with a subsequent
> 	 * increment?  We handle this using the knowledge that if the zero'th
> 	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
> 	 * transaction ID so it can't be a multixact member.  Therefore, if we
> 	 * read a zero from the members array, just ignore it.
> 	 *
> 	 * This is all pretty messy, but the mess occurs only in infrequent corner
> 	 * cases, so it seems better than holding the MultiXactGenLock for a long
> 	 * time on every multixact creation.
> 	 */

With 64-bit offsets, can we assume that it never wraps around? We often 
treat 2^64 as "large enough that we'll never run out", e.g. LSNs are 
also assumed to never wrap around. I think that would be a safe 
assumption here too.

If we accept that, we don't need to worry about case 3 anymore. But if 
we upgrade wrapped-around members files by just renaming them, there 
could still be a members array where we had skipped offset 0, and 
reading that after the upgrade might get confused. We could continue to 
ignore a 0 XID in the members array like the comment says; I think that 
would be enough. But yeah, maybe it's better to bite the bullet in 
pg_upgrade and squeeze those out.

Does your upgrade test suite include case 3, where the next multixact's 
offset is 1?


Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other 
comments and checks that talk about binary-upgraded values too that we 
can hopefully clean up now.


If we are to parse the member segments in detail in upgrade anyway, I'd 
be tempted to make some further changes / optimizations:

-  You could leave out all locking XID members in upgrade, because 
they're not relevant after upgrade any more (all the XIDs will be 
committed or aborted and have released the locks; we require prepared 
transactions to be completed before upgrading too). It'd be enough to 
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the 
next multixid's offset is a bit complicated. We could have one extra 
flag per XID in the members to indicate "this is the last member of this 
multixid". That could either to replace the current mechanism of looking 
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags 
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per 
XID unaligned.

- A more radical idea: There can be only one updating XID in one 
multixid. We could store that directly in the offsets SLRU, and keep 
only the locking XIDs in members. That way, the members SLRU would 
become less critical; it could be safely reset on crash for example 
(except for prepared transactions, which could still be holding locks, 
but it'd still be less serious). Separating correctness-critical data 
from more ephemeral state is generally a good idea.

I'm not insisting on any of these changes, just some things that might 
be worth considering if we're rewriting the SLRUs on upgrade anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)