Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix partial read handling in pg_upgrade's multixact conversion

  2. Increase timeout in multixid_conversion upgrade test

  3. Improve sanity checks on multixid members length

  4. Clarify comment on multixid offset wraparound check

  5. Never store 0 as the nextMXact

  6. Add runtime checks for bogus multixact offsets

  7. Widen MultiXactOffset to 64 bits

  8. Move pg_multixact SLRU page format definitions to a separate header

  9. Convert confusing macros in multixact.c to static inline functions

  10. Index SLRUs by 64-bit integers rather than by 32-bit integers

  11. Cope with possible failure of the oldest MultiXact to exist.

  1. POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-04-23T08:23:41Z

    Hi!
    
    I've been trying to introduce 64-bit transaction identifications to
    Postgres for quite a while [0]. All this implies,
    of course, an enormous amount of change that will have to take place in
    various modules. Consider this, the
    patch set become too big to be committed “at once”.
    
    The obvious solutions was to try to split the patch set into smaller ones.
    But here it comes a new challenge,
    not every one of these parts, make Postgres better at the moment. Actually,
    even switching to a
    FullTransactionId in PGPROC will have disadvantage in increasing of WAL
    size [1].
    
    In fact, I believe, we're dealing with the chicken and the egg problem
    here. Not able to commit full patch set
    since it is too big to handle and not able to commit parts of it, since
    they make sense all together and do not
    help improve Postgres at the moment.
    
    But it's not that bad. Since the commit 4ed8f0913bfdb5f, added in [3], we
    are capable to use 64 bits to
    indexing SLRUs.
    
    PROPOSAL
    Make multixact offsets 64 bit.
    
    RATIONALE
    It is not very difficult to overflow 32-bit mxidoff. Since, it is created
    for every unique combination of the
    transaction for each tuple, including XIDs and respective flags. And when a
    transaction is added to a
    specific multixact, it is rewritten with a new offset. In other words, it
    is possible to overflow the offsets of
    multixacts without overflowing the multixacts themselves and/or ordinary
    transactions. I believe, there
    was something about in the hackers mail lists, but I could not find links
    now.
    
    PFA, patch. Here is a WIP version. Upgrade machinery should be added later.
    
    As always, any opinions on a subject a very welcome!
    
    [0]
    https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com
    [1]
    https://www.postgresql.org/message-id/flat/CACG%3DezY7msw%2Bjip%3Drtfvnfz051dRqz4s-diuO46v3rAoAE0T0g%40mail.gmail.com
    [3]
    https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
    
    -- 
    Best regards,
    Maxim Orlov.
    
  2. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2024-04-23T09:37:40Z

    On 23/04/2024 11:23, Maxim Orlov wrote:
    > PROPOSAL
    > Make multixact offsets 64 bit.
    
    +1, this is a good next step and useful regardless of 64-bit XIDs.
    
    > @@ -156,7 +148,7 @@
    >  		((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
    >  
    >  /* page in which a member is to be found */
    > -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
    > +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset) MULTIXACT_MEMBERS_PER_PAGE)
    >  #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
    >  
    >  /* Location (byte offset within page) of flag word for a given member */
    
    This is really a bug fix. It didn't matter when TransactionId and 
    MultiXactOffset were both typedefs of uint32, but it was always wrong. 
    The argument name 'xid' is also misleading.
    
    I think there are some more like that, MXOffsetToFlagsBitShift for example.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
    
  3. Re: POC: make mxidoff 64 bits

    Andrey Borodin <x4mmm@yandex-team.ru> — 2024-04-23T13:02:56Z

    
    > On 23 Apr 2024, at 11:23, Maxim Orlov <orlovmg@gmail.com> wrote:
    > 
    > Make multixact offsets 64 bit.
    
    -		ereport(ERROR,
    -				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
    -				 errmsg("multixact \"members\" limit exceeded"),
    Personally, I'd be happy with this! We had some incidents where the only mitigation was vacuum settings tweaking.
    
    BTW as a side note... I see lot's of casts to (unsigned long long), can't we just cast to MultiXactOffset?
    
    
    Best regards, Andrey Borodin.
    
    
    
  4. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2024-04-23T13:03:06Z

    Hi Maxim Orlov
       Thank you so much for your tireless work on this. Increasing the WAL
    size by a few bytes should have very little impact with today's disk
    performance(Logical replication of this feature wal log is also increased a
    lot, logical replication is a milestone new feature, and the community has
    been improving the logical replication of functions),I believe removing
    troubled postgresql Transaction ID Wraparound was also a milestone  new
    feature  adding a few bytes is worth it!
    
    Best regards
    
    On Tue, 23 Apr 2024 at 17:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > On 23/04/2024 11:23, Maxim Orlov wrote:
    > > PROPOSAL
    > > Make multixact offsets 64 bit.
    >
    > +1, this is a good next step and useful regardless of 64-bit XIDs.
    >
    > > @@ -156,7 +148,7 @@
    > >               ((uint32) ((0xFFFFFFFF % MULTIXACT_MEMBERS_PER_PAGE) + 1))
    > >
    > >  /* page in which a member is to be found */
    > > -#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)
    > MULTIXACT_MEMBERS_PER_PAGE)
    > > +#define MXOffsetToMemberPage(xid) ((xid) / (MultiXactOffset)
    > MULTIXACT_MEMBERS_PER_PAGE)
    > >  #define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) /
    > SLRU_PAGES_PER_SEGMENT)
    > >
    > >  /* Location (byte offset within page) of flag word for a given member */
    >
    > This is really a bug fix. It didn't matter when TransactionId and
    > MultiXactOffset were both typedefs of uint32, but it was always wrong.
    > The argument name 'xid' is also misleading.
    >
    > I think there are some more like that, MXOffsetToFlagsBitShift for example.
    >
    > --
    > Heikki Linnakangas
    > Neon (https://neon.tech)
    >
    >
    >
    >
    
  5. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-04-25T14:20:53Z

    On Tue, 23 Apr 2024 at 12:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > This is really a bug fix. It didn't matter when TransactionId and
    > MultiXactOffset were both typedefs of uint32, but it was always wrong.
    > The argument name 'xid' is also misleading.
    >
    > I think there are some more like that, MXOffsetToFlagsBitShift for example.
    
    Yeah, I always thought so too.  I believe, this is just a copy-paste.  You
    mean, it is worth creating a separate CF
    entry for these fixes?
    
    
    On Tue, 23 Apr 2024 at 16:03, Andrey M. Borodin <x4mmm@yandex-team.ru>
    wrote:
    
    > BTW as a side note... I see lot's of casts to (unsigned long long), can't
    > we just cast to MultiXactOffset?
    >
    Actually, first versions of the 64xid patch set have such a cast to types
    TransactionID, MultiXact and so on.  But,
    after some discussions, we are switched to unsigned long long cast.
    Unfortunately, I could not find an exact link
    for that discussion.  On the other hand, such a casting is already used
    throughout the code.  So, just for the
    sake of the consistency, I would like to stay with these casts.
    
    
    On Tue, 23 Apr 2024 at 16:03, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    
    > Hi Maxim Orlov
    >    Thank you so much for your tireless work on this. Increasing the WAL
    > size by a few bytes should have very little impact with today's disk
    > performance(Logical replication of this feature wal log is also increased a
    > lot, logical replication is a milestone new feature, and the community has
    > been improving the logical replication of functions),I believe removing
    > troubled postgresql Transaction ID Wraparound was also a milestone  new
    > feature  adding a few bytes is worth it!
    >
    I'm 100% agree.  Maybe, I should return to this approach and find some
    benefits for having FXIDs in WAL.
    
  6. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-08-14T15:30:15Z

    Hi!
    
    Sorry for delay. I was a bit busy last month. Anyway, here is my
    proposal for making multioffsets 64 bit.
    The patch set consists of three parts:
    0001 - making user output of offsets 64-bit ready;
    0002 - making offsets 64-bit;
    0003 - provide 32 to 64 bit conversion in pg_upgarde.
    
    I'm pretty sure this is just a beginning of the conversation, so any
    opinions and reviews, as always, are very welcome!
    
    -- 
    Best regards,
    Maxim Orlov.
    
  7. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-09-03T13:30:10Z

    Here is rebase.  Apparently I'll have to do it often, since the
    CATALOG_VERSION_NO changed in the patch.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  8. Re: POC: make mxidoff 64 bits

    Alexander Korotkov <aekorotkov@gmail.com> — 2024-09-03T13:32:46Z

    On Tue, Sep 3, 2024 at 4:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    > Here is rebase.  Apparently I'll have to do it often, since the CATALOG_VERSION_NO changed in the patch.
    
    I don't think you need to maintain CATALOG_VERSION_NO change in your
    patch for the exact reason you have mentioned: patch will get conflict
    each time CATALOG_VERSION_NO is advanced.  It's responsibility of
    committer to advance CATALOG_VERSION_NO when needed.
    
    ------
    Regards,
    Alexander Korotkov
    Supabase
    
    
    
    
  9. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-09-04T08:49:32Z

    On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov <aekorotkov@gmail.com>
    wrote:
    
    > I don't think you need to maintain CATALOG_VERSION_NO change in your
    > patch for the exact reason you have mentioned: patch will get conflict
    > each time CATALOG_VERSION_NO is advanced.  It's responsibility of
    > committer to advance CATALOG_VERSION_NO when needed.
    >
    
    OK, I got it. My intention here was to help to test the patch. If someone
    wants to have a
    look at the patch, he won't need to make changes in the code. In the next
    iteration, I'll
    remove CATALOG_VERSION_NO version change.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  10. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-09-07T04:36:52Z

    Here is v3. I removed CATALOG_VERSION_NO change, so this should be done by
    the actual commiter.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  11. Re: POC: make mxidoff 64 bits

    Pavel Borisov <pashkin.elfe@gmail.com> — 2024-09-12T12:09:08Z

    Hi, Maxim!
    
    Previously we accessed offsets in shared MultiXactState without locks as
    32-bit read is always atomic. But I'm not sure it's so when offset become
    64-bit.
    E.g. GetNewMultiXactId():
    
    nextOffset = MultiXactState->nextOffset;
    is outside lock.
    
    There might be other places we do the same as well.
    
    Regards,
    Pavel Borisov
    Supabase
    
  12. Re: POC: make mxidoff 64 bits

    Pavel Borisov <pashkin.elfe@gmail.com> — 2024-09-12T12:25:53Z

    On Thu, 12 Sept 2024 at 16:09, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
    
    > Hi, Maxim!
    >
    > Previously we accessed offsets in shared MultiXactState without locks as
    > 32-bit read is always atomic. But I'm not sure it's so when offset become
    > 64-bit.
    > E.g. GetNewMultiXactId():
    >
    > nextOffset = MultiXactState->nextOffset;
    > is outside lock.
    >
    > There might be other places we do the same as well.
    >
    I think the replacement of plain assignments by
    pg_atomic_read_u64/pg_atomic_write_u64 would be sufficient.
    
    (The same I think is needed for the patchset [1])
    [1]
    https://www.postgresql.org/message-id/flat/CAJ7c6TMvPz8q+nC=JoKniy7yxPzQYcCTnNFYmsDP-nnWsAOJ2g@mail.gmail.com
    
    Regards,
    Pavel Borisov
    
  13. Re: POC: make mxidoff 64 bits

    Alvaro Herrera <alvherre@alvh.no-ip.org> — 2024-09-12T13:14:08Z

    On 2024-Sep-12, Pavel Borisov wrote:
    
    > Hi, Maxim!
    > 
    > Previously we accessed offsets in shared MultiXactState without locks as
    > 32-bit read is always atomic. But I'm not sure it's so when offset become
    > 64-bit.
    > E.g. GetNewMultiXactId():
    > 
    > nextOffset = MultiXactState->nextOffset;
    > is outside lock.
    
    Good though.  But fortunately I think it's not a problem.  The one you
    say is with MultiXactGetLock held in shared mode -- and that works OK,
    as the assignment (in line 1263 at the bottom of the same routine) is
    done with exclusive lock held.
    
    -- 
    Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
    
    
    
    
  14. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2024-10-22T09:43:34Z

    On 07/09/2024 07:36, Maxim Orlov wrote:
    > Here is v3.
    
    MultiXactMemberFreezeThreshold looks quite bogus now. Now that 
    MaxMultiXactOffset==2^64-1, you cannot get anywhere near the 
    MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD 
    values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I 
    guess it would still be useful to trigger autovacuum if multixacts 
    members grows large though, to release the disk space, even if you can't 
    run out of members as such anymore. What should the logic for that look 
    like?
    
    I'd love to see some tests for the pg_upgrade code. Something like a 
    little perl script to generate test clusters with different wraparound 
    scenarios etc. using the old version, and a TAP test to run pg_upgrade 
    on them and verify that queries on the upgraded cluster works correctly. 
    We don't have tests like that in the repository today, and I don't know 
    if we'd want to commit these permanently either, but it would be highly 
    useful now as a one-off thing, to show that the code works.
    
    On upgrade, are there really no changes required to 
    pg_multixact/members? I imagined that the segment files would need to be 
    renamed around wraparound, so that if you previously had files like this:
    
    pg_multixact/members/FFFE
    pg_multixact/members/FFFF
    pg_multixact/members/0000
    pg_multixact/members/0001
    
    after upgrade you would need to have:
    
    pg_multixact/members/00000000FFFE
    pg_multixact/members/00000000FFFF
    pg_multixact/members/000000010000
    pg_multixact/members/000000010001
    
    
    Thanks for working on this!
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
    
  15. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-10-22T16:33:58Z

    On Tue, 22 Oct 2024 at 12:43, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > MultiXactMemberFreezeThreshold looks quite bogus now. Now that
    > MaxMultiXactOffset==2^64-1, you cannot get anywhere near the
    > MULTIXACT_MEMBER_SAFE_THRESHOLD and MULTIXACT_MEMBER_DANGER_THRESHOLD
    > values anymore. Can we just get rid of MultiXactMemberFreezeThreshold? I
    > guess it would still be useful to trigger autovacuum if multixacts
    > members grows large though, to release the disk space, even if you can't
    > run out of members as such anymore. What should the logic for that look
    > like?
    >
    Yep, you're totally correct. The MultiXactMemberFreezeThreshold call is not
    necessary any more and can be safely removed.
    I made this as a separate commit in v4. But, as you rightly say, it will be
    useful to trigger autovacuum in some cases. The obvious
    place for this machinery is in the GetNewMultiXactId. I imagine this like
    "if nextOff - oldestOff > threshold kick autovac". So, the
    question is: what kind of threshold we want here? Is it a hard coded define
    or GUC? If it is a GUC (32–bit), what values should it be?
    
    And the other issue I feel a little regretful about. We still must be
    holding MultiXactGenLock in order to track oldestOffset to do
    "nextOff - oldestOff" calculation.
    
    
    >
    > I'd love to see some tests for the pg_upgrade code. Something like a
    > little perl script to generate test clusters with different wraparound
    > scenarios etc.
    
    Agree. I'll address this as soon as I can.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  16. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-10-23T15:55:06Z

    After a bit of thought, I've realized that to be conservative here is the
    way to go.
    
    We can reuse a maximum of existing logic. I mean, we can remove offset
    wraparound "error logic" and reuse "warning logic". But set the threshold
    for "warning logic" to a much higher value. For now, I choose 2^32-1. In
    other world, legit logic, in my view, here would be to trigger autovacuum
    if the number of offsets (i.e. difference nextOffset - oldestOffset)
    exceeds 2^32-1. PFA patch set.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  17. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2024-10-25T03:38:57Z

    HI Maxim Orlov
    > After a bit of thought, I've realized that to be conservative here is the
    way to go.
    >We can reuse a maximum of existing logic. I mean, we can remove offset
    wraparound "error logic" and reuse "warning logic". But set the threshold
    for "warning >logic" to a much higher value. For now, I choose 2^32-1. In
    other world, legit logic, in my view, here would be to trigger autovacuum
    if the number of offsets (i.e. >difference nextOffset - oldestOffset)
    exceeds 2^32-1. PFA patch set.
    good point ,Couldn't agree with you more. xid64 is the solution to the
    wraparound problem,The previous error log is no longer meaningful ,But we
    might want to refine the output waring log a little(For example, checking
    the underlying reasons why age has been increasing),Though we don't have to
    worry about xid wraparound
    
    +/*
    + * Multixact members warning threshold.
    + *
    + * If difference bettween nextOffset and oldestOffset exceed this value, we
    + * trigger autovacuum in order to release the disk space if possible.
    + */
    +#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
    Can we refine this annotation a bit? for example
    +/*
    + * Multixact members warning threshold.
    + *
    + * If difference bettween nextOffset and oldestOffset exceed this value, we
    + * trigger autovacuum in order to release the disk space ,reduce table
    bloat if possible.
    + */
    +#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
    
    Thanks
    
    Maxim Orlov <orlovmg@gmail.com> 于2024年10月23日周三 23:55写道:
    
    > After a bit of thought, I've realized that to be conservative here is the
    > way to go.
    >
    > We can reuse a maximum of existing logic. I mean, we can remove offset
    > wraparound "error logic" and reuse "warning logic". But set the threshold
    > for "warning logic" to a much higher value. For now, I choose 2^32-1. In
    > other world, legit logic, in my view, here would be to trigger autovacuum
    > if the number of offsets (i.e. difference nextOffset - oldestOffset)
    > exceeds 2^32-1. PFA patch set.
    >
    > --
    > Best regards,
    > Maxim Orlov.
    >
    
  18. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-11-08T18:10:28Z

    On Fri, 25 Oct 2024 at 06:39, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    
    >
    > + * Multixact members warning threshold.
    > + *
    > + * If difference bettween nextOffset and oldestOffset exceed this value,
    > we
    > + * trigger autovacuum in order to release the disk space if possible.
    > + */
    > +#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0xFFFFFFFF)
    > Can we refine this annotation a bit? for example
    >
    Thank you, fixed.
    
    Sorry for a late reply. There was a problem in upgrade with offset
    wraparound. Here is a fixed version. Test also added. I decide to use my
    old patch to set a non-standard multixacts for the old cluster, fill it
    with data and do pg_upgrade.
    
    Here is how to test. All the patches are for 14e87ffa5c543b5f3 master
    branch.
    1) Get the 14e87ffa5c543b5f3 master branch apply patches
    0001-Add-initdb-option-to-initialize-cluster-with-non-sta.patch and
    0002-TEST-lower-SLRU_PAGES_PER_SEGMENT.patch
    2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and
    apply v6 patch set.
    3) Build two branches.
    4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl
    oldinstall=/home/orlov/proj/pgsql-new PG_TEST_NOCLEAN=1 make check -C
    src/bin/pg_upgrade/
    
    Maybe, I'll make a shell script to automate this steps if required.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  19. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2024-11-11T23:31:40Z

    On 08/11/2024 20:10, Maxim Orlov wrote:
    > Sorry for a late reply. There was a problem in upgrade with offset 
    > wraparound. Here is a fixed version. Test also added. I decide to use my 
    > old patch to set a non-standard multixacts for the old cluster, fill it 
    > with data and do pg_upgrade.
    
    The wraparound logic is still not correct. To test, I created a cluster 
    where multixids have wrapped around, so that:
    
    $ ls -l data-old/pg_multixact/offsets/
    total 720
    -rw------- 1 heikki heikki 212992 Nov 12 01:11 0000
    -rw-r--r-- 1 heikki heikki 262144 Nov 12 00:55 FFFE
    -rw------- 1 heikki heikki 262144 Nov 12 00:56 FFFF
    
    After running pg_upgrade:
    
    $ ls -l data-new/pg_multixact/offsets/
    total 1184
    -rw------- 1 heikki heikki 155648 Nov 12 01:12 0001
    -rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFD
    -rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFE
    -rw------- 1 heikki heikki 262144 Nov 12 01:11 1FFFF
    -rw------- 1 heikki heikki 262144 Nov 12 01:11 20000
    -rw------- 1 heikki heikki 155648 Nov 12 01:11 20001
    
    That's not right. The segments 20000 and 20001 were created by the new 
    pg_upgrade conversion code from old segment '0000'. But multixids are 
    still 32-bit values, so after segment 1FFFF, you should still wrap 
    around to 0000. The new segments should be '0000' and '0001'. The 
    segment '0001' is created when postgres is started after upgrade, but 
    it's created from scratch and doesn't contain the upgraded values.
    
    When I try to select from a table after upgrade that contains 
    post-wraparound multixids:
    
    TRAP: failed Assert("offset != 0"), File: 
    "../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386
    
    
    On a different note, I'm surprised you're rewriting member segments from 
    scratch, parsing all the individual member groups and writing them out 
    again. There's no change to the members file format, except for the 
    numbering of the files, so you could just copy the files under the new 
    names without paying attention to the contents. It's not wrong to parse 
    them in detail, but I'd assume that it would be simpler not to.
    
    > Here is how to test. All the patches are for 14e87ffa5c543b5f3 master 
    > branch.
    > 1) Get the 14e87ffa5c543b5f3 master branch apply patches 0001-Add- 
    > initdb-option-to-initialize-cluster-with-non-sta.patch and 0002-TEST- 
    > lower-SLRU_PAGES_PER_SEGMENT.patch
    > 2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and 
    > apply v6 patch set.
    > 3) Build two branches.
    > 4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl 
    > <http://005_mxidoff.pl> oldinstall=/home/orlov/proj/pgsql-new 
    > PG_TEST_NOCLEAN=1 make check -C src/bin/pg_upgrade/
    > 
    > Maybe, I'll make a shell script to automate this steps if required.
    
    Yeah, I think we need something to automate this. I did the testing 
    manually. I used the attached python script to consume multixids faster, 
    but it's still tedious.
    
    I used pg_resetwal to quickly create a cluster that's close to multixid 
    wrapround:
    
    initdb -D data
    pg_resetwal -D data -m 4294900001,4294900000
    dd if=/dev/zero of=data/pg_multixact/offsets/FFFE bs=8192 count=32
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
  20. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-11-13T15:44:44Z

    On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > The wraparound logic is still not correct.
    
    Yep, my fault. I forget to reset segment counter if wraparound is happened.
    Fixed.
    
    When I try to select from a table after upgrade that contains
    > post-wraparound multixids:
    >
    > TRAP: failed Assert("offset != 0"), File:
    > "../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386
    >
    The problem was in converting offset segments. The new_entry index should
    also bypass the invalid offset (0) value. Fixed.
    
    
    >
    > On a different note, I'm surprised you're rewriting member segments from
    > scratch, parsing all the individual member groups and writing them out
    > again. There's no change to the members file format, except for the
    > numbering of the files, so you could just copy the files under the new
    > names without paying attention to the contents. It's not wrong to parse
    > them in detail, but I'd assume that it would be simpler not to.
    >
    Yes, at the beginning I also thought that it would be possible to get by
    with simple copying. But in case of wraparound, we must "bypass" invalid
    zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 values
    appears in multixact.c So, they must be handled. Bypass, in fact. When we
    are switched to the 64-bit offsets, we have two options:
    1). Bypass every ((uint32) offset == 0) value in multixact.c;
    2). Convert members and bypass invalid value once.
    
    The first options seem too weird for me. So, we have to repack members and
    bypass invalid value.
    
    All patches are for master@38c18710b37a2d
    
    -- 
    Best regards,
    Maxim Orlov.
    
  21. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-11-15T08:41:46Z

    Here is the test scripts.
    The generate.sh script is used to generate data dir with multimple clusters
    in it. This script will call multixids.py in order to generate data. If you
    are not use system psql consider using LD_LIBRARY_PATH env to specify path
    to the lib directory.
    OLDBIN=/.../pgsql-new ./generate.sh
    
    Then the test.sh is used to run various upgrades.
    OLDBIN=/.../pgsql-old NEWBIN=/.../pgsql-new ./test.sh
    
    I hope that helps!
    
    -- 
    Best regards,
    Maxim Orlov.
    
  22. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2024-11-15T11:06:00Z

    On 13/11/2024 17:44, Maxim Orlov wrote:
    > On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka@iki.fi 
    > <mailto:hlinnaka@iki.fi>> wrote:
    >     On a different note, I'm surprised you're rewriting member segments
    >     from
    >     scratch, parsing all the individual member groups and writing them out
    >     again. There's no change to the members file format, except for the
    >     numbering of the files, so you could just copy the files under the new
    >     names without paying attention to the contents. It's not wrong to parse
    >     them in detail, but I'd assume that it would be simpler not to.
    > 
    > Yes, at the beginning I also thought that it would be possible to get by 
    > with simple copying. But in case of wraparound, we must "bypass" invalid 
    > zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0 
    > values appears in multixact.c So, they must be handled. Bypass, in fact. 
    > When we are switched to the 64-bit offsets, we have two options:
    > 1). Bypass every ((uint32) offset == 0) value in multixact.c;
    > 2). Convert members and bypass invalid value once.
    > 
    > The first options seem too weird for me. So, we have to repack members 
    > and bypass invalid value.
    
    Hmm, so if I understand correctly, this is related to how we determine 
    the length of the members array, by looking at the next multixid's 
    offset. This is explained in GetMultiXactIdMembers:
    
    > 	/*
    > 	 * Find out the offset at which we need to start reading MultiXactMembers
    > 	 * and the number of members in the multixact.  We determine the latter as
    > 	 * the difference between this multixact's starting offset and the next
    > 	 * one's.  However, there are some corner cases to worry about:
    > 	 *
    > 	 * 1. This multixact may be the latest one created, in which case there is
    > 	 * no next one to look at.  In this case the nextOffset value we just
    > 	 * saved is the correct endpoint.
    > 	 *
    > 	 * 2. The next multixact may still be in process of being filled in: that
    > 	 * is, another process may have done GetNewMultiXactId but not yet written
    > 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
    > 	 * the offset entry for that multixact exists (because GetNewMultiXactId
    > 	 * won't release MultiXactGenLock until it does) but contains zero
    > 	 * (because we are careful to pre-zero offset pages). Because
    > 	 * GetNewMultiXactId will never return zero as the starting offset for a
    > 	 * multixact, when we read zero as the next multixact's offset, we know we
    > 	 * have this case.  We handle this by sleeping on the condition variable
    > 	 * we have just for this; the process in charge will signal the CV as soon
    > 	 * as it has finished writing the multixact offset.
    > 	 *
    > 	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
    > 	 * handle case #2, there is an ambiguity near the point of offset
    > 	 * wraparound.  If we see next multixact's offset is one, is that our
    > 	 * multixact's actual endpoint, or did it end at zero with a subsequent
    > 	 * increment?  We handle this using the knowledge that if the zero'th
    > 	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
    > 	 * transaction ID so it can't be a multixact member.  Therefore, if we
    > 	 * read a zero from the members array, just ignore it.
    > 	 *
    > 	 * This is all pretty messy, but the mess occurs only in infrequent corner
    > 	 * cases, so it seems better than holding the MultiXactGenLock for a long
    > 	 * time on every multixact creation.
    > 	 */
    
    With 64-bit offsets, can we assume that it never wraps around? We often 
    treat 2^64 as "large enough that we'll never run out", e.g. LSNs are 
    also assumed to never wrap around. I think that would be a safe 
    assumption here too.
    
    If we accept that, we don't need to worry about case 3 anymore. But if 
    we upgrade wrapped-around members files by just renaming them, there 
    could still be a members array where we had skipped offset 0, and 
    reading that after the upgrade might get confused. We could continue to 
    ignore a 0 XID in the members array like the comment says; I think that 
    would be enough. But yeah, maybe it's better to bite the bullet in 
    pg_upgrade and squeeze those out.
    
    Does your upgrade test suite include case 3, where the next multixact's 
    offset is 1?
    
    
    Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other 
    comments and checks that talk about binary-upgraded values too that we 
    can hopefully clean up now.
    
    
    If we are to parse the member segments in detail in upgrade anyway, I'd 
    be tempted to make some further changes / optimizations:
    
    -  You could leave out all locking XID members in upgrade, because 
    they're not relevant after upgrade any more (all the XIDs will be 
    committed or aborted and have released the locks; we require prepared 
    transactions to be completed before upgrading too). It'd be enough to 
    include actual UPDATE/DELETE XIDs.
    
    - The way we determine the length of the members array by looking at the 
    next multixid's offset is a bit complicated. We could have one extra 
    flag per XID in the members to indicate "this is the last member of this 
    multixid". That could either to replace the current mechanism of looking 
    at the next offset, or be just an additional cross-check.
    
    - Do we still like the "group" representation, with 4 bytes of flags 
    followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per 
    XID unaligned.
    
    - A more radical idea: There can be only one updating XID in one 
    multixid. We could store that directly in the offsets SLRU, and keep 
    only the locking XIDs in members. That way, the members SLRU would 
    become less critical; it could be safely reset on crash for example 
    (except for prepared transactions, which could still be holding locks, 
    but it'd still be less serious). Separating correctness-critical data 
    from more ephemeral state is generally a good idea.
    
    I'm not insisting on any of these changes, just some things that might 
    be worth considering if we're rewriting the SLRUs on upgrade anyway.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
  23. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-11-15T16:19:07Z

    On Fri, 15 Nov 2024 at 14:06, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > Hmm, so if I understand correctly, this is related to how we determine
    > the length of the members array, by looking at the next multixid's
    > offset. This is explained in GetMultiXactIdMembers:
    >
    Correct.
    
    
    > If we accept that, we don't need to worry about case 3 anymore. But if
    > we upgrade wrapped-around members files by just renaming them, there
    > could still be a members array where we had skipped offset 0, and
    > reading that after the upgrade might get confused. We could continue to
    > ignore a 0 XID in the members array like the comment says; I think that
    > would be enough. But yeah, maybe it's better to bite the bullet in
    > pg_upgrade and squeeze those out.
    >
    Correct. I couldn't explain this better. I'm more for the squeeze those
    out. Overwise, we're ending up in adding another hack in multixact, but one
    of the benefits from switching to 64-bits, it should make XID's logic more
    straight forward. After all, mxact juggling in pg_upgrade is one time
    inconvenience.
    
    
    >
    > Does your upgrade test suite include case 3, where the next multixact's
    > offset is 1?
    >
    Not exactly.
    
    simple
    Latest checkpoint's NextMultiXactId:  119441
    Latest checkpoint's NextMultiOffset:  5927049
    
    offset-wrap
    Latest checkpoint's NextMultiXactId:  119441
    Latest checkpoint's NextMultiOffset:  5591183
    
    multi-wrap
    Latest checkpoint's NextMultiXactId:  82006
    Latest checkpoint's NextMultiOffset:  7408811
    
    offset-multi-wrap
    Latest checkpoint's NextMultiXactId:  52146
    Latest checkpoint's NextMultiOffset:  5591183
    
    You want test case where NextMultiOffset will be 1?
    
    
    > Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
    > comments and checks that talk about binary-upgraded values too that we
    > can hopefully clean up now.
    >
    Yes, technically we can. But this is kinda unrelated to the offsets and
    will make the patch set significantly complicated, thus more complicated to
    review and less likely to be committed. Again, I'm not opposing the idea,
    I'm not sure if it is worth to do it right now.
    
    
    >
    > If we are to parse the member segments in detail in upgrade anyway, I'd
    > be tempted to make some further changes / optimizations:
    >
    > -  You could leave out all locking XID members in upgrade, because
    > they're not relevant after upgrade any more (all the XIDs will be
    > committed or aborted and have released the locks; we require prepared
    > transactions to be completed before upgrading too). It'd be enough to
    > include actual UPDATE/DELETE XIDs.
    >
    > - The way we determine the length of the members array by looking at the
    > next multixid's offset is a bit complicated. We could have one extra
    > flag per XID in the members to indicate "this is the last member of this
    > multixid". That could either to replace the current mechanism of looking
    > at the next offset, or be just an additional cross-check.
    >
    > - Do we still like the "group" representation, with 4 bytes of flags
    > followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
    > XID unaligned.
    >
    Not really. But I would leave it for next iteration - switching multi to 64
    bit. I already have some drafts for this. In any case, we'll must do
    adjustments in pg_upgrade again. My goal is to move towards 64 XIDs, but
    with the small steps, and I plan changes in "group" representation in
    combination with switching multi to 64 bit. This seems a bit more
    appropriate in my view.
    
    As for your optimization suggestions, I like them. I don’t against them,
    but I’m afraid to disrupt the clarity of thought, especially since the
    algorithm is not the simplest.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  24. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-11-18T13:22:38Z

    Shame on me! I've sent an erroneous patch set. Version 7 is defective. Here
    is the proper version v8 with minor refactoring in segresize.c.
    
    Also, I rename bump cat version patch into txt in order not to break cfbot.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  25. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-11-19T17:53:49Z

    Oops! Sorry for the noise. I've must have been overworking yesterday and
    messed up the working branches. v7 was a correct set and v8 don't. Here is
    the correction with extended Perl test.
    
    The test itself is in src/bin/pg_upgrade/t/005_offset.pl It is rather heavy
    and took about 45 minutes on my i5 with 2.7 Gb data generated. Basically,
    each test here is creating a cluster and fill it with multixacts. Thus,
    dozens of segments are created using two methods. One is with prepared
    transactions, and it creates, roughly, the same amount of segments for
    members and for offsets. The other one is based on Heikki's multixids.py
    and creates more members than offsets. I've used both of these methods to
    generate as much diverse data as possible.
    
    Here is how I test this patch set:
    
       1. You need two pg clusters: the "old" one, i.e. without patch set, and
       the "new" with patch set v9 applied.
       2. Apply v9-0005-TEST-initdb-option-to-initialize-cluster-with-non.patch.txt
       to the "old" and "new" clusters. Note, this is only patch required for
       "old" cluster. This will allow you to create a cluster with non-standard
       initial multixact and multixact offset. Unfortunately, this patch was not
       did not arouse public interest since it is assumed that there is similar
       functionality to the pg_resetwal utility. But similar is not mean equal.
       See, pg_resetwal must be used after cluster init, thus, we step into some
       problems with vacuum and some SLRU segments must be filled with zeroes.
       Also, template0 datminmxid must be manually updated. So, in me view,
       using this patch is justified and very handy here.
       3. Also, apply all the "TEST" (0006 and 0007) patches to the "new"
       cluster.
       4. Build "old" and "new" pg clusters.
       5. Run the test with: PROVE_TESTS=t/005_offset.pl PG_TEST_NOCLEAN=1
       oldinstall=/home/orlov/proj/OFFSET3/pgsql-old make check -s -C
       src/bin/pg_upgrade/
       6. In my case, it took around 45 minutes and generate roughly 2.7 Gb of
       data.
    
    "TEST" patches, of course, are for the test purposes and not to be
    committed.
    
    In src/bin/pg_upgrade/t/005_offset.pl I try to consider next cases:
    
       - Basic sanity checks.
       Here I test various initial multi and offset values (including
       wraparound) and see how appropriate segments are generated.
       - pg_upgarde tests.
       Here is oldinstall ENV is for. Run pg_upgrade for old cluster with multi
       and offset values just like in previous step. i.e. with various
       combinations.
       - Self pg_upgarde.
    
    
    -- 
    Best regards,
    Maxim Orlov.
    
  26. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2024-12-18T10:21:35Z

    Thanks for working on this!
    
    On 19/11/2024 19:53, Maxim Orlov wrote:
    > The test itself is in src/bin/pg_upgrade/t/005_offset.pl 
    > <http://005_offset.pl> It is rather heavy and took about 45 minutes on 
    > my i5 with 2.7 Gb data generated. Basically, each test here is creating 
    > a cluster and fill it with multixacts. Thus, dozens of segments are 
    > created using two methods. One is with prepared transactions, and it 
    > creates, roughly, the same amount of segments for members and for 
    > offsets. The other one is based on Heikki's multixids.py and creates 
    > more members than offsets. I've used both of these methods to generate 
    > as much diverse data as possible.
    > 
    > Here is how I test this patch set:
    > 
    >  1. You need two pg clusters: the "old" one, i.e. without patch set, and
    >     the "new" with patch set v9 applied.
    >  2. Apply v9-0005-TEST-initdb-option-to-initialize-cluster-with-
    >     non.patch.txt to the "old" and "new" clusters. Note, this is only
    >     patch required for "old" cluster. This will allow you to create a
    >     cluster with non-standard initial multixact and multixact offset.
    >     Unfortunately, this patch was not did not arouse public interest
    >     since it is assumed that there is similar functionality to the
    >     pg_resetwal utility. But similar is not mean equal. See, pg_resetwal
    >     must be used after cluster init, thus, we step into some problems
    >     with vacuum and some SLRU segments must be filled with zeroes. Also,
    >     template0 datminmxidmust be manually updated. So, in me view, using
    >     this patch is justifiedand very handy here.
    >  3. Also, apply all the "TEST" (0006 and 0007) patches to the "new" cluster.
    >  4. Build "old" and "new" pg clusters.
    >  5. Run the test with: PROVE_TESTS=t/005_offset.pl
    >     <http://005_offset.pl> PG_TEST_NOCLEAN=1 oldinstall=/home/orlov/
    >     proj/OFFSET3/pgsql-old make check -s -C src/bin/pg_upgrade/
    >  6. In my case, it took around 45 minutes and generate roughly 2.7 Gb of
    >     data.
    > 
    > "TEST" patches, of course, are for the test purposes and not to be 
    > committed.
    > 
    > In src/bin/pg_upgrade/t/005_offset.pl <http://005_offset.pl> I try to 
    > consider next cases:
    > 
    >   * Basic sanity checks.
    >     Here I test various initial multi and offset values (including
    >     wraparound) and see how appropriate segments are generated.
    >   * pg_upgarde tests.
    >     Here is oldinstall ENV is for. Run pg_upgrade for old cluster with
    >     multi and offset values just like in previous step. i.e. with
    >     various combinations.
    >   * Self pg_upgarde.
    
    Attached is some more cleanup on top of patch set v9, removing more dead 
    stuff related to wraparound. I also removed the oldestOffsetKnown 
    variable and related code. It was needed to deal with clusters upgraded 
    from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will 
    rewrite the SLRUs, it's no longer needed.
    
    Does the pg_upgrade code work though, if you have that buggy situation 
    where oldestOffsetKnown == false ?
    
    > 
    > 		if (!TransactionIdIsValid(*xactptr))
    > 		{
    > 			/* Corner case 3: we must be looking at unused slot zero */
    > 			Assert(offset == 0);
    > 			continue;
    > 		}
    
    After upgrade, this corner case 3 would *not* happen on offset == 0. So 
    looks like we're still missing test coverage for this upgrade corner case.
    
    I'm still felt pretty uneasy about the pg_upgrade code. It's 
    complicated, and the way it rewrites offsets and members separately and 
    page at a time is quite different from the normal codepaths in 
    multixact.c, so it's a bit hard to see if it's handling all those corner 
    cases the same way. I rewrito that so that it's easier to understand, 
    IMHO anyway. The code for reading the old format and writing the new 
    format is now more decoupled. The code for reading the old format is in 
    a separate source file, multixact_old.c. The interface to that is the 
    GetOldMultiXactIdSingleMember() that returns the updating member of a 
    given multixid, much like the GetMultiXactIdMembers() backend function. 
    The conversion routine calls that for each multixid, and write it back 
    out in the new format, one multixid at a time.
    
    The new code now "squeezes out" locking-only XIDs, keeping only the 
    updating ones. Not that important, but with this new code structure, it 
    was trivial and even easier to do than retaining all the XIDs.
    
    Now that the offsets are rewritten one by one, we don't need the 
    "special case 3" in GetMultiXactIdMembers. The upgrade process removes 
    that special case.
    
    I tried to keep my changes sepearate from your patches in the attached 
    patch series. This needs some more cleanup and squashing before 
    committing, but I think we're getting close.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
  27. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2024-12-27T17:09:56Z

    On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > Attached is some more cleanup on top of patch set v9, removing more dead
    > stuff related to wraparound. I also removed the oldestOffsetKnown
    > variable and related code. It was needed to deal with clusters upgraded
    > from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will
    > rewrite the SLRUs, it's no longer needed.
    >
    Yep, multixact.c looks correct to me. As for "XXX could use
    SimpleLruTruncate()", yes, for sure.
    Actually, xl_multixact_truncate.startTruncMemb is also no longer needed, is
    it?
    
    
    > Does the pg_upgrade code work though, if you have that buggy situation
    > where oldestOffsetKnown == false ?
    >
    > >
    > >               if (!TransactionIdIsValid(*xactptr))
    > >               {
    > >                       /* Corner case 3: we must be looking at unused
    > slot zero */
    > >                       Assert(offset == 0);
    > >                       continue;
    > >               }
    >
    > After upgrade, this corner case 3 would *not* happen on offset == 0. So
    > looks like we're still missing test coverage for this upgrade corner case.
    >
    Am I understanding correctly that you want to have a test corresponding to
    the buggy 9.3 and 9.4 era versions?
    Do you think we could imitate this scenario on a current master branch like
    that:
    1) generate a couple of offsets segments for the first table;
    2) generate more segments for a second table;
    3) drop first table;
    4) stop pg cluster;
    5) remove pg_multixact/offsets/0000
    6) upgrade?
    
    PFA, v10-0016-TEST-try-to-replicate-buggy-oldest-offset.patch
    This test will fail now, for an obvious reason, but is this case a relevant
    one?
    
    -- 
    Best regards,
    Maxim Orlov.
    
  28. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-01-01T22:12:36Z

    On 27/12/2024 19:09, Maxim Orlov wrote:
    > 
    > On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas <hlinnaka@iki.fi 
    > <mailto:hlinnaka@iki.fi>> wrote:
    >     Does the pg_upgrade code work though, if you have that buggy situation
    >     where oldestOffsetKnown == false ?
    
    ...
    
    >      >
    >      >               if (!TransactionIdIsValid(*xactptr))
    >      >               {
    >      >                       /* Corner case 3: we must be looking at
    >     unused slot zero */
    >      >                       Assert(offset == 0);
    >      >                       continue;
    >      >               }
    > 
    >     After upgrade, this corner case 3 would *not* happen on offset == 0. So
    >     looks like we're still missing test coverage for this upgrade corner
    >     case.
    > 
    > Am I understanding correctly that you want to have a test corresponding 
    > to the buggy 9.3 and 9.4 era versions?
    
    No, those were two different things. I think there might be two things 
    wrong here:
    
    1. I suspect pg_upgrade might not correctly handle the situation where 
    oldestOffsetKnown==false, and
    
    2. The above assertion in "corner case 3" would not hold. It seems that 
    we don't have a test case for it, or it would've hit the assertion.
    
    
    Now that I think about it, yes, a test case for 1. would be good too. 
    But I was talking about 2.
    
    > Do you think we could imitate this scenario on a current master branch 
    > like that:
    > 1) generate a couple of offsets segments for the first table;
    > 2) generate more segments for a second table;
    > 3) drop first table;
    > 4) stop pg cluster;
    > 5) remove pg_multixact/offsets/0000
    > 6) upgrade?
    
    I don't remember off the top of my head.
    
    It might be best to just refuse the upgrade if oldestOffsetKnown==false. 
    It's a very ancient corner case. It seems reasonable to require you to 
    upgrade to a newer minor version and run VACUUM before upgrading. IIRC 
    that sets oldestOffsetKnown.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
    
  29. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-01-07T10:23:59Z

    On Thu, 2 Jan 2025 at 01:12, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > It might be best to just refuse the upgrade if oldestOffsetKnown==false.
    > It's a very ancient corner case. It seems reasonable to require you to
    > upgrade to a newer minor version and run VACUUM before upgrading. IIRC
    > that sets oldestOffsetKnown.
    >
    
    I agree.  After all, we do already have a ready-made solution in the form
    of a vacuum, do we?
    
    If I understand all this multixact_old.c machinery correctly, in case of
    oldestOffsetKnown==false
    we should fail with "could not open file" or offset will be 0 in
    GetOldMultiXactIdSingleMember.
    So, I suppose we can put an analogue of SimpleLruDoesPhysicalPageExist call
    in the beginning
    of GetOldMultiXactIdSingleMember. And if either
    SimpleLruDoesPhysicalPageExist return false
    or a corresponding offset will be 0 we have to bail out with "oldest offset
    does not exist, consider
    running vacuum before pg_upgrdade" or smth. Please, correct me if I'm wrong.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  30. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-01-16T13:32:09Z

    Looks like there is a bit of a pause in the discussion. Here is a small
    update. Consider v12.
    No major changes, rebase to the actual master and a squash of multiple
    commits to make a
    patch set easy to reviewer.
    
    AFAICs, we are reached a consensus on a core patch for switching to 64 bits
    offsets. The
    only concern is about more comprehensive test coverage for pg_upgrade, is
    it?
    
    -- 
    Best regards,
    Maxim Orlov.
    
  31. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-01-21T03:35:33Z

    HI Maxim
    > Looks like there is a bit of a pause in the discussion. Here is a small
    update. Consider v12.
    > No major changes, rebase to the actual master and a squash of multiple
    commits to make a
    > patch set easy to reviewer.
    
    > AFAICs, we are reached a consensus on a core patch for switching to 64
    bits offsets. The
    > only concern is about more comprehensive test coverage for pg_upgrade, is
    it?
    Agree ,When upgrading meets extremes (oldestOffsetKnown==false.) Just
    follow the solution mentioned by Heikki Linnakangas.
    
    Thanks
    
    On Thu, Jan 16, 2025 at 9:32 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    
    > Looks like there is a bit of a pause in the discussion. Here is a small
    > update. Consider v12.
    > No major changes, rebase to the actual master and a squash of multiple
    > commits to make a
    > patch set easy to reviewer.
    >
    > AFAICs, we are reached a consensus on a core patch for switching to 64
    > bits offsets. The
    > only concern is about more comprehensive test coverage for pg_upgrade, is
    > it?
    >
    > --
    > Best regards,
    > Maxim Orlov.
    >
    
  32. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-01-29T14:04:28Z

    Here is a v13 version with small changes to make cf bot happy.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  33. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-03-07T11:30:17Z

    Here is a rebase, v14.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  34. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-03-27T02:03:22Z

    HI Maxim Orlov Heikki Linnakangas
        Thank you for working on it,A few more days is a code freeze.It seems
    like things have been quiet for a while, but I believe implementing xid64
    is absolutely necessary. For instance, there’s often concern about
    performance jitter caused by frequent freezes. If xid64 is implemented, the
    freeze  process can be smoother.
    
    
    
    Thanks
    
    On Fri, Mar 7, 2025 at 7:30 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    
    > Here is a rebase, v14.
    >
    > --
    > Best regards,
    > Maxim Orlov.
    >
    
  35. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-04-01T18:25:16Z

    On 07/03/2025 13:30, Maxim Orlov wrote:
    > Here is a rebase, v14.
    
    Thanks! I did some manual testing of this. I created a little helper 
    function to consume multixids, to test the autovacuum behavior, and 
    found one issue:
    
    If you consume a lot of multixid members space, by creating lots of 
    multixids with huge number of members in each, you can end up with a 
    very bloated members SLRU, and autovacuum is in no hurry to clean it up. 
    Here's what I did:
    
    1. Installed attached test module
    2. Ran "select consume_multixids(10000, 100000);" many times
    3. ran:
    
    $ du -h data/pg_multixact/members/
    26G	data/pg_multixact/members/
    
    When I run "vacuum freeze; select * from pg_database;", I can see that 
    'datminmxid' for the current database is advanced. However, autovacuum 
    is in no hurry to vacuum 'template0' and 'template1', so 
    pg_multixact/members/ does not get truncated. Eventually, when 
    autovacuum_multixact_freeze_max_age is reached, it presumably will, but 
    you will run out of disk space before that.
    
    There is this check for members size at the end of SetOffsetVacuumLimit():
    
    > 
    > 	/*
    > 	 * Do we need autovacuum?	If we're not sure, assume yes.
    > 	 */
    > 	return !oldestOffsetKnown ||
    > 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
    
    And the caller (SetMultiXactIdLimit()) will in fact signal the 
    autovacuum launcher after "vacuum freeze" because of that. But 
    autovacuum launcher will look at the datminmxid / relminmxid values, see 
    that they are well within autovacuum_multixact_freeze_max_age, and do 
    nothing.
    
    This is a very extreme case, but clearly the code to signal autovacuum 
    launcher, and the freeze age cutoff that autovacuum then uses, are not 
    in sync.
    
    This patch removed MultiXactMemberFreezeThreshold(), per my suggestion, 
    but we threw this baby with the bathwater. We discussed that in this 
    thread, but didn't come up with any solution. But ISTM we still need 
    something like MultiXactMemberFreezeThreshold() to trigger autovacuum 
    freezing if the members have grown too large.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
    
  36. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-04-01T18:26:27Z

    On 01/04/2025 21:25, Heikki Linnakangas wrote:
    > On 07/03/2025 13:30, Maxim Orlov wrote:
    >> Here is a rebase, v14.
    > 
    > Thanks! I did some manual testing of this. I created a little helper 
    > function to consume multixids, to test the autovacuum behavior, and 
    > found one issue:
    
    Forgot to attach the test function I used, here it is.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
  37. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-04-14T09:04:40Z

    HI Heikki
        Pls Kindly help to create task in https://commitfest.postgresql.org/53/
    ,I can not found this path in
    Commitfest 2025-07
    
    
    Thanks
    
    On Wed, Apr 2, 2025 at 2:26 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > On 01/04/2025 21:25, Heikki Linnakangas wrote:
    > > On 07/03/2025 13:30, Maxim Orlov wrote:
    > >> Here is a rebase, v14.
    > >
    > > Thanks! I did some manual testing of this. I created a little helper
    > > function to consume multixids, to test the autovacuum behavior, and
    > > found one issue:
    >
    > Forgot to attach the test function I used, here it is.
    >
    > --
    > Heikki Linnakangas
    > Neon (https://neon.tech)
    
  38. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-04-16T08:11:46Z

    I moved the topic to the next commitfest.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  39. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-04-29T12:01:09Z

    Hi Maxim,
    
    On Wed, Apr 16, 2025 at 1:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    
    > I moved the topic to the next commitfest.
    >
    
    I am reviewing these patches.
    
    I notice that transam/README does not mention multixact except one place in
    section "Transaction Emulation during Recovery". I expected it to document
    how pg_multixact/members and pg_multixact/offset are used and what is their
    layout. It's not the responsibility of this patchset to document it, but it
    will be good if we add a section about multixacts in transam/README. It
    will make reviews easier.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
  40. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-05-27T15:18:26Z

    On Tue, 29 Apr 2025 at 15:01, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
    wrote:
    
    >
    > I notice that transam/README does not mention multixact except one place
    > in section "Transaction Emulation during Recovery". I expected it to
    > document how pg_multixact/members and pg_multixact/offset are used and what
    > is their layout. It's not the responsibility of this patchset to document
    > it, but it will be good if we add a section about multixacts in
    > transam/README. It will make reviews easier.
    >
    
    Yeah, I agree, this is a big overlook, I think. Anyone who tries to
    understand how pg_multixact works has to deal with the code.
    Certainly, we need to address this issue.
    
    But for now, here is a new rebase @ 70a13c528b6e382a381f.
    The only change is that following commits 15a79c7 and a0ed19e, we must also
    switch to PRIu64 format.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  41. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-07-14T13:54:33Z

    Yet another rebase @ f5a987c0e5
    
    -- 
    Best regards,
    Maxim Orlov.
    
  42. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-09-11T08:58:13Z

    Once again, @ 8191e0c16a
    
    
    -- 
    Best regards,
    Maxim Orlov.
    
  43. Re: POC: make mxidoff 64 bits

    Alexander Korotkov <aekorotkov@gmail.com> — 2025-09-13T13:34:01Z

    Hello Maxim!
    
    On Thu, Sep 11, 2025 at 11:58 AM Maxim Orlov <orlovmg@gmail.com> wrote:
    >
    > Once again, @ 8191e0c16a
    
    Thank you for your work on this subject.  Multixact members can really
    grow much faster than multixact offsets, and avoiding wraparound just
    here might make sense.  At the same time, making multixact offsets
    64-bit is local and doesn't require changing the tuple xmin/xmax
    interpretation.
    
    I went through the patchset.  The shape does not look bad, but I have
    a concern about the size of the multixact offsets.  As I understand,
    this patchset grows multixact offsets twice; each multixact offset
    grows from 32 bits to 64 bits.  This seems quite a price for avoiding
    the Multixact members' wraparound.
    
    We can try to squeeze multixact offsets given it's ascending sequence
    each time increased by a multixact size.  But how many members can a
    multixact contain at maximum?  Looking at MultiXactIdExpand(), I get
    that we're keeping locks from in-progress transactions, and committed
    non-lock transactions (I guess the latter could be only one).  The
    number of transactions running by backends should fit MAX_BACKENDS
    (2^18 - 1), and the number of prepared transactions should also fit
    MAX_BACKENDS. So, I guess we can cap the total number of one multixact
    members to 2^24.
    
    Therefore, we can change from each 8 of 32-bit multixact offsets
    (takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
    (takes 29-bytes).  The actual multixact offsets can be calculated at
    the fly, overhead shouldn't be significant.  What do you think?
    
    ------
    Regards,
    Alexander Korotkov
    Supabase
    
    
    
    
  44. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-09-15T15:42:00Z

    On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com>
    wrote:
    
    >
    > Therefore, we can change from each 8 of 32-bit multixact offsets
    > (takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
    > (takes 29-bytes).  The actual multixact offsets can be calculated at
    > the fly, overhead shouldn't be significant.  What do you think?
    >
    >
    Thank you for your review; I'm pleased to hear from you again.
    
    Yes, because the maximum number of mxoff is limited by the number of
    running transactions, we may do it that way.
    However, it is a bit wired to have offsets with the 7-byte "base".
    
    I believe we may take advantage of the 64XID patch's notion of putting a
    8 byte base followed by 4 byte offsets for particular page.
    
    32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
    Therefore, offset from base will never overflow 2^31 and will always
    fit uint32.
    
    It appears logical to me.
    
    
    -- 
    Best regards,
    Maxim Orlov.
    
  45. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-09-16T12:12:17Z

    Hi Maxim
      Thanks for your continued efforts to get XID64 implemented.
    >   32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
    > Therefore, offset from base will never overflow 2^31 and will always
    > fit uint32.
    
    > It appears logical to me.
    Agree +1 , but I have a question: I remember the XID64 patch got split into
    a few threads. How are these threads related? The original one was seen as
    too big a change, so it was broken up after people raised concerns.
    
    Thanks
    
    On Mon, Sep 15, 2025 at 11:42 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    
    >
    >
    > On Sat, 13 Sept 2025 at 16:34, Alexander Korotkov <aekorotkov@gmail.com>
    > wrote:
    >
    >>
    >> Therefore, we can change from each 8 of 32-bit multixact offsets
    >> (takes 32-bytes) to one 64-bit offset + 7 of 24-bit offset increments
    >> (takes 29-bytes).  The actual multixact offsets can be calculated at
    >> the fly, overhead shouldn't be significant.  What do you think?
    >>
    >>
    > Thank you for your review; I'm pleased to hear from you again.
    >
    > Yes, because the maximum number of mxoff is limited by the number of
    > running transactions, we may do it that way.
    > However, it is a bit wired to have offsets with the 7-byte "base".
    >
    > I believe we may take advantage of the 64XID patch's notion of putting a
    > 8 byte base followed by 4 byte offsets for particular page.
    >
    > 32kB page may contain then 2^13-2 offsets, each is maxed by 2^18+1.
    > Therefore, offset from base will never overflow 2^31 and will always
    > fit uint32.
    >
    > It appears logical to me.
    >
    >
    > --
    > Best regards,
    > Maxim Orlov.
    >
    
  46. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-09-18T16:37:11Z

    On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    
    >
    > Agree +1 , but I have a question: I remember the XID64 patch got split
    > into a few threads. How are these threads related? The original one was
    > seen as too big a change, so it was broken up after people raised
    > concerns.
    >
    Yeah, you're absolutely correct. This thread is a part of the overall
    work on the transition to XID64 in Postgres. As far as I remember, no
    one explicitly raised concerns, but it's clear to me that it won't be
    committed as one big patch set.
    
    Here is a WIP patch @ 8191e0c16a for the discussed above issue.
    I also had to merge several patches from the previous set, since the
    consensus is to use the PRI* format for outputting 64-bit values, and a
    separate patch that only changed the format with cast and %llu lost
    it's meaning.
    
    If this option suits everyone, the next step is to add a part related
    to the pg_upgrade.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  47. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-09-19T09:07:15Z

    Hi Maxim
         Thank you for your feedback.I remember this path the primary
    challenges are in the upgrade section, where we're still debating how to
    address a few edge cases. Right now, this thread is missing input from core
    code contributors.
    
    
    Thanks
    
    On Fri, Sep 19, 2025 at 12:37 AM Maxim Orlov <orlovmg@gmail.com> wrote:
    
    >
    > On Tue, 16 Sept 2025 at 15:12, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    >
    >>
    >> Agree +1 , but I have a question: I remember the XID64 patch got split
    >> into a few threads. How are these threads related? The original one was
    >> seen as too big a change, so it was broken up after people raised
    >> concerns.
    >>
    > Yeah, you're absolutely correct. This thread is a part of the overall
    > work on the transition to XID64 in Postgres. As far as I remember, no
    > one explicitly raised concerns, but it's clear to me that it won't be
    > committed as one big patch set.
    >
    > Here is a WIP patch @ 8191e0c16a for the discussed above issue.
    > I also had to merge several patches from the previous set, since the
    > consensus is to use the PRI* format for outputting 64-bit values, and a
    > separate patch that only changed the format with cast and %llu lost
    > it's meaning.
    >
    > If this option suits everyone, the next step is to add a part related
    > to the pg_upgrade.
    >
    > --
    > Best regards,
    > Maxim Orlov.
    >
    
  48. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-10-27T15:54:35Z

    Here is a new patch set @ 10b5bb3bffaee8
    
    As previously stated, the patch set implements the concept of saving the
    "difference" between page offsets in order to save disc space.
    
    The second significant change was that I decided to modify the
    pg_upgrade portion as suggested by Heikki above.
    
    At the very least, the logic becomes much simpler and completely
    replicates what is done in the multxact.c module and provide some
    optimization.
    
    Perhaps this will be easier to comprehend and analyse than attempting
    to correctly convert bytes from one format to another.
    
    In the near future, I intend to focus on testing and implement a test
    suite.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  49. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-10-28T14:17:11Z

    On 27/10/2025 17:54, Maxim Orlov wrote:
    > Here is a new patch set @ 10b5bb3bffaee8
    > 
    > As previously stated, the patch set implements the concept of saving the
    > "difference" between page offsets in order to save disc space.
    
    Hmm, is that safe? We do the assignment of multixact and offset, in the 
    GetNewMultiXactId() function, separately from updating the SLRU pages in 
    the RecordNewMultiXact() function. I believe this happen:
    
    To keep the arithmetic simple, let's assume that multixid 100 is the 
    first multixid on an offsets SLRU page. So the 'base' on the page is 
    initialized when multixid 100 is written.
    
    1. Backend A calls GetNewMultiXactId(), is assigned multixid 100, offset 
    1000
    2. Backend B calls GetNewMultiXactId(), is assigned multixid 101, offset 
    1010
    3. Backend B calls RecordNewMultiXact() and sets 'page->offsets[1] = 10'
    4. Backend A calls RecordNewMultiXact() and sets 'page->base = 1000' and 
    'page->offsets[0] = 0'
    
    If backend C looks up multixid 101 in between steps 3 and 4, it would 
    read the offset incorrectly, because 'base' isn't set yet.
    
    - Heikki
    
    
    
    
    
  50. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-10-30T06:13:54Z

    Unfortunately, I need to admit that I have messed a bit with v18.
    I forgot to sync the pg_upgrade portion with the rest of the patch,
    among other things. Here's a proper version with additional testing.
    
    pg_upgrade/t/007_mxoff.pl is not meant to be committed, it's just
    for test purposes. In order to run it, env var oldinstall must be set.
    
    On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > On 27/10/2025 17:54, Maxim Orlov wrote:
    >
    >
    > If backend C looks up multixid 101 in between steps 3 and 4, it would
    > read the offset incorrectly, because 'base' isn't set yet.
    >
    > Hmm, maybe I miss something? We set page base on first write of any
    offset on the page, not only the first one. In other words, there
    should never be a case when we read an offset without a previously
    defined page base. Correct me if I'm wrong:
    1. Backend A assigned mxact=100, offset=1000.
    2. Backend B assigned mxact=101, offset=1010.
    3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
        set page base=1010, offset plus 0^0x80000000 bit while
        holding lock on the page.
    4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
        and should get exactly what he's looking for:
        base (1010) + offset (0) minus 0x80000000 bit.
    5. Backend A calls RecordNewMultiXact() and sets his offset using
        existing base from step 3.
    
    
    -- 
    Best regards,
    Maxim Orlov.
    
  51. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-10-30T09:10:43Z

    On 30/10/2025 08:13, Maxim Orlov wrote:
    > On Tue, 28 Oct 2025 at 17:17, Heikki Linnakangas <hlinnaka@iki.fi 
    > <mailto:hlinnaka@iki.fi>> wrote:
    > 
    >     On 27/10/2025 17:54, Maxim Orlov wrote:
    > 
    > 
    >     If backend C looks up multixid 101 in between steps 3 and 4, it would
    >     read the offset incorrectly, because 'base' isn't set yet.
    > 
    > Hmm, maybe I miss something? We set page base on first write of any
    > offset on the page, not only the first one. In other words, there
    > should never be a case when we read an offset without a previously
    > defined page base. Correct me if I'm wrong:
    > 1. Backend A assigned mxact=100, offset=1000.
    > 2. Backend B assigned mxact=101, offset=1010.
    > 3. Backend B calls RecordNewMultiXact()/MXOffsetWrite() and
    >      set page base=1010, offset plus 0^0x80000000 bit while
    >      holding lock on the page.
    > 4. Backend C looks up for the mxact=101 by calling MXOffsetRead()
    >      and should get exactly what he's looking for:
    >      base (1010) + offset (0) minus 0x80000000 bit.
    > 5. Backend A calls RecordNewMultiXact() and sets his offset using
    >      existing base from step 3.
    
    Oh I see, the 'base' is not necessarily the base offset of the first 
    multixact on the page, it's the base offset of the first multixid that 
    is written to the page. And the (short) offsets can be negative. That's 
    a frighteningly clever encoding scheme. One upshot of that is that WAL 
    redo might get construct the page with a different 'base'. I guess that 
    works, but it scares me. Could we come up with a more deterministic scheme?
    
    - Heikki
    
    
    
    
    
  52. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-10-30T16:17:00Z

    On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > Oh I see, the 'base' is not necessarily the base offset of the first
    > multixact on the page, it's the base offset of the first multixid that
    > is written to the page. And the (short) offsets can be negative. That's
    > a frighteningly clever encoding scheme. One upshot of that is that WAL
    > redo might get construct the page with a different 'base'. I guess that
    > works, but it scares me. Could we come up with a more deterministic scheme?
    >
    > Definitely! The most stable approach is the one we had before, which
    used actual 64-bit offsets in the SLRU. To be honest, I'm completely
    happy with it. After all, what's most important for me is to have 64-bit
    xids in Postgres, and this patch is a step towards that goal.
    
    PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
    segments.
    
    Fortunately, now that I've separated reading and writing offsets into
    different functions, switching from one implementation to another is
    easy to do.
    
    Here's a quick overview of the current state of the patch:
    1) Access to the offset is placed to separate calls:
       MXOffsetWrite/MXOffsetRead.
    2) I abandoned byte juggling in pg_upgrade and moved to using logic that
       replicates the work with offsets im multixact.c
    3) As a result, the update issue came down to the correct implementation
       of functions MXOffsetWrite/MXOffsetRead.
    4) The only question that remains is the question of disk representation
       of 64-bit offsets in SLRU segments.
    
    My thoughts on point (4).
    
    Using 32-bit offsets + some kind of packing:
    Pros:
     + Reduce the total disc space used by the segments; ideally it is
       almost the same as before.
    Cons:
     - Reduces reliability (losing a part will most likely result in losing
       the entire page).
     - Complicates code, especially considering that segments may be written
       to the page in random order.
    
    Using 64-bit offsets in SLRU:
    Pros:
     + Easy to implement/transparent logic.
    Cons:
     - Increases the amount of disk space used.
    
    In terms of speed, I'm not sure which will be faster. On the one hand,
    64-bit eliminates the necessity for calculations and branching. On the
    other hand, the amount of data used will increase.
    
    I am not opposed to any of these options, as our primary goal is getting
    64-bit offsets. However, I like the approach using full 64-bit offsets
    in SLRU, because it is more clear and, should we say, robust. Yes, it
    will increase the number of segment, however this is not heap data in
    for a table. Under typical circumstances, there should not be too many
    such segments.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  53. Re: POC: make mxidoff 64 bits

    Alexander Korotkov <aekorotkov@gmail.com> — 2025-11-04T15:10:31Z

    On Thu, Oct 30, 2025 at 6:17 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    > On Thu, 30 Oct 2025 at 12:10, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >>
    >>
    >> Oh I see, the 'base' is not necessarily the base offset of the first
    >> multixact on the page, it's the base offset of the first multixid that
    >> is written to the page. And the (short) offsets can be negative. That's
    >> a frighteningly clever encoding scheme. One upshot of that is that WAL
    >> redo might get construct the page with a different 'base'. I guess that
    >> works, but it scares me. Could we come up with a more deterministic scheme?
    >>
    > Definitely! The most stable approach is the one we had before, which
    > used actual 64-bit offsets in the SLRU. To be honest, I'm completely
    > happy with it. After all, what's most important for me is to have 64-bit
    > xids in Postgres, and this patch is a step towards that goal.
    
    Yes, but why can't we have an encoding scheme which would both be
    deterministic and provide compression?  The attached is what I meant
    in [1].  It's based on v19 and provide deterministic conversion of
    each 8 of 64-bit offsets into a chunks containing 64-bit base and 7 of
    24-bit increments.  I didn't touch pg_upgrade code though.
    
    Links.
    1. https://www.postgresql.org/message-id/CAPpHfdtPybyMYBj-x3-Z5%3D4bj_vhYk2R0nezfy%3DVjcz4QBMDgw%40mail.gmail.com
    
    ------
    Regards,
    Alexander Korotkov
    Supabase
    
  54. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-05T15:37:54Z

    On 30/10/2025 18:17, Maxim Orlov wrote:
    > PFA v20 returns to using actual 64-bit offsets for on-disk SLRU
    > segments.
    > 
    > Fortunately, now that I've separated reading and writing offsets into
    > different functions, switching from one implementation to another is
    > easy to do.
    > 
    > Here's a quick overview of the current state of the patch:
    > 1) Access to the offset is placed to separate calls:
    >     MXOffsetWrite/MXOffsetRead.
    > 2) I abandoned byte juggling in pg_upgrade and moved to using logic that
    >     replicates the work with offsets im multixact.c
    > 3) As a result, the update issue came down to the correct implementation
    >     of functions MXOffsetWrite/MXOffsetRead.
    > 4) The only question that remains is the question of disk representation
    >     of 64-bit offsets in SLRU segments.
    
    Here's another round of review and cleanup of this. I made a bunch of 
    small changes, but haven't found any major problems. Looking pretty good.
    
    Notable changes since v20:
    
    - Changed MULTIXACT_MEMBER_AUTOVAC_THRESHOLD to 4000000000 instead of 
    0xFFFFFFFF. The 2^32 mark doesn't have any particular meaning 
    significance and using a round number makes that more clear. We should 
    possibly expose that as a separate GUC, but that can be done in a 
    followup patch.
    
    - Removed the MXOffsetRead/Write functions again. They certainly make 
    sense if we make them more complicated with some kind of a compression 
    scheme, but I preferred to keep the code closer to 'master' for now.
    
    - Removed more remnants of offset wraparound handling. There were still 
    a few places that checked for wraparound and tried to deal with it, 
    while other places just assumed that it doesn't happen. I added a check 
    in GetNewMultiXactId() to error out if it would wrap around. It really 
    should not happen in the real world, one reason being that we would run 
    out of WAL before running out of 64-bit mxoffsets, but a sanity check 
    won't hurt just in case someone e.g. abuses pg_resetwal to get into that 
    situation.
    
    - Removed MaybeExtendOffsetSlru(). It was only used to deal with binary 
    upgrade from version 9.2 and below. Now that pg_upgrade rewrites the 
    files, it's not needed anymore.
    
    - Modified PerformMembersTruncation() to use SimpleLruTruncate()
    
    Changes in pg_upgrade:
    
    - Removed the nextMXact/nextOffset fields from MultiXactWriter. They 
    were redundant with the next_multi and next_offset local variables in 
    the caller.
    
    Remaining issues:
    
    - There's one more refactoring I'd like to do before merging this: Move 
    the definitions that are now duplicated between 
    src/bin/pg_upgrade/multixact_new.c and 
    src/backend/access/transam/multixact.c into a new header file, 
    multixact_internal.h. One complication with that is that it needs 
    SLRU_PAGES_PER_SEGMENT from access/slru.h, but slru.h cannot currently 
    be included in FRONTEND code. Perhaps we should move 
    SLRU_PAGES_PER_SEGMENT to pg_config_manual.h, or if that feels too 
    global, to a separate slru_config.h file.
    
    - I saw Alexander's proposal for a new compression scheme but didn't 
    incorporate that here. It might be a good idea, but I think we can do 
    that as a followup patch before the release, if it seems worth it. I 
    don't feel too bad about just making pg_multixact/offsets 2x larger either.
    
    - Have you done any performance testing of the pg_upgrade code? How long 
    does the conversion take if you have e.g. 1 billion multixids?
    
    - Is the !oldestOffsetKnown case in the code still reachable? I left one 
    FIXME comment about that. Needs a comment update at least.
    
    - The new pg_upgrade test fails on my system with this error in the log:
    
    > # Running: pg_dump --no-sync --restrict-key test -d port=22462 host=/tmp/5KdMvth1jk dbname='postgres' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql
    > pg_dump: error: aborting because of server version mismatch
    > pg_dump: detail: server version: 19devel; pg_dump version: 17.6
    > could not read "/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_CINS/newnode_1_dump.sql": No such file or directory at /home/heikki/git-sandbox/postgresql/src/bin/pg_upgrade/t/007_mxoff.pl line 242.
    
    This turns out to be an issue with IPC::Run. Setting the 
    IPCRUNDEBUG=basic env variable reveals that it has a built-in command cache:
    
    > IPC::Run 0004 [#19(109223)]: ****** harnessing *****
    > IPC::Run 0004 [#19(109223)]: parsing [ pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres'' -f '/home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql' ]
    > IPC::Run 0004 [#19(109223)]: ** starting
    > IPC::Run 0004 [#19(109223)]: 'pg_dump' found in cache: '/home/heikki/pgsql.17stable/bin/pg_dump'
    > IPC::Run 0004 [#19(111432) pg_dump]: execing /home/heikki/pgsql.17stable/bin/pg_dump --no-sync --restrict-key test -d 'port=20999 host=/tmp/NsJKldN1Ie dbname='postgres'' -f /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/007_mxoff/data/tmp_test_urgw/newnode_1_dump.sql
    > IPC::Run 0004 [#19(109223)]: ** finishing
    > pg_dump: error: aborting because of server version mismatch
    > pg_dump: detail: server version: 19devel; pg_dump version: 17.6
    
    The test calls pg_dump twice: first with the old version, then with the 
    new version. But thanks to IPC::Run's command cache, the invocation of 
    the new pg_dump version actually also calls the old version. I'm not 
    sure how to fix that, but I was able to work around it by reversing the 
    pg_dump calls so that thew new version is called first. That way we use 
    the new pg_dump against both server versions which works.
    
    - The new pg_ugprade test is very slow. I would love to include that 
    test permanently in the test suite, but it's too slow for that currently.
    
    - Heikki
    
  55. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-11-06T15:50:23Z

    On Wed, 5 Nov 2025 at 18:38, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > Remaining issues:
    >
    > - There's one more refactoring I'd like to do before merging this: Move
    > the definitions that are now duplicated between
    > src/bin/pg_upgrade/multixact_new.c and
    > src/backend/access/transam/multixact.c into a new header file,
    > multixact_internal.h. One complication with that is that it needs
    > SLRU_PAGES_PER_SEGMENT ...
    
    Done. Also put SLRU_PAGES_PER_SEGMENT in pg_config_manual.h
    In my opinion, this constant perfectly aligns the description in the
    file header. In any case, feel free to move it anywhere you like.
    
    
    - Have you done any performance testing of the pg_upgrade code? How long
    > does the conversion take if you have e.g. 1 billion multixids?
    >
    Unfortunately, not yet. I'd like to do this soon. Currently, the
    bulk of the testing time is spent generating multi-transactions.
    
    
    - Is the !oldestOffsetKnown case in the code still reachable? I left one
    > FIXME comment about that. Needs a comment update at least.
    >
    Yep, no longer needed. A separate commit has been added.
    
    
    - The new pg_upgrade test fails on my system with this error in the log:
    
    Unfortunately, I don't face this issue. I think this can be fixed by
    providing an explicit path to the utility.
    
    
    - The new pg_ugprade test is very slow. I would love to include that
    > test permanently in the test suite, but it's too slow for that currently.
    
    Yes, unfortunately. The majority of the time is spent on tests that
    produce multiple segments. These are cases numbered 4-th and higher.
    If we remove these, the testing should be relatively fast.
    
    
    I also add commit "Handle wraparound of next new multi in pg_upgrade".
    Per BUG #18863 and BUG #18865
    
    The issue is that pg_upgrade neglects to handle the wraparound of
    mxact/mxoff.
    
    We'll obviously resolve the issue with mxoff wraparound by moving to
    64-bits. And the mxact bug can be easily solved with two lines of code.
    Or five if you count indents and comments. Test also provided.
    
    This commit is totally optional. If you think it deserves to be treated
    as a different issue, feel free to discard it.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  56. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-11-07T16:03:11Z

    I noticed one minor issue after I had already sent the
    previous letter.
    
    --- a/src/backend/access/transam/multixact.c
    +++ b/src/backend/access/transam/multixact.c
    @@ -1034,7 +1034,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset
    *offset)
        if (nextOffset + nmembers < nextOffset)
            ereport(ERROR,
                    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
    -                "MultiXact members would wrap around"));
    +                errmsg("MultiXact members would wrap around")));
        *offset = nextOffset;
    
    
    $ $PGBINOLD/pg_controldata -D pgdata
    pg_control version number:            1800
    Catalog version number:               202510221
    ...
    Latest checkpoint's NextMultiXactId:  10000000
    Latest checkpoint's NextMultiOffset:  999995050
    Latest checkpoint's oldestXID:        748
    ...
    
    I tried finding out how long it would take to convert a big number of
    segments. Unfortunately, I only have access to a very old machine right
    now. It took me 7 hours to generate this much data on my old
    Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
    
    Here are my rough measurements:
    
    HDD
    $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
    $ time pg_upgrade
    ...
    real    4m59.459s
    user    0m19.974s
    sys     0m13.640s
    
    SSD
    $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
    $ time pg_upgrade
    ...
    real    4m52.958s
    user    0m19.826s
    sys     0m13.624s
    
    I aim to get access to more modern stuff and check it all out there.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  57. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-12T13:00:02Z

    On 07/11/2025 18:03, Maxim Orlov wrote:
    > I tried finding out how long it would take to convert a big number of
    > segments. Unfortunately, I only have access to a very old machine right
    > now. It took me 7 hours to generate this much data on my old
    > Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 16 Gb of RAM.
    > 
    > Here are my rough measurements:
    > 
    > HDD
    > $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
    > $ time pg_upgrade
    > ...
    > real    4m59.459s
    > user    0m19.974s
    > sys     0m13.640s
    > 
    > SSD
    > $ sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
    > $ time pg_upgrade
    > ...
    > real    4m52.958s
    > user    0m19.826s
    > sys     0m13.624s
    > 
    > I aim to get access to more modern stuff and check it all out there.
    
    Thanks, I also did some perf testing on my laptop. I wrote a little 
    helper function to consume multixids, and used it to create a v17 
    cluster with 100 million multixids. See attached 
    consume-mxids.patch.txt. I then ran pg_upgrade on that, and measured how 
    long the pg_multixact conversion part of pg_upgrade took. It took about 
    1.2 s on my laptop. Extrapolating from that, converting 1 billion 
    multixids would take 12 s. These were very simple multixacts with just 
    one member each, though; realistic multixacts with more members would 
    presumably take a little longer.
    
    In any case, I think we're in an acceptable ballpark here.
    
    There's some very low-hanging fruit though: Profiling with 'linux-perf' 
    suggested that a lot of CPU time was spent simply on the function call 
    overhead of GetOldMultiXactIdSingleMember, SlruReadSwitchPage, 
    RecordNewMultiXact, SlruWriteSwitchPage for each multixact. I added an 
    inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to 
    eliminate the function call overhead of those in the common case that no 
    page switch is needed. With that, the 100 million mxid test case I used 
    went from 1.2 s to 0.9 s. We could optimize this further but I think 
    this is good enough.
    
    Some other changes since patch set v23:
    
    - Rebased. I committed the wraparound bug fixes.
    
    - I added an SlruFileName() helper function to slru_io.c, and support 
    for reading SLRUs with long_segment_names==true. It's not needed 
    currently, but it seemed like a weird omission. AllocSlruRead() actually 
    left 'long_segment_names' uninitialized which is error-prone. We 
    could've just documented it, but it seems just as easy to support it.
    
    - I split the multixact_internal.h header in a separate commit, to make 
    it more clear what changes are related to 64-bit offsets
    
    I kept all the new test cases for now. We need to decide which ones are 
    worth keeping, and polish and speed up the ones we decide to keep.
    
    
    I'm getting one failure from the pg_upgrade/008_mxoff test:
    
    > [14:43:38.422](0.530s) not ok 26 - dump outputs from original and restored regression databases match
    > [14:43:38.422](0.000s) #   Failed test 'dump outputs from original and restored regression databases match'
    > #   at /home/heikki/git-sandbox/postgresql/src/test/perl/PostgreSQL/Test/Utils.pm line 801.
    > [14:43:38.422](0.000s) #          got: '1'
    > #     expected: '0'
    > === diff of /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted and /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted
    > === stdout ===
    > --- /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/oldnode_6_dump.sql_adjusted       2025-11-12 14:43:38.030399957 +0200
    > +++ /home/heikki/git-sandbox/postgresql/build/testrun/pg_upgrade/008_mxoff/data/tmp_test_AC6A/newnode_6_dump.sql_adjusted       2025-11-12 14:43:38.314399819 +0200
    > @@ -2,8 +2,8 @@
    >  -- PostgreSQL database dump
    >  --
    >  \restrict test
    > --- Dumped from database version 17.6
    > --- Dumped by pg_dump version 17.6
    > +-- Dumped from database version 19devel
    > +-- Dumped by pg_dump version 19devel
    >  SET statement_timeout = 0;
    >  SET lock_timeout = 0;
    >  SET idle_in_transaction_session_timeout = 0;=== stderr ===
    > === EOF ===
    > [14:43:38.425](0.004s) # >>> case #6
    
    I ran the test with:
    
    (rm -rf build/testrun/ build/tmp_install/; 
    olddump=/tmp/olddump-regress.sql oldinstall=/home/heikki/pgsql.17stable/ 
    meson test -C build --suite setup --suite pg_upgrade)
    
    - Heikki
    
  58. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-13T16:04:48Z

    I realized that this issue was still outstanding:
    
    On 01/04/2025 21:25, Heikki Linnakangas wrote:
    > Thanks! I did some manual testing of this. I created a little helper 
    > function to consume multixids, to test the autovacuum behavior, and 
    > found one issue:
    > 
    > If you consume a lot of multixid members space, by creating lots of 
    > multixids with huge number of members in each, you can end up with a 
    > very bloated members SLRU, and autovacuum is in no hurry to clean it up. 
    > Here's what I did:
    > 
    > 1. Installed attached test module
    > 2. Ran "select consume_multixids(10000, 100000);" many times
    > 3. ran:
    > 
    > $ du -h data/pg_multixact/members/
    > 26G    data/pg_multixact/members/
    > 
    > When I run "vacuum freeze; select * from pg_database;", I can see that 
    > 'datminmxid' for the current database is advanced. However, autovacuum 
    > is in no hurry to vacuum 'template0' and 'template1', so pg_multixact/ 
    > members/ does not get truncated. Eventually, when 
    > autovacuum_multixact_freeze_max_age is reached, it presumably will, but 
    > you will run out of disk space before that.
    > 
    > There is this check for members size at the end of SetOffsetVacuumLimit():
    > 
    >>
    >>     /*
    >>      * Do we need autovacuum?    If we're not sure, assume yes.
    >>      */
    >>     return !oldestOffsetKnown ||
    >>         (nextOffset - oldestOffset > MULTIXACT_MEMBER_AUTOVAC_THRESHOLD);
    > 
    > And the caller (SetMultiXactIdLimit()) will in fact signal the 
    > autovacuum launcher after "vacuum freeze" because of that. But 
    > autovacuum launcher will look at the datminmxid / relminmxid values, see 
    > that they are well within autovacuum_multixact_freeze_max_age, and do 
    > nothing.
    > 
    > This is a very extreme case, but clearly the code to signal autovacuum 
    > launcher, and the freeze age cutoff that autovacuum then uses, are not 
    > in sync.
    > 
    > This patch removed MultiXactMemberFreezeThreshold(), per my suggestion, 
    > but we threw this baby with the bathwater. We discussed that in this 
    > thread, but didn't come up with any solution. But ISTM we still need 
    > something like MultiXactMemberFreezeThreshold() to trigger autovacuum 
    > freezing if the members have grown too large.
    
    Here's a new patch version that addresses the above issue. I resurrected 
    MultiXactMemberFreezeThreshold(), using the same logic as before, just 
    using pretty arbitrary thresholds of 1 and 2 billion offsets instead of 
    the safe/danger thresholds derived from MaxMultiOffset. That gives 
    roughly the same behavior wrt. calculating effective freeze age as before.
    
    Another change is that I removed the offset-based emergency vacuum 
    triggering. With 64-bit offsets, we never need to shut down the system 
    to prevent offset wraparound, so even if the offsets SLRU grows large, 
    it's not an "emergency" the same way that wraparound is. Consuming lots 
    of disk space could be a problem, of course, but we can let autovacuum 
    deal with that at the normal pace, like it deals with bloated tables.
    
    The heuristics could surely be made better and/or more configurable, but 
    I think this good enough for now.
    
    I included these changes as a separate patch for review purposes, but it 
    ought to be squashed with the main patch before committing.
    
    - Heikki
    
  59. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-11-14T15:40:50Z

    On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > I added an
    > inlined fast path to SlruReadSwitchPage and SlruWriteSwitchPage to
    > eliminate the function call overhead of those in the common case that no
    > page switch is needed. With that, the 100 million mxid test case I used
    > went from 1.2 s to 0.9 s. We could optimize this further but I think
    > this is good enough.
    >
    I agree with you.
    
    
    - I added an SlruFileName() helper function to slru_io.c, and support
    > for reading SLRUs with long_segment_names==true. It's not needed
    > currently, but it seemed like a weird omission. AllocSlruRead() actually
    > left 'long_segment_names' uninitialized which is error-prone. We
    > could've just documented it, but it seems just as easy to support it.
    >
    Yeah, I didn't particularly like that place either. But then I decided it
    was
    overkill to do it for the sake of symmetry and would raise questions.
    It turned out much better this way.
    
    
    
    > I kept all the new test cases for now. We need to decide which ones are
    > worth keeping, and polish and speed up the ones we decide to keep.
    >
    I think of two cases here.
    A) Upgrade from "new cluster":
        * created cluster with pre 32-bit overflow mxoff
        * consume around of 2k of mxacts (1k before 32-bit overflow
          and 1k after)
        * run pg_upgrade
        * check upgraded cluster is working
        * check data invariant
    B)  Same as A), but for an "old cluster" with oldinstall env.
    
    
    On Thu, 13 Nov 2025 at 19:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > Here's a new patch version that addresses the above issue. I resurrected
    > MultiXactMemberFreezeThreshold(), using the same logic as before, just
    > using pretty arbitrary thresholds of 1 and 2 billion offsets instead of
    > the safe/danger thresholds derived from MaxMultiOffset. That gives
    > roughly the same behavior wrt. calculating effective freeze age as before.
    >
    Yes, I think it's okay for now. This reflects the existing logic well.
    I wonder what the alternative solution might be? Can we make a
    "vacuum freeze" also do pg_multixact segments truncation?
    In any case, this can be discussed later.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  60. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-14T17:48:39Z

    On 14/11/2025 17:40, Maxim Orlov wrote:
    > On Wed, 12 Nov 2025 at 16:00, Heikki Linnakangas <hlinnaka@iki.fi> 
    >> I kept all the new test cases for now. We need to decide which
    >> ones are worth keeping, and polish and speed up the ones we decide
    >> to keep.
    
    Attached is a new patch version, with more work on the tests. The 
    pg_upgrade patch 
    (v26-0004-Add-pg_upgrade-for-64-bit-multixact-offsets.patch) now 
    includes a test case. I'm proposing to commit that test along with these 
    patches. It's a heavily-modified version of the test cases you wrote.
    
    I tested that test using old installations, all the way down to version 
    9.4. That required a bunch of changes to the test perl modules, to make 
    them work with such old versions. Without any extra changes, the test 
    works down to v11.
    
    Later patches in the patch set add more tests, labelled with the TEST: 
    prefix. Those are the tests you posted earlier, with little to no 
    modifications. I'm just carrying those around, so that I can easily run 
    them now during development. But I don't think they're adding much value 
    and I plan to leave them out of the final commit.
    
    
    > I think of two cases here.
    > A) Upgrade from "new cluster":
    >      * created cluster with pre 32-bit overflow mxoff
    >      * consume around of 2k of mxacts (1k before 32-bit overflow
    >        and 1k after)
    >      * run pg_upgrade
    >      * check upgraded cluster is working
    >      * check data invariant
    > B)  Same as A), but for an "old cluster" with oldinstall env.
    
    Makes sense.
    
    The 007_multixact_conversion.pl test in the attached patches includes 
    two test scenarios:  "basic" and "wraparound" test. In the basic 
    scenario there's no overflow or wraparound involved, but it can be run 
    without an old installation, i.e. in a "new -> new upgrade". The 
    "wraparound" scenario is the same, but the old cluster is reset with 
    pg_resetwal so that the mxoff wraps around. The "wraparound" requires a 
    pre-19 old installation, because the pg_resetwal logic requires pre-v19 
    layout.
    
    If we enhance the reset_mxoff() perl function in the test so that it 
    also works with v19, we could run the "wraparound" scenario in new->new 
    upgrades too. That would essentially the case A) that you listed above.
    
    I think it's already pretty good as it is though. I don't expect the 
    point where we cross offset 2^32 in the new version to be very 
    interesting now that we use 64-bit offsets everywhere.
    
    - Heikki
    
  61. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-17T16:35:46Z

    Here's yet another patch version. I spent the day reviewing this in 
    detail and doing little cleanups here and there. I squashed the commits 
    and wrote a proper commit message.
    
    One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to 
    me at least) how upgrade from version 9.2 and below now works. It was 
    actually broken when I tested it. Not sure if I had broken it earlier or 
    if it never worked, but in any case it works now.
    
    I also tested upgrading a cluster from an old minor version, < 9.3.5, 
    where the control file has a bogus oldestMultiXid==1 value (see commit 
    b6a3444fa6). As expected, you get a "could not open file" error:
    
    > Performing Upgrade
    > ------------------
    > Setting locale and encoding for new cluster                   ok
    > ...
    > Deleting files from new pg_multixact/members                  ok
    > Deleting files from new pg_multixact/offsets                  ok
    > Converting pg_multixact files                                 
    > could not open file "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file or directory
    > Failure, exiting
    
    I don't think we need to support that case. I hope there are no clusters 
    in that state still in the wild, and you can work around it by upgrading 
    to 9.3.5 or above and letting autovacuum run. But I wonder if a 
    pre-upgrade check with a better error message would still be worthwhile.
    
    
    Ashutosh, you were interested in reviewing this earlier. Would you have 
    a chance to review this now, before I commit it? Alexander, Alvaro, 
    would you have a chance to take a final look too, please?
    
    - Heikki
    
  62. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-11-18T01:47:45Z

    Hi Heikki
    > I don't think we need to support that case. I hope there are no clusters
    > in that state still in the wild, and you can work around it by upgrading
    > to 9.3.5 or above and letting autovacuum run. But I wonder if a
    > pre-upgrade check with a better error message would still be worthwhile.
    I think we believe it is now highly unlikely to find instances of version
    9.3; all users are advised to upgrade to the latest version first.
    
    
    
    
    Thanks
    
    On Tue, Nov 18, 2025 at 12:35 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > Here's yet another patch version. I spent the day reviewing this in
    > detail and doing little cleanups here and there. I squashed the commits
    > and wrote a proper commit message.
    >
    > One noteworthy refactoring is in pg_upgrade.c, to make it more clear (to
    > me at least) how upgrade from version 9.2 and below now works. It was
    > actually broken when I tested it. Not sure if I had broken it earlier or
    > if it never worked, but in any case it works now.
    >
    > I also tested upgrading a cluster from an old minor version, < 9.3.5,
    > where the control file has a bogus oldestMultiXid==1 value (see commit
    > b6a3444fa6). As expected, you get a "could not open file" error:
    >
    > > Performing Upgrade
    > > ------------------
    > > Setting locale and encoding for new cluster                   ok
    > > ...
    > > Deleting files from new pg_multixact/members                  ok
    > > Deleting files from new pg_multixact/offsets                  ok
    > > Converting pg_multixact files
    > > could not open file
    > "/home/heikki/pgsql.93stable/data/pg_multixact/offsets/0000": No such file
    > or directory
    > > Failure, exiting
    >
    > I don't think we need to support that case. I hope there are no clusters
    > in that state still in the wild, and you can work around it by upgrading
    > to 9.3.5 or above and letting autovacuum run. But I wonder if a
    > pre-upgrade check with a better error message would still be worthwhile.
    >
    >
    > Ashutosh, you were interested in reviewing this earlier. Would you have
    > a chance to review this now, before I commit it? Alexander, Alvaro,
    > would you have a chance to take a final look too, please?
    >
    > - Heikki
    >
    
  63. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-19T16:20:27Z

    One more small issue: The docs for pg_resetwal contain recipes for how 
    to determine safe values to use:
    
    > -m mxid,mxid
    > --multixact-ids=mxid,mxid
    > Manually set the next and oldest multitransaction ID.
    > 
    > A safe value for the next multitransaction ID (first part) can be
    > determined by looking for the numerically largest file name in the
    > directory pg_multixact/offsets under the data directory, adding one,
    > and then multiplying by 65536 (0x10000). Conversely, a safe value
    > for the oldest multitransaction ID (second part of -m) can be
    > determined by looking for the numerically smallest file name in the
    > same directory and multiplying by 65536. The file names are in
    > hexadecimal, so the easiest way to do this is to specify the option
    > value in hexadecimal and append four zeroes.
    > 
    > -O mxoff
    > --multixact-offset=mxoff
    > 
    > Manually set the next multitransaction offset.
    > 
    > A safe value can be determined by looking for the numerically
    > largest file name in the directory pg_multixact/members under the
    > data directory, adding one, and then multiplying by 52352 (0xCC80).
    > The file names are in hexadecimal. There is no simple recipe such as
    > the ones for other options of appending zeroes.
    
    I think those recipes need to be adjusted for 64-bit offsets.
    
    - Heikki
    
    
    
    
    
  64. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-11-20T16:30:51Z

    On Wed, 19 Nov 2025 at 19:20, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > I think those recipes need to be adjusted for 64-bit offsets.
    
    Yes, we need to do it.
    
    Sorry if this is too obvious, but with 32-bit offsets, we get:
    SLRU_PAGES_PER_SEGMENT * BLKSZ / sizeof(MXOff) =
        32 * 8192 / 4 = 65,536 mxoff per segment.
    
    Now, with 64-bits offsets, we should have half as much.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  65. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-11-21T12:15:54Z

    On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >
    > Ashutosh, you were interested in reviewing this earlier. Would you have
    > a chance to review this now, before I commit it?
    
    0002 seems to be fine. It's just moving code from one file to another.
    However, the name multixact_internals.h seems to be misusing term
    "internal".  I would expect an "internal.h" to expose things for use
    within the module and not "externally" in other modules. But this
    isn't the first time, buf_internal.h, toast_internal.h
    bgworker_internal.h and so on have their own meaning of what
    "internal" means to them. But it might be better to use something more
    meaningful than "internal" in this case. The file seems to contain
    declarations related to how pg_multixact SLRU is structured. To that
    effect multixact_slru.h or multixact_layout.h may be more appropriate.
    
    There are two offsets that that file deals with offset within
    pg_multixact/offset, MultiXactOffset and member offset (flag offset
    and xid offset) within pg_multixact/members. It's quite easy to get
    confused between those when reading that code. For example, it's not
    clear which offset MultiXactIdToOffset* functions are about. These
    functions are calculating the page, entry (within the page) and
    segment (of page) in pg_multixact/offset where to find the
    MultiXactOffset of the first member of a given mxid. Thus returning
    offset within offset. I feel they should have been named differently
    when the code was written. But now that we are moving this code in a
    separate file, we also have an opportunity to rename it better. I
    think MXOffsetToMember* functions have better names. Using a similar
    convention we could use MultiXactIdToOffsetOffset*, but that might
    increase confusion. How about MultiXactIdToOffsetPos*? A separate .h
    file also looks like a good place to document how offsets are laid out
    and its contents and how members is laid out. The latter is somehow
    documented in terms of macros and the static functions. The first is
    not documented well, I feel. This refactoring seems to be a good
    opportunity to do that. If we do so, I think, the .h there will be
    some value in committing .h file as a separate commit.
    
    The reason why this eliminates the need for wraparound is mentioned
    somewhere in GetNewMultiXactId(), but probably it should be mentioned
    at a more prominent place and also in the commit message. I expected
    it to be in the prologue of GetNewMultiXactId(), and then a reference
    to prologue from where the comment is right now.
    
    ereport(ERROR,
    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
    errmsg("MultiXact members would wrap around")));
    
     If a server ever reaches this point, there is no way but to create
    another cluster, if the applications requires multi-xact ids? We
    should also provide this as an errhint().
    
    if (nextOffset + nmembers < nextOffset)
    
    :). I spent a few seconds trying to understand this. But I don't know
    how to make it easy to understand.
    
    In ExtendMultiXactMember() the last comment mentions a wraparound
    /*
    * Advance to next page, taking care to properly handle the wraparound
    * case. OK if nmembers goes negative.
    */
    I thought this wraparound is about offset wraparound, which can not
    happen now. Since you have left the comment intact, it's something
    else. Is the wraparound of offset within the page? Maybe requires a
    bit more clarification?
    
    PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
    newOldestOffset)
    {
    ... snip ...
    - segment += 1;
    - }
    + SimpleLruTruncate(MultiXactMemberCtl,
    + MXOffsetToMemberPage(newOldestOffset));
    }
    
    Most of the callers of SimpleLruTruncate() call it directly. Why do we
    want to keep this static wrapper? PerformOffsetsTruncation() has a
    comment which seems to need the wrapper. But
    PerformMembersTruncation() doesn't even have that.
    
    MultiXactState->oldestMultiXactId = newOldestMulti;
    MultiXactState->oldestMultiXactDB = newOldestMultiDB;
    + MultiXactState->oldestOffset = newOldestOffset;
    LWLockRelease(MultiXactGenLock);
    
    Is this something we are missing in the current code? I can not
    understand the connection between this change and the fact that offset
    wraparound is not possible with wider multi xact offsets. Maybe I
    missed some previous discussion.
    
    I have reviewed patch 0002 and multxact.c changes in 0003. So far I
    have only these comments. I will review the pg_upgrade.c changes next.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  66. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-21T13:56:33Z

    On 21/11/2025 14:15, Ashutosh Bapat wrote:
    > On Mon, Nov 17, 2025 at 10:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >>
    >> Ashutosh, you were interested in reviewing this earlier. Would you have
    >> a chance to review this now, before I commit it?
    > 
    > 0002 seems to be fine. It's just moving code from one file to another.
    > However, the name multixact_internals.h seems to be misusing term
    > "internal".  I would expect an "internal.h" to expose things for use
    > within the module and not "externally" in other modules. But this
    > isn't the first time, buf_internal.h, toast_internal.h
    > bgworker_internal.h and so on have their own meaning of what
    > "internal" means to them. But it might be better to use something more
    > meaningful than "internal" in this case. The file seems to contain
    > declarations related to how pg_multixact SLRU is structured. To that
    > effect multixact_slru.h or multixact_layout.h may be more appropriate.
    
    Yeah, I went with multixact_internal.h because of the precedence. It's 
    not great, but IMHO it's better to be consistent than invent a new 
    naming scheme.
    
    > There are two offsets that that file deals with offset within
    > pg_multixact/offset, MultiXactOffset and member offset (flag offset
    > and xid offset) within pg_multixact/members. It's quite easy to get
    > confused between those when reading that code.
    
    Agreed, those are confusing. I'll think about that a little more.
    
    > The reason why this eliminates the need for wraparound is mentioned
    > somewhere in GetNewMultiXactId(), but probably it should be mentioned
    > at a more prominent place and also in the commit message. I expected
    > it to be in the prologue of GetNewMultiXactId(), and then a reference
    > to prologue from where the comment is right now.
    > 
    > ereport(ERROR,
    > (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
    > errmsg("MultiXact members would wrap around")));
    > 
    >   If a server ever reaches this point, there is no way but to create
    > another cluster, if the applications requires multi-xact ids?
    
    Pretty much. You can also vacuum freeze everything, so that no multixids 
    are in use, and then use pg_resetwal to reset nextOffset to a lower value.
    
    That obviously sounds bad, but this is a "can't happen" situation. For 
    comparison, we don't provide any better way to recover from running out 
    of LSNs either.
    
    > We should also provide this as an errhint().
    I think no. You cannot run into this "organically" by just running the 
    system. If you do manage to hit it, it's a sign of some other trouble, 
    and I don't want to guess what that might be, or what might be the 
    appropriate way to recover.
    
    > In ExtendMultiXactMember() the last comment mentions a wraparound
    > /*
    > * Advance to next page, taking care to properly handle the wraparound
    > * case. OK if nmembers goes negative.
    > */
    > I thought this wraparound is about offset wraparound, which can not
    > happen now. Since you have left the comment intact, it's something
    > else. Is the wraparound of offset within the page? Maybe requires a
    > bit more clarification?
    
    It was indeed about offset wraparound. I'll remove it.
    
    > PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset
    > newOldestOffset)
    > {
    > ... snip ...
    > - segment += 1;
    > - }
    > + SimpleLruTruncate(MultiXactMemberCtl,
    > + MXOffsetToMemberPage(newOldestOffset));
    > }
    > 
    > Most of the callers of SimpleLruTruncate() call it directly. Why do we
    > want to keep this static wrapper? PerformOffsetsTruncation() has a
    > comment which seems to need the wrapper. But
    > PerformMembersTruncation() doesn't even have that.
    
    Hmm, yeah those wrappers are a bit vestigial now. I'm inclined to keep 
    them, because as you said, PerformOffsetsTruncation() provides a place 
    for the comment. And given that, it seems best to keep 
    PerformMembersTruncation(), for symmetry.
    
    > MultiXactState->oldestMultiXactId = newOldestMulti;
    > MultiXactState->oldestMultiXactDB = newOldestMultiDB;
    > + MultiXactState->oldestOffset = newOldestOffset;
    > LWLockRelease(MultiXactGenLock);
    > 
    > Is this something we are missing in the current code? I can not
    > understand the connection between this change and the fact that offset
    > wraparound is not possible with wider multi xact offsets. Maybe I
    > missed some previous discussion.
    
    Good question. At first I intended to extract that to a separate commit, 
    before the main patch, because it seems like a nice improvement: We have 
    just calculated 'oldestOffset', so why not update the value in shared 
    memory while we have it? But looking closer, I'm not sure if it'd be 
    sane without the 64-bit offsets. Currently, oldestOffset is only updated 
    by SetOffsetVacuumLimit(), which also updates offsetStopLimit. We could 
    get into a state where oldestOffset is set, but offsetStopLimit is not. 
    With 64-bit offsets that's no longer a concern because it removes 
    offsetStopLimit altogether.
    
    > I have reviewed patch 0002 and multxact.c changes in 0003. So far I
    > have only these comments. I will review the pg_upgrade.c changes next.
    
    Thanks for the review so far!
    
    - Heikki
    
    
    
    
  67. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-25T10:07:27Z

    Looking at the upgrade code, in light of the "IPC/MultixactCreation on 
    the Standby server" thread [1], I think we need to make it more 
    tolerant. It's possible that there are 0 offsets in 
    pg_multixact/offsets. That might or might not be a problem: it's OK as 
    long as those multixids don't appear in any heap table, or you might 
    actually have lost those multixids, which is bad but the damage has 
    already been done and upgrade should not get stuck on it. 
    GetOldMultiXactIdSingleMember() currently asserts that the offset is 
    never zero, but it should try to do something sensible in that case 
    instead of just failing.
    
    [1] 
    https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru
    
    - Heikki
    
    
    
    
    
  68. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-11-26T15:23:18Z

    On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > GetOldMultiXactIdSingleMember() currently asserts that the offset is
    > never zero, but it should try to do something sensible in that case
    > instead of just failing.
    >
    > Correct me if I'm wrong, but we added the assertion that offsets are
    never 0, based on the idea that case #2 will never take place during an
    update. If this isn't the case, this assertion could be removed.
    The rest of the function appears to work correctly.
    
    I even think that, as an experiment, we could randomly reset some of the
    offsets to zero and nothing would happen, except that some data would
    be lost.
    
    The most sensible thing we can do is give the user a warning, right?
    Something like, "During the update, we encountered some weird offset
    that shouldn't have been there, but there's nothing we can do about it,
    just take note."
    
    -- 
    Best regards,
    Maxim Orlov.
    
  69. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-26T15:50:25Z

    On 26/11/2025 17:23, Maxim Orlov wrote:
    > On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi 
    > <mailto:hlinnaka@iki.fi>> wrote:
    >> GetOldMultiXactIdSingleMember() currently asserts that the offset is
    >> never zero, but it should try to do something sensible in that case
    >> instead of just failing.
    > 
    > Correct me if I'm wrong, but we added the assertion that offsets are
    > never 0, based on the idea that case #2 will never take place during an
    > update. If this isn't the case, this assertion could be removed.
    > The rest of the function appears to work correctly.
    > 
    > I even think that, as an experiment, we could randomly reset some of the
    > offsets to zero and nothing would happen, except that some data would
    > be lost.
    
    +1
    
    > The most sensible thing we can do is give the user a warning, right?
    > Something like, "During the update, we encountered some weird offset
    > that shouldn't have been there, but there's nothing we can do about it,
    > just take note."
    
    Yep, makes sense.
    
    - Heikki
    
    
    
    
    
  70. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-11-28T13:05:44Z

    On Fri, Nov 21, 2025 at 7:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >
    >
    > > I have reviewed patch 0002 and multxact.c changes in 0003. So far I
    > > have only these comments. I will review the pg_upgrade.c changes next.
    
    007_multixact_conversion.pl fires thousands of queries through
    BackgroundPsql which prints debug output for each of the queries. When
    running this file with oldinstall set,
    2.2M  regress_log_007_multixact_conversion (size of file)
    77874 regress_log_007_multixact_conversion (wc -l output)
    
    Since this output is also copied in testlog.txt, the effect is two-fold.
    
    Most, if not all, of this output is useless. It also makes it hard to
    find the output we are looking for. PFA patch which reduces this
    output. The patch adds a flag verbose to query_safe() and query() to
    toggle this output. With the patch the sizes are
    27K  regress_log_007_multixact_conversion
    588 regress_log_007_multixact_conversion
    
    And it makes the test faster by about a second or two on my laptop.
    Something on those lines or other is required to reduce the output
    from query_safe().
    
    Some more comments
    +++ b/src/bin/pg_upgrade/multixact_old.c
    
    We may need to introduce new _new and then _old will become _older.
    Should we rename the files to have pre19 and post19 or some similar
    suffixes which make it clear what is meant by old and new?
    
    +
    +static inline int64
    +MultiXactIdToOffsetPage(MultiXactId multi)
    
    The prologue mentions that the definitions are copy-pasted from
    multixact.c from version 18, but they share the names with functions
    in the current version. I think that's going to be a good source of
    confusion especially in a file which is a few hundred lines long. Can
    we rename them to have "Old" prefix or something similar?
    
    +
    +# Dump contents of the 'mxofftest' table, created by mxact_workload
    +sub get_dump_for_comparison
    
    This local function shares its name with a local function in
    002_pg_upgrade.pl. Better to use a separate name. Also it's not
    "dumping" data using "pg_dump", so "dump" in the name can be
    misleading.
    
    + $newnode->start;
    + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
    + $newnode->stop;
    
    There is no code which actually looks at the multixact offsets here to
    make sure that the conversion happened correctly. I guess the test
    relies on visibility checks for that. Anyway, we need a comment
    explaining why just comparing the contents of the table is enough to
    ensure correct conversion. Better if we can add an explicit test that
    the offsets were converted correctly. I don't have any idea of how to
    do that right now, though. Maybe use pg_get_multixact_members()
    somehow in the query to extract data out of the table?
    
    +
    + compare_files($old_dump, $new_dump,
    + 'dump outputs from original and restored regression databases match');
    
    A shared test name too :); but there is not regression database here.
    
    +
    + note ">>> case #${tag}\n"
    +   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
    +   . " newnode mxoff ${new_next_mxoff}\n";
    
    Should we check that some condition holds between finish_mxoff and
    new_next_mxoff?
    
    I will continue reviewing it further.
    
    
    --
    Best Wishes,
    Ashutosh Bapat
    
  71. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-11-28T13:17:24Z

    On Fri, Nov 28, 2025 at 6:35 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    > On Fri, Nov 21, 2025 at 7:26 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    > >
    > >
    > > > I have reviewed patch 0002 and multxact.c changes in 0003. So far I
    > > > have only these comments. I will review the pg_upgrade.c changes next.
    >
    > 007_multixact_conversion.pl fires thousands of queries through
    > BackgroundPsql which prints debug output for each of the queries. When
    > running this file with oldinstall set,
    > 2.2M  regress_log_007_multixact_conversion (size of file)
    > 77874 regress_log_007_multixact_conversion (wc -l output)
    >
    > Since this output is also copied in testlog.txt, the effect is two-fold.
    >
    > Most, if not all, of this output is useless. It also makes it hard to
    > find the output we are looking for. PFA patch which reduces this
    > output. The patch adds a flag verbose to query_safe() and query() to
    > toggle this output. With the patch the sizes are
    > 27K  regress_log_007_multixact_conversion
    > 588 regress_log_007_multixact_conversion
    >
    > And it makes the test faster by about a second or two on my laptop.
    > Something on those lines or other is required to reduce the output
    > from query_safe().
    >
    > Some more comments
    > +++ b/src/bin/pg_upgrade/multixact_old.c
    >
    > We may need to introduce new _new and then _old will become _older.
    > Should we rename the files to have pre19 and post19 or some similar
    > suffixes which make it clear what is meant by old and new?
    >
    > +
    > +static inline int64
    > +MultiXactIdToOffsetPage(MultiXactId multi)
    >
    > The prologue mentions that the definitions are copy-pasted from
    > multixact.c from version 18, but they share the names with functions
    > in the current version. I think that's going to be a good source of
    > confusion especially in a file which is a few hundred lines long. Can
    > we rename them to have "Old" prefix or something similar?
    >
    > +
    > +# Dump contents of the 'mxofftest' table, created by mxact_workload
    > +sub get_dump_for_comparison
    >
    > This local function shares its name with a local function in
    > 002_pg_upgrade.pl. Better to use a separate name. Also it's not
    > "dumping" data using "pg_dump", so "dump" in the name can be
    > misleading.
    >
    > + $newnode->start;
    > + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
    > + $newnode->stop;
    >
    > There is no code which actually looks at the multixact offsets here to
    > make sure that the conversion happened correctly. I guess the test
    > relies on visibility checks for that. Anyway, we need a comment
    > explaining why just comparing the contents of the table is enough to
    > ensure correct conversion. Better if we can add an explicit test that
    > the offsets were converted correctly. I don't have any idea of how to
    > do that right now, though. Maybe use pg_get_multixact_members()
    > somehow in the query to extract data out of the table?
    >
    > +
    > + compare_files($old_dump, $new_dump,
    > + 'dump outputs from original and restored regression databases match');
    >
    > A shared test name too :); but there is not regression database here.
    >
    > +
    > + note ">>> case #${tag}\n"
    > +   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
    > +   . " newnode mxoff ${new_next_mxoff}\n";
    >
    > Should we check that some condition holds between finish_mxoff and
    > new_next_mxoff?
    >
    > I will continue reviewing it further.
    
    One more thing,
    An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
    experiments I didn't see an UPDATE creating a multi-xact. Why do we
    have UPDATEs in the load created by the test? Am I missing something?
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  72. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-12-01T08:53:41Z

    On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
    wrote:
    
    > One more thing,
    > An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
    > experiments I didn't see an UPDATE creating a multi-xact. Why do we
    > have UPDATEs in the load created by the test? Am I missing something?
    
    
    As far as I remember, this was done on purpose to create different
    multixact members statuses randomly.
    
    
    -- 
    Best regards,
    Maxim Orlov.
    
  73. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-12-01T12:35:11Z

    On Mon, Dec 1, 2025 at 2:23 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    >
    >
    >
    > On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
    >>
    >> One more thing,
    >> An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
    >> experiments I didn't see an UPDATE creating a multi-xact. Why do we
    >> have UPDATEs in the load created by the test? Am I missing something?
    >
    >
    > As far as I remember, this was done on purpose to create different
    > multixact members statuses randomly.
    
    In that case, better to include that in the comments.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  74. Re: POC: make mxidoff 64 bits

    Alexander Korotkov <aekorotkov@gmail.com> — 2025-12-02T14:11:34Z

    Hi, Heikki!
    
    On Tue, Nov 25, 2025 at 12:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    > Looking at the upgrade code, in light of the "IPC/MultixactCreation on
    > the Standby server" thread [1], I think we need to make it more
    > tolerant. It's possible that there are 0 offsets in
    > pg_multixact/offsets. That might or might not be a problem: it's OK as
    > long as those multixids don't appear in any heap table, or you might
    > actually have lost those multixids, which is bad but the damage has
    > already been done and upgrade should not get stuck on it.
    > GetOldMultiXactIdSingleMember() currently asserts that the offset is
    > never zero, but it should try to do something sensible in that case
    > instead of just failing.
    
    Thank you for your work on this subject.  It's very much appreciated.
    
    I'd like to raise the question about compression again.  You have
    fairly criticized non-deterministic compression, but what do you think
    about deterministic one that I've proposed [1].  I understand that
    multixact offsets are subject of growth and their limit is not
    removed.  However, it's still several extra gigabytes for multixact
    offsets, which we could save.
    
    Links.
    1. https://www.postgresql.org/message-id/CAPpHfduDFLXATvBkUiOjyvZUBZXhK_pj5zjVpxvrJzkRVq%2B8Lw%40mail.gmail.com
    
    ------
    Regards,
    Alexander Korotkov
    Supabase
    
    
    
    
  75. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-02T14:47:29Z

    On 02/12/2025 16:11, Alexander Korotkov wrote:
    > I'd like to raise the question about compression again.  You have
    > fairly criticized non-deterministic compression, but what do you think
    > about deterministic one that I've proposed [1].  I understand that
    > multixact offsets are subject of growth and their limit is not
    > removed.  However, it's still several extra gigabytes for multixact
    > offsets, which we could save.
    
    It felt overly complicated to my taste. And decoding/encoding the whole 
    chunk on every access seems expensive. Maybe it's cheap enough that it 
    doesn't matter in practice, but some performance testing would at least 
    be in order. But I'd love to find a simpler scheme to begin with.
    
    Storing one "base" offset per page, as Maxim did in [1], feels about 
    right to me. Except for the non-deterministic nature of how it gets set 
    in that patch, and what I referred to as a "frighteningly clever 
    encoding scheme".
    
    Perhaps we could set the base offset in ExtendMultiXactOffset() already?
    
    [1] 
    https://www.postgresql.org/message-id/CACG%3DezbPUASDL1eJ%2Bc-ZkJMwRPukvp3EL0q1vSUa1h%2BfnX8y3g%40mail.gmail.com
    
    - Heikki
    
    
    
    
    
  76. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-12-03T09:54:23Z

    The biggest problem with compression, in my opinion, is that losing
    even one byte causes the loss of the entire compressed block in the
    worst case scenario. After all, we still don't have checksums for the
    SLRU's, which is a shame by itself.
    
    Again, I'm not against the idea of compression, but the risks need to
    be considered.
    
    As a software developer, I definitely want to implement compression and
    save a few gigabytes. However, given my previous experience using
    Postgres in real-world applications, reliability at the cost of several
    gigabytes would not have caused me any trouble. Just saying.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  77. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-03T12:04:11Z

    On 03/12/2025 11:54, Maxim Orlov wrote:
    > The biggest problem with compression, in my opinion, is that losing
    > even one byte causes the loss of the entire compressed block in the
    > worst case scenario. After all, we still don't have checksums for the
    > SLRU's, which is a shame by itself.
    > 
    > Again, I'm not against the idea of compression, but the risks need to
    > be considered.
    
    There are plenty of such critical bytes in the system where a single bit 
    flip renders the whole block unreadable. Actually, if we had checksums 
    on SLRU pages, a single bit flip anywhere in the page would make the 
    checksum fail and render the block unreadable.
    
    If things go really bad and you need to open a hex editor and try to fix 
    the data manually, it shouldn't be too hard to deduce the correct base 
    offset from surrounding data.
    
    > As a software developer, I definitely want to implement compression and
    > save a few gigabytes. However, given my previous experience using
    > Postgres in real-world applications, reliability at the cost of several
    > gigabytes would not have caused me any trouble. Just saying.
    
    +1. If we decide to do some kind of compression here, I want it to be 
    very simple. Otherwise it's just not worth the code complexity and risk.
    
    Let's do the math of how much disk space we'd save. Let's assume the 
    worst case that every multixid consists of only one transaction ID. 
    Currently, every such multixid takes up 4 bytes in the offsets SLRU, and 
    5 bytes in the members SLRU (one flag byte and 4 bytes for the XID). So 
    that's 9 bytes. With 64-bit offsets, it becomes 13 bytes. With the 
    compression, we're back to 9 bytes again (ignoring the one base offset 
    per page). So in an extreme case that you have 1 billion multixids, with 
    only one XID per multixid, the difference is between 9 GB and 13 GB. 
    That seems acceptable.
    
    And having just one XID per multixid is a rare corner case. Much more 
    commonly, you have at at least two XIDs. With two XIDs per multixid, the 
    difference is between 14 bytes and 18 bytes.
    
    And having a billion multixids is pretty extreme. Your database is 
    likely very large too if you reach that point, and a few gigabytes won't 
    matter.
    
    One could argue that the memory needed for the SLRU cache matters more 
    than the disk space. That's perhaps true, but I think this is totally 
    acceptable from that point of view, too.
    
    - Heikki
    
    
    
    
    
  78. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-03T19:12:40Z

    On 01/12/2025 14:35, Ashutosh Bapat wrote:
    > On Mon, Dec 1, 2025 at 2:23 PM Maxim Orlov <orlovmg@gmail.com> wrote:
    >> On Fri, 28 Nov 2025 at 16:17, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
    >>> An UPDATE waits for FOR SHARE query to finish, and vice versa. In my
    >>> experiments I didn't see an UPDATE creating a multi-xact. Why do we
    >>> have UPDATEs in the load created by the test? Am I missing something?
    >>
    >> As far as I remember, this was done on purpose to create different
    >> multixact members statuses randomly.
    > 
    > In that case, better to include that in the comments.
    
    I think that was indeed the purpose, but the test should use FOR KEY 
    SHARE rather than FOR SHARE. Otherwise the UPDATEs don't generate multixids.
    
    - Heikki
    
    
    
    
    
  79. Re: POC: make mxidoff 64 bits

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-12-04T02:04:46Z

    Hi
    > As a software developer, I definitely want to >  implement compression and
    > save a few gigabytes. However, given my previous experience using
    > Postgres in real-world applications, reliability at the cost of several
    > gigabytes would not have caused me any trouble. Just saying.
    Agree +1, If this had been done twenty years ago, the cost might have been
    unacceptable. But with today’s hardware—especially disk random and
    sequential I/O performance improving by hundreds of thousands of times, and
    memory capacity increasing by several hundred times—it’s almost
    unimaginable that we now have single 256-GB DIMMs. So this kind of overhead
    is negligible for modern hardware.
    
    
    Thanks
    
    
    On Wed, 3 Dec 2025 at 17:54, Maxim Orlov <orlovmg@gmail.com> wrote:
    
    > The biggest problem with compression, in my opinion, is that losing
    > even one byte causes the loss of the entire compressed block in the
    > worst case scenario. After all, we still don't have checksums for the
    > SLRU's, which is a shame by itself.
    >
    > Again, I'm not against the idea of compression, but the risks need to
    > be considered.
    >
    > As a software developer, I definitely want to implement compression and
    > save a few gigabytes. However, given my previous experience using
    > Postgres in real-world applications, reliability at the cost of several
    > gigabytes would not have caused me any trouble. Just saying.
    >
    > --
    > Best regards,
    > Maxim Orlov.
    >
    
  80. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-12-04T10:03:23Z

    On Wed, 3 Dec 2025 at 15:04, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > There are plenty of such critical bytes in the system where a single bit
    > flip renders the whole block unreadable. Actually, if we had checksums
    > on SLRU pages, a single bit flip anywhere in the page would make the
    > checksum fail and render the block unreadable.
    >
    > Correct. However, my concern about the lack of checksums for SLRU wasn't
    about data loss and the impossibility of recovering it, but about the
    impossibility of detecting the error.  But it is how it is for now.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  81. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-12-04T10:36:51Z

    On Wed, Dec 3, 2025 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >
    > On 03/12/2025 11:54, Maxim Orlov wrote:
    > > The biggest problem with compression, in my opinion, is that losing
    > > even one byte causes the loss of the entire compressed block in the
    > > worst case scenario. After all, we still don't have checksums for the
    > > SLRU's, which is a shame by itself.
    > >
    > > Again, I'm not against the idea of compression, but the risks need to
    > > be considered.
    >
    > There are plenty of such critical bytes in the system where a single bit
    > flip renders the whole block unreadable. Actually, if we had checksums
    > on SLRU pages, a single bit flip anywhere in the page would make the
    > checksum fail and render the block unreadable.
    >
    > If things go really bad and you need to open a hex editor and try to fix
    > the data manually, it shouldn't be too hard to deduce the correct base
    > offset from surrounding data.
    >
    > > As a software developer, I definitely want to implement compression and
    > > save a few gigabytes. However, given my previous experience using
    > > Postgres in real-world applications, reliability at the cost of several
    > > gigabytes would not have caused me any trouble. Just saying.
    >
    > +1. If we decide to do some kind of compression here, I want it to be
    > very simple. Otherwise it's just not worth the code complexity and risk.
    >
    > Let's do the math of how much disk space we'd save. Let's assume the
    > worst case that every multixid consists of only one transaction ID.
    > Currently, every such multixid takes up 4 bytes in the offsets SLRU, and
    > 5 bytes in the members SLRU (one flag byte and 4 bytes for the XID). So
    > that's 9 bytes. With 64-bit offsets, it becomes 13 bytes. With the
    > compression, we're back to 9 bytes again (ignoring the one base offset
    > per page). So in an extreme case that you have 1 billion multixids, with
    > only one XID per multixid, the difference is between 9 GB and 13 GB.
    > That seems acceptable.
    >
    > And having just one XID per multixid is a rare corner case. Much more
    > commonly, you have at at least two XIDs. With two XIDs per multixid, the
    > difference is between 14 bytes and 18 bytes.
    >
    > And having a billion multixids is pretty extreme. Your database is
    > likely very large too if you reach that point, and a few gigabytes won't
    > matter.
    
    I am in favour of keeping things simpler than using a complex compression.
    
    >
    > One could argue that the memory needed for the SLRU cache matters more
    > than the disk space. That's perhaps true, but I think this is totally
    > acceptable from that point of view, too.
    
    This brings an interesting point. Since the offsets are twice large,
    SLRU will contain half the entries than earlier. Have we measured
    performance impact of this? Do we need to provide some guidance about
    increasing the SLRU size or increase the default SLRU size?
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  82. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-04T10:39:43Z

    On 26/11/2025 17:50, Heikki Linnakangas wrote:
    > On 26/11/2025 17:23, Maxim Orlov wrote:
    >> On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <hlinnaka@iki.fi 
    >> <mailto:hlinnaka@iki.fi>> wrote:
    >>> GetOldMultiXactIdSingleMember() currently asserts that the offset is
    >>> never zero, but it should try to do something sensible in that case
    >>> instead of just failing.
    >>
    >> Correct me if I'm wrong, but we added the assertion that offsets are
    >> never 0, based on the idea that case #2 will never take place during an
    >> update. If this isn't the case, this assertion could be removed.
    >> The rest of the function appears to work correctly.
    >>
    >> I even think that, as an experiment, we could randomly reset some of the
    >> offsets to zero and nothing would happen, except that some data would
    >> be lost.
    > 
    > +1
    > 
    >> The most sensible thing we can do is give the user a warning, right?
    >> Something like, "During the update, we encountered some weird offset
    >> that shouldn't have been there, but there's nothing we can do about it,
    >> just take note."
    > 
    > Yep, makes sense.
    
    I read through the SLRU reading codepath, looking for all the things 
    that could go wrong (not sure I got them all):
    
    1. An SLRU file does not exist
    2. An SLRU file is too short, i.e. a page does not exist
    3. The offset in 'offsets' page is 0
    4. The offset in 'offsets' page looks invalid, i.e. it's greater than 
    nextOffset or smaller than oldestOffset.
    5. The offset is out of order compared to its neighbors
    6. The multixid has no members
    7. The multixid has an invalid (0) member
    8. A multixid has more than one updating member
    
    Some of those situations are theoretically are possible if there was a 
    crash. We don't follow the WAL-before-data rule for these SLRUs. 
    Instead, we piggyback on the WAL-before-data of the heap page that would 
    reference the multixid. In other words, we rely on the fact that if a 
    multixid write is missed or torn because of a crash, that multixid will 
    not be referenced from anywhwere and will never be read.
    
    However, that doesn't hold for pg_upgrade. pg_upgrade will try to read 
    all the multixids. So we need to make the multixact reading code 
    tolerant of the situations that could be present after a crash. I think 
    the right philosophy here is that we try to read all the old multixids, 
    and do our best to interpret them the same way that the old server 
    would. For those situations that can legitimately be present if the old 
    server crashed at some point, be silent. For cases that should not 
    happen, even if there was a crash, print a warning. For example, I think 
    an SLRU file should never be missing (1) or truncated (2). But the zero 
    offset (3), and (6) can happen.
    
    Perhaps we should check that all the files exist and have the correct 
    sizes in the pre-check stage, and abort the upgrade early if anything is 
    missing. That would be pretty cheap to check.
    
    - Heikki
    
    
    
    
    
  83. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-12-04T15:33:47Z

    On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
    > all the multixids. So we need to make the multixact reading code
    > tolerant of the situations that could be present after a crash. I think
    > the right philosophy here is that we try to read all the old multixids,
    > and do our best to interpret them the same way that the old server
    > would.
    
    
    Something like attached?
    
    Now previous scheme of upgrade with the bytes joggling start
    looking not so bad. Just a funny thought that came to my mind.
    
    Perhaps we should check that all the files exist and have the correct
    
    sizes in the pre-check stage
    
    
    Not sure about it. Because SLRU does not support "holes", simply
    checking if the first and last multixacts exist will be enough. But
    we'll do it anyway in a real conversion.
    
    PFA to start a conversation.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  84. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-12-05T13:42:57Z

    On Fri, Nov 28, 2025 at 6:35 PM Ashutosh Bapat
    <ashutosh.bapat.oss@gmail.com> wrote:
    >
    >
    > I will continue reviewing it further.
    >
    
    There is duplication of code/functionality between server and
    pg_upgrade. With it we carry all the risks that come with
    code/functionality duplication like the copies going out of sync.
    There may be a valid reason to do that but it's not documented in the
    comments. At-least both mutlixact_new.c and slru_io.c are not as well
    commented as their server counterparts. I understand that the SLRU
    code in the server deals with shared memory which is not needed in
    pg_upgrade; pg_upgrade will not need more than one buffer in memory
    and pg_upgrade code doesn't need to deal with lock and it can not deal
    with locks. That means the code required by pg_upgrade is much simpler
    than that on the server. But there's also non-trivial code which is
    required in both the cases. WIll it be possible to extract parts of
    slru.c which deal with IO into slru_io.c, make it part of the core and
    then use it in pg_upgrade as well as slru.c? Or whether it's possible
    to make SLRU use local memory? And throwing some FRONTEND magic to the
    mix, we may be able to avoid duplication. Have we tried this or
    something else to avoid duplication? Sorry, if this has been discussed
    earlier. Please point me to the relevant discussion if so.
    
    --
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  85. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-05T23:36:46Z

    After a little detour to the "IPC/MultixactCreation on the Standby 
    server" issue [1], I'm back to working on this. New patch version 
    attached, addressing your comments and Maxim's.
    
    On 05/12/2025 15:42, Ashutosh Bapat wrote:
    > There is duplication of code/functionality between server and
    > pg_upgrade. With it we carry all the risks that come with
    > code/functionality duplication like the copies going out of sync.
    > There may be a valid reason to do that but it's not documented in the
    > comments. At-least both mutlixact_new.c and slru_io.c are not as well
    > commented as their server counterparts. I understand that the SLRU
    > code in the server deals with shared memory which is not needed in
    > pg_upgrade; pg_upgrade will not need more than one buffer in memory
    > and pg_upgrade code doesn't need to deal with lock and it can not deal
    > with locks. That means the code required by pg_upgrade is much simpler
    > than that on the server. But there's also non-trivial code which is
    > required in both the cases. WIll it be possible to extract parts of
    > slru.c which deal with IO into slru_io.c, make it part of the core and
    > then use it in pg_upgrade as well as slru.c? Or whether it's possible
    > to make SLRU use local memory? And throwing some FRONTEND magic to the
    > mix, we may be able to avoid duplication. Have we tried this or
    > something else to avoid duplication? Sorry, if this has been discussed
    > earlier. Please point me to the relevant discussion if so.
    
    That's a fair point, but I think it's better to have some code 
    duplication in this case, than trying to write code that works for both 
    the server and for pg_upgrade. The needs are so different.
    
    > 007_multixact_conversion.pl fires thousands of queries through
    > BackgroundPsql which prints debug output for each of the queries. When
    > running this file with oldinstall set,
    > 2.2M  regress_log_007_multixact_conversion (size of file)
    > 77874 regress_log_007_multixact_conversion (wc -l output)
    > 
    > 
    > Since this output is also copied in testlog.txt, the effect is two-fold.
    > 
    > 
    > Most, if not all, of this output is useless. It also makes it hard to
    > find the output we are looking for. PFA patch which reduces this
    > output. The patch adds a flag verbose to query_safe() and query() to
    > toggle this output. With the patch the sizes are
    > 27K  regress_log_007_multixact_conversion
    > 588 regress_log_007_multixact_conversion
    > 
    > 
    > And it makes the test faster by about a second or two on my laptop.
    > Something on those lines or other is required to reduce the output
    > from query_safe().
    
    Nice! That log bloat was the reason I bundled together the "COMMIT; 
    BEGIN; SELECT ...;" steps into one statement in the loop. Your solution 
    addresses it more directly.
    
    I turned 'verbose' into a keyword parameter, for future extensibility of 
    those functions, so you now call it like "$node->query_safe("SELECT 1", 
    verbose => 0);". I also set "log_statements=none" in those connections, 
    to reduce the noise in the server log too.
    
    > Some more comments
    > +++ b/src/bin/pg_upgrade/multixact_old.c
    > 
    > 
    > We may need to introduce new _new and then _old will become _older.
    > Should we rename the files to have pre19 and post19 or some similar
    > suffixes which make it clear what is meant by old and new?
    
    +1. I renamed multixact_old.c to multixact_pre_v19.c. And 
    multixact_new.c to multixact_rewrite.c. I also moved the 
    "convert_multixact" function that drives the conversion to 
    multixact_rewrite.c. The idea is that if in the future we change the 
    format again, we will have:
    
    multixact_pre_v19.c  # for reading -v19 files
    multixact_pre_v24.c  # for reading v19-v23 files
    multixact_rewrite.c  # for writing new files
    
    Hard to predict what a possible future format might look like and how 
    we'd want to organize the code then, though. This can be changed then if 
    needed, but it makes sense now.
    
    > +static inline int64
    > +MultiXactIdToOffsetPage(MultiXactId multi)
    > 
    > 
    > The prologue mentions that the definitions are copy-pasted from
    > multixact.c from version 18, but they share the names with functions
    > in the current version. I think that's going to be a good source of
    > confusion especially in a file which is a few hundred lines long. Can
    > we rename them to have "Old" prefix or something similar?
    
    Fair. On the other hand, having the same names makes it easier to see 
    what the real differences with the server functions are. Not sure what's 
    best here..
    
    As long as we use the same names, it's important that 
    multixact_pre_v19.c doesn't #include the new definitions. I added some 
    comments on that, and also this safeguard:
    
    #define MultiXactOffset should_not_be_used
    
    That actually caught one (harmless) instance in the file where we had 
    not renamed MultiXactOffset to OldMultiXactOffset.
    
    I'm not entirely happy with the "Old" prefix here, because as you 
    pointed out, we might end up needing "older" or "oldold" in the future. 
    I couldn't come up with anything better though. "PreV19MultiXactOffset" 
    is quite a mouthful.
    
    > +# Dump contents of the 'mxofftest' table, created by mxact_workload
    > +sub get_dump_for_comparison
    > 
    > 
    > This local function shares its name with a local function in
    > 002_pg_upgrade.pl. Better to use a separate name. Also it's not
    > "dumping" data using "pg_dump", so "dump" in the name can be
    > misleading.
    
    Renamed to "get_test_table_contents"
    
    > + $newnode->start;
    > + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}_dump");
    > + $newnode->stop;
    > 
    > 
    > There is no code which actually looks at the multixact offsets here to
    > make sure that the conversion happened correctly. I guess the test
    > relies on visibility checks for that. Anyway, we need a comment
    > explaining why just comparing the contents of the table is enough to
    > ensure correct conversion. Better if we can add an explicit test that
    > the offsets were converted correctly. I don't have any idea of how to
    > do that right now, though. Maybe use pg_get_multixact_members()
    > somehow in the query to extract data out of the table?
    
    Agreed, the verification here is quite weak. I didn't realize that 
    pg_get_multixact_members() exists! That might indeed be handy here, but 
    I'm not sure how exactly to construct the test. A direct C function like 
    test_create_multixact() in test_multixact.c would be handy here, but 
    we'd need to compile and do run that in the old cluster, which seems 
    difficult.
    
    > + compare_files($old_dump, $new_dump,
    > + 'dump outputs from original and restored regression databases match');
    > 
    > 
    > A shared test name too :); but there is not regression database here.
    
    Fixed :-)
    
    > +
    > + note ">>> case #${tag}\n"
    > +   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
    > +   . " newnode mxoff ${new_next_mxoff}\n";
    > 
    > 
    > Should we check that some condition holds between finish_mxoff and
    > new_next_mxoff?
    
    Got something in mind that we could check?
    
    
    On 04/12/2025 17:33, Maxim Orlov wrote:
    > On Thu, 4 Dec 2025 at 13:39, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
    >> However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
    >> all the multixids. So we need to make the multixact reading code
    >> tolerant of the situations that could be present after a crash. I think
    >> the right philosophy here is that we try to read all the old multixids,
    >> and do our best to interpret them the same way that the old server
    >> would.
    > 
    > Something like attached?
    
    +1
    
    > Now previous scheme of upgrade with the bytes joggling start
    > looking not so bad. Just a funny thought that came to my mind.
    
    :-)
    
    >> Perhaps we should check that all the files exist and have the correct
    >> sizes in the pre-check stage
    > 
    > Not sure about it. Because SLRU does not support "holes", simply
    > checking if the first and last multixacts exist will be enough. But
    > we'll do it anyway in a real conversion.
    
    Yeah, the point would be to complain if there are holes when there 
    shouldn't be. As a sanity check.
    
    There's a reason to not do that though: if you use pg_resetwal to skip 
    over some multixids, you end up with holes. We shouldn't encourage 
    people to use pg_resetwal, but it seems reasonable to tolerate it if you 
    have done it.
    
    I worked some more on this. One notable change is that in light of the 
    "IPC/MultixactCreation on the Standby server" changes [1], we need to 
    always write the next multixid's offset, even if the next multixid 
    itself is invalid. Because otherwise the previous multixid is unreadable 
    too.
    
    The SlruReadSwitchPageSlow() function didn't handle short reads 
    properly. As a result, if an SLRU file was shorter than expected, the 
    buffer kept its old contents when switching to read the missing page. 
    Fixed that so that the missing part is read as all-zeros instead.
    
    I removed the warnings from some of the invalid multixid cases, like if 
    the offset or some of the members are zeros. Those cases can 
    legitimately happen after crash and restart, so we shouldn't complain 
    about them. If a multixid has more than one updating member, I kept that 
    as a fatal error. That really should not happen.
    
    To summarize, the behavior now is that if an old SLRU file does not 
    exist, you get an error. If an SLRU file is too short, you get warnings 
    and the missing pages are read as all-zeros, i.e. all the multixids on 
    the missing pages are considered invalid. If an individual multixid is 
    invalid, because the offset is zero, it's silently written as invalid in 
    the new file too.
    
    I'm still not 100% sure what the desired behavior for missing files is. 
    For now, I didn't include the pre-checks for the first and last files in 
    this version. You can end up with missing files if you skip over many 
    multixids with pg_resetwal. Or it could be a sign of lost data. If it's 
    lost data, would you prefer for the upgrade to fail, or continue 
    upgrading the data that you have? The conversion shouldn't make things 
    worse, if the data was already lost, but then again, if something really 
    bad has happened, all bets are off and perhaps it would be best to abort 
    and complain loudly.
    
    [1] 
    https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
    
    - Heikki
    
  86. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-08T13:02:11Z

    On 06/12/2025 01:36, Heikki Linnakangas wrote:
    > On 05/12/2025 15:42, Ashutosh Bapat wrote:
    >> + $newnode->start;
    >> + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag} 
    >> _dump");
    >> + $newnode->stop;
    >>
    >>
    >> There is no code which actually looks at the multixact offsets here to
    >> make sure that the conversion happened correctly. I guess the test
    >> relies on visibility checks for that. Anyway, we need a comment
    >> explaining why just comparing the contents of the table is enough to
    >> ensure correct conversion. Better if we can add an explicit test that
    >> the offsets were converted correctly. I don't have any idea of how to
    >> do that right now, though. Maybe use pg_get_multixact_members()
    >> somehow in the query to extract data out of the table?
    > 
    > Agreed, the verification here is quite weak. I didn't realize that 
    > pg_get_multixact_members() exists! That might indeed be handy here, but 
    > I'm not sure how exactly to construct the test. A direct C function like 
    > test_create_multixact() in test_multixact.c would be handy here, but 
    > we'd need to compile and do run that in the old cluster, which seems 
    > difficult.
    
    I added verification of all the multixids between oldest and next 
    multixid, using pg_get_multixact_members(). The test now calls 
    pg_get_multixact_members() for all updating multixids in the range, 
    before and after the upgrade, and compares the results.
    
    The verification ignores locking-only multixids. Verifying their 
    correctness would need a little more code because they're not fully 
    preserved by the upgrade.
    
    I also expanded the test to cover multixid wraparound. It only covered 
    mxoffset wraparound previously.
    
    New patch set attached. Only test changes compared to patch set v28.
    
    - Heikki
    
  87. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-12-08T15:43:00Z

    On Sat, Dec 6, 2025 at 5:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    > On 05/12/2025 15:42, Ashutosh Bapat wrote:
    
    >
    > > 007_multixact_conversion.pl fires thousands of queries through
    > > BackgroundPsql which prints debug output for each of the queries. When
    > > running this file with oldinstall set,
    > > 2.2M  regress_log_007_multixact_conversion (size of file)
    > > 77874 regress_log_007_multixact_conversion (wc -l output)
    > >
    > >
    > > Since this output is also copied in testlog.txt, the effect is two-fold.
    > >
    > >
    > > Most, if not all, of this output is useless. It also makes it hard to
    > > find the output we are looking for. PFA patch which reduces this
    > > output. The patch adds a flag verbose to query_safe() and query() to
    > > toggle this output. With the patch the sizes are
    > > 27K  regress_log_007_multixact_conversion
    > > 588 regress_log_007_multixact_conversion
    > >
    > >
    > > And it makes the test faster by about a second or two on my laptop.
    > > Something on those lines or other is required to reduce the output
    > > from query_safe().
    >
    > Nice! That log bloat was the reason I bundled together the "COMMIT;
    > BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
    > addresses it more directly.
    
    Now we can call query_safe() separately on each of those. That will be
    more readable and marginally less code.
    
    >
    > I turned 'verbose' into a keyword parameter, for future extensibility of
    > those functions, so you now call it like "$node->query_safe("SELECT 1",
    > verbose => 0);". I also set "log_statements=none" in those connections,
    > to reduce the noise in the server log too.
    
    keyword parameter is better. also +1 for log_statements.
    
    >
    > > Some more comments
    > > +++ b/src/bin/pg_upgrade/multixact_old.c
    > >
    > >
    > > We may need to introduce new _new and then _old will become _older.
    > > Should we rename the files to have pre19 and post19 or some similar
    > > suffixes which make it clear what is meant by old and new?
    >
    > +1. I renamed multixact_old.c to multixact_pre_v19.c. And
    > multixact_new.c to multixact_rewrite.c. I also moved the
    > "convert_multixact" function that drives the conversion to
    > multixact_rewrite.c. The idea is that if in the future we change the
    > format again, we will have:
    >
    > multixact_pre_v19.c  # for reading -v19 files
    > multixact_pre_v24.c  # for reading v19-v23 files
    > multixact_rewrite.c  # for writing new files
    >
    > Hard to predict what a possible future format might look like and how
    > we'd want to organize the code then, though. This can be changed then if
    > needed, but it makes sense now.
    
    +1.
    
    >
    > > +static inline int64
    > > +MultiXactIdToOffsetPage(MultiXactId multi)
    > >
    > >
    > > The prologue mentions that the definitions are copy-pasted from
    > > multixact.c from version 18, but they share the names with functions
    > > in the current version. I think that's going to be a good source of
    > > confusion especially in a file which is a few hundred lines long. Can
    > > we rename them to have "Old" prefix or something similar?
    >
    > Fair. On the other hand, having the same names makes it easier to see
    > what the real differences with the server functions are. Not sure what's
    > best here..
    >
    > As long as we use the same names, it's important that
    > multixact_pre_v19.c doesn't #include the new definitions. I added some
    > comments on that, and also this safeguard:
    >
    > #define MultiXactOffset should_not_be_used
    >
    > That actually caught one (harmless) instance in the file where we had
    > not renamed MultiXactOffset to OldMultiXactOffset.
    >
    
    That looks useful, and has proved to be useful already.
    
    > I'm not entirely happy with the "Old" prefix here, because as you
    > pointed out, we might end up needing "older" or "oldold" in the future.
    > I couldn't come up with anything better though. "PreV19MultiXactOffset"
    > is quite a mouthful.
    
    How about MultiXactOffset32?
    
    Thanks for addressing rest of the comments.
    
    >
    > > +
    > > + note ">>> case #${tag}\n"
    > > +   . " oldnode mxoff from ${start_mxoff} to ${finish_mxoff}\n"
    > > +   . " newnode mxoff ${new_next_mxoff}\n";
    > >
    > >
    > > Should we check that some condition holds between finish_mxoff and
    > > new_next_mxoff?
    >
    > Got something in mind that we could check?
    >
    
    I have always seen that finish_mxoff is very high compared to newnode
    mxoff - given that we write only one member per mxid, is newnode mxoff
    going to be always something like 4K or so? Then we can check that
    value. But I will experiment more to see if I can come up with
    something, if possible.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  88. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-12-08T15:50:23Z

    On Mon, Dec 8, 2025 at 6:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >
    > On 06/12/2025 01:36, Heikki Linnakangas wrote:
    > > On 05/12/2025 15:42, Ashutosh Bapat wrote:
    > >> + $newnode->start;
    > >> + my $new_dump = get_dump_for_comparison($newnode, "newnode_${tag}
    > >> _dump");
    > >> + $newnode->stop;
    > >>
    > >>
    > >> There is no code which actually looks at the multixact offsets here to
    > >> make sure that the conversion happened correctly. I guess the test
    > >> relies on visibility checks for that. Anyway, we need a comment
    > >> explaining why just comparing the contents of the table is enough to
    > >> ensure correct conversion. Better if we can add an explicit test that
    > >> the offsets were converted correctly. I don't have any idea of how to
    > >> do that right now, though. Maybe use pg_get_multixact_members()
    > >> somehow in the query to extract data out of the table?
    > >
    > > Agreed, the verification here is quite weak. I didn't realize that
    > > pg_get_multixact_members() exists! That might indeed be handy here, but
    > > I'm not sure how exactly to construct the test. A direct C function like
    > > test_create_multixact() in test_multixact.c would be handy here, but
    > > we'd need to compile and do run that in the old cluster, which seems
    > > difficult.
    >
    > I added verification of all the multixids between oldest and next
    > multixid, using pg_get_multixact_members(). The test now calls
    > pg_get_multixact_members() for all updating multixids in the range,
    > before and after the upgrade, and compares the results.
    
    I thought about adding pg_get_multixact_member in
    get_test_table_contents() itself like SELECT ctid, xmin, xmax,
    get_multixact_member(xmin), get_multixact_member(xmax) * FROM
    mxofftest; but then I realized that the UPDATE would replace mxids by
    actual transaction ids in the visible rows. So that can't be used.
    What you have done doesn't have that drawback, but it's also not
    checking whether the multixids in (invisible) rows are reachable in
    offsets and members. But probably that's too hard to do and is covered
    by visibility checks.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  89. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-09T12:00:05Z

    On 08/12/2025 17:43, Ashutosh Bapat wrote:
    > On Sat, Dec 6, 2025 at 5:06 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >> On 05/12/2025 15:42, Ashutosh Bapat wrote:
    >>> And it makes the test faster by about a second or two on my laptop.
    >>> Something on those lines or other is required to reduce the output
    >>> from query_safe().
    >>
    >> Nice! That log bloat was the reason I bundled together the "COMMIT;
    >> BEGIN; SELECT ...;" steps into one statement in the loop. Your solution
    >> addresses it more directly.
    > 
    > Now we can call query_safe() separately on each of those. That will be
    > more readable and marginally less code.
    
    Done.
    
    >> I'm not entirely happy with the "Old" prefix here, because as you
    >> pointed out, we might end up needing "older" or "oldold" in the future.
    >> I couldn't come up with anything better though. "PreV19MultiXactOffset"
    >> is quite a mouthful.
    > 
    > How about MultiXactOffset32?
    
    Ooh, I like that. It doesn't sound as nice for the other "old" prefixed 
    things though. So I changed OldMultiXactOffset to MultiXactOffset32, but 
    kept OldMultiXactReader, GetOldMultiXactIdSingleMember() et al. We can 
    live with that for now, and rename in the future if we'd need "oldold".
    
    Committed with that and some other minor cleanups. Thanks everyone! This 
    patch has been brewing for a while :-).
    
    
    There are some noncritical followups that I'd like to address, now that 
    we know that in v19 the pg_multixact files will be rewritten. That gives 
    us an opportunity to clean up some backwards-compatibility stuff. The 
    committed patch already cleaned up a bunch, but there's some more we 
    could do:
    
    1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to 
    0, which is invalid. All the readers must be prepared for that, and skip 
    over the 0. That's error-prone, we've already missed that a few times. 
    Let's change things so that the code that *writes* 
    MultiXactState->nextMXact skips over the zero already.
    
    2. We currently don't persist 'oldestOffset' in the control file the 
    same way as 'oldestMultiXactId'. Instead, we look up the offset of the 
    oldestMultiXactId at startup, and keep that value in memory. Originally 
    that was because we missed the need for that and had to add the offset 
    wraparound protections in a minor release without changing the control 
    file format. But we could easily do it now.
    
    With 64-bit offsets, it's actually less critical to persist the 
    oldestOffset. Previously, if we failed to look up the oldest offset 
    because the oldest multixid was invalid, it could lead to serious 
    trouble if the offsets then wrapped around and old offsets were 
    overwritten, but that won't happen anymore. Nevertheless, it leads to 
    unnecessarily aggressive vacuuming and some messages in the log.
    
    At first I thought that the failure to look up the oldest offset should 
    no longer happen, because we don't need to support reading old 9.3 era 
    SLRUs anymore that were created before we added the offset wraparound 
    protection. But it's not so: it's still possible to have multixids with 
    invalid offsets in the 'offsets' SLRU on a crash. Such multixids won't 
    be referenced from anywhere in the heap, but I think they could later 
    become the oldest multixid, and we would fail to look up its offset. 
    Persisting the oldest offset doesn't fully fix that problem, because 
    advancing the oldest offset is done by looking up the oldest multixid's 
    offset anyway.
    
    3. I think we should turn some of the assertions in 
    GetMultiXactIdMembers() into ereports(ERROR) calls. I included those 
    changes in my patch version 29 [1], as a separate patch, but I didn't 
    commit that yet.
    
    4. Compressing the offsets, per discussion. It doesn't really seem worth 
    to me and I don't intend to work on it, but if someone wants to do it, 
    now would be the time, so that we don't need to have upgrade code to 
    deal with yet another format.
    
    - Heikki
    
    
    
    
    
  90. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-10T19:19:18Z

    On 09/12/2025 14:00, Heikki Linnakangas wrote:
    > 1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to 
    > 0, which is invalid. All the readers must be prepared for that, and skip 
    > over the 0. That's error-prone, we've already missed that a few times. 
    > Let's change things so that the code that *writes* MultiXactState- 
    >  >nextMXact skips over the zero already.
    
    Here's a patch for that. Does anyone see a problem with this?
    
    - Heikki
    
  91. Re: POC: make mxidoff 64 bits

    Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> — 2025-12-11T03:06:11Z

    On Thu, Dec 11, 2025 at 12:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >
    > On 09/12/2025 14:00, Heikki Linnakangas wrote:
    > > 1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
    > > 0, which is invalid. All the readers must be prepared for that, and skip
    > > over the 0. That's error-prone, we've already missed that a few times.
    > > Let's change things so that the code that *writes* MultiXactState-
    > >  >nextMXact skips over the zero already.
    >
    > Here's a patch for that. Does anyone see a problem with this?
    
    The patch looks fine to me. It simplifies readers without affecting
    writers much. I was expecting more explanation of why it wasn't done
    that way to start with and why is it safe to do so (now, if
    applicable). There must be a reason why we chose to make readers
    handle invalid mxid instead of writers writing one. If it's for
    performance reasons then does the new arrangement cause any
    regression? If it's for safety reasons, are we fixing one set of
    problems but introducing a new set. I was expecting commit message to
    answer those questions.
    
    -- 
    Best Wishes,
    Ashutosh Bapat
    
    
    
    
  92. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-11T07:58:34Z

    On 11/12/2025 05:06, Ashutosh Bapat wrote:
    > On Thu, Dec 11, 2025 at 12:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >>
    >> On 09/12/2025 14:00, Heikki Linnakangas wrote:
    >>> 1. Currently, at multixid wraparound, MultiXactState->nextMXact goes to
    >>> 0, which is invalid. All the readers must be prepared for that, and skip
    >>> over the 0. That's error-prone, we've already missed that a few times.
    >>> Let's change things so that the code that *writes* MultiXactState-
    >>>   >nextMXact skips over the zero already.
    >>
    >> Here's a patch for that. Does anyone see a problem with this?
    > 
    > The patch looks fine to me. It simplifies readers without affecting
    > writers much. I was expecting more explanation of why it wasn't done
    > that way to start with and why is it safe to do so (now, if
    > applicable). There must be a reason why we chose to make readers
    > handle invalid mxid instead of writers writing one. If it's for
    > performance reasons then does the new arrangement cause any
    > regression? If it's for safety reasons, are we fixing one set of
    > problems but introducing a new set. I was expecting commit message to
    > answer those questions.
    
    That's a great question and I've been wondering about it myself. It goes 
    all the way to the initial commit where multixacts were introduced, and 
    I don't see any particular reason for it even back then. Even in the 
    very first version of multixact.c, IMO it would've been simpler to have 
    the writer handle the wraparound.
    
    Álvaro, would you happen to remember?
    
    - Heikki
    
    
    
    
    
  93. Re: POC: make mxidoff 64 bits

    Maxim Orlov <orlovmg@gmail.com> — 2025-12-11T12:47:53Z

    On Thu, 11 Dec 2025 at 10:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    >
    > That's a great question and I've been wondering about it myself. It goes
    > all the way to the initial commit where multixacts were introduced, and
    > I don't see any particular reason for it even back then. Even in the
    > very first version of multixact.c, IMO it would've been simpler to have
    > the writer handle the wraparound.
    >
    > +1 This code is quite old. I don't see any particular reason for doing
    it that way. Unfortunately, we were unable to prove the absence of
    something, namely errors, in this instance. But there were no obvious
    statements on why it should be in this manner. So, for me, it's much
    clearer to increment and handle wraparound in one place rather
    than spread it across multiple calls in the module.
    
    -- 
    Best regards,
    Maxim Orlov.
    
  94. Re: POC: make mxidoff 64 bits

    Alvaro Herrera <alvherre@alvh.no-ip.org> — 2025-12-11T20:05:24Z

    On 2025-Dec-11, Heikki Linnakangas wrote:
    
    > That's a great question and I've been wondering about it myself. It goes all
    > the way to the initial commit where multixacts were introduced, and I don't
    > see any particular reason for it even back then. Even in the very first
    > version of multixact.c, IMO it would've been simpler to have the writer
    > handle the wraparound.
    > 
    > Álvaro, would you happen to remember?
    
    Sorry, I have no recollections of the reason why it was done this way.
    
    -- 
    Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
    Voy a acabar con todos los humanos / con los humanos yo acabaré
    voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)
    
    
    
    
  95. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-12T08:51:25Z

    On 11/12/2025 22:05, Alvaro Herrera wrote:
    > On 2025-Dec-11, Heikki Linnakangas wrote:
    > 
    >> That's a great question and I've been wondering about it myself. It goes all
    >> the way to the initial commit where multixacts were introduced, and I don't
    >> see any particular reason for it even back then. Even in the very first
    >> version of multixact.c, IMO it would've been simpler to have the writer
    >> handle the wraparound.
    >>
    >> Álvaro, would you happen to remember?
    > 
    > Sorry, I have no recollections of the reason why it was done this way.
    
    Ok, I have pushed this. Thanks!
    
    - Heikki
    
    
    
    
    
  96. Re: POC: make mxidoff 64 bits

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-14T22:55:36Z

    Heikki Linnakangas <hlinnaka@iki.fi> writes:
    > Ok, I have pushed this. Thanks!
    
    Coverity is unhappy about this bit:
    
    /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_upgrade/multixact_read_v18.c: 282             in GetOldMultiXactIdSingleMember()
    276             if (!TransactionIdIsValid(*xactptr))
    277             {
    278                 /*
    279                  * Corner case 2: we are looking at unused slot zero
    280                  */
    281                 if (offset == 0)
    >>>     CID 1676077:         Control flow issues  (DEADCODE)
    >>>     Execution cannot reach this statement: "continue;".
    282                     continue;
    283     
    284                 /*
    285                  * Otherwise this is an invalid entry that should not be
    
    It sees the earlier test for offset == 0, and evidently is assuming
    that the loop's "offset++" will not wrap around.  Now I think that
    the point of this check is exactly that "offset++" could have wrapped
    around, but the commentary is not so clear that I'm certain this is a
    false positive.  If that is the intention, what do you think of
    rephrasing this comment as "we have wrapped around to unused slot
    zero"?
    
    			regards, tom lane
    
    
    
    
  97. Re: POC: make mxidoff 64 bits

    Alexander Lakhin <exclusion@gmail.com> — 2025-12-15T08:00:00Z

    Hello Heikki,
    
    09.12.2025 14:00, Heikki Linnakangas wrote:
    > Committed with that and some other minor cleanups. Thanks everyone! This patch has been brewing for a while :-).
    
    I've spotted a couple of failures of new test 007_multixact_conversion at
    buildfarm:
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2025-12-09%2020%3A40%3A53
    007_multixact_conversion_basic_oldnode.log:
    ...
    2025-12-09 22:33:39.299 CET [2872679][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-09 22:36:39.025 CET [2871745][postmaster][:0] LOG:  received immediate shutdown request
    (180 seconds timeout)
    
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2025-12-14%2023%3A53%3A48
    007_multixact_conversion_basic_oldnode.log:
    ...
    2025-12-15 01:57:01.380 CET [2178307][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-15 02:00:01.020 CET [2177271][postmaster][:0] LOG:  received immediate shutdown request
    (180 seconds timeout)
    
    Both occurred on JIT-enabled animals (moreover, JIT is provided by a debug
    build of LLVM), so these animals are very slow.
    
    Looking at other urutu's runs, we can see:
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2025-12-10%2013%3A06%3A35
    007_multixact_conversion_basic_oldnode.log::
    2025-12-10 14:37:05.254 CET [2322763][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-10 14:39:21.784 CET [2322610][client backend][:0] LOG: disconnection: session time: 0:02:16.878 user=bf 
    database=postgres host=[local]
    (136 seconds)
    
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2025-12-10%2017%3A43%3A02
    2025-12-10 19:20:30.310 CET [1785680][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-10 19:22:41.734 CET [1784967][client backend][:0] LOG: disconnection: session time: 0:02:11.903 user=bf 
    database=postgres host=[local]
    (133 seconds)
    
    Though major runs show timing under 80 seconds, e.g.:
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2025-12-14%2021%3A20%3A49
    2025-12-14 22:48:53.764 CET [3751001][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-14 22:49:57.223 CET [3750961][client backend][:0] LOG: disconnection: session time: 0:01:03.571 user=bf 
    database=postgres host=[local]
    (64 seconds)
    
    And a couple of other (successful) canebrake's runs:
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2025-12-14%2021%3A09%3A31
    2025-12-14 22:59:48.951 CET [3761649][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-14 23:01:26.274 CET [3761608][client backend][:0] LOG: disconnection: session time: 0:01:37.441 user=bf 
    database=postgres host=[local]
    (98 seconds)
    
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake&dt=2025-12-14%2012%3A03%3A29
    2025-12-14 13:59:29.205 CET [1564673][client backend][21/2:0] LOG: statement: SET log_statement=none
         ;
    2025-12-14 14:01:15.208 CET [1564633][client backend][:0] LOG: disconnection: session time: 0:01:46.147 user=bf 
    database=postgres host=[local]
    (106 seconds)
    
    Thus, it looks like these animals can hit 180 seconds timeout with some
    external factors (concurrent load?) that make them run 2-3 times slower...
    
    Best regards,
    Alexander
    
    
    
    
  98. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-15T09:55:10Z

    On 15/12/2025 00:55, Tom Lane wrote:
    > Heikki Linnakangas <hlinnaka@iki.fi> writes:
    >> Ok, I have pushed this. Thanks!
    > 
    > Coverity is unhappy about this bit:
    > 
    > /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_upgrade/multixact_read_v18.c: 282             in GetOldMultiXactIdSingleMember()
    > 276             if (!TransactionIdIsValid(*xactptr))
    > 277             {
    > 278                 /*
    > 279                  * Corner case 2: we are looking at unused slot zero
    > 280                  */
    > 281                 if (offset == 0)
    >>>>      CID 1676077:         Control flow issues  (DEADCODE)
    >>>>      Execution cannot reach this statement: "continue;".
    > 282                     continue;
    > 283
    > 284                 /*
    > 285                  * Otherwise this is an invalid entry that should not be
    > 
    > It sees the earlier test for offset == 0, and evidently is assuming
    > that the loop's "offset++" will not wrap around.  Now I think that
    > the point of this check is exactly that "offset++" could have wrapped
    > around, but the commentary is not so clear that I'm certain this is a
    > false positive.
    
    Correct.
    
    > If that is the intention, what do you think of rephrasing this
    > comment as "we have wrapped around to unused slot zero"?
    Ah yes, that's much better. Changed it to "offset must have wrapped 
    around to unused slot zero". This code and its comments are copied from 
    v18 GetMultiXactIdMembers(), so in order to maintain the rhyme with 
    that, I changed the comment in backbranches too. It doesn't exist in the 
    'master' version of GetMultiXactIdMembers() anymore.
    
    Coverity is also complaining about the 'length' variable being tainted, 
    because it's calculated from the values read from disk. That's bogus 
    because we trust and make assumptions of the values on disk. That said, 
    I think it would make sense to do some more sanity checking here. In 
    particular, length should never be negative. I added such sanity checks 
    to the 'master' version of the GetMultiXactIdMembers() server function 
    in commit d4b7bde418, but it would make sense to add them to the upgrade 
    code too. I'll look into that.
    
    - Heikki
    
    
    
    
    
  99. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-15T14:31:13Z

    On 15/12/2025 10:00, Alexander Lakhin wrote:
    > Hello Heikki,
    > 
    > 09.12.2025 14:00, Heikki Linnakangas wrote:
    >> Committed with that and some other minor cleanups. Thanks everyone! 
    >> This patch has been brewing for a while :-).
    > 
    > I've spotted a couple of failures of new test 007_multixact_conversion at
    > buildfarm:
    > ...
    > 
    > Thus, it looks like these animals can hit 180 seconds timeout with some
    > external factors (concurrent load?) that make them run 2-3 times slower...
    
    Hmm, so it's hitting the timeout while running the multixid generation 
    workload part of the test. The workload keeps 20 parallel connections 
    open, using them in a round-robin fashion, and apparently the timeout 
    spans the whole duration of each connection rather than the individual 
    queries run on them.
    
    On my laptop with jit_above_cost=0 and jit_optimize_above_cost=1000, 
    like on those buildfarm animals, the multixid generation workload runs 
    in under 10 s. I suppose it could be 20x slower on a slow, busy system. 
    So the straightforward fix is to bump up the timeout.
    
    I'm tempted to force "jit=off" here, as the multixid generation is not 
    really the thing that's being tested here. It's just used to generate 
    multixids in the cluster before the upgrade. Then again, the point of 
    forcing JIT in these buildfarm members is to test JITting as part of 
    everything else that wasn't originally written as a JIT test.
    
    - Heikki
    
    
    
    
    
  100. Re: POC: make mxidoff 64 bits

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-15T16:11:05Z

    On 15/12/2025 16:31, Heikki Linnakangas wrote:
    > On my laptop with jit_above_cost=0 and jit_optimize_above_cost=1000, 
    > like on those buildfarm animals, the multixid generation workload runs 
    > in under 10 s. I suppose it could be 20x slower on a slow, busy system. 
    > So the straightforward fix is to bump up the timeout.
    
    I bumped the timeout to 4 * 180 s. I also added a few progress report 
    notes to the test output, so that we get a better picture of where the 
    time goes.
    
    - Heikki
    
    
    
    
    
  101. Re: POC: make mxidoff 64 bits

    zengman <zengman@halodbtech.com> — 2025-12-30T01:49:36Z

    Hi,
    
    I'm currently looking into the `SlruReadSwitchPageSlow` function and have a question regarding the expression `&state->buf.data + bytes_read` — 
    I suspect the ampersand (&) here might be misused. Would you be able to help me verify this?
    
    ```
    	while (bytes_read < BLCKSZ)
    	{
    		ssize_t		rc;
    
    		rc = pg_pread(state->fd,
    					  &state->buf.data + bytes_read,
    					  BLCKSZ - bytes_read,
    					  offset);
    		if (rc < 0)
    		{
    			if (errno == EINTR)
    				continue;
    			pg_fatal("could not read file \"%s\": %m", state->fn);
    		}
    		if (rc == 0)
    		{
    			/* unexpected EOF */
    			pg_log(PG_WARNING, "unexpected EOF reading file \"%s\" at offset %u, reading as zeros",
    				   state->fn, (unsigned int) offset);
    			memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
    			break;
    		}
    		bytes_read += rc;
    		offset += rc;
    	}
    ```
    
    ```
    		rc = pg_pread(state->fd,
    					  &state->buf.data + bytes_read,
    					  BLCKSZ - bytes_read,
    					  offset);
    memset(&state->buf.data + bytes_read, 0, BLCKSZ - bytes_read);
    ```
    
    --
    Regards,
    Man Zeng
    www.openhalo.org