Thread

  1. Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

    Kirill Reshke <reshkekirill@gmail.com> — 2025-12-18T08:55:46Z

    On Thu, 18 Dec 2025 at 05:30, Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > > in v27-0005. This patch changes code which is not exercised in
    > > tests[0]. I spent some time understanding the conditions when we
    > > entered this. There is a comment about non-finished relation
    > > extension, but I got no success trying to reproduce this. I ended up
    > > modifying code to lose PageSetAllVisible in proper places and running
    > > vacuum. Looks like everything works as expected. I will spend some
    > > more time on this, maybe I will be successful in writing an
    > > injection-point-based TAP test which hits this...
    >
    > Based on the coverage report link you provided, that code is changed
    > by v27 0007, not 0005. 0005 is about moving an assertion out of
    > lazy_scan_prune(). 0007 changes lazy_scan_new_or_empty() (the code in
    > question).
    >
    > Regarding 0007, it looks like what is uncovered (the orange bits in
    > the coverage report are uncovered, I assume) is empty pages _without_
    > PD_ALL_VISIBLE set. I don't see anywhere where PageSetAllVisible() is
    > called except vacuum and COPY FREEZE.
    
    Sure, I meant 0007.
    
    > If I was trying to guess how empty pages with PD_ALL_VISIBLE set are
    > getting vacuumed, I would think it is due to SKIP_PAGES_THRESHOLD
    > causing us to vacuum an all-frozen empty page.
    
    Yes, vacuum (disable_page_skipping);
    
    > Then the question is, why wouldn't we have coverage of the empty page
    > first being set all-visible/all-frozen? It can't be COPY FREEZE
    > because the page is empty. And it can't be vacuum, because then we
    > would have coverage. It's very mysterious.
    >
    > It would be good to have coverage for this case. I don't think you'll
    > need an injection point for the main case of "empty page not yet set
    > all-visible is vacuumed for the first time" (unless I'm
    > misunderstanding something).
    >
    > I'm not sure how you'll test the "vacuuming an empty, previously
    > uninitialized page" case described in this comment, though.
    >
    >              * It's possible that another backend has extended the heap,
    >              * initialized the page, and then failed to WAL-log the page due
    >              * to an ERROR.  Since heap extension is not WAL-logged, recovery
    >              * might try to replay our record setting the page all-visible and
    >              * find that the page isn't initialized, which will cause a PANIC.
    >              * To prevent that, check whether the page has been previously
    >              * WAL-logged, and if not, do that now.
    >
    > You'd want to force an error during relation extension and then vacuum
    > the page. I don't know if you need an injection point to force the
    > error -- depends on what kind of error, I think.
    
    I did small archeology and this "if (PageIsEmpty(page)) {   if
    (!PageIsAllVisible(page)) { .... }}" code  originates back to
    608195a3a365. Comment about not WAL-logged relation extension is from
    a6370fd9ed3d, and I don't think we need to think about this case.
    
    I am currently inclined to think that we cannot see an empty page that
    has PD_ALL_VISIBLE not-set. This is because when we make a page empty,
    we are in a critical section, and we WAL-log everything we do, so our
    changes should not be half-made. Maybe as of 608195a3a365, there was a
    case with empry-page-without-PD_ALL_VISIBLE, but I dont think this
    happens on HEAD.
    
    > So that I know for attribution, did you review 0003-0005?
    
    yes, but I did not have any valuable review points for them.
    
    
    Also, after the whole set is committed, we should then never
    experience discrepancy between  PD_ALL_VISIBLE and VM bits? Because
    they will be set in a single WAL record. The only cases when heap and
    VM disagrees on all-visibility then are corruption,
    pg_visibilitymap_truncate and old data (data before v19+ upgrade?)
    If my understanding is correct, should we add document this?
    
    -- 
    Best regards,
    Kirill Reshke