Thread
-
Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Kirill Reshke <reshkekirill@gmail.com> — 2025-12-18T08:55:46Z
On Thu, 18 Dec 2025 at 05:30, Melanie Plageman <melanieplageman@gmail.com> wrote: > > > in v27-0005. This patch changes code which is not exercised in > > tests[0]. I spent some time understanding the conditions when we > > entered this. There is a comment about non-finished relation > > extension, but I got no success trying to reproduce this. I ended up > > modifying code to lose PageSetAllVisible in proper places and running > > vacuum. Looks like everything works as expected. I will spend some > > more time on this, maybe I will be successful in writing an > > injection-point-based TAP test which hits this... > > Based on the coverage report link you provided, that code is changed > by v27 0007, not 0005. 0005 is about moving an assertion out of > lazy_scan_prune(). 0007 changes lazy_scan_new_or_empty() (the code in > question). > > Regarding 0007, it looks like what is uncovered (the orange bits in > the coverage report are uncovered, I assume) is empty pages _without_ > PD_ALL_VISIBLE set. I don't see anywhere where PageSetAllVisible() is > called except vacuum and COPY FREEZE. Sure, I meant 0007. > If I was trying to guess how empty pages with PD_ALL_VISIBLE set are > getting vacuumed, I would think it is due to SKIP_PAGES_THRESHOLD > causing us to vacuum an all-frozen empty page. Yes, vacuum (disable_page_skipping); > Then the question is, why wouldn't we have coverage of the empty page > first being set all-visible/all-frozen? It can't be COPY FREEZE > because the page is empty. And it can't be vacuum, because then we > would have coverage. It's very mysterious. > > It would be good to have coverage for this case. I don't think you'll > need an injection point for the main case of "empty page not yet set > all-visible is vacuumed for the first time" (unless I'm > misunderstanding something). > > I'm not sure how you'll test the "vacuuming an empty, previously > uninitialized page" case described in this comment, though. > > * It's possible that another backend has extended the heap, > * initialized the page, and then failed to WAL-log the page due > * to an ERROR. Since heap extension is not WAL-logged, recovery > * might try to replay our record setting the page all-visible and > * find that the page isn't initialized, which will cause a PANIC. > * To prevent that, check whether the page has been previously > * WAL-logged, and if not, do that now. > > You'd want to force an error during relation extension and then vacuum > the page. I don't know if you need an injection point to force the > error -- depends on what kind of error, I think. I did small archeology and this "if (PageIsEmpty(page)) { if (!PageIsAllVisible(page)) { .... }}" code originates back to 608195a3a365. Comment about not WAL-logged relation extension is from a6370fd9ed3d, and I don't think we need to think about this case. I am currently inclined to think that we cannot see an empty page that has PD_ALL_VISIBLE not-set. This is because when we make a page empty, we are in a critical section, and we WAL-log everything we do, so our changes should not be half-made. Maybe as of 608195a3a365, there was a case with empry-page-without-PD_ALL_VISIBLE, but I dont think this happens on HEAD. > So that I know for attribution, did you review 0003-0005? yes, but I did not have any valuable review points for them. Also, after the whole set is committed, we should then never experience discrepancy between PD_ALL_VISIBLE and VM bits? Because they will be set in a single WAL record. The only cases when heap and VM disagrees on all-visibility then are corruption, pg_visibilitymap_truncate and old data (data before v19+ upgrade?) If my understanding is correct, should we add document this? -- Best regards, Kirill Reshke