Thread

Re: Skip prefetch for block references that follow a FPW or WILL_INIT of the same block

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-07T07:45:54Z
Hi,

On Tue, Mar 24, 2026 at 9:18 AM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:

> Hi Hackers,
>
> While review the patch in the thread [1] I noticed the following:
>
> When the WAL prefetcher encounters a block reference that carries a full
> page image (FPW) or has BKPBLOCK_WILL_INIT set, it correctly skips issuing
> a prefetch for that block because the old on-disk content is irrelevant
> since replay will overwrite or zero the page entirely. However, if a later
> WAL record within the look-ahead window references the same block without
> an FPW, the prefetcher would still issue a fadvise64 syscall for it,
> because the block was never recorded in the duplicate-detection window.
>
> Fixed this by making these blocks as recently seen in the FPW and
> WILL_INIT skip paths. The existing duplicate-check loop then naturally
> suppresses prefetch attempts for subsequent references to the same block,
> counting them under the skip_rep stat. This is particularly effective for
> workloads that produce many sequential writes to the same page (e.g., bulk
> inserts into heap-only tables), where each page's first post-checkpoint
> touch generates an FPW and subsequent inserts to the same page follow
> shortly after in WAL.
>
> In order to further improve the wasted prefetch calls, we can try to
> increase the window size by changing XLOGPREFETCHER_SEQ_WINDOW_SIZE
> according to max blocks that can be prefetched or maintain a hash table. I
> did not attempt to do this in this patch because that can impact the redo
> performance (more cpu cycles).  Worst case, the current fix may fail in
> scenarios where the table has more than four indexes, for example. However,
> I still believe it is an improvement over the baseline. If we decide to
> spend more cycles on optimizing the window sizes, it can be in a different
> patch.
>
> Benchmarked recovery with 10 GB of WAL from insert-only workload into a
> no-index table, replayed from an identical crash snapshot:
>
> Fast disk (NVMe)
> Baseline: redo 37.30s, system CPU 9.38s, 1,204,992 fadvise calls
> Patched: redo 25.78s, system CPU 3.39s, 122,753 fadvise calls
>
> This is nearly 31% faster redo, 90% fewer fadvise syscalls
>
> *Prefetch Counters*
> Counter Baseline Patched Delta
> prefetch (fadvise issued) 1,204,992 122,753 −89.8%
> hit 924,457 911,785 −1.4%
> skip_init 1,097,536 1,097,536 0
> skip_fpw 28 28 0
> skip_rep 80,020,209 81,115,120 +1,094,911
>
> Slower disk (with ~2ms latency)
> Baseline: redo 188.04s, system CPU 6.87s, 1,204,992 fadvise calls
> Patched: redo 60.02s, system CPU 3.39s, 122,753 fadvise calls
>
> This is nearly 68% faster redo, 3.1× overall speedup
>
>
> *Configuration:*
>
> shared_buffers = '124GB'
> huge_pages = on
> wal_buffers = '512MB'
> max_wal_size = '100GB'
> checkpoint_timeout = '30min'
> full_page_writes = on
> maintenance_io_concurrency = 50
> recovery_prefetch = on
>
> *Workload:*
> CREATE TABLE test_noindex(id bigint, val1 int, val2 int, payload text);
> -- No indexes, no primary key.
>
>
> -- Then insert in batches of 1M rows until WAL reaches 10 GB:
> INSERT INTO test_noindex
> SELECT g, (g*7+13)%100000, (g*31+17)%100000, repeat(chr(65+(g%26)),60)
> FROM generate_series(1, 1000000) g;
>
>
> Thanks,
> Satya
>
> [1]
> https://www.postgresql.org/message-id/flat/CA%2B3i_M8C%2BrK9vhwBm8U%2Bys2hbDifoBb4Xnws5Wmn2f4u7iqOpA%40mail.gmail.com#8eac90e696baf6e4f58f91482af28e07
>

Rebased the patch.