Thread

  1. Re: Skip prefetch for block references that follow a FPW or WILL_INIT of the same block

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-07T07:45:54Z

    Hi,
    
    On Tue, Mar 24, 2026 at 9:18 AM SATYANARAYANA NARLAPURAM <
    satyanarlapuram@gmail.com> wrote:
    
    > Hi Hackers,
    >
    > While review the patch in the thread [1] I noticed the following:
    >
    > When the WAL prefetcher encounters a block reference that carries a full
    > page image (FPW) or has BKPBLOCK_WILL_INIT set, it correctly skips issuing
    > a prefetch for that block because the old on-disk content is irrelevant
    > since replay will overwrite or zero the page entirely. However, if a later
    > WAL record within the look-ahead window references the same block without
    > an FPW, the prefetcher would still issue a fadvise64 syscall for it,
    > because the block was never recorded in the duplicate-detection window.
    >
    > Fixed this by making these blocks as recently seen in the FPW and
    > WILL_INIT skip paths. The existing duplicate-check loop then naturally
    > suppresses prefetch attempts for subsequent references to the same block,
    > counting them under the skip_rep stat. This is particularly effective for
    > workloads that produce many sequential writes to the same page (e.g., bulk
    > inserts into heap-only tables), where each page's first post-checkpoint
    > touch generates an FPW and subsequent inserts to the same page follow
    > shortly after in WAL.
    >
    > In order to further improve the wasted prefetch calls, we can try to
    > increase the window size by changing XLOGPREFETCHER_SEQ_WINDOW_SIZE
    > according to max blocks that can be prefetched or maintain a hash table. I
    > did not attempt to do this in this patch because that can impact the redo
    > performance (more cpu cycles).  Worst case, the current fix may fail in
    > scenarios where the table has more than four indexes, for example. However,
    > I still believe it is an improvement over the baseline. If we decide to
    > spend more cycles on optimizing the window sizes, it can be in a different
    > patch.
    >
    > Benchmarked recovery with 10 GB of WAL from insert-only workload into a
    > no-index table, replayed from an identical crash snapshot:
    >
    > Fast disk (NVMe)
    > Baseline: redo 37.30s, system CPU 9.38s, 1,204,992 fadvise calls
    > Patched: redo 25.78s, system CPU 3.39s, 122,753 fadvise calls
    >
    > This is nearly 31% faster redo, 90% fewer fadvise syscalls
    >
    > *Prefetch Counters*
    > Counter Baseline Patched Delta
    > prefetch (fadvise issued) 1,204,992 122,753 −89.8%
    > hit 924,457 911,785 −1.4%
    > skip_init 1,097,536 1,097,536 0
    > skip_fpw 28 28 0
    > skip_rep 80,020,209 81,115,120 +1,094,911
    >
    > Slower disk (with ~2ms latency)
    > Baseline: redo 188.04s, system CPU 6.87s, 1,204,992 fadvise calls
    > Patched: redo 60.02s, system CPU 3.39s, 122,753 fadvise calls
    >
    > This is nearly 68% faster redo, 3.1× overall speedup
    >
    >
    > *Configuration:*
    >
    > shared_buffers = '124GB'
    > huge_pages = on
    > wal_buffers = '512MB'
    > max_wal_size = '100GB'
    > checkpoint_timeout = '30min'
    > full_page_writes = on
    > maintenance_io_concurrency = 50
    > recovery_prefetch = on
    >
    > *Workload:*
    > CREATE TABLE test_noindex(id bigint, val1 int, val2 int, payload text);
    > -- No indexes, no primary key.
    >
    >
    > -- Then insert in batches of 1M rows until WAL reaches 10 GB:
    > INSERT INTO test_noindex
    > SELECT g, (g*7+13)%100000, (g*31+17)%100000, repeat(chr(65+(g%26)),60)
    > FROM generate_series(1, 1000000) g;
    >
    >
    > Thanks,
    > Satya
    >
    > [1]
    > https://www.postgresql.org/message-id/flat/CA%2B3i_M8C%2BrK9vhwBm8U%2Bys2hbDifoBb4Xnws5Wmn2f4u7iqOpA%40mail.gmail.com#8eac90e696baf6e4f58f91482af28e07
    >
    
    Rebased the patch.