Thread

  1. Re: missing PG_IO_ALIGN_SIZE uses

    Thomas Munro <thomas.munro@gmail.com> — 2025-12-02T05:59:03Z

    On Mon, Dec 1, 2025 at 8:56 PM Peter Eisentraut <peter@eisentraut.org> wrote:
    > Commit faeedbcefd4 changed the alignment of WAL buffers from XLOG_BLCKSZ
    > to PG_IO_ALIGN_SIZE.
    >
    > While looking around for places to apply alignas, I think I found at
    > least two places that were forgotten, namely in BootStrapXLOG() and in
    > pg_test_fsync.c.  Patches attached for those.
    
         /* page buffer must be aligned suitably for O_DIRECT */
    -    buffer = (char *) palloc(XLOG_BLCKSZ + XLOG_BLCKSZ);
    -    page = (XLogPageHeader) TYPEALIGN(XLOG_BLCKSZ, buffer);
    +    buffer = (char *) palloc(XLOG_BLCKSZ + PG_IO_ALIGN_SIZE);
    +    page = (XLogPageHeader) TYPEALIGN(PG_IO_ALIGN_SIZE, buffer);
    
    Could save a line with palloc_aligned()?
    
    . o O { If we did that palloc_object() trick I mentioned the other day
    I suppose we could also palloc_object(PGAlignedXLogBlock)... }
    
    > I also suspect that the TYPEALIGN call in XLOGShmemInit() should take
    > PG_IO_ALIGN_SIZE into account, but it's not immediately obvious how,
    > since the comment also mentions that it wants alignment on "a full xlog
    > block size boundary".  Maybe Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE)?
    
    I guess it was worrying about either direct I/O (to work) or VM pages
    (cf ALIGNOF_BUFFER)?  PG_IO_ALIGN_SIZE should be enough I think.
    
    > I also wonder whether the check in check_debug_io_direct() for #if
    > XLOG_BLCKSZ < PG_IO_ALIGN_SIZE would be required if we fixed all those
    > places?
    
    I don't think so, because the undersized blocks in the array can't all
    satisfy PG_IO_ALIGN_SIZE.  Also, not only the base address but also
    the length of I/Os or iovecs have to be aligned on 4KB pages to
    benefit from physical I/O direct to/from user space memory, at least
    in our handwavy portable model of this stuff*.  It might work on some
    systems though, it just didn't seem useful enough to research which
    ones...
    
    That's probably why I didn't change those places above: either
    XLOG_BLCKSZ is itself already big enough, or it isn't and there's not
    much we can do about it except block direct I/O from being turned on.
    PG_IO_ALIGN_SIZE is probably still a better choice though, even if it
    doesn't really work: if you set the block size very large, it would be
    wasteful to over-align the memory.
    
    *The real alignment requirements are way more complicated and hiding
    in many layers of kernels and drivers and protocols... Linux says it
    requires only 512 bytes these days (used to be 4KB IIRC), but first
    gen NVMe required 4KB alignment except for the first and last entries
    of its PRP for vectored I/O (huh, what does Linux do about that?),
    FreeBSD works in terms of 4KB pages at various levels, and Windows
    even uses 4KB page lists as arguments to its direct I/O-only
    ReadFileScatter()/WriteFileGather() system calls (that we aren't using
    yet, but one day).  4KB seems to be enough everywhere and also matches
    common VM page size, and also helps buffered I/O too by touching fewer
    pages.