Thread
-
Re: missing PG_IO_ALIGN_SIZE uses
Thomas Munro <thomas.munro@gmail.com> — 2025-12-02T05:59:03Z
On Mon, Dec 1, 2025 at 8:56 PM Peter Eisentraut <peter@eisentraut.org> wrote: > Commit faeedbcefd4 changed the alignment of WAL buffers from XLOG_BLCKSZ > to PG_IO_ALIGN_SIZE. > > While looking around for places to apply alignas, I think I found at > least two places that were forgotten, namely in BootStrapXLOG() and in > pg_test_fsync.c. Patches attached for those. /* page buffer must be aligned suitably for O_DIRECT */ - buffer = (char *) palloc(XLOG_BLCKSZ + XLOG_BLCKSZ); - page = (XLogPageHeader) TYPEALIGN(XLOG_BLCKSZ, buffer); + buffer = (char *) palloc(XLOG_BLCKSZ + PG_IO_ALIGN_SIZE); + page = (XLogPageHeader) TYPEALIGN(PG_IO_ALIGN_SIZE, buffer); Could save a line with palloc_aligned()? . o O { If we did that palloc_object() trick I mentioned the other day I suppose we could also palloc_object(PGAlignedXLogBlock)... } > I also suspect that the TYPEALIGN call in XLOGShmemInit() should take > PG_IO_ALIGN_SIZE into account, but it's not immediately obvious how, > since the comment also mentions that it wants alignment on "a full xlog > block size boundary". Maybe Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE)? I guess it was worrying about either direct I/O (to work) or VM pages (cf ALIGNOF_BUFFER)? PG_IO_ALIGN_SIZE should be enough I think. > I also wonder whether the check in check_debug_io_direct() for #if > XLOG_BLCKSZ < PG_IO_ALIGN_SIZE would be required if we fixed all those > places? I don't think so, because the undersized blocks in the array can't all satisfy PG_IO_ALIGN_SIZE. Also, not only the base address but also the length of I/Os or iovecs have to be aligned on 4KB pages to benefit from physical I/O direct to/from user space memory, at least in our handwavy portable model of this stuff*. It might work on some systems though, it just didn't seem useful enough to research which ones... That's probably why I didn't change those places above: either XLOG_BLCKSZ is itself already big enough, or it isn't and there's not much we can do about it except block direct I/O from being turned on. PG_IO_ALIGN_SIZE is probably still a better choice though, even if it doesn't really work: if you set the block size very large, it would be wasteful to over-align the memory. *The real alignment requirements are way more complicated and hiding in many layers of kernels and drivers and protocols... Linux says it requires only 512 bytes these days (used to be 4KB IIRC), but first gen NVMe required 4KB alignment except for the first and last entries of its PRP for vectored I/O (huh, what does Linux do about that?), FreeBSD works in terms of 4KB pages at various levels, and Windows even uses 4KB page lists as arguments to its direct I/O-only ReadFileScatter()/WriteFileGather() system calls (that we aren't using yet, but one day). 4KB seems to be enough everywhere and also matches common VM page size, and also helps buffered I/O too by touching fewer pages.