Thread

  1. Failure in test_slru for host gokiburi (REL_16_STABLE only)

    Michael Paquier <michael@paquier.xyz> — 2026-05-18T11:41:45Z

    Hi all,
    
    gokiburi has been failing on only REL_16_STABLE for the last few days,
    for the tests of module test_slru.  First failure:
    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gokiburi&dt=2026-05-13%2012%3A20%3A45
    
    Set of changes associated with the first failure, which seem
    completely innocent to me:
    5f12d86dd76 Wed May 13 05:43:49 2026 UTC  Add more tests for
     corrupted data with pglz_decompress()
    d140237dab8 Wed May 13 02:46:17 2026 UTC  Fix stale COPY progress
     during logical replication table sync 
    
    While the buildfarm runs don't show much, I have been able to
    reproduce the failure on the buildfarm host, after using
    -DEXEC_BACKEND.  Here is a backtrace, pointing out that something is
    broken with LWLock initialization:
    2026-05-18 05:20:50.186 UTC client backend[870830]
    pg_regress/test_slru STATEMENT:  SELECT
    test_slru_page_readonly(12377); TRAP: failed
    Assert("LWLockHeldByMe(TestSLRULock)"), File: "test_slru.c", Line:
    124, PID: 870830
    postgres: popo contrib_regression [local]
    SELECT(ExceptionalCondition+0x16c) [0xaaaaabcf4d88]
    /home/popo/lib/test_slru.so(test_slru_page_readonly+0xe4)
    [0xffffedf83060] 
    postgres: popo contrib_regression [local] SELECT(+0x885c40) [0xaaaaab325c40] 
    postgres: popo contrib_regression [local] SELECT(ExecInterpExprStillValid+0x84) [0xaaaaab329a4c] 
    postgres: popo contrib_regression [local] SELECT(+0x9405fc) [0xaaaaab3e05fc] 
    postgres: popo contrib_regression [local] SELECT(+0x9406d4) [0xaaaaab3e06d4] 
    postgres: popo contrib_regression [local] SELECT(+0x940b34) [0xaaaaab3e0b34] 
    postgres: popo contrib_regression [local] SELECT(+0x8b7ac0) [0xaaaaab357ac0] 
    postgres: popo contrib_regression [local] SELECT(+0x89de14) [0xaaaaab33de14] 
    postgres: popo contrib_regression [local] SELECT(+0x8a46c0) [0xaaaaab3446c0] 
    postgres: popo contrib_regression [local] SELECT(standard_ExecutorRun+0x2d0) [0xaaaaab33ec68] 
    postgres: popo contrib_regression [local] SELECT(ExecutorRun+0xb8) [0xaaaaab33e970] 
    postgres: popo contrib_regression [local] SELECT(+0xe550dc) [0xaaaaab8f50dc] 
    postgres: popo contrib_regression [local] SELECT(PortalRun+0x460) [0xaaaaab8f4958] 
    postgres: popo contrib_regression [local] SELECT(+0xe43150) [0xaaaaab8e3150] 
    postgres: popo contrib_regression [local] SELECT(PostgresMain+0x15e8) [0xaaaaab8f0560] 
    postgres: popo contrib_regression [local] SELECT(postmaster_forkexec+0x0) [0xaaaaab70f644] 
    postgres: popo contrib_regression [local] SELECT(SubPostmasterMain+0x6fc) [0xaaaaab7106d8] 
    postgres: popo contrib_regression [local] SELECT(main+0x6d0)
    [0xaaaaab463f6c] /lib/aarch64-linux-gnu/libc.so.6(+0x2225c)
    [0xfffff725225c]
    /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x9c)
    [0xfffff725233c] 
    postgres: popo contrib_regression [local] SELECT(_start+0x30) [0xaaaaaad3d4b0]
    
    The server logs include the following, pointing to a broken state
    (these two should not fail):
    2026-05-18 05:20:50.184 UTC client backend[870830] pg_regress/test_slru
    ERROR:  lock <unassigned:0> is not held
    2026-05-18 05:20:50.184 UTC client backend[870830] pg_regress/test_slru
    STATEMENT:  SELECT test_slru_page_write(12345, 'Test SLRU');
    
    Note that the tests pass without -DEXEC_BACKEND.
    
    While reading through the module, I think that the LWLock
    initialization logic is borked, where we decide to do a
    LWLockInitialize() more times than necessary, confusing the internal
    states.  Honestly, I have no clue why the test has suddenly been
    failing, and why other buildfarm members don't complain.  The host has
    been upgraded a couple of days ago to the latest Debian, but I also
    had a few clean runs in the buildfarm before this began showing up.
    What I do know is that the patch attached is able to make the tests of
    the module pass for v16 on the problematic host with -DEXEC_BACKEND.
    
    Comments or opinions?
    --
    Michael