Re: Instability of phycodorus in pg_upgrade tests with JIT
Alexander Lakhin <exclusion@gmail.com>
From: Alexander Lakhin <exclusion@gmail.com>
To: Andres Freund <andres@anarazel.de>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Michael Paquier <michael@paquier.xyz>,
Fujii Masao <masao.fujii@gmail.com>,
Postgres hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-10-22T21:00:01Z
Lists: pgsql-hackers
Hello Andres, 17.10.2025 08:21, Fujii Masao wrote: > On Fri, Oct 17, 2025 at 8:32 AM Michael Paquier<michael@paquier.xyz> wrote: >> On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote: >>> I collected all of such failures here: >>> https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption >>> >>> Masao-san was going to dig into that: >>> https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com > I tried that briefly, but unfortunately I still have no idea what caused > this failure or what triggered the double-free issue shown below… I've been trying to reproduce the issue locally for several days, with clang 3.9.0 and 4.0.1 compiled from sources with -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_ASSERTIONS=ON, running buildfarm client (TestUpgrade) on four different x86_64 systems (Debian, Ubuntu, but not the latest versions), with no single failure so far. (I've re-created config from petalura/phycodurus: 'jit=1', 'jit_above_cost=0', 'jit_optimize_above_cost=1000'... also tried jit_optimize_above_cost=0...) I tried to invoke double free with a simple program and confirmed that the double free is detected and the program aborted. So if I re-created all the conditions (based on buildfarm logs) correctly, then several hundred runs, which I performed, should be enough to reproduce the issue, but probably there is something specific with those animals (petalura, phycodurus, desmoxytes, dragonet)... Maybe a buggy libc update was installed there in September? Meanwhile we've got a failure at stage Check (not pg_upgradeCheck), with a release LLVM build [1]: 2025-10-21 17:15:16.261 CEST [1489783][client backend][:0] LOG: disconnection: session time: 0:00:03.177 user=bf database=regression host=[local] corrupted size vs. prev_size while consolidating Thus, the initial suspicion that the issue is caused by dff7591a7 (because the first failure [2] happened right after it) seems wrong now. Maybe you have an insight on the possible cause of these memory errors? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2025-10-21%2015%3A14%3A12 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-09-16%2011%3A09%3A07 Best regards, Alexander