Thread

  1. Re: Instability of phycodorus in pg_upgrade tests with JIT

    Alexander Lakhin <exclusion@gmail.com> — 2025-10-22T21:00:01Z

    Hello Andres,
    
    17.10.2025 08:21, Fujii Masao wrote:
    > On Fri, Oct 17, 2025 at 8:32 AM Michael Paquier<michael@paquier.xyz> wrote:
    >> On Thu, Oct 16, 2025 at 10:00:00PM +0300, Alexander Lakhin wrote:
    >>> I collected all of such failures here:
    >>> https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#check-pg_upgrade_fails_on_LLVM-enabled_animals_due_to_double_free_or_corruption
    >>>
    >>> Masao-san was going to dig into that:
    >>> https://www.postgresql.org/message-id/CAHGQGwFcjccSYX+Ap8meEbCccUei-B4tmYsBFu4wMEixKi90fQ@mail.gmail.com
    > I tried that briefly, but unfortunately I still have no idea what caused
    > this failure or what triggered the double-free issue shown below…
    
    I've been trying to reproduce the issue locally for several days, with
    clang 3.9.0 and 4.0.1 compiled from sources with -DCMAKE_BUILD_TYPE=Debug
    -DLLVM_ENABLE_ASSERTIONS=ON, running buildfarm client (TestUpgrade) on
    four different x86_64 systems (Debian, Ubuntu, but not the latest versions), with
    no single failure so far.
    
    (I've re-created config from petalura/phycodurus:  'jit=1',
    'jit_above_cost=0', 'jit_optimize_above_cost=1000'... also tried
    jit_optimize_above_cost=0...)
    
    I tried to invoke double free with a simple program and confirmed that the
    double free is detected and the program aborted.
    
    So if I re-created all the conditions (based on buildfarm logs) correctly,
    then several hundred runs, which I performed, should be enough to
    reproduce the issue, but probably there is something specific with those
    animals (petalura, phycodurus, desmoxytes, dragonet)... Maybe a buggy libc
    update was installed there in September?
    
    Meanwhile we've got a failure at stage Check (not pg_upgradeCheck), with a
    release LLVM build [1]:
    2025-10-21 17:15:16.261 CEST [1489783][client backend][:0] LOG: disconnection: session time: 0:00:03.177 user=bf 
    database=regression host=[local]
    corrupted size vs. prev_size while consolidating
    
    Thus, the initial suspicion that the issue is caused by dff7591a7 (because
    the first failure [2] happened right after it) seems wrong now.
    
    Maybe you have an insight on the possible cause of these memory errors?
    
    [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2025-10-21%2015%3A14%3A12
    [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=phycodurus&dt=2025-09-16%2011%3A09%3A07
    
    Best regards,
    Alexander