Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Allow "make check"-style testing to work with musl C library.

  2. Fix compiler warnings on MSYS2

  1. Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-16T12:38:48Z

    Running the regression tests when building with musl libc fails, with 
    errors like the following:
    
    ERROR:  could not load library 
    "<builddir>/tmp_install/usr/lib/postgresql/libpqwalreceiver.so": Error 
    loading shared library libpq.so.5: No such file or directory (needed by 
    <builddir>/tmp_install/usr/lib/postgresql/libpqwalreceiver.so)
    
    This was observed in Alpine Linux [1] and nixpkgs [2] a few years ago. I 
    now looked at this a bit and this is what happens:
    
    - The temporary install location is set via LD_LIBRARY_PATH in the 
    regression tests, so that postgres can find those libs.
    
    - All tests which load an extension / shared module via dlopen() fail, 
    when the loaded library in turn depends on another library in 
    tmp_install - I think in practice it's libpq.so all the time.
    
    - LD_LIBRARY_PATH is used correctly to look for the direct dependency 
    loaded in dlopen(...), but is not taken into account anymore when trying 
    to locate libpq.so. This step only fails with musl, but works fine with 
    glibc.
    
    
    I can reproduce this with a simple Dockerfile (attached), which uses the 
    library/postgres-alpine image, moves libpq.so to a different folder and 
    points LD_LIBRARY_PATH at it. Build and run the dockerfile like this:
    
       docker build . -t pg-musl && docker run --rm pg-musl
    
    This Dockerfile can easily be adjusted to work with the debian image - 
    which shows that doing the same with glibc works just fine.
    
    
    Even though this originated in "just" the regression tests, I'm filing 
    this as a bug, because:
    - The docs explicitly mention LD_LIBRARY_PATH support to point at a 
    different /lib folder in [3].
    - This can clearly break outside the test-suite as shown with the 
    Dockerfile.
    
    
    I tried a few more things:
    - When I add an /etc/ld-musl-$(ARCH).path file and add the path to 
    libpq.so's libdir to it, libpq.so can be found.
    - When I add the path to libpq.so as an rpath to the postgres binary, 
    libpq.so can be found.
    
    Both is not surprising, but just confirms musl-ld actually works as 
    expected. It's just LD_LIBRARY_PATH that seems to not be passed on.
    
    To rule out a musl bug, I also put together a very simple test-case of 
    an executable loading liba with dlopen(), which depends on libb and then 
    constructing the same scenario with LD_LIBRARY_PATH. This works fine 
    when compiled with glibc and musl, too. Thus, I believe the problem to 
    be somewhere in how postgres loads those libraries.
    
    Best,
    
    Wolfgang
    
    [1]: 
    https://github.com/alpinelinux/aports/commit/d67ceb66a1ca9e1899071c9ef09fffba29fa0417#diff-2bd25b5172fc52319de1b09086ac0db6314d2e9fa73497979f5198f8caaec1b9
    
    [2]: 
    https://github.com/NixOS/nixpkgs/commit/09ffd722072291f00f2a54d7404eb568a15e562a
    
    [3]: https://www.postgresql.org/docs/current/install-post.html
  2. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-16T14:24:43Z

    Wolfgang Walther <walther@technowledgy.de> writes:
    > - LD_LIBRARY_PATH is used correctly to look for the direct dependency 
    > loaded in dlopen(...), but is not taken into account anymore when trying 
    > to locate libpq.so. This step only fails with musl, but works fine with 
    > glibc.
    
    Why do you think this is our bug and not musl's?  We do not even have
    any code that knows anything about indirect library dependencies.
    
    			regards, tom lane
    
    
    
    
  3. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-16T15:55:49Z

    Tom Lane:
    > Why do you think this is our bug and not musl's?
    
    Because I tried to reproduce with musl directly with a very simple 
    example as mentioned in:
    
    > To rule out a musl bug, I also put together a very simple test-case of an executable loading liba with dlopen(), which depends on libb and then constructing the same scenario with LD_LIBRARY_PATH. This works fine when compiled with glibc and musl, too. Thus, I believe the problem to be somewhere in how postgres loads those libraries.
    
    My test case looked like the attached. To compile it with musl via 
    Dockerfile:
    
       docker build . -t musl-dlopen && docker run --rm musl-dlopen
    
    a.c/a.h is equivalent to libpqwalreceiver and b.c/b.h to libpq.
    
    This works fine with both musl and glibc.
    
    (Note: I also tried putting liba.so and libb.so in different folders, 
    adding both to LD_LIBRARY_PATH - but that worked fine as well.)
    
    Now my very simple example probably does something different than 
    postgres, so that the problem doesn't appear there. But since it seems 
    possible to do this with musl in principle, it should be possible to do 
    it differently in postgres to make it work, too.
    
    Any ideas?
    
    Best,
    Wolfgang
  4. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-16T19:49:27Z

    On Sat, Mar 16, 2024 at 11:55 AM Wolfgang Walther <walther@technowledgy.de>
    wrote:
    
    > Tom Lane:
    > > Why do you think this is our bug and not musl's?
    >
    > Because I tried to reproduce with musl directly with a very simple
    > example as mentioned in:
    >
    > > To rule out a musl bug, I also put together a very simple test-case of
    > an executable loading liba with dlopen(), which depends on libb and then
    > constructing the same scenario with LD_LIBRARY_PATH. This works fine when
    > compiled with glibc and musl, too. Thus, I believe the problem to be
    > somewhere in how postgres loads those libraries.
    >
    > My test case looked like the attached. To compile it with musl via
    > Dockerfile:
    >
    >    docker build . -t musl-dlopen && docker run --rm musl-dlopen
    >
    > a.c/a.h is equivalent to libpqwalreceiver and b.c/b.h to libpq.
    >
    > This works fine with both musl and glibc.
    >
    > (Note: I also tried putting liba.so and libb.so in different folders,
    > adding both to LD_LIBRARY_PATH - but that worked fine as well.)
    >
    > Now my very simple example probably does something different than
    > postgres, so that the problem doesn't appear there. But since it seems
    > possible to do this with musl in principle, it should be possible to do
    > it differently in postgres to make it work, too.
    >
    > Any ideas?
    >
    >
    >
    
    On Alpine Linux, which uses musl libc, you have to run `make install`
    before you can run `make check`. Have you tried
    that?
    
    (Note to self: need a new Alpine buildfarm member)
    
    cheers
    
    andrew
    
  5. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-16T20:00:28Z

    Andrew Dunstan:
    > On Alpine Linux, which uses musl libc, you have to run `make install` 
    > before you can run `make check`. Have you tried that?
    
    I can see how that could work around the problem, because the library 
    would already be available in the default library path / rpath and 
    LD_LIBRARY_PATH would not be needed.
    
    However, this would only be a workaround for the specific case of 
    running the regression tests, not a solution. Using LD_LIBRARY_PATH, as 
    documented, would still not be possible.
    
    In my case, I am just using docker images with Alpine to easily 
    reproduce the problem. I am building with NixOS / nixpkgs' pkgsMusl. The 
    order of check and install phases can't be changed there, AFAICT. The 
    workaround I use right now is to temporarily patch rpath of the postgres 
    binary - this will be reset during the install phase anyway. This works, 
    but again is not a real solution.
    
    Best,
    
    Wolfgang
    
    
    
    
  6. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-16T20:10:56Z

    Andrew Dunstan <andrew@dunslane.net> writes:
    > On Alpine Linux, which uses musl libc, you have to run `make install`
    > before you can run `make check`. Have you tried that?
    
    We have the same situation on macOS.  There, it seems to be the result
    of a "security feature" that strips DYLD_LIBRARY_PATH from the process
    environment when make executes a shell.  There's not much we can do
    about that, and I suspect there is not much we can do about musl's
    behavior either.  (I am not a fan of proposals to modify the binaries
    under test, because then you are not testing what you intend to
    install.)
    
    			regards, tom lane
    
    
    
    
  7. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-16T20:19:34Z

    On Sun, Mar 17, 2024 at 4:56 AM Wolfgang Walther
    <walther@technowledgy.de> wrote:
    > Any ideas?
    
    I'd look into whether there is a difference in the rules it uses for
    deciding not to trust LD_LIBRARY_PATH, which seems to be around here
    somewhere:
    
    https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1812
    
    I wonder if you can break into an affected program and check out the
    magic there.  FWIW on MacOS something equivalent happens at the moment
    we execute a shell, because the system shell is 'code signed' and that
    OS treats signed stuff similar to setuid binaries for this purpose
    (IIRC setting SHELL to point to a suitable unsigned shell could work
    around the problem there?)
    
    Another interesting thing that came up when I googled musl/glibc
    differences -- old but looks plausibly still true (not that I expect
    our code to be modifying that stuff in place, just something to
    check):
    
    https://www.openwall.com/lists/musl/2014/08/31/14
    
    
    
    
  8. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-16T20:21:55Z

    Tom Lane:
    > We have the same situation on macOS.  There, it seems to be the result
    > of a "security feature" that strips DYLD_LIBRARY_PATH from the process
    > environment when make executes a shell.
    
    I'm not sure whether this explanation is sufficient for the musl case, 
    because LD_LIBRARY_PATH does make a difference: The direct dependency 
    (libpqwalreceiver.so) can still be found if it's moved elsewhere and 
    LD_LIBRARY_PATH points at it. So clearly the LD_LIBRARY_PATH variable is 
    still set after make executed the shell - it's just not in effect on the 
    *indirect* dependency (libpq.so) anymore.
    
    Best,
    
    Wolfgang
    
    
    
    
  9. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-16T20:31:10Z

    Thomas Munro:
    > I'd look into whether there is a difference in the rules it uses for
    > deciding not to trust LD_LIBRARY_PATH, which seems to be around here
    > somewhere:
    > 
    > https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1812
    
    Yeah, I have been looking at that, too. I had also experimented a bit 
    with setuid/setgid for that matter, but that didn't lead anywhere, yet. 
    I'm not 100% sure, but I think this would also not match my other 
    observation, that LD_LIBRARY_PATH does work for libpqwalreceiver (direct 
    dep), but not libpq (indirect dep).
    
    > Another interesting thing that came up when I googled musl/glibc
    > differences -- old but looks plausibly still true (not that I expect
    > our code to be modifying that stuff in place, just something to
    > check):
    > 
    > https://www.openwall.com/lists/musl/2014/08/31/14
    
    To me, this seems very much like what could happen - it matches all my 
    observations, so far. But I can't tell how likely that is, not knowing 
    much of the postgres code.
    
    Best,
    
    Wolfgang
    
    
    
    
  10. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-16T20:54:52Z

    On Sun, Mar 17, 2024 at 9:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > Another interesting thing that came up when I googled musl/glibc
    > differences -- old but looks plausibly still true (not that I expect
    > our code to be modifying that stuff in place, just something to
    > check):
    >
    > https://www.openwall.com/lists/musl/2014/08/31/14
    
    Hmm, that does mention setproctitle, and our ps_status.c does indeed
    clobber some stuff in that region (in fact our ps_status.c is likely
    derived from the setproctitle() function from sendmail AFAICT).  But
    that's in our "backend" server processes, unlike the problems we have
    on Macs... oh but you're failing to load libpqwalreceiver.so which
    makes some sense for the backend hypothesis.  What happens if you hack
    ps_status.c to use PS_USE_NONE?
    
    
    
    
  11. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-17T09:44:03Z

    Thomas Munro:
    > Hmm, that does mention setproctitle, and our ps_status.c does indeed
    > clobber some stuff in that region (in fact our ps_status.c is likely
    > derived from the setproctitle() function from sendmail AFAICT).  But
    > that's in our "backend" server processes, unlike the problems we have
    > on Macs... oh but you're failing to load libpqwalreceiver.so which
    > makes some sense for the backend hypothesis.  What happens if you hack
    > ps_status.c to use PS_USE_NONE?
    
    Nailed it. PS_USE_NONE fixes it.
    
    Best,
    
    Wolfgang
    
    
    
    
  12. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Christophe Pettus <xof@thebuild.com> — 2024-03-17T10:33:52Z

    
    > On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
    > 
    > Nailed it. PS_USE_NONE fixes it.
    
    Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case?  It seems wrong that the current test passes a case that doesn't actually work.
    
    
    
  13. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-17T13:11:19Z

    Christophe Pettus:
    > Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case?  It seems wrong that the current test passes a case that doesn't actually work.
    
    The missing macro is on purpose and unlikely to change: 
    https://openwall.com/lists/musl/2013/03/29/13
    
    I also found this thread, which discusses exactly our case: 
    https://www.openwall.com/lists/musl/2022/08/17/1
    
    Some quotes from that thread:
    
    > I understand that what Postgres et al are doing is a nasty hack.
    
    And:
    
    > Applications that *really* want setproctitle type functionality can
    > presumably do something like re-exec themselves with a suitably large
    > argv[0] to give them safe space to overwrite with their preferred
    > message, rather than UB trying to relocate the environment (and auxv?
    > how? they can't tell libc they moved it) to some other location.
    
    Could that be a more portable way of doing this?
    
    Best,
    
    Wolfgang
    
    
    
    
  14. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Christophe Pettus <xof@thebuild.com> — 2024-03-17T15:44:45Z

    
    > On Mar 17, 2024, at 06:11, Wolfgang Walther <walther@technowledgy.de> wrote:
    > The missing macro is on purpose and unlikely to change: https://openwall.com/lists/musl/2013/03/29/13
    
    Indeed.
    
    > I also found this thread, which discusses exactly our case: https://www.openwall.com/lists/musl/2022/08/17/1
    
    While getting proper setproctitle functionality on musl would be great, my goal was more modest: Have it correctly set PS_USE_NONE when compiling against musl.
    
    
    
  15. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-17T20:33:40Z

    On Sun, Mar 17, 2024 at 11:45 AM Christophe Pettus <xof@thebuild.com> wrote:
    
    >
    >
    > > On Mar 17, 2024, at 06:11, Wolfgang Walther <walther@technowledgy.de>
    > wrote:
    > > The missing macro is on purpose and unlikely to change:
    > https://openwall.com/lists/musl/2013/03/29/13
    >
    > Indeed.
    >
    
    That seems a little shortsighted. If other libc implementations find it
    appropriate to have similar macros why should they be different?
    
    
    > > I also found this thread, which discusses exactly our case:
    > https://www.openwall.com/lists/musl/2022/08/17/1
    >
    > While getting proper setproctitle functionality on musl would be great, my
    > goal was more modest: Have it correctly set PS_USE_NONE when compiling
    > against musl.
    >
    
    One simple thing might be for us to enclose the block in ps_status.c at
    lines 49-59 in #ifndef PS_USE_NONE/#endif. Then you could compile with
    -DPS_USE_NONE in your CPPFLAGS.
    
    cheers
    
    andrew
    
  16. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Christophe Pettus <xof@thebuild.com> — 2024-03-17T21:05:54Z

    
    > On Mar 17, 2024, at 13:33, Andrew Dunstan <andrew@dunslane.net> wrote:
    > 
    > That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why should they be different?
    
    It's a philosophical argument against checking for particular libc implementations instead of particular features.  I'm not unsympathetic to that argument, but AFAICT there's no clean way of checking for this by examining feature #defines.
    
    
    
  17. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-17T23:20:26Z

    On Mon, Mar 18, 2024 at 10:06 AM Christophe Pettus <xof@thebuild.com> wrote:
    > > On Mar 17, 2024, at 13:33, Andrew Dunstan <andrew@dunslane.net> wrote:
    > > That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why should they be different?
    >
    > It's a philosophical argument against checking for particular libc implementations instead of particular features.  I'm not unsympathetic to that argument, but AFAICT there's no clean way of checking for this by examining feature #defines.
    
    I like their philosophy, and I like their FAQ.  Down with software
    monocultures, up with standards and cooperation.  But anyway...
    
    I wondered for a moment if there could be a runtime way to test if
    we'd broken stuff, but it seems hard without a way to ask the runtime
    linker for its search path to see if it has any pointers into the
    environment.  We can't, that "env_path" variable in dynlink.c is not
    accessible to us by any reasonable means.  And yeah, this whole thing
    is a nasty invasive hack that harks back to the 80s I assume, before
    many systems provided a clean way to do this (and some never did)...
    
    Hmm, I can think of one dirty hack on top of our existing dirty hack
    that might work.  I feel bad typing this out, but here goes nothing:
    In save_ps_display_args(), we compute end_of_area, stepping past
    contiguous arguments and environment variables.  But what if we
    terminated that if we saw an environment entry beginning "LD_"?  We'd
    still feel free to clobber the memory up to that point (rather than
    limiting ourselves to the argv space, another more conservative choice
    that might truncate a few PS display messages, or maybe not given the
    typical postmaster arguments, maye that'd work out OK), and we'd still
    copy the environment to somewhere new, but anything like "LD_XXX" that
    the runtime linker might have stashed a pointer to would remain valid.
    /me runs away and hides
    
    
    
    
  18. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Christophe Pettus <xof@thebuild.com> — 2024-03-18T09:33:45Z

    
    > On Mar 17, 2024, at 16:20, Thomas Munro <thomas.munro@gmail.com> wrote:
    > 
    > We'd
    > still feel free to clobber the memory up to that point (rather than
    > limiting ourselves to the argv space, another more conservative choice
    > that might truncate a few PS display messages, or maybe not given the
    > typical postmaster arguments, maye that'd work out OK), and we'd still
    > copy the environment to somewhere new, but anything like "LD_XXX" that
    > the runtime linker might have stashed a pointer to would remain valid.
    > /me runs away and hides
    
    It doesn't lack for bravery!  (And I have to just comment that the linker storing pointers into that space as a way of finding libraries... well, that doesn't get them the moral high ground for nasty hacks.)
    
    I'm comfortable with "if you are using musl, you don't get the ps messages" as a first solution, if we can find a way of detecting a libc that passes the other tests but doesn't support any of the existing hacks.
    
    
    
  19. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-18T13:25:33Z

    On Mon, Mar 18, 2024 at 10:34 PM Christophe Pettus <xof@thebuild.com> wrote:
    > > On Mar 17, 2024, at 16:20, Thomas Munro <thomas.munro@gmail.com> wrote:
    > > We'd
    > > still feel free to clobber the memory up to that point (rather than
    > > limiting ourselves to the argv space, another more conservative choice
    > > that might truncate a few PS display messages, or maybe not given the
    > > typical postmaster arguments, maye that'd work out OK), and we'd still
    > > copy the environment to somewhere new, but anything like "LD_XXX" that
    > > the runtime linker might have stashed a pointer to would remain valid.
    > > /me runs away and hides
    >
    > It doesn't lack for bravery!  (And I have to just comment that the linker storing pointers into that space as a way of finding libraries... well, that doesn't get them the moral high ground for nasty hacks.)
    
    FWIW here is a blind patch if someone wants to try it out... no musl here.
    
    (Hmm, I think it's not that unreasonable on their part to assume the
    initial environment is immutable if their implementation doesn't
    mutate it, and our doing so is undeniably UB; surprising, maybe, given
    that the technique works on that other popular brand of C library on
    that kind of kernel, not to mention dozens of old Unixen of yore...
    The real solution may turn out to be the prctl() described in that
    thread, where you can tell the kernel where you're planning to move
    your argv and it can find it to show ps/top, but I checked and you
    still can't call that without special privileges, so maybe someone
    should get onto complaining to kernel hackers about that?  That thread
    is wrong about us clobbering auxv BTW, we're not animals!)
    
  20. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-18T14:23:32Z

    Thomas Munro <thomas.munro@gmail.com> writes:
    > (Hmm, I think it's not that unreasonable on their part to assume the
    > initial environment is immutable if their implementation doesn't
    > mutate it, and our doing so is undeniably UB; surprising, maybe, given
    > that the technique works on that other popular brand of C library on
    > that kind of kernel, not to mention dozens of old Unixen of yore...
    
    Does their implementation also ignore the effects of putenv() or
    setenv() on LD_LIBRARY_PATH?  They have no moral high ground
    whatsoever if that's the case.  But if it doesn't, an alternative
    route to a solution could be to scan the original environment, strdup
    and putenv each entry to move it to freshly malloc'd space, and
    then reclaim the old environment area.
    
    			regards, tom lane
    
    
    
    
  21. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-18T21:17:36Z

    On Tue, Mar 19, 2024 at 3:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Thomas Munro <thomas.munro@gmail.com> writes:
    > > (Hmm, I think it's not that unreasonable on their part to assume the
    > > initial environment is immutable if their implementation doesn't
    > > mutate it, and our doing so is undeniably UB; surprising, maybe, given
    > > that the technique works on that other popular brand of C library on
    > > that kind of kernel, not to mention dozens of old Unixen of yore...
    >
    > Does their implementation also ignore the effects of putenv() or
    > setenv() on LD_LIBRARY_PATH?  They have no moral high ground
    > whatsoever if that's the case.  But if it doesn't, an alternative
    > route to a solution could be to scan the original environment, strdup
    > and putenv each entry to move it to freshly malloc'd space, and
    > then reclaim the old environment area.
    
    Yes, the musl linker/loader ignores putenv()/setenv() changes to
    LD_LIBRARY_PATH after process start (that is, changes only effect the
    search path when injected into a new program with exec*()).  As does
    glibc, it's just that it captures by copy instead of reference
    (according to one of the links above, I didn't check the source).  So
    setenv() has no effect on dlopen() in *this* program, and using putenv
    in that way won't help.  We simply can't move the value of
    LD_LIBRARY_PATH (though my patch could be a little sneakier and steal
    all the bytes right up to the = sign to get more space for our
    message!).
    
    One way to tell if a copy has been made is to trace a program that does:
    
            getenv("LD_LIBRARY_PATH")[2] = 'X';
            dlopen("foo.so", RTLD_NOW | RTLD_GLOBAL);
    
    ... when run with LD_LIBRARY_PATH set to /asdf.  On FreeBSD I see it
    tries to open "/aXdf...", so now I know that FreeBSD also captures it
    by reference like musl.  But we don't use the clobber trick on
    FreeBSD, it has a proper setproctitle() function that knows how to
    negotiate with the kernel, so it doesn't matter.  It also ignores
    changes made with setent()/putenv(), because those create fresh
    entries but leave the initial environment strings untouched.
    
    Solaris also ignores changes made after startup (it's in the dlopen
    man page), and from a very quick look at its ld_lib_setup() I think it
    achieved that with a copy.  I believe its ancestor SunOS 4 invented
    all of these conventions (and the mmap/virtual memory concepts they
    rode in on), later nailed down to some degree in the System V ABI and
    very widely adopted, but I don't see anything in the latter that
    specifically addresses this point, eg LD_LIBRARY copy vs reference and
    interaction with dlopen() (perhaps I didn't look hard enough).  I'm
    not sure what else you can point to to make strong claims about this
    stuff, but I bet every system ignores changes after startup, it's just
    that they found two ways to achieve that.  POSIX says of dlopen that
    the "file [argument] is used in an implementation-defined manner", and
    of environ that we're welcome to swap a whole new environ, but doesn't
    seem to tell us anything about the one that is replaced (who owns it?
    is the initial one set up at execution time special? etc).  The line
    banning manipulation of the pointers environ refers to doesn't exactly
    describe what we're doing (we're manipulating the strings pointed to
    by the *previous* environ).  UB.
    
    
    
    
  22. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-18T22:48:50Z

    On Tue, Mar 19, 2024 at 10:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > ... (though my patch could be a little sneakier and steal
    > all the bytes right up to the = sign to get more space for our
    > message!).
    
    Here's one like that.  No musl here -- does this work Wolfgang?  Do we
    think it's generous enough with space in practice that we could just
    always do this for __linux__ systems without anyone noticing (ie
    including glibc users)?  Should we be more specific about which LD_*
    variables?  Do people not doing hacking/testing ever really set those,
    eg on production servers?  This code path was once used by up to a
    dozen or so OSes but they're all dead, only Linux, Solaris and macOS
    left, and I don't have any reason to think they suffer from this
    problem and Macs don't even follow the SysV LD_ naming convention,
    hence gating on Linux.
    
  23. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-19T23:54:49Z

    On Tue, Mar 19, 2024 at 11:48:50AM +1300, Thomas Munro wrote:
    > On Tue, Mar 19, 2024 at 10:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > > ... (though my patch could be a little sneakier and steal
    > > all the bytes right up to the = sign to get more space for our
    > > message!).
    > 
    > Here's one like that.  No musl here -- does this work Wolfgang?  Do we
    > think it's generous enough with space in practice that we could just
    > always do this for __linux__ systems without anyone noticing (ie
    > including glibc users)?  Should we be more specific about which LD_*
    > variables?  Do people not doing hacking/testing ever really set those,
    > eg on production servers?  This code path was once used by up to a
    > dozen or so OSes but they're all dead, only Linux, Solaris and macOS
    > left, and I don't have any reason to think they suffer from this
    > problem and Macs don't even follow the SysV LD_ naming convention,
    > hence gating on Linux.
    
    So this would truncate the process title on all Linux that have an LD_
    environment entry, even those without musl?
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  24. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-20T01:12:54Z

    On Wed, Mar 20, 2024 at 12:54 PM Bruce Momjian <bruce@momjian.us> wrote:
    > So this would truncate the process title on all Linux that have an LD_
    > environment entry, even those without musl?
    
    Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
    postmaster near you?  You'd always get that much, plus as much of
    /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
    typical system would, I guess, usually be never.  If it's a problem
    you could try to arrange for LD_ XXX to come later in environ[].  What
    I observe is that they seem to get copied in backwards, wrt the
    environment exported by the parent, so if you set DUMMY=XXXXXXXX just
    before starting the process it'll make sacrificial space in the right
    place (but I'm not sure where that effect is coming from so don't
    quote me).
    
    
    
    
  25. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-20T01:27:23Z

    On Wed, Mar 20, 2024 at 02:12:54PM +1300, Thomas Munro wrote:
    > On Wed, Mar 20, 2024 at 12:54 PM Bruce Momjian <bruce@momjian.us> wrote:
    > > So this would truncate the process title on all Linux that have an LD_
    > > environment entry, even those without musl?
    > 
    > Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
    > postmaster near you?  You'd always get that much, plus as much of
    
    	$ cat /proc/2000/cmdline |wc -c
    	30
    
    > /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
    > typical system would, I guess, usually be never.  If it's a problem
    > you could try to arrange for LD_ XXX to come later in environ[].  What
    > I observe is that they seem to get copied in backwards, wrt the
    > environment exported by the parent, so if you set DUMMY=XXXXXXXX just
    > before starting the process it'll make sacrificial space in the right
    > place (but I'm not sure where that effect is coming from so don't
    > quote me).
    
    I am just cautious about changing behavior on our most common platform
    for a libc library I have never heard of.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  26. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-20T01:53:08Z

    Thomas Munro <thomas.munro@gmail.com> writes:
    > Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
    > postmaster near you?  You'd always get that much, plus as much of
    > /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
    > typical system would, I guess, usually be never.
    
    I'd be happier about this if the target pattern were more restrictive.
    Is there reason to think that musl keeps a pointer to anything besides
    LD_LIBRARY_PATH?
    
    			regards, tom lane
    
    
    
    
  27. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-20T02:01:09Z

    On Wed, Mar 20, 2024 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Thomas Munro <thomas.munro@gmail.com> writes:
    > > Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
    > > postmaster near you?  You'd always get that much, plus as much of
    > > /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
    > > typical system would, I guess, usually be never.
    >
    > I'd be happier about this if the target pattern were more restrictive.
    > Is there reason to think that musl keeps a pointer to anything besides
    > LD_LIBRARY_PATH?
    
    Also LD_PRELOAD:
    
    https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1824
    
    Yeah we could do just those two.
    
    
    
    
  28. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-20T02:03:36Z

    Thomas Munro <thomas.munro@gmail.com> writes:
    > On Wed, Mar 20, 2024 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> I'd be happier about this if the target pattern were more restrictive.
    >> Is there reason to think that musl keeps a pointer to anything besides
    >> LD_LIBRARY_PATH?
    
    > Also LD_PRELOAD:
    > https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1824
    > Yeah we could do just those two.
    
    +1 for stopping only at one of those two names.
    
    			regards, tom lane
    
    
    
    
  29. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-20T02:12:49Z

    On Wed, Mar 20, 2024 at 2:27 PM Bruce Momjian <bruce@momjian.us> wrote:
    > I am just cautious about changing behavior on our most common platform
    > for a libc library I have never heard of.
    
    Yeah, I hear you.  I don't have a dog in this race, I just like
    retro-computing mysteries and arguments about the meaning of
    standards...  That said I'm pretty sure no one should be running
    production PostgreSQL systems held together by LD_LIBRARY_PATH, and if
    they do, it looks like systemd/rc.d scripts set a bunch of PG_OOM_BLAH
    stuff just before the start the cluster that would provide extra chaff
    in front of LD_XXX stuff defined earlier, and then pg_ctl inserts even
    more, and even if they don't use any of that stuff and just run the
    postmaster directly with some other homegrown tooling, now we're down
    to very niche/expert scenarios where it should be acceptable to point
    to this thread that says "try setting an extra dummy variable after
    you set LD_XXX!".
    
    
    
    
  30. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-20T04:39:34Z

    On Wed, Mar 20, 2024 at 3:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > +1 for stopping only at one of those two names.
    
    Here's one like that for Wolfgang to test on musl.
    
  31. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Peter Eisentraut <peter@eisentraut.org> — 2024-03-20T06:02:48Z

    On 17.03.24 11:33, Christophe Pettus wrote:
    >> On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
    >>
    >> Nailed it. PS_USE_NONE fixes it.
    > 
    > Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case?  It seems wrong that the current test passes a case that doesn't actually work.
    > 
    
    We could turn it around and do
    
    #if defined(__linux__)
    #if defined(__GLIBC__) || defined(__UCLIBC__ )
    #define PS_USE_CLOBBER_ARGV
    #else
    #define PS_USE_NONE
    #endif
    #endif
    
    
    
    
    
  32. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-20T07:16:14Z

    On Wed, Mar 20, 2024 at 2:03 AM Peter Eisentraut <peter@eisentraut.org>
    wrote:
    
    > On 17.03.24 11:33, Christophe Pettus wrote:
    > >> On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de>
    > wrote:
    > >>
    > >> Nailed it. PS_USE_NONE fixes it.
    > >
    > > Given the musl (still?) does not define a preprocessor macro specific to
    > it, is there a way of improving the test in pg_status.c to catch this
    > case?  It seems wrong that the current test passes a case that doesn't
    > actually work.
    > >
    >
    > We could turn it around and do
    >
    > #if defined(__linux__)
    > #if defined(__GLIBC__) || defined(__UCLIBC__ )
    > #define PS_USE_CLOBBER_ARGV
    > #else
    > #define PS_USE_NONE
    > #endif
    > #endif
    >
    >
    >
    >
    I like it. Neat and minimal.
    
    cheers
    
    andrew
    
  33. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T09:39:20Z

    Thomas Munro:
    > On Wed, Mar 20, 2024 at 3:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> +1 for stopping only at one of those two names.
    > 
    > Here's one like that for Wolfgang to test on musl.
    
    Works fine.
    
    Peter Eisentraut:
    > We could turn it around and do
    > 
    > #if defined(__linux__)
    > #if defined(__GLIBC__) || defined(__UCLIBC__ )
    > #define PS_USE_CLOBBER_ARGV
    > #else
    > #define PS_USE_NONE
    > #endif
    > #endif
    
    This works as well.
    
    I also put together a PoC of what was mentioned in musl's mailing list: 
    Instead of clobbering environ at all, exec yourself again with padded 
    argv0. This works, too. Attached.
    
    Best,
    
    Wolfgang
  34. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-20T13:35:51Z

    On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
    > Peter Eisentraut:
    > > We could turn it around and do
    > > 
    > > #if defined(__linux__)
    > > #if defined(__GLIBC__) || defined(__UCLIBC__ )
    > > #define PS_USE_CLOBBER_ARGV
    > > #else
    > > #define PS_USE_NONE
    > > #endif
    > > #endif
    > 
    > This works as well.
    
    Yes, I prefer this.  I am worried the environ hackery will bite us
    someday and the cause will be hard to find.
    
    > I also put together a PoC of what was mentioned in musl's mailing list:
    > Instead of clobbering environ at all, exec yourself again with padded argv0.
    > This works, too. Attached.
    
    It is hard to imagine why we would add an extra exec on every Linux
    server start for this.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  35. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-20T13:37:30Z

    On Wed, Mar 20, 2024 at 09:35:51AM -0400, Bruce Momjian wrote:
    > On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
    > > I also put together a PoC of what was mentioned in musl's mailing list:
    > > Instead of clobbering environ at all, exec yourself again with padded argv0.
    > > This works, too. Attached.
    > 
    > It is hard to imagine why we would add an extra exec on every Linux
    > server start for this.
    
    I guess we could conditionally exec only if we find we must, but then
    such exec cases would be rare and rarely tested.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  36. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T14:24:58Z

    Bruce Momjian:
    > I guess we could conditionally exec only if we find we must, but then
    > such exec cases would be rare and rarely tested.
    
    I think you might be seriously underestimating how often musl is used. 
    Alpine Linux uses musl and is very widespread in the container world 
    because of smaller image size.
    
    The library/postgres docker image has been pulled about 8 billion times 
    since 2014 [1]. While we can't really tell how many of those pulled the 
    alpine variant of the image, comparing the alpine [2] and ubuntu/debian 
    [3,4] base images gives a rough estimate of >50% using alpine in general.
    
    This is certainly not rare.
    
    But yeah, buildfarm coverage for musl would be good, I agree. Maybe even 
    directly in CI?
    
    Best,
    
    Wolfgang
    
    [1]: https://hub.docker.com/v2/repositories/library/postgres
    [2]: https://hub.docker.com/v2/repositories/library/alpine
    [3]: https://hub.docker.com/v2/repositories/library/ubuntu
    [4]: https://hub.docker.com/v2/repositories/library/debian
    
    
    
    
  37. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Laurenz Albe <laurenz.albe@cybertec.at> — 2024-03-20T14:28:53Z

    On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
    > I think you might be seriously underestimating how often musl is used. 
    > Alpine Linux uses musl and is very widespread in the container world 
    > because of smaller image size
    
    The last time I looked, its collation support didn't work at all...
    
    Yours,
    Laurenz Albe
    
    
    
    
  38. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-20T14:36:38Z

    Laurenz Albe <laurenz.albe@cybertec.at> writes:
    > On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
    >> I think you might be seriously underestimating how often musl is used. 
    >> Alpine Linux uses musl and is very widespread in the container world 
    >> because of smaller image size
    
    > The last time I looked, its collation support didn't work at all...
    
    I think the same is true of some of the BSDen, so that's not a
    large impediment to us.  But in any case, if somebody wants
    Alpine or musl to be considered a supported platform, they'd
    best step up and run a buildfarm animal.
    
    			regards, tom lane
    
    
    
    
  39. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T14:37:07Z

    Bruce Momjian:
    > On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
    >> Peter Eisentraut:
    >>> We could turn it around and do
    >>>
    >>> #if defined(__linux__)
    >>> #if defined(__GLIBC__) || defined(__UCLIBC__ )
    >>> #define PS_USE_CLOBBER_ARGV
    >>> #else
    >>> #define PS_USE_NONE
    >>> #endif
    >>> #endif
    >>
    >> This works as well.
    > 
    > Yes, I prefer this.  I am worried the environ hackery will bite us
    > someday and the cause will be hard to find.
    
    Well, the environ hackery already bit and it sure was hard to find. But 
    this approach would still clobber environ happily... which is undefined 
    behavior. But certainly the opt-in to known-to-be-good libc variants is 
    a better approach than before.
    
    Between this and "stop clobbering at LD_LIBRARY_PATH", I prefer the 
    latter, though.
    
    >> I also put together a PoC of what was mentioned in musl's mailing list:
    >> Instead of clobbering environ at all, exec yourself again with padded argv0.
    >> This works, too. Attached.
    > 
    > It is hard to imagine why we would add an extra exec on every Linux
    > server start for this.
    
    Would this be a problem? For a running server this would happen only 
    once when the postmaster starts up, AFAICT.
    
    Best,
    
    Wolfgang
    
    
    
    
  40. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T14:43:05Z

    Laurenz Albe:
    > On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
    >> I think you might be seriously underestimating how often musl is used.
    >> Alpine Linux uses musl and is very widespread in the container world
    >> because of smaller image size
    > 
    > The last time I looked, its collation support didn't work at all...
    
    IIUC, using icu collations should work. I didn't extensively try, 
    though. But yeah - musl itself doesn't do it, knowingly so.
    
    Best,
    
    Wolfgang
    
    
    
    
  41. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T14:46:39Z

    Tom Lane:
    > But in any case, if somebody wants
    > Alpine or musl to be considered a supported platform, they'd
    > best step up and run a buildfarm animal.
    
    Yeah, I was already thinking about that. But I guess we'd need to first 
    make the test suite pass on musl. i.e. $subject, but there are also some 
    smaller issues after that, before the full test suite will pass.
    
    Best,
    
    Wolfgang
    
    
    
    
  42. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Peter Eisentraut <peter@eisentraut.org> — 2024-03-20T18:37:08Z

    On 20.03.24 15:37, Wolfgang Walther wrote:
    >> It is hard to imagine why we would add an extra exec on every Linux
    >> server start for this.
    > 
    > Would this be a problem? For a running server this would happen only 
    > once when the postmaster starts up, AFAICT.
    
    I wonder if it would cause issues with systemd or similar, if the PID of 
    the running process is not the one that systemd started.  If so, there 
    is probably a workaround, but it would have to be analyzed.
    
    
    
    
    
  43. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T19:10:01Z

    Peter Eisentraut:
    >> Would this be a problem? For a running server this would happen only 
    >> once when the postmaster starts up, AFAICT.
    > 
    > I wonder if it would cause issues with systemd or similar, if the PID of 
    > the running process is not the one that systemd started.  If so, there 
    > is probably a workaround, but it would have to be analyzed.
    
    I don't think that exec even creates a new PID. The current process is 
    replaced, so the PID stays the same.
    
    Best,
    
    Wolfgang
    
    
    
    
  44. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-20T19:18:39Z

    On Wed, Mar 20, 2024 at 03:24:58PM +0100, Wolfgang Walther wrote:
    > Bruce Momjian:
    > > I guess we could conditionally exec only if we find we must, but then
    > > such exec cases would be rare and rarely tested.
    > 
    > I think you might be seriously underestimating how often musl is used.
    > Alpine Linux uses musl and is very widespread in the container world because
    > of smaller image size.
    > 
    > The library/postgres docker image has been pulled about 8 billion times
    > since 2014 [1]. While we can't really tell how many of those pulled the
    > alpine variant of the image, comparing the alpine [2] and ubuntu/debian
    > [3,4] base images gives a rough estimate of >50% using alpine in general.
    
    Uh, what is the current behavior of Postgres on musl?  It just fails if
    the process title is longer than argv[0] plus the environment space to
    the LD_ environment variable, and then linking fails for certain
    extensions?  If there are many downloads, why would we only be getting
    this report now?
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  45. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-20T19:29:21Z

    Bruce Momjian:
    >> The library/postgres docker image has been pulled about 8 billion times
    >> since 2014 [1]. While we can't really tell how many of those pulled the
    >> alpine variant of the image, comparing the alpine [2] and ubuntu/debian
    >> [3,4] base images gives a rough estimate of >50% using alpine in general.
    > 
    > Uh, what is the current behavior of Postgres on musl?  It just fails if
    > the process title is longer than argv[0] plus the environment space to
    > the LD_ environment variable, and then linking fails for certain
    > extensions?  If there are many downloads, why would we only be getting
    > this report now?
    
    The process title works fine. It's just the way how space is cleared for 
    the process title, that is causing problems elsewhere.
    
    The thing that is broken when running postgres on alpine/musl is, to put 
    libpq in a custom path and use LD_LIBRARY_PATH to find it when loading 
    libpqwalreceiver (+ some contrib modules). Nobody does that, especially 
    not in a container environment where postgres is likely the only thing 
    running in that container, so there is no point in using any custom 
    library paths or anything - the image is built once and made to work, 
    and everybody else is just using that working image.
    
    The much more practical problem is that the test suite doesn't run, 
    because it makes use of LD_LIBRARY_PATH for that purpose. In the past, 
    the packagers for alpine only disabled the failing tests, but IIRC they 
    have given up on that and just disabled the whole test suite by now.
    
    Best,
    
    Wolfgang
    
    
    
    
  46. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-20T19:42:16Z

    On Wed, Mar 20, 2024 at 08:29:21PM +0100, Wolfgang Walther wrote:
    > Bruce Momjian:
    > > > The library/postgres docker image has been pulled about 8 billion times
    > > > since 2014 [1]. While we can't really tell how many of those pulled the
    > > > alpine variant of the image, comparing the alpine [2] and ubuntu/debian
    > > > [3,4] base images gives a rough estimate of >50% using alpine in general.
    > > 
    > > Uh, what is the current behavior of Postgres on musl?  It just fails if
    > > the process title is longer than argv[0] plus the environment space to
    > > the LD_ environment variable, and then linking fails for certain
    > > extensions?  If there are many downloads, why would we only be getting
    > > this report now?
    > 
    > The process title works fine. It's just the way how space is cleared for the
    > process title, that is causing problems elsewhere.
    > 
    > The thing that is broken when running postgres on alpine/musl is, to put
    > libpq in a custom path and use LD_LIBRARY_PATH to find it when loading
    > libpqwalreceiver (+ some contrib modules). Nobody does that, especially not
    > in a container environment where postgres is likely the only thing running
    > in that container, so there is no point in using any custom library paths or
    > anything - the image is built once and made to work, and everybody else is
    > just using that working image.
    > 
    > The much more practical problem is that the test suite doesn't run, because
    > it makes use of LD_LIBRARY_PATH for that purpose. In the past, the packagers
    > for alpine only disabled the failing tests, but IIRC they have given up on
    > that and just disabled the whole test suite by now.
    
    Thanks, that is very clear.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  47. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-20T22:26:26Z

    On Thu, Mar 21, 2024 at 2:35 AM Bruce Momjian <bruce@momjian.us> wrote:
    > Yes, I prefer this.  I am worried the environ hackery will bite us
    > someday and the cause will be hard to find.
    
    Some speculation on how widespread this trick is: as mentioned, it
    seems to come from sendmail (I didn't spend the time to find a repo
    that has ancient versions, so this is someone's random snapshot repo,
    but it's referring to pretty old dead systems):
    
    https://github.com/Distrotech/sendmail/blob/master/sendmail/conf.c#L2436
    
    That was once ubiquitous, back in the day.  One of the most widespread
    envon-clobberers these days must be openssh:
    
    https://github.com/openssh/openssh-portable/blob/86bdd3853f4d32c85e295e6216a2fe0953ad93f0/openbsd-compat/setproctitle.c#L69
    
    And funnily enough, googling LD_LIBRARY_PATH and openssh brings up a
    few unsolved/unanswered questions about mysterious breakage (though I
    didn't see any that mentioned musl by name and there could be other
    explanations, *shrug*).
    
    There is also Chromium/Chrome:
    
    https://github.com/chromium/chromium/blob/main/base/process/set_process_title_linux.cc#L136
    
    That code has some interesting commentary and points to a commit in
    Linux which mentions setproctitle() and making sure it still works
    (funny because setproctitle() isn't a function in any standard
    userspace library AFAIK so I guess it just means the trick in
    general), and also mentions the failure of attempts to get an official
    way to do this negotiated between the relevant projects.
    
    Of course we have to distinguish between the basic argv[] clobbering
    trick which is barely even a trick, and the more advanced environ
    stealing trick which confuses musl.  A very widespread user of the
    basic trick would be systemd, which tries to use the prctl() if it can
    to get a much bigger window of memory to write on, but otherwise falls
    back to accepting a small one.  I guess we'd do the same if we could,
    ie if a future Linux version didn't require CAP_SYS_RESOURCES to do
    it:
    
    https://github.com/systemd/systemd/blob/8810b782a17050d7f7a870b975f09e8a690b7bea/src/basic/argv-util.c
    
    Anyway, it looks like there is plenty of will out there to keep this
    working, albeit in a weird semi-supported state whose cruftiness is
    undeniable.
    
    
    
    
  48. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-21T20:16:46Z

    Thomas Munro:
    > Of course we have to distinguish between the basic argv[] clobbering
    > trick which is barely even a trick, and the more advanced environ
    > stealing trick which confuses musl.
    
    Right. The latter not only confuses musl, but also makes 
    /proc/<pid>/environ return garbage. This is also mentioned at the bottom 
    of main.c, which has a workaround for the specific case of UBSan 
    depending on that. This is kind of funny: Because we are relying on 
    undefined behavior regarding the modification of environ, we need a 
    workaround for the "UndefinedBehaviorSanitizer" - I guess by failing 
    without this workaround, it wanted to tell us something..
    
    This happens on glibc, too.
    
    So summarizing:
    
    1. The simple approach is to use PS_USE_CLOBBER_ARGV on Linux only for 
    glibc and other known-to-be-good-and-identifiable libc variants, 
    otherwise default to PS_USE_NONE. This will not only keep the problem 
    for /proc/../environ for glibc users, but also disable ps status for 
    musl entirely. Considering that probably the biggest use-case for musl 
    is to run postgres in containers, it's quite likely to actually run more 
    than just one cluster on a single machine. In this case... ps status 
    would be especially handy to identify which cluster a process belongs to.
    
    2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH 
    / LD_PRELOAD is found to keep those intact. This will keep ps status 
    support on musl, which is good. But the /proc/.../environ problem will 
    still be there, unchanged.
    
    Both of those approaches rely on the undefined behavior of clobbering 
    environ.
    
    3. The logical consequence of this is, to stop clobbering environ and 
    use only the available argv space. However, this will quickly leave us 
    with a very small ps status buffer to work with, making the feature less 
    useful. Note, that this could happen theoretically by starting postgres 
    with the fewest arguments and environment possible, too. Not sure what 
    the minimal buffer size is that could be achieved that way. The point 
    is: The buffer size is not guaranteed at all.
    
    4. The upstream (musl) suggestion of which I sent a PoC was to "exec 
    yourself with a bigger argv". This works. I chose to pad argv0 with 
    trailing slashes. Those can safely be stripped away again, because any 
    argv0 which would come with a trailing slash to start with, would not be 
    the current executable, but a directory - so would fail exec immediately 
    anyway. This keeps /proc/.../environ intact and does not rely on 
    undefined behavior. Additionally, we get a guaranteed ps buffer size of 
    256, which is what we use on BSDs and Windows, too.
    
    I wonder why we actually fall back to PS_USE_NONE by default.. and how 
    much of that is related to the environment clobbering to start with? 
    Could we even use the exec-approach as the fallback in all other cases 
    except BSDs and Windows and get rid of PS_USE_NONE? Clobbering only argv 
    sure seems way safer to do than what we do right now.
    
    Best,
    
    Wolfgang
    
    
    
    
  49. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-21T20:30:00Z

    Here is what we could with this:
    
    > 2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH 
    > / LD_PRELOAD is found to keep those intact.
    
    We could backpatch this down to v12. This would be one step to make the 
    test suite pass on Alpine Linux with musl and ultimately allow setting 
    up a buildfarm animal for that.
    
    It does not solve the /proc/.../environ problem, but at least keeps ps 
    status working on musl as it did before, so not a regression.
    
    > 4. The upstream (musl) suggestion of which I sent a PoC was to "exec 
    > yourself with a bigger argv". 
    
    We could do this in HEAD now ...
    
    > Could we even use the exec-approach as the fallback in all other cases 
    > except BSDs and Windows and get rid of PS_USE_NONE?
    
    ... and then remove PS_USE_NONE at the beginning of the v18 cycle.
    
    This would give a bit more time for those "other systems", which were 
    previously falling back PS_USE_NONE and would then clobber argv, too.
    
    Opinions?
    
    Best,
    
    Wolfgang
    
    
    
    
  50. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-21T22:42:52Z

    On Fri, Mar 22, 2024 at 9:30 AM <walther@technowledgy.de> wrote:
    > > 4. The upstream (musl) suggestion of which I sent a PoC was to "exec
    > > yourself with a bigger argv".
    >
    > We could do this in HEAD now ...
    
    Just a thought: if we want to go this way, do we need a new exec call?
     We already control the initial exec in pg_ctl.c.
    
    > > Could we even use the exec-approach as the fallback in all other cases
    > > except BSDs and Windows and get rid of PS_USE_NONE?
    >
    > ... and then remove PS_USE_NONE at the beginning of the v18 cycle.
    >
    > This would give a bit more time for those "other systems", which were
    > previously falling back PS_USE_NONE and would then clobber argv, too.
    
    RIght.  It's unspecified by POSIX whether ps shows changes to those
    strings (and there are systems that don't), but it can't hurt to do so
    anyway, and it'd be better than having a PS_USE_NONE code path that is
    untested.  I dimly recall that it turned out that PS_USE_NONE was
    actually broken for a while without anyone noticing.
    
    
    
    
  51. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-21T23:02:47Z

    Thomas Munro <thomas.munro@gmail.com> writes:
    > Just a thought: if we want to go this way, do we need a new exec call?
    >  We already control the initial exec in pg_ctl.c.
    
    I'm resistant to assuming the postmaster is launched through pg_ctl.
    systemd, for example, might well prefer not to do that, not to
    mention all the troglodytes still using 1990s launch scripts.
    
    A question that seems worth debating in this thread is how much
    updating the process title is even worth nowadays.  It feels like
    a hangover from before we had pg_stat_activity and other monitoring
    support.  So I don't feel a huge need to support it on musl.
    The previously-suggested patch to whitelist glibc and variants,
    and otherwise fall back to PS_USE_NONE, seems like it might be
    the appropriate amount of effort.
    
    			regards, tom lane
    
    
    
    
  52. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-21T23:20:11Z

    On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > The previously-suggested patch to whitelist glibc and variants,
    > and otherwise fall back to PS_USE_NONE, seems like it might be
    > the appropriate amount of effort.
    
    What about meeting musl halfway: clobber argv, but only clobber
    environ for the libcs known to tolerate that?  Then musl might see
    truncation at 30-60 characters or whatever it is, but that's probably
    enough to see your cluster_name and backend type/user name which is
    pretty useful information.
    
    
    
    
  53. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-21T23:23:45Z

    On Thu, Mar 21, 2024 at 7:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    
    > Thomas Munro <thomas.munro@gmail.com> writes:
    > > Just a thought: if we want to go this way, do we need a new exec call?
    > >  We already control the initial exec in pg_ctl.c.
    >
    > I'm resistant to assuming the postmaster is launched through pg_ctl.
    > systemd, for example, might well prefer not to do that, not to
    > mention all the troglodytes still using 1990s launch scripts.
    >
    > A question that seems worth debating in this thread is how much
    > updating the process title is even worth nowadays.  It feels like
    > a hangover from before we had pg_stat_activity and other monitoring
    > support.  So I don't feel a huge need to support it on musl.
    > The previously-suggested patch to whitelist glibc and variants,
    > and otherwise fall back to PS_USE_NONE, seems like it might be
    > the appropriate amount of effort.
    >
    >
    >
    
    
    +1
    
    cheers
    
    andrew
    
  54. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-21T23:30:52Z

    Thomas Munro <thomas.munro@gmail.com> writes:
    > On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> The previously-suggested patch to whitelist glibc and variants,
    >> and otherwise fall back to PS_USE_NONE, seems like it might be
    >> the appropriate amount of effort.
    
    > What about meeting musl halfway: clobber argv, but only clobber
    > environ for the libcs known to tolerate that?  Then musl might see
    > truncation at 30-60 characters or whatever it is, but that's probably
    > enough to see your cluster_name and backend type/user name which is
    > pretty useful information.
    
    No real objection here.  I do wonder about the point you (or somebody)
    made upthread that we don't have any testing of the PS_USE_NONE case;
    but that could be addressed some other way.
    
    			regards, tom lane
    
    
    
    
  55. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-22T00:07:14Z

    On Fri, Mar 22, 2024 at 11:42:52AM +1300, Thomas Munro wrote:
    > On Fri, Mar 22, 2024 at 9:30 AM <walther@technowledgy.de> wrote:
    > > > 4. The upstream (musl) suggestion of which I sent a PoC was to "exec
    > > > yourself with a bigger argv".
    > >
    > > We could do this in HEAD now ...
    > 
    > Just a thought: if we want to go this way, do we need a new exec call?
    >  We already control the initial exec in pg_ctl.c.
    > 
    > > > Could we even use the exec-approach as the fallback in all other cases
    > > > except BSDs and Windows and get rid of PS_USE_NONE?
    > >
    > > ... and then remove PS_USE_NONE at the beginning of the v18 cycle.
    > >
    > > This would give a bit more time for those "other systems", which were
    > > previously falling back PS_USE_NONE and would then clobber argv, too.
    > 
    > RIght.  It's unspecified by POSIX whether ps shows changes to those
    > strings (and there are systems that don't), but it can't hurt to do so
    > anyway, and it'd be better than having a PS_USE_NONE code path that is
    > untested.  I dimly recall that it turned out that PS_USE_NONE was
    > actually broken for a while without anyone noticing.
    
    Actually, I was thinking the opposite.  Since the musl libc is widely
    used, it will be tested, and I don't want to disable process display
    updates for such a common platform.
    
    I suggest we use the #ifdef test to continue our existing behavior for
    the libraries we know about, like glibc, and use the LD_* process title
    truncation hack for libc's we don't recognize.
    
    Attached is a prototype patch which implements this based on previous
    patches.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
  56. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-22T00:08:49Z

    On Fri, Mar 22, 2024 at 12:20:11PM +1300, Thomas Munro wrote:
    > On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > > The previously-suggested patch to whitelist glibc and variants,
    > > and otherwise fall back to PS_USE_NONE, seems like it might be
    > > the appropriate amount of effort.
    > 
    > What about meeting musl halfway: clobber argv, but only clobber
    > environ for the libcs known to tolerate that?  Then musl might see
    > truncation at 30-60 characters or whatever it is, but that's probably
    > enough to see your cluster_name and backend type/user name which is
    > pretty useful information.
    
    I just posted a prototype patch to implement this.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  57. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andres Freund <andres@anarazel.de> — 2024-03-22T00:19:48Z

    Hi,
    
    On 2024-03-21 21:16:46 +0100, Wolfgang Walther wrote:
    > Right. The latter not only confuses musl, but also makes /proc/<pid>/environ
    > return garbage. This is also mentioned at the bottom of main.c, which has a
    > workaround for the specific case of UBSan depending on that. This is kind of
    > funny: Because we are relying on undefined behavior regarding the
    > modification of environ, we need a workaround for the
    > "UndefinedBehaviorSanitizer" - I guess by failing without this workaround,
    > it wanted to tell us something..
    
    I don't think that's quite a fair description. Ubsan is basically doing
    undefined things itself, so it's turtles all the way down.
    
    
    > So summarizing:
    
    FWIW, independent of which fix we go with, I think we need a buildfarm animal
    using musl. Even better if one of the CI tasks can be made to use musl as
    well.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  58. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-22T06:12:26Z

    Sent from my iPad
    
    > On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
    > 
    > Hi,
    > 
    >> On 2024-03-21 21:16:46 +0100, Wolfgang Walther wrote:
    >> Right. The latter not only confuses musl, but also makes /proc/<pid>/environ
    >> return garbage. This is also mentioned at the bottom of main.c, which has a
    >> workaround for the specific case of UBSan depending on that. This is kind of
    >> funny: Because we are relying on undefined behavior regarding the
    >> modification of environ, we need a workaround for the
    >> "UndefinedBehaviorSanitizer" - I guess by failing without this workaround,
    >> it wanted to tell us something..
    > 
    > I don't think that's quite a fair description. Ubsan is basically doing
    > undefined things itself, so it's turtles all the way down.
    > 
    > 
    >> So summarizing:
    > 
    > FWIW, independent of which fix we go with, I think we need a buildfarm animal
    > using musl. Even better if one of the CI tasks can be made to use musl as
    > well.
    
    
    We had one till 3 months ago. It’s on my list to recreate.
    
    Cheers
    
    Andrew
    
    
    
    
  59. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-22T06:15:27Z

    Andrew Dunstan <andrew@dunslane.net> writes:
    > On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
    >> FWIW, independent of which fix we go with, I think we need a buildfarm animal
    >> using musl. Even better if one of the CI tasks can be made to use musl as
    >> well.
    
    > We had one till 3 months ago. It’s on my list to recreate.
    
    How was it passing?  The issue discussed in this thread has surely
    been there for a long time, and Wolfgang mentioned that he sees
    others.
    
    			regards, tom lane
    
    
    
    
  60. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-22T07:02:19Z

    
    > On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > 
    > Andrew Dunstan <andrew@dunslane.net> writes:age
    >>> On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
    >>> FWIW, independent of which fix we go with, I think we need a buildfarm animal
    >>> using musl. Even better if one of the CI tasks can be made to use musl as
    >>> well.
    > 
    >> We had one till 3 months ago. It’s on my list to recreate.
    > 
    > How was it passing?  The issue discussed in this thread has surely
    > been there for a long time, and Wolfgang mentioned that he sees
    > others.
    > 
    >  
    
    
    The buildfarm client has a switch that delays running regression tests until after the install stages.
    
    Cheers 
    
    Andrew
    
    
    
  61. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-22T07:55:52Z

    Andres Freund:
    > FWIW, independent of which fix we go with, I think we need a buildfarm animal
    > using musl. Even better if one of the CI tasks can be made to use musl as
    > well.
    
    I am already working with Andrew to set up a buildfarm animal to run 
    Alpine Linux/musl. I can look into the CI task as well. Are you 
    suggesting to change an existing task to run with Alpine/musl or to add 
    a new task for it? It would be docker image based for sure.
    
    Best,
    
    Wolfgang
    
    
    
    
  62. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-22T08:02:04Z

    Andrew Dunstan:
    >> On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> How was it passing?  The issue discussed in this thread has surely
    >> been there for a long time, and Wolfgang mentioned that he sees
    >> others.
    > 
    > The buildfarm client has a switch that delays running regression tests until after the install stages.
    
    Hm. So while that switch makes the animal pass the build, it did hide 
    exactly this problem. Not sure whether this switch should be used at 
    all, then. Was this switch only implemented for the specific case of 
    Alpine/musl or is there a different reason for it, as well?
    
    The other issues I had been seeing were during make check-world, but not 
    make check. Those were things around setlocale() / /bin/locale, IIRC. 
    Not sure whether all of the tests are run by the buildfarm?
    
    Best,
    
    Wolfgang
    
    
    
    
    
  63. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-22T08:12:47Z

    Tom Lane:
    > Thomas Munro <thomas.munro@gmail.com> writes:
    >> Just a thought: if we want to go this way, do we need a new exec call?
    >>   We already control the initial exec in pg_ctl.c.
    > 
    > I'm resistant to assuming the postmaster is launched through pg_ctl.
    > systemd, for example, might well prefer not to do that, not to
    > mention all the troglodytes still using 1990s launch scripts.
    
    Right, the systemd example in the docs is not using pg_ctl.
    
    But, it should be easily possible to have:
    - pg_ctl call postgres with a padded argv0
    - postgres call itself with padding, if it wasn't called with that already
    
    This way, there would be no additional exec call when started through 
    pg_ctl, but one more call when started directly.
    
    Best,
    
    Wolfgang
    
    
    
    
  64. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-22T08:33:38Z

    Bruce Momjian:
    > I suggest we use the #ifdef test to continue our existing behavior for
    > the libraries we know about, like glibc, and use the LD_* process title
    > truncation hack for libc's we don't recognize.
    > 
    > Attached is a prototype patch which implements this based on previous
    > patches.
    
    The condition to check for linux/glibc in your patch is slightly off:
    
       #if ! defined(__linux__) || (! defined(__GLIBC__) && 
    defined(__UCLIBC__ ))
    
    should be
    
       #if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
    
    With the latter, it passes tests with musl.
    
    Best,
    
    Wolfgang
    
    
    
    
  65. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-22T09:41:39Z

    Wolfgang Walther:
    > The other issues I had been seeing were during make check-world, but not 
    > make check. Those were things around setlocale() / /bin/locale, IIRC. 
    > Not sure whether all of the tests are run by the buildfarm?
    
    Ah, those other tests only fail when building --with-icu, but Andrew's 
    animal didn't do that.
    
    Best,
    
    Wolfgang
    
    
    
    
  66. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andrew Dunstan <andrew@dunslane.net> — 2024-03-22T12:10:27Z

    On Fri, Mar 22, 2024 at 4:02 AM Wolfgang Walther <walther@technowledgy.de>
    wrote:
    
    > Andrew Dunstan:
    > >> On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > >> How was it passing?  The issue discussed in this thread has surely
    > >> been there for a long time, and Wolfgang mentioned that he sees
    > >> others.
    > >
    > > The buildfarm client has a switch that delays running regression tests
    > until after the install stages.
    >
    > Hm. So while that switch makes the animal pass the build, it did hide
    > exactly this problem. Not sure whether this switch should be used at
    > all, then. Was this switch only implemented for the specific case of
    > Alpine/musl or is there a different reason for it, as well?
    >
    
    Alpine was the main motivation, but it's also probably useful on Macs with
    SIP enabled.
    
    ISTR raising the Alpine issue back then (2018) but I can't find a reference
    now.
    
    cheers
    
    andrew
    
  67. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-22T13:36:19Z

    On Fri, Mar 22, 2024 at 09:33:38AM +0100, walther@technowledgy.de wrote:
    > Bruce Momjian:
    > > I suggest we use the #ifdef test to continue our existing behavior for
    > > the libraries we know about, like glibc, and use the LD_* process title
    > > truncation hack for libc's we don't recognize.
    > > 
    > > Attached is a prototype patch which implements this based on previous
    > > patches.
    > 
    > The condition to check for linux/glibc in your patch is slightly off:
    > 
    >   #if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
    > 
    > should be
    > 
    >   #if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
    > 
    > With the latter, it passes tests with musl.
    
    Yes, my logic was wrong. Not sure what I was thinking, frankly.
    
    I am not a big fan of negating a complex conditional, but would rather
    pass the negation into the conditional, new patch attached.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
  68. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-22T19:44:28Z

    On Fri, Mar 22, 2024 at 09:36:19AM -0400, Bruce Momjian wrote:
    > On Fri, Mar 22, 2024 at 09:33:38AM +0100, walther@technowledgy.de wrote:
    > > Bruce Momjian:
    > > > I suggest we use the #ifdef test to continue our existing behavior for
    > > > the libraries we know about, like glibc, and use the LD_* process title
    > > > truncation hack for libc's we don't recognize.
    > > > 
    > > > Attached is a prototype patch which implements this based on previous
    > > > patches.
    > > 
    > > The condition to check for linux/glibc in your patch is slightly off:
    > > 
    > >   #if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
    > > 
    > > should be
    > > 
    > >   #if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
    > > 
    > > With the latter, it passes tests with musl.
    > 
    > Yes, my logic was wrong. Not sure what I was thinking, frankly.
    > 
    > I am not a big fan of negating a complex conditional, but would rather
    > pass the negation into the conditional, new patch attached.
    
    With no one "hoping this patch dies in a fire"*, I have updated it with
    more details, which I now think is committable to master.  Is this
    something to backpatch?  Seems too rare a bug to me.
    
    * Robert Haas, https://www.postgresql.org/message-id/CA%2BTgmoYsyrCNmg%2BYh6rgP7K8r-bYPjCeF1tPxENRFwD4VZAZvw%40mail.gmail.com
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
  69. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-25T19:20:47Z

    On Thu, Mar 21, 2024 at 11:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > Some speculation on how widespread this trick is: as mentioned, it
    
    Just one more, which I just ran into by accident while looking for
    something else, which I'm posting just in case this thread ever gets
    used to try to convince musl hackers to change their end to support
    it:
    
    https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_setproctitle.c#L158
    
    
    
    
  70. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-25T21:58:30Z

    Bruce Momjian:
    > With no one "hoping this patch dies in a fire"*, I have updated it with
    > more details, which I now think is committable to master.  Is this
    > something to backpatch?  Seems too rare a bug to me.
    
    I would like to see this backpatched to be able to run the regression 
    tests easily on all stable branches.
    
    I have taken your's/Thomas' patch and extended it with a few more taking 
    in many of the ideas in this thread:
    
    0001 Don't clobber LD_*
    This is the patch you posted. This applies cleanly all the way down to 
    v12. This fixes the bug and allows running most of the tests with musl - 
    yeah!
    I also confirmed, that this will not create practical problems with 
    library/postgres docker image, where this is likely used the most. While 
    "postgres" is called by default without any arguments here, plenty of 
    environment variables are passed. The docker image does use LD_PRELOAD 
    to trick initdb, but that's not set when running the postmaster, so not 
    a problem here.
    This use-case also shows why the proposed patch to still partially 
    clobber environ at this stage is better than to not clobber environ at 
    all - in this case, the docker image would essentially have no ps status 
    at all by default.-
    
    0002 Allow setting PS_USE_NONE via CPPFLAGS
    This was proposed by Andrew and applies cleanly down to v12. Thus, it 
    could be backpatched, too. First and foremost this would allow setting a 
    buildfarm animal to use this flag to make sure this code path is 
    actually build/tested at all. This is something that Thomas and Tom 
    hinted at.
    
    0003 Don't ever clobber environ again
    This is the approach I had previously posted as a PoC. This would not be 
    backpatched, but I suggested this could go into v17 now. This avoids the 
    undefined behavior and sets the table to eventually set ps status via 
    argv by default and remove PS_USE_NONE later.
    Compared to the PoC patch, I decided not to pad argv[0], because this 
    will break the ps status display for the postmaster itself. Instead exec 
    is called with an additional argument consisting of exactly 255 spaces.
    I also tried avoiding the additional exec-call if postgres was called 
    via pg_ctl, as suggested by Peter. This quickly turned out to be more 
    invasive than I would have liked this to be, though.
    The current approach works very well, the environment doesn't need to be 
    copied anymore and the workaround for /proc/<pid>/environ in main.c can 
    go away, too.
    
    0004 Default to PS_USE_CLOBBER_ARGV
    This changes the default to display ps status on all other systems, too. 
    This could potentially go in now as well, or be delayed to the beginning 
    of the v18 cycle. In the unlikely event that this breaks something on a 
    platform not considered here and we get a bug report, we can easily 
    advise to compile with CPPFLAGS=-DPS_USE_NONE, which is still there at 
    this stage.
    
    0005 Remove PS_USE_NONE
    However, if no reports come in and no problems are detected with 0004, 
    then this can be entirely removed. This for "later", whenever that is.
    
    Best,
    
    Wolfgang
  71. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-25T22:35:06Z

    On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
    > Bruce Momjian:
    > > With no one "hoping this patch dies in a fire"*, I have updated it with
    > > more details, which I now think is committable to master.  Is this
    > > something to backpatch?  Seems too rare a bug to me.
    > 
    > I would like to see this backpatched to be able to run the regression tests
    > easily on all stable branches.
    
    You want to risk destabilizing Postgres by backpatching something this
    complex so the regression tests can be run on all stable branches?  I
    think you are overestimating our desire to take on risk.
    
    Also, in my patch, the parentheses here:
    
    	#if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__))
    
    are unnecessary so they should be removed:
    
    	#if defined(__linux__) && ! defined(__GLIBC__) && ! defined(__UCLIBC__)
    
    I am only willing to apply my patch, and only to master.  Other
    committers might be more willing.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  72. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Peter Eisentraut <peter@eisentraut.org> — 2024-03-25T22:46:09Z

    On 22.03.24 20:44, Bruce Momjian wrote:
    > +				 * linking (dlopen) might fail.  Here, we truncate the update
    > +				 * of the process title when either of two important dynamic
    > +				 * linking environment variables are set.  Musl does not
    > +				 * define any compiler symbols, so we have to do this for
    > +				 * any Linux libc we don't know is safe.
    > +				 */
    > +				if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
    > +					strstr(environ[i], "LD_PRELOAD=") == environ[i])
    
    What determines which variables require this treatment?
    
    
    
    
    
  73. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-25T22:51:09Z

    On Mon, Mar 25, 2024 at 11:46:09PM +0100, Peter Eisentraut wrote:
    > On 22.03.24 20:44, Bruce Momjian wrote:
    > > +				 * linking (dlopen) might fail.  Here, we truncate the update
    > > +				 * of the process title when either of two important dynamic
    > > +				 * linking environment variables are set.  Musl does not
    > > +				 * define any compiler symbols, so we have to do this for
    > > +				 * any Linux libc we don't know is safe.
    > > +				 */
    > > +				if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
    > > +					strstr(environ[i], "LD_PRELOAD=") == environ[i])
    > 
    > What determines which variables require this treatment?
    
    Thomas Munro came up with that part of the patch.  I just combined his
    patch with the macro test.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  74. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-25T22:54:59Z

    On Tue, Mar 26, 2024 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
    > What determines which variables require this treatment?
    
    That came from me peeking at their code:
    
    https://www.postgresql.org/message-id/CA%2BhUKGKNK5V8XwwJJZm36s3EUy8V51xu4XiE8%3D26n%3DWq3OGd4A%40mail.gmail.com
    
    I had originally proposed to avoid anything beginning "LD_" but Tom
    suggested being more specific.  I doubt LD_PRELOAD can really hurt you
    though (the linker probably only needs the value at the start by
    definition, not at later dlopen() time (?)).  I dunno.  If you're
    asking if there is any standard or whatever supplying these names, the
    System V or at least ELF standards talk about LD_LIBRARY_PATH (though
    those standards don't know/care what happens after userspace takes
    over control of the environment concept, they just talk about how the
    world is created when you exec a process, so they AFAICS they don't
    address this clobbering stuff, and AFAIK other LD_XXX stuff is
    probably implementation specific).
    
    
    
    
  75. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-25T23:14:25Z

    Bruce Momjian <bruce@momjian.us> writes:
    > On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
    >> I would like to see this backpatched to be able to run the regression tests
    >> easily on all stable branches.
    
    > You want to risk destabilizing Postgres by backpatching something this
    > complex so the regression tests can be run on all stable branches?  I
    > think you are overestimating our desire to take on risk.
    
    If we want a buildfarm animal testing this platform, we kind of need
    to support it on all branches.  Having said that, I agree with you
    that we are looking for a minimalist fix not a maximalist one.
    I think the 0001 patch is about right, but the rest seem to be solving
    problems we don't have.
    
    			regards, tom lane
    
    
    
    
  76. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-25T23:15:52Z

    Thomas Munro <thomas.munro@gmail.com> writes:
    > On Tue, Mar 26, 2024 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
    >> What determines which variables require this treatment?
    
    > I had originally proposed to avoid anything beginning "LD_" but Tom
    > suggested being more specific.  I doubt LD_PRELOAD can really hurt you
    > though (the linker probably only needs the value at the start by
    > definition, not at later dlopen() time (?)).
    
    Oh, good point.  So we could simplify the patch by only looking for
    LD_LIBRARY_PATH.
    
    			regards, tom lane
    
    
    
    
  77. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-25T23:21:48Z

    On Mon, Mar 25, 2024 at 07:14:25PM -0400, Tom Lane wrote:
    > Bruce Momjian <bruce@momjian.us> writes:
    > > On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
    > >> I would like to see this backpatched to be able to run the regression tests
    > >> easily on all stable branches.
    > 
    > > You want to risk destabilizing Postgres by backpatching something this
    > > complex so the regression tests can be run on all stable branches?  I
    > > think you are overestimating our desire to take on risk.
    > 
    > If we want a buildfarm animal testing this platform, we kind of need
    > to support it on all branches.  Having said that, I agree with you
    > that we are looking for a minimalist fix not a maximalist one.
    > I think the 0001 patch is about right, but the rest seem to be solving
    > problems we don't have.
    
    I could support the minimalist patch applied to all branches.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  78. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-25T23:43:08Z

    I wrote:
    > Thomas Munro <thomas.munro@gmail.com> writes:
    >> I had originally proposed to avoid anything beginning "LD_" but Tom
    >> suggested being more specific.  I doubt LD_PRELOAD can really hurt you
    >> though (the linker probably only needs the value at the start by
    >> definition, not at later dlopen() time (?)).
    
    > Oh, good point.  So we could simplify the patch by only looking for
    > LD_LIBRARY_PATH.
    
    I looked at the musl source code you identified and confirmed that
    only the LD_LIBRARY_PATH string is remembered in a static variable;
    LD_PRELOAD is only accessed locally in that initialization function.
    So we only need to do the attached.  (I failed to resist the
    temptation to rewrite the comments.)
    
    			regards, tom lane
    
    
  79. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Thomas Munro <thomas.munro@gmail.com> — 2024-03-25T23:49:55Z

    On Tue, Mar 26, 2024 at 12:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > I looked at the musl source code you identified and confirmed that
    > only the LD_LIBRARY_PATH string is remembered in a static variable;
    > LD_PRELOAD is only accessed locally in that initialization function.
    > So we only need to do the attached.  (I failed to resist the
    > temptation to rewrite the comments.)
    
    LGTM.
    
    
    
    
  80. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Andres Freund <andres@anarazel.de> — 2024-03-26T00:14:47Z

    Hi,
    
    On 2024-03-22 08:55:52 +0100, Wolfgang Walther wrote:
    > Andres Freund:
    > > FWIW, independent of which fix we go with, I think we need a buildfarm animal
    > > using musl. Even better if one of the CI tasks can be made to use musl as
    > > well.
    >
    > I am already working with Andrew to set up a buildfarm animal to run Alpine
    > Linux/musl. I can look into the CI task as well. Are you suggesting to
    > change an existing task to run with Alpine/musl or to add a new task for it?
    > It would be docker image based for sure.
    
    I'd rather adapt one of the existing tasks, to avoid increasing CI costs
    unduly.
    
    The way we currently run CI for testing of not-yet-merged patches runs
    all tasks other than macos as full VMs, that turned out to be faster &
    cheaper.
    
    FWIW, except for one small issue, building postgres against musl works on
    debian and the tests pass if I install first.
    
    
    The small problem mentioned above is that on debian linux/fs.h isn't available
    when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
    fail to compile.  I assume that's not the case on "fully musl" distro?
    
    Greetings,
    
    Andres Freund
    
    
    
    
  81. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Bruce Momjian <bruce@momjian.us> — 2024-03-26T00:53:05Z

    On Tue, Mar 26, 2024 at 12:49:55PM +1300, Thomas Munro wrote:
    > On Tue, Mar 26, 2024 at 12:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > > I looked at the musl source code you identified and confirmed that
    > > only the LD_LIBRARY_PATH string is remembered in a static variable;
    > > LD_PRELOAD is only accessed locally in that initialization function.
    > > So we only need to do the attached.  (I failed to resist the
    > > temptation to rewrite the comments.)
    > 
    > LGTM.
    
    +1
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  82. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Peter Eisentraut <peter@eisentraut.org> — 2024-03-26T07:19:09Z

    On 26.03.24 00:43, Tom Lane wrote:
    > I wrote:
    >> Thomas Munro <thomas.munro@gmail.com> writes:
    >>> I had originally proposed to avoid anything beginning "LD_" but Tom
    >>> suggested being more specific.  I doubt LD_PRELOAD can really hurt you
    >>> though (the linker probably only needs the value at the start by
    >>> definition, not at later dlopen() time (?)).
    > 
    >> Oh, good point.  So we could simplify the patch by only looking for
    >> LD_LIBRARY_PATH.
    > 
    > I looked at the musl source code you identified and confirmed that
    > only the LD_LIBRARY_PATH string is remembered in a static variable;
    > LD_PRELOAD is only accessed locally in that initialization function.
    > So we only need to do the attached.  (I failed to resist the
    > temptation to rewrite the comments.)
    
    Yeah, I was more looking for a comment for posterity for *why* we need 
    to preserve this variable in particular.  The updated comment looks 
    reasonable.
    
    
    
    
    
  83. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-26T07:36:40Z

    Bruce Momjian:
    > You want to risk destabilizing Postgres by backpatching something this
    > complex so the regression tests can be run on all stable branches?  I
    > think you are overestimating our desire to take on risk.
    
    I specifically wrote about backpatching the first (and maybe second) 
    patch. None of that is risky.
    
    Patches 3-5 were not meant for backpatching at all.
    
    Best,
    
    Wolfgang
    
    
    
    
  84. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-26T07:43:54Z

    Tom Lane:
    > If we want a buildfarm animal testing this platform, we kind of need
    > to support it on all branches.  Having said that, I agree with you
    > that we are looking for a minimalist fix not a maximalist one.
    > I think the 0001 patch is about right, but the rest seem to be solving
    > problems we don't have.
    
    The second patch potentially solves the problem of PS_USE_NONE not being 
    tested. Of course you could also set up a buildfarm animal on some other 
    platform, which is sure to fall through to PS_USE_NONE, but that seems 
    to have not worked in the past:
    
    Thomas Munro:
    > I dimly recall that it turned out that PS_USE_NONE was
    > actually broken for a while without anyone noticing
    Best,
    
    Wolfgang
    
    
    
    
  85. Building with musl in CI and the build farm

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-26T08:22:28Z

    The need to do $subject came up in [1]. Moving this to a separate 
    discussion on -hackers, because there are more issues to solve than just 
    the LD_LIBRARY_PATH problem.
    
    Andres Freund:
    > FWIW, except for one small issue, building postgres against musl works on
    > debian and the tests pass if I install first.
    > 
    > 
    > The small problem mentioned above is that on debian linux/fs.h isn't available
    > when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
    > fail to compile.  I assume that's not the case on "fully musl" distro?
    
    Correct, I have not seen this before on Alpine.
    
    Here is my progress setting up a buildfarm animal to run on Alpine Linux 
    and the issues I found, so far:
    
    The animal runs in a docker container via GitHub Actions in [2]. Right 
    now it's still running with --test, until I get the credentials to 
    activate it.
    
    I tried to enable everything (except systemd, because Alpine doesn't 
    have it) and run all tests. The LDAP tests are failing right now, but 
    that is likely something that I need to fix in the Dockerfile - it's 
    failing to start the slapd, IIRC. There are other issues, though - all 
    of them have open pull requests in that repo [3].
    
    I also had to skip the recovery check. Andrew mentioned that he had to 
    do that, too, when he was still running his animal on Alpine. Not sure 
    what this is about, yet.
    
    Building --with-icu fails two tests. One of them (001_initdb) is fixed 
    by having the "locale" command in your PATH, which is not the case on 
    Alpine by default. I assume this will not break on your debian/musl 
    build, Andres - but it will also probably not return any sane values, 
    because it will run glibc's locale command.
    I haven't looked into that in detail, yet, but I think the other test 
    (icu/010_database) fails because it expects that setlocale(LC_COLLATE, 
    <illegal_value>) throws an error. I think it doesn't do that on musl, 
    because LC_COLLATE is not implemented.
    Those failing tests are not "just failing", but probably mean that we 
    need to do something about how we deal with locale/setlocale on musl.
    
    The last failure is about building --with-nls. This fails with something 
    like:
    
    ld: src/port/strerror.c:72:(.text+0x2d8): undefined reference to 
    `libintl_gettext'
    
    Of course, gettext-dev is installed, libintl.so is available in /usr/lib 
    and it also contains the symbol. So not sure what's happening here.
    
    Andres, did you build --with-icu and/or --with-nls on debian/musl? Did 
    you run the recovery tests?
    
    Best,
    
    Wolfgang
    
    [1]: 
    https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488%40technowledgy.de
    [2]: 
    https://github.com/technowledgy/postgresql-buildfarm-alpine/actions/workflows/run.yaml
    [3]: https://github.com/technowledgy/postgresql-buildfarm-alpine/pulls
    
    
    
    
  86. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-26T14:35:43Z

    walther@technowledgy.de writes:
    > The second patch potentially solves the problem of PS_USE_NONE not being 
    > tested. Of course you could also set up a buildfarm animal on some other 
    > platform, which is sure to fall through to PS_USE_NONE, but that seems 
    > to have not worked in the past:
    
    > Thomas Munro:
    >> I dimly recall that it turned out that PS_USE_NONE was
    >> actually broken for a while without anyone noticing
    
    I think what Thomas is recollecting is this:
    
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Branch: master Release: REL_15_BR [0fb6954aa] 2022-03-27 12:57:46 -0400
    Branch: REL_14_STABLE Release: REL_14_3 [3f7a59c59] 2022-03-27 12:57:52 -0400
    Branch: REL_13_STABLE Release: REL_13_7 [9016a2a3d] 2022-03-27 12:57:57 -0400
    
        Fix breakage of get_ps_display() in the PS_USE_NONE case.
        
        Commit 8c6d30f21 caused this function to fail to set *displen
        in the PS_USE_NONE code path.  If the variable's previous value
        had been negative, that'd lead to a memory clobber at some call
        sites.  We'd managed not to notice due to very thin test coverage
        of such configurations, but this appears to explain buildfarm member
        lorikeet's recent struggles.
        
        Credit to Andrew Dunstan for spotting the problem.  Back-patch
        to v13 where the bug was introduced.
        
        Discussion: https://postgr.es/m/136102.1648320427@sss.pgh.pa.us
    
    
    The problem wasn't lack of coverage, it was that the failure was
    intermittent and erratic enough to be very hard to diagnose.
    I think that's more bad luck than anything else.
    
    			regards, tom lane
    
    
    
    
  87. Re: Regression tests fail with musl libc because libpq.so can't be loaded

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-03-26T15:45:55Z

    Peter Eisentraut <peter@eisentraut.org> writes:
    > On 26.03.24 00:43, Tom Lane wrote:
    >> I looked at the musl source code you identified and confirmed that
    >> only the LD_LIBRARY_PATH string is remembered in a static variable;
    >> LD_PRELOAD is only accessed locally in that initialization function.
    >> So we only need to do the attached.  (I failed to resist the
    >> temptation to rewrite the comments.)
    
    > Yeah, I was more looking for a comment for posterity for *why* we need 
    > to preserve this variable in particular.  The updated comment looks 
    > reasonable.
    
    OK, pushed after a bit more comment-fiddling.
    
    			regards, tom lane
    
    
    
    
  88. Re: Building with musl in CI and the build farm

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-30T14:05:19Z

    Here's an update on the progress to run musl (Alpine Linux) in the 
    buildfarm.
    
    Wolfgang Walther:
    > The animal runs in a docker container via GitHub Actions in [2]. Right 
    > now it's still running with --test, until I get the credentials to 
    > activate it.
    
    The animals have been activated and are reporting now. Thanks, Andrew!
    
    
    > I tried to enable everything (except systemd, because Alpine doesn't 
    > have it) and run all tests. The LDAP tests are failing right now, but 
    > that is likely something that I need to fix in the Dockerfile - it's 
    > failing to start the slapd, IIRC. There are other issues, though - all 
    > of them have open pull requests in that repo [3].
    
    ldap tests are enabled, just a missing package.
    
    
    > I also had to skip the recovery check. Andrew mentioned that he had to 
    > do that, too, when he was still running his animal on Alpine. Not sure 
    > what this is about, yet.
    
    This was about a missing init process in the docker image. Without an 
    init process reaping zombie processes, the recovery tests end up with 
    some supposed-to-be-terminated backends still running and can't start 
    them up again. Fixed by adding a minimal init process with "tinit".
    
    
    > Building --with-icu fails two tests. One of them (001_initdb) is fixed 
    > by having the "locale" command in your PATH, which is not the case on 
    > Alpine by default. I assume this will not break on your debian/musl 
    > build, Andres - but it will also probably not return any sane values, 
    > because it will run glibc's locale command.
    > I haven't looked into that in detail, yet, but I think the other test 
    > (icu/010_database) fails because it expects that setlocale(LC_COLLATE, 
    > <illegal_value>) throws an error. I think it doesn't do that on musl, 
    > because LC_COLLATE is not implemented.
    > Those failing tests are not "just failing", but probably mean that we 
    > need to do something about how we deal with locale/setlocale on musl.
    
    I still need to look into this in depth.
    
    
    > The last failure is about building --with-nls. This fails with something 
    > like:
    > 
    > ld: src/port/strerror.c:72:(.text+0x2d8): undefined reference to 
    > `libintl_gettext'
    > 
    > Of course, gettext-dev is installed, libintl.so is available in /usr/lib 
    > and it also contains the symbol. So not sure what's happening here.
    
    This is an Alpine Linux packaging issue. Theoretically, it could be made 
    to work by introducing some configure/meson flag like "--with-gettext" 
    or so, to prefer gettext's libintl over the libc-builtin. However, NixOS 
    / nixpkgs with its pkgsMusl overlay manages to solve this issue just 
    fine, builds with --enable-nls and gettext work. Thus, I conclude this 
    is best solved upstream in Alpine Linux.
    
    TLDR: The only real issue which is still open from PostgreSQL's side is 
    around locales and ICU - certainly the pain point in musl. Will look 
    into it further.
    
    Best,
    
    Wolfgang
    
    
    
    
  89. Re: Building with musl in CI and the build farm

    Wolfgang Walther <walther@technowledgy.de> — 2024-03-31T13:34:23Z

    About building one of the CI tasks with musl:
    
    Andres Freund:
    > I'd rather adapt one of the existing tasks, to avoid increasing CI costs unduly.
    
    I looked into this and I think the only task that could be changed is 
    the SanityCheck. This is because this builds without any additional 
    features enabled. I guess that makes sense, because otherwise those 
    dependencies would first have to be built with musl-gcc as well.
    
    
    > FWIW, except for one small issue, building postgres against musl works on debian and the tests pass if I install first.
    
    After the fix for LD_LIBRARY_PATH this now works as expected without 
    installing first. I confirmed it works on debian with CC=musl-gcc.
    
    
    > The small problem mentioned above is that on debian linux/fs.h isn't available
    > when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
    > fail to compile.
    
    According to [1], this can be worked around by linking some folders:
    
    ln -s /usr/include/linux /usr/include/x86_64-linux-musl/
    ln -s /usr/include/asm-generic /usr/include/x86_64-linux-musl/
    ln -s /usr/include/x86_64-linux-gnu/asm /usr/include/x86_64-linux-musl/
    
    Please find a patch to use musl-gcc in SanityCheck attached. Logs from 
    the CI run are in [2]. It has this in the configure phase:
    
    [13:19:52.712] Using 'CC' from environment with value: 'ccache musl-gcc'
    [13:19:52.712] C compiler for the host machine: ccache musl-gcc (gcc 
    10.2.1 "cc (Debian 10.2.1-6) 10.2.1 20210110")
    [13:19:52.712] C linker for the host machine: musl-gcc ld.bfd 2.35.2
    [13:19:52.712] Using 'CC' from environment with value: 'ccache musl-gcc'
    
    So meson picks up musl-gcc properly. I also double checked that without 
    the links above, the build does indeed fail with the linux/fs.h error.
    
    I assume the installation of musl-tools should be done in the 
    pg-vm-images repo instead of the additional script here?
    
    Best,
    
    Wolfgang
    
    [1]: 
    https://debian-bugs-dist.debian.narkive.com/VlFkLigg/bug-789789-musl-fails-to-compile-stuff-that-depends-on-kernel-headers
    [2]: https://cirrus-ci.com/task/5741892590108672
  90. Re: Building with musl in CI and the build farm

    Thomas Munro <thomas.munro@gmail.com> — 2024-04-03T00:00:01Z

    On Wed, Mar 27, 2024 at 11:27 AM Wolfgang Walther
    <walther@technowledgy.de> wrote:
    > The animal runs in a docker container via GitHub Actions in [2].
    
    Great idea :-)
    
    
    
    
  91. Re: Building with musl in CI and the build farm

    Peter Eisentraut <peter@eisentraut.org> — 2024-04-04T13:56:25Z

    On 31.03.24 15:34, walther@technowledgy.de wrote:
    >> I'd rather adapt one of the existing tasks, to avoid increasing CI 
    >> costs unduly.
    > 
    > I looked into this and I think the only task that could be changed is 
    > the SanityCheck.
    
    I think SanityCheck should run a simple, "average" environment, like the 
    current Debian one.  Otherwise, niche problems with musl or multi-arch 
    or whatever will throw off the entire build pipeline.
    
    
    
    
  92. Re: Building with musl in CI and the build farm

    Wolfgang Walther <walther@technowledgy.de> — 2024-04-04T14:11:56Z

    Peter Eisentraut:
    > On 31.03.24 15:34, walther@technowledgy.de wrote:
    >>> I'd rather adapt one of the existing tasks, to avoid increasing CI 
    >>> costs unduly.
    >>
    >> I looked into this and I think the only task that could be changed is 
    >> the SanityCheck.
    > 
    > I think SanityCheck should run a simple, "average" environment, like the 
    > current Debian one.  Otherwise, niche problems with musl or multi-arch 
    > or whatever will throw off the entire build pipeline.
    
    All the errors/problems I have seen so far, while setting up the 
    buildfarm animal on Alpine Linux, have been way beyond what SanityCheck 
    does. Problems only appeared in the tests suites, of which sanity check 
    only runs *very* basic ones. I don't have much experience with the 
    "cross" setup, that "musl on debian" essentially is, though.
    
    All those things are certainly out of scope for CI - they are tested in 
    the build farm instead.
    
    I do agree: SanityCheck doesn't feel like the right place to put this. 
    But on the other side.. if it really fails to *build* with musl, then it 
    shouldn't make a difference whether you will be notified about that 
    immediately or later in the CI pipeline. It certainly needs the fewest 
    additional resources to put it there.
    
    I'm not sure what Andres meant with "adopting one of the existing 
    tasks". It could fit as another step into the "Linux - Debian Bullseye - 
    Autoconf" task, too. A bit similar to how the meson task build for 32 
    and 64bit. This would still not be an entirely new task like I proposed 
    initially (to run in Docker).
    
    Best,
    
    Wolfgang
    
    
    
    
  93. Re: Building with musl in CI and the build farm

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-04-04T14:36:41Z

    Wolfgang Walther <walther@technowledgy.de> writes:
    > Peter Eisentraut:
    >> I think SanityCheck should run a simple, "average" environment, like the 
    >> current Debian one.  Otherwise, niche problems with musl or multi-arch 
    >> or whatever will throw off the entire build pipeline.
    
    > I do agree: SanityCheck doesn't feel like the right place to put this. 
    > But on the other side.. if it really fails to *build* with musl, then it 
    > shouldn't make a difference whether you will be notified about that 
    > immediately or later in the CI pipeline. It certainly needs the fewest 
    > additional resources to put it there.
    
    That is not the concern here.  What I think Peter is worried about,
    and certainly what I'm worried about, is that a breakage in
    SanityCheck comprehensively breaks all CI testing for all Postgres
    developers.  One buildfarm member that's failing does not halt
    progress altogether, so it's not even in the same ballpark of
    being as critical.  So I agree with Peter that SanityCheck had
    better use a very common, vanilla environment.
    
    To be blunt, I do not think we need to test musl in the CI pipeline.
    I see it as one of the niche platforms that the buildfarm exists
    to test.
    
    			regards, tom lane
    
    
    
    
  94. Re: Building with musl in CI and the build farm

    Wolfgang Walther <walther@technowledgy.de> — 2024-04-04T15:01:32Z

    Tom Lane:
    > That is not the concern here.  What I think Peter is worried about,
    > and certainly what I'm worried about, is that a breakage in
    > SanityCheck comprehensively breaks all CI testing for all Postgres
    > developers.
    
    You'd have to commit a failing patch first to break CI for all other 
    developers. If you're only going to commit patches that pass those CI 
    tasks, then this is not going to happen. Then it only becomes a question 
    of how much feedback *you* get from a single CI run of your own patch.
    
    > To be blunt, I do not think we need to test musl in the CI pipeline.
    > I see it as one of the niche platforms that the buildfarm exists
    > to test.
    
    I don't really have an opinion on this. I'm fine with having musl in the 
    buildfarm only. I don't expect the core build itself to fail with musl 
    anyway, this has been working fine for years. Andres asked for it to be 
    added to CI, so maybe he sees more value on top of just "building with 
    musl"?
    
    Best,
    
    Wolfgang
    
    
    
    
  95. Re: Building with musl in CI and the build farm

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-04-04T15:19:11Z

    Wolfgang Walther <walther@technowledgy.de> writes:
    >> That is not the concern here.  What I think Peter is worried about,
    >> and certainly what I'm worried about, is that a breakage in
    >> SanityCheck comprehensively breaks all CI testing for all Postgres
    >> developers.
    
    > You'd have to commit a failing patch first to break CI for all other 
    > developers.
    
    No, what I'm more worried about is some change in the environment
    causing the build to start failing.  When that happens, it'd better
    be an environment that many of us are familiar with and can test/fix.
    
    			regards, tom lane
    
    
    
    
  96. Re: Building with musl in CI and the build farm

    Wolfgang Walther <walther@technowledgy.de> — 2024-04-04T19:09:30Z

    Tom Lane:
    >> You'd have to commit a failing patch first to break CI for all other
    >> developers.
    > 
    > No, what I'm more worried about is some change in the environment
    > causing the build to start failing.  When that happens, it'd better
    > be an environment that many of us are familiar with and can test/fix.
    
    The way I understand how this work is, that the images for the VMs in 
    which those CI tasks run, are not just dynamically updated - but are 
    actually tested before they are used in CI. So the environment doesn't 
    just change suddenly.
    
    See e.g. [1] for a pull request to the repo containing those images to 
    update the linux debian image from bullseye to bookworm. This is exactly 
    the image we're talking about. Before this image is used in postgres CI, 
    it's tested and confirmed that it actually works there. If one of the 
    jobs was using musl - that would be tested as well. So this job would 
    not just suddenly start failing for everybody.
    
    I do see the "familiarity" argument for the SanityCheck task, but for a 
    different reason: Even though it's unlikely for this job to fail for 
    musl specific reasons - if you're not familiar with musl and can't 
    easily test it locally, you might not be able to tell immediately 
    whether it's musl specific or not. If musl was run in one of the later 
    jobs, it's much different: You see all tests failing - alright, not musl 
    specific. You see only the musl test failing - yeah, musl problem. This 
    should give developers much more confidence looking at the results.
    
    Best,
    
    Wolfgang
    
    [1]: https://github.com/anarazel/pg-vm-images/pull/91