Thread

  1. Re: GNU/Hurd portability patches

    Alexander Lakhin <exclusion@gmail.com> — 2025-11-10T21:00:01Z

    10.11.2025 22:03, Thomas Munro wrote:
    > On Tue, Nov 11, 2025 at 8:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
    >> With this modification:
    >> @@ -137,7 +140,7 @@ pqsignal(int signo, pqsigfunc func)
    >>
    >>   #if !(defined(WIN32) && defined(FRONTEND))
    >>          act.sa_handler = func;
    >> -       sigemptyset(&act.sa_mask);
    >> +       sigfillset(&act.sa_mask);
    >>          act.sa_flags = SA_RESTART;
    >>
    >> I got 100 iterations passed (12 of them hanged) without that Assert
    >> triggered.
    > Interesting.  Perhaps a minimal program that installs a handler
    > assert(signo < 32) for both SIGUSR1 and SIGUSR2 might fail too, if
    > another program loops calling kill(the_other_one, rand() % 2 == 0 ?
    > SIGUSR1 : SIGUSR2), to support a bug report?
    
    Yeah, thank you for the idea! I will try it in the coming days.
    
    >> [lots of weird errors in a wide range of code]
    > I can't make much sense of these failures, but are you saying that
    > these only happen without that sigfillset(&act.sa_mask) change, that
    > is, when the signal implementation is misbehaving?  If so, I wonder if
    > the same bug in their signal handling might just be corrupting the
    > user stack sometimes even when the signal number assertion doesn't
    > trip.
    
    No, I think those failures are unrelated, I hit them just because I
    executed `make check` many times and some of them definitely occurred
    with the unmodified code. Now that I have a script that handles OS hangs
    and restores VM's disk automatically, I can run tests for hours and look
    for one failure or another if it can be helpful.
    
    >> On the assumption that this isn't a general bug, but just a timing issue
    >> (planning 'SELECT 1' isn't complicated), I see two possibilities:
    >>
    >> 1. Ignore the plan times, and replace SELECT 1 with SELECT
    >> pg_sleep(1e-6), similar to e849bd551. I guess this would reduce test
    >> coverage so likely not be great?
    >>
    >> 2. Make the query a bit more complicated so that the plan time is likely
    >> to be non-negligable. I actually had to go quite a way to make it pretty
    >> failsafe, the attached made it fail less than 5 times out of 50000
    >> iterations, not sure whether that is acceptable or still considered
    >> flaky?
    > Wait, we have tests that fail if the clock doesn't advance?  Isn't
    > that just bogus?
    
    Yeah, we have, this was discussed (and one test was hardened) upthread.
    
    >> What concerns me is that there is also subscription.sql and maybe could
    >> be other test(s) that expect at least 1000ns (far from infinite) timer
    >> resolution. Probably it would make sense to define which timer resolution
    >> we consider acceptable for tests and then to check if Hurd can provide it.
    > Ah, I see, so that one is checking if the last reset time advanced to
    > check that something happened.  That also has the theoretical problem
    > that CLOCK_REALTIME can go backwards sometimes, due to ntpd
    > adjustments or whatever.  In the absence of a "reset_counter" column,
    > perhaps we could consider a kludge like x->reset_time =
    > Max(x->reset_time + 1ns, now), just to make sure the value always goes
    > up on reset, without having any noticeable effect on normal systems...
    
    AFAICS, those test cases use pg_clock_gettime_ns() with CLOCK_MONOTONIC
    (if defined, and it's really defined on Hurd), so it should not matter in
    this concrete case.
    
    Best regards,
    Alexander