Thread

Re: GNU/Hurd portability patches

Alexander Lakhin <exclusion@gmail.com> — 2025-11-10T21:00:01Z
10.11.2025 22:03, Thomas Munro wrote:
> On Tue, Nov 11, 2025 at 8:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> With this modification:
>> @@ -137,7 +140,7 @@ pqsignal(int signo, pqsigfunc func)
>>
>>   #if !(defined(WIN32) && defined(FRONTEND))
>>          act.sa_handler = func;
>> -       sigemptyset(&act.sa_mask);
>> +       sigfillset(&act.sa_mask);
>>          act.sa_flags = SA_RESTART;
>>
>> I got 100 iterations passed (12 of them hanged) without that Assert
>> triggered.
> Interesting.  Perhaps a minimal program that installs a handler
> assert(signo < 32) for both SIGUSR1 and SIGUSR2 might fail too, if
> another program loops calling kill(the_other_one, rand() % 2 == 0 ?
> SIGUSR1 : SIGUSR2), to support a bug report?

Yeah, thank you for the idea! I will try it in the coming days.

>> [lots of weird errors in a wide range of code]
> I can't make much sense of these failures, but are you saying that
> these only happen without that sigfillset(&act.sa_mask) change, that
> is, when the signal implementation is misbehaving?  If so, I wonder if
> the same bug in their signal handling might just be corrupting the
> user stack sometimes even when the signal number assertion doesn't
> trip.

No, I think those failures are unrelated, I hit them just because I
executed `make check` many times and some of them definitely occurred
with the unmodified code. Now that I have a script that handles OS hangs
and restores VM's disk automatically, I can run tests for hours and look
for one failure or another if it can be helpful.

>> On the assumption that this isn't a general bug, but just a timing issue
>> (planning 'SELECT 1' isn't complicated), I see two possibilities:
>>
>> 1. Ignore the plan times, and replace SELECT 1 with SELECT
>> pg_sleep(1e-6), similar to e849bd551. I guess this would reduce test
>> coverage so likely not be great?
>>
>> 2. Make the query a bit more complicated so that the plan time is likely
>> to be non-negligable. I actually had to go quite a way to make it pretty
>> failsafe, the attached made it fail less than 5 times out of 50000
>> iterations, not sure whether that is acceptable or still considered
>> flaky?
> Wait, we have tests that fail if the clock doesn't advance?  Isn't
> that just bogus?

Yeah, we have, this was discussed (and one test was hardened) upthread.

>> What concerns me is that there is also subscription.sql and maybe could
>> be other test(s) that expect at least 1000ns (far from infinite) timer
>> resolution. Probably it would make sense to define which timer resolution
>> we consider acceptable for tests and then to check if Hurd can provide it.
> Ah, I see, so that one is checking if the last reset time advanced to
> check that something happened.  That also has the theoretical problem
> that CLOCK_REALTIME can go backwards sometimes, due to ntpd
> adjustments or whatever.  In the absence of a "reset_counter" column,
> perhaps we could consider a kludge like x->reset_time =
> Max(x->reset_time + 1ns, now), just to make sure the value always goes
> up on reset, without having any noticeable effect on normal systems...

AFAICS, those test cases use pg_clock_gettime_ns() with CLOCK_MONOTONIC
(if defined, and it's really defined on Hurd), so it should not matter in
this concrete case.

Best regards,
Alexander