Thread
-
Re: BUG #19340: Wrong result from CORR() function
Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-03T16:33:36Z
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Poking further at this, I found that my v2 patch fails that principle >> in one case: >> >> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g; >> corr >> ------ >> >> (1 row) >> >> We see that Y is constant and therefore return NULL, despite the >> other NaN input. > I think that would be more readable as 2 separate "if" statements, Actually, in the light of morning I think this behavior is probably fine. This example treads on two different edge-cases, one where we're supposed to return NULL and one where we're supposed to return NaN, and it's not that open-and-shut which rule should win. A counterexample here is float8_covar_samp: that returns NULL if there are less than two input values, and will still do so if there is one input that is NaN. Should we change that? I don't think so. The fact that HEAD returns NULL for this corr() case as long as there is not enough input to create a roundoff problem also suggests to me that that's the behavior to stick with. So I'm now thinking that the NULL-output rule should win in cases where they are both applicable. However, I don't know if we want to take that so far as to say that constant-NaN input should produce a NULL; the argument that NaN isn't really a constant still has some force in my mind. What do you think? > Another case to consider is this: > SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x; > Here Sxx and Syy end up being zero, so the current code returns NULL, > but with the patch it returns NaN (because Sxy is also zero). It's not > quite obvious what to do in this case (the correct answer is 1, but > there' no way of reliably computing that with double precision > arithmetic). My first thought is that we should probably try to limit > NaN results to NaN inputs, and so it would be better to continue to > return NULL for cases like this where either Sxx or Syy are zero, even > though the inputs weren't quite constant. Good example. I had wondered whether to retain the final test for zero Sxx/Syy, and this shows that we do need that (along with a comment explaining that roundoff error could produce zeroes even though we know the inputs weren't constant). > Then there's this: > SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x; > Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the > product Sxx * Syy underflows to zero. So in both HEAD and with the > patch, this returns Infinity. That's not good, given that the > correlation coefficient is supposed to lie in the range [-1,1]. > The correct answer in this case is also 1, which could be achieved by > taking the square roots of Sxx and Syy separately, before multiplying, > but it might also be sensible to clamp the result to the range [-1,1]. I like the idea of taking the square roots separately. I believe sqrt() is a hardware operation on pretty much any machine people still care about, and we're doing this part only once per aggregation. So the cost seems fairly minimal. regards, tom lane