Thread

Re: BUG #19340: Wrong result from CORR() function

Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-03T16:33:36Z
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Poking further at this, I found that my v2 patch fails that principle
>> in one case:
>> 
>> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
>> corr
>> ------
>> 
>> (1 row)
>> 
>> We see that Y is constant and therefore return NULL, despite the
>> other NaN input.

> I think that would be more readable as 2 separate "if" statements,

Actually, in the light of morning I think this behavior is probably
fine.  This example treads on two different edge-cases, one where
we're supposed to return NULL and one where we're supposed to return
NaN, and it's not that open-and-shut which rule should win.  A
counterexample here is float8_covar_samp: that returns NULL if there
are less than two input values, and will still do so if there is one
input that is NaN.  Should we change that?  I don't think so.
The fact that HEAD returns NULL for this corr() case as long as
there is not enough input to create a roundoff problem also suggests
to me that that's the behavior to stick with.

So I'm now thinking that the NULL-output rule should win in cases
where they are both applicable.  However, I don't know if we want
to take that so far as to say that constant-NaN input should
produce a NULL; the argument that NaN isn't really a constant
still has some force in my mind.  What do you think?

> Another case to consider is this:

> SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;

> Here Sxx and Syy end up being zero, so the current code returns NULL,
> but with the patch it returns NaN (because Sxy is also zero). It's not
> quite obvious what to do in this case (the correct answer is 1, but
> there' no way of reliably computing that with double precision
> arithmetic). My first thought is that we should probably try to limit
> NaN results to NaN inputs, and so it would be better to continue to
> return NULL for cases like this where either Sxx or Syy are zero, even
> though the inputs weren't quite constant.

Good example.  I had wondered whether to retain the final test for zero
Sxx/Syy, and this shows that we do need that (along with a comment
explaining that roundoff error could produce zeroes even though we
know the inputs weren't constant).

> Then there's this:

> SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;

> Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
> product Sxx * Syy underflows to zero. So in both HEAD and with the
> patch, this returns Infinity. That's not good, given that the
> correlation coefficient is supposed to lie in the range [-1,1].

> The correct answer in this case is also 1, which could be achieved by
> taking the square roots of Sxx and Syy separately, before multiplying,
> but it might also be sensible to clamp the result to the range [-1,1].

I like the idea of taking the square roots separately.  I believe
sqrt() is a hardware operation on pretty much any machine people still
care about, and we're doing this part only once per aggregation.
So the cost seems fairly minimal.

			regards, tom lane