Thread

Re: BUG #19340: Wrong result from CORR() function

Dean Rasheed <dean.a.rasheed@gmail.com> — 2025-12-03T11:38:50Z
On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Poking further at this, I found that my v2 patch fails that principle
> in one case:
>
> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
>  corr
> ------
>
> (1 row)
>
> We see that Y is constant and therefore return NULL, despite the
> other NaN input.
>
> I think we can fix that along these lines:
>
> @@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
>         if (N < 1.0)
>                 PG_RETURN_NULL();
>
> -       /* per spec, return NULL for horizontal and vertical lines */
> -       if (!isnan(commonX) || !isnan(commonY))
> +       /*
> +        * per spec, return NULL for horizontal and vertical lines; but not if the
> +        * result would otherwise be NaN
> +        */
> +       if ((!isnan(commonX) || !isnan(commonY)) &&
> +               (!isnan(Sxx) && !isnan(Syy)))
>                 PG_RETURN_NULL();
>
>         /* at this point, Sxx and Syy cannot be zero or negative */

I think that would be more readable as 2 separate "if" statements,
something like:

    /* if any input is NaN, return NaN */
    if (isnan(Sxx) || isnan(Syy))
        PG_RETURN_FLOAT8(get_float8_nan());

    /* per spec, return NULL for horizontal and vertical lines */
    if (!isnan(commonX) || !isnan(commonY))
        PG_RETURN_NULL();

so that we're more explicit about the NaN-handling semantics.

Also, given that any NaN input results in NaN output, regardless of
whether or not the inputs are all the same, the values for commonX and
commonY don't matter if any input is NaN. I think that allows the
accumulator function's tests to be simplified (no need to test if new
values are NaN).


Another case to consider is this:

SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;

Here Sxx and Syy end up being zero, so the current code returns NULL,
but with the patch it returns NaN (because Sxy is also zero). It's not
quite obvious what to do in this case (the correct answer is 1, but
there' no way of reliably computing that with double precision
arithmetic). My first thought is that we should probably try to limit
NaN results to NaN inputs, and so it would be better to continue to
return NULL for cases like this where either Sxx or Syy are zero, even
though the inputs weren't quite constant.


Then there's this:

SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;

Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
product Sxx * Syy underflows to zero. So in both HEAD and with the
patch, this returns Infinity. That's not good, given that the
correlation coefficient is supposed to lie in the range [-1,1].

The correct answer in this case is also 1, which could be achieved by
taking the square roots of Sxx and Syy separately, before multiplying,
but it might also be sensible to clamp the result to the range [-1,1].

Regard,
Dean