Thread

  1. Re: BUG #19340: Wrong result from CORR() function

    Dean Rasheed <dean.a.rasheed@gmail.com> — 2025-12-03T11:38:50Z

    On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >
    > Poking further at this, I found that my v2 patch fails that principle
    > in one case:
    >
    > regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
    >  corr
    > ------
    >
    > (1 row)
    >
    > We see that Y is constant and therefore return NULL, despite the
    > other NaN input.
    >
    > I think we can fix that along these lines:
    >
    > @@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
    >         if (N < 1.0)
    >                 PG_RETURN_NULL();
    >
    > -       /* per spec, return NULL for horizontal and vertical lines */
    > -       if (!isnan(commonX) || !isnan(commonY))
    > +       /*
    > +        * per spec, return NULL for horizontal and vertical lines; but not if the
    > +        * result would otherwise be NaN
    > +        */
    > +       if ((!isnan(commonX) || !isnan(commonY)) &&
    > +               (!isnan(Sxx) && !isnan(Syy)))
    >                 PG_RETURN_NULL();
    >
    >         /* at this point, Sxx and Syy cannot be zero or negative */
    
    I think that would be more readable as 2 separate "if" statements,
    something like:
    
        /* if any input is NaN, return NaN */
        if (isnan(Sxx) || isnan(Syy))
            PG_RETURN_FLOAT8(get_float8_nan());
    
        /* per spec, return NULL for horizontal and vertical lines */
        if (!isnan(commonX) || !isnan(commonY))
            PG_RETURN_NULL();
    
    so that we're more explicit about the NaN-handling semantics.
    
    Also, given that any NaN input results in NaN output, regardless of
    whether or not the inputs are all the same, the values for commonX and
    commonY don't matter if any input is NaN. I think that allows the
    accumulator function's tests to be simplified (no need to test if new
    values are NaN).
    
    
    Another case to consider is this:
    
    SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;
    
    Here Sxx and Syy end up being zero, so the current code returns NULL,
    but with the patch it returns NaN (because Sxy is also zero). It's not
    quite obvious what to do in this case (the correct answer is 1, but
    there' no way of reliably computing that with double precision
    arithmetic). My first thought is that we should probably try to limit
    NaN results to NaN inputs, and so it would be better to continue to
    return NULL for cases like this where either Sxx or Syy are zero, even
    though the inputs weren't quite constant.
    
    
    Then there's this:
    
    SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
    
    Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
    product Sxx * Syy underflows to zero. So in both HEAD and with the
    patch, this returns Infinity. That's not good, given that the
    correlation coefficient is supposed to lie in the range [-1,1].
    
    The correct answer in this case is also 1, which could be achieved by
    taking the square roots of Sxx and Syy separately, before multiplying,
    but it might also be sensible to clamp the result to the range [-1,1].
    
    Regard,
    Dean