Thread

  1. Re: BUG #19340: Wrong result from CORR() function

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-03T16:33:36Z

    Dean Rasheed <dean.a.rasheed@gmail.com> writes:
    > On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> Poking further at this, I found that my v2 patch fails that principle
    >> in one case:
    >> 
    >> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
    >> corr
    >> ------
    >> 
    >> (1 row)
    >> 
    >> We see that Y is constant and therefore return NULL, despite the
    >> other NaN input.
    
    > I think that would be more readable as 2 separate "if" statements,
    
    Actually, in the light of morning I think this behavior is probably
    fine.  This example treads on two different edge-cases, one where
    we're supposed to return NULL and one where we're supposed to return
    NaN, and it's not that open-and-shut which rule should win.  A
    counterexample here is float8_covar_samp: that returns NULL if there
    are less than two input values, and will still do so if there is one
    input that is NaN.  Should we change that?  I don't think so.
    The fact that HEAD returns NULL for this corr() case as long as
    there is not enough input to create a roundoff problem also suggests
    to me that that's the behavior to stick with.
    
    So I'm now thinking that the NULL-output rule should win in cases
    where they are both applicable.  However, I don't know if we want
    to take that so far as to say that constant-NaN input should
    produce a NULL; the argument that NaN isn't really a constant
    still has some force in my mind.  What do you think?
    
    > Another case to consider is this:
    
    > SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;
    
    > Here Sxx and Syy end up being zero, so the current code returns NULL,
    > but with the patch it returns NaN (because Sxy is also zero). It's not
    > quite obvious what to do in this case (the correct answer is 1, but
    > there' no way of reliably computing that with double precision
    > arithmetic). My first thought is that we should probably try to limit
    > NaN results to NaN inputs, and so it would be better to continue to
    > return NULL for cases like this where either Sxx or Syy are zero, even
    > though the inputs weren't quite constant.
    
    Good example.  I had wondered whether to retain the final test for zero
    Sxx/Syy, and this shows that we do need that (along with a comment
    explaining that roundoff error could produce zeroes even though we
    know the inputs weren't constant).
    
    > Then there's this:
    
    > SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
    
    > Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
    > product Sxx * Syy underflows to zero. So in both HEAD and with the
    > patch, this returns Infinity. That's not good, given that the
    > correlation coefficient is supposed to lie in the range [-1,1].
    
    > The correct answer in this case is also 1, which could be achieved by
    > taking the square roots of Sxx and Syy separately, before multiplying,
    > but it might also be sensible to clamp the result to the range [-1,1].
    
    I like the idea of taking the square roots separately.  I believe
    sqrt() is a hardware operation on pretty much any machine people still
    care about, and we're doing this part only once per aggregation.
    So the cost seems fairly minimal.
    
    			regards, tom lane