Thread
-
Re: BUG #19340: Wrong result from CORR() function
Dean Rasheed <dean.a.rasheed@gmail.com> — 2025-12-03T11:38:50Z
On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Poking further at this, I found that my v2 patch fails that principle > in one case: > > regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g; > corr > ------ > > (1 row) > > We see that Y is constant and therefore return NULL, despite the > other NaN input. > > I think we can fix that along these lines: > > @@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS) > if (N < 1.0) > PG_RETURN_NULL(); > > - /* per spec, return NULL for horizontal and vertical lines */ > - if (!isnan(commonX) || !isnan(commonY)) > + /* > + * per spec, return NULL for horizontal and vertical lines; but not if the > + * result would otherwise be NaN > + */ > + if ((!isnan(commonX) || !isnan(commonY)) && > + (!isnan(Sxx) && !isnan(Syy))) > PG_RETURN_NULL(); > > /* at this point, Sxx and Syy cannot be zero or negative */ I think that would be more readable as 2 separate "if" statements, something like: /* if any input is NaN, return NaN */ if (isnan(Sxx) || isnan(Syy)) PG_RETURN_FLOAT8(get_float8_nan()); /* per spec, return NULL for horizontal and vertical lines */ if (!isnan(commonX) || !isnan(commonY)) PG_RETURN_NULL(); so that we're more explicit about the NaN-handling semantics. Also, given that any NaN input results in NaN output, regardless of whether or not the inputs are all the same, the values for commonX and commonY don't matter if any input is NaN. I think that allows the accumulator function's tests to be simplified (no need to test if new values are NaN). Another case to consider is this: SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x; Here Sxx and Syy end up being zero, so the current code returns NULL, but with the patch it returns NaN (because Sxy is also zero). It's not quite obvious what to do in this case (the correct answer is 1, but there' no way of reliably computing that with double precision arithmetic). My first thought is that we should probably try to limit NaN results to NaN inputs, and so it would be better to continue to return NULL for cases like this where either Sxx or Syy are zero, even though the inputs weren't quite constant. Then there's this: SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x; Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the product Sxx * Syy underflows to zero. So in both HEAD and with the patch, this returns Infinity. That's not good, given that the correlation coefficient is supposed to lie in the range [-1,1]. The correct answer in this case is also 1, which could be achieved by taking the square roots of Sxx and Syy separately, before multiplying, but it might also be sensible to clamp the result to the range [-1,1]. Regard, Dean