Thread

Re: BUG #19340: Wrong result from CORR() function

Dean Rasheed <dean.a.rasheed@gmail.com> — 2025-12-05T10:17:15Z
On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Poking at this, I soon found a test case where even with the separate
> sqrt() calls we'd produce a result slightly outside [-1, 1] (running
> this test over more values of x is sufficient).  So now I think we
> should do both the separate sqrt and the clamp.
>

I'm starting to have doubts about having 2 sqrt() calls. The problem
is that it seems to produce a noticeable reduction in accuracy in
quite a few cases. This is especially noticeable with fully-correlated
data. For example:

SELECT n, (SELECT corr(x, x) FROM generate_series(1, n) x)
  FROM generate_series(1, 10) g(n);

 n  |        corr
----+--------------------
  1 |
  2 | 0.9999999999999998
  3 | 0.9999999999999998
  4 | 0.9999999999999998
  5 | 0.9999999999999998
  6 |                  1
  7 | 0.9999999999999999
  8 |                  1
  9 | 0.9999999999999999
 10 |                  1
(10 rows)

Now I'm not sure that the current code can be expected to get cases
like this exactly right 100% of the time, but it's pretty close. For
example, if I do this:

WITH t1 AS (
  SELECT n, random() * 1000 AS r FROM generate_series(1, 1000000) n
), t2 AS (
  SELECT corr(r, r) FROM t1 GROUP BY n % 10000
)
SELECT count(*), count(*) FILTER (WHERE corr != 1) FROM t2;

on HEAD it produced corr = 1 every time I ran it, whereas the patch
gives rounding errors roughly 25% of the time, which seems likely to
be noticed.

Perhaps we should only use 2 sqrt()'s if the product Sxx * Syy overflows.

Regards,
Dean