Re: BUG #19340: Wrong result from CORR() function
Dean Rasheed <dean.a.rasheed@gmail.com>
From: Dean Rasheed <dean.a.rasheed@gmail.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Oleg Ivanov <o15611@gmail.com>, Laurenz Albe <laurenz.albe@cybertec.at>, pgsql-bugs@lists.postgresql.org
Date: 2025-12-05T10:17:15Z
Lists: pgsql-bugs
On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Poking at this, I soon found a test case where even with the separate > sqrt() calls we'd produce a result slightly outside [-1, 1] (running > this test over more values of x is sufficient). So now I think we > should do both the separate sqrt and the clamp. > I'm starting to have doubts about having 2 sqrt() calls. The problem is that it seems to produce a noticeable reduction in accuracy in quite a few cases. This is especially noticeable with fully-correlated data. For example: SELECT n, (SELECT corr(x, x) FROM generate_series(1, n) x) FROM generate_series(1, 10) g(n); n | corr ----+-------------------- 1 | 2 | 0.9999999999999998 3 | 0.9999999999999998 4 | 0.9999999999999998 5 | 0.9999999999999998 6 | 1 7 | 0.9999999999999999 8 | 1 9 | 0.9999999999999999 10 | 1 (10 rows) Now I'm not sure that the current code can be expected to get cases like this exactly right 100% of the time, but it's pretty close. For example, if I do this: WITH t1 AS ( SELECT n, random() * 1000 AS r FROM generate_series(1, 1000000) n ), t2 AS ( SELECT corr(r, r) FROM t1 GROUP BY n % 10000 ) SELECT count(*), count(*) FILTER (WHERE corr != 1) FROM t2; on HEAD it produced corr = 1 every time I ran it, whereas the patch gives rounding errors roughly 25% of the time, which seems likely to be noticed. Perhaps we should only use 2 sqrt()'s if the product Sxx * Syy overflows. Regards, Dean