Thread

  1. Re: BUG #19340: Wrong result from CORR() function

    Dean Rasheed <dean.a.rasheed@gmail.com> — 2025-12-05T10:17:15Z

    On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >
    > Poking at this, I soon found a test case where even with the separate
    > sqrt() calls we'd produce a result slightly outside [-1, 1] (running
    > this test over more values of x is sufficient).  So now I think we
    > should do both the separate sqrt and the clamp.
    >
    
    I'm starting to have doubts about having 2 sqrt() calls. The problem
    is that it seems to produce a noticeable reduction in accuracy in
    quite a few cases. This is especially noticeable with fully-correlated
    data. For example:
    
    SELECT n, (SELECT corr(x, x) FROM generate_series(1, n) x)
      FROM generate_series(1, 10) g(n);
    
     n  |        corr
    ----+--------------------
      1 |
      2 | 0.9999999999999998
      3 | 0.9999999999999998
      4 | 0.9999999999999998
      5 | 0.9999999999999998
      6 |                  1
      7 | 0.9999999999999999
      8 |                  1
      9 | 0.9999999999999999
     10 |                  1
    (10 rows)
    
    Now I'm not sure that the current code can be expected to get cases
    like this exactly right 100% of the time, but it's pretty close. For
    example, if I do this:
    
    WITH t1 AS (
      SELECT n, random() * 1000 AS r FROM generate_series(1, 1000000) n
    ), t2 AS (
      SELECT corr(r, r) FROM t1 GROUP BY n % 10000
    )
    SELECT count(*), count(*) FILTER (WHERE corr != 1) FROM t2;
    
    on HEAD it produced corr = 1 every time I ran it, whereas the patch
    gives rounding errors roughly 25% of the time, which seems likely to
    be noticed.
    
    Perhaps we should only use 2 sqrt()'s if the product Sxx * Syy overflows.
    
    Regards,
    Dean