Thread

  1. Re: BUG #19340: Wrong result from CORR() function

    Dean Rasheed <dean.a.rasheed@gmail.com> — 2025-12-05T09:23:59Z

    On Wed, 3 Dec 2025 at 22:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >
    > Attached is a fleshed-out patch proposal that fixes the related
    > aggregates and adds test cases.
    >
    
    Looking at float8_regr_accum(), I think it would be preferable to
    arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
    values are seen. I.e., something like this:
    
        if (newvalX != commonX || isnan(newvalX))
            commonX = get_float8_nan();
        if (newvalY != commonY || isnan(newvalY))
            commonY = get_float8_nan();
    
        if (isnan(commonX) || isnan(commonY))
        {
            tmpX = newvalX * N - Sx;
            tmpY = newvalY * N - Sy;
            scale = 1.0 / (N * transvalues[0]);
            if (isnan(commonX))
                Sxx += tmpX * tmpX * scale;
            if (isnan(commonY))
                Syy += tmpY * tmpY * scale;
            if (isnan(commonX) && isnan(commonY))
                Sxy += tmpX * tmpY * scale;
    
            ... Overflow check ...
        }
    
    This would mean that float8_corr(), float8_regr_r2(),
    float8_regr_slope(), and float8_regr_intercept() would not need to
    look at commonX or commonY, and could simply rely on Sxx == 0 or Syy
    == 0 to detect horizontal and vertical lines.
    
    Aside from making the code simpler, this would guarantee that the
    aggregate functions regr_sxx() and regr_syy() would return exactly
    zero for all-constant X and Y inputs respectively, and that
    regr_sxy(), covar_pop(), and covar_samp() would return exactly zero if
    either the X or the Y inputs were all constant.
    
    Something else that occurred to me was that float8_regr_avgx() and
    float8_regr_avgy() might as well make use of commonX and commonY,
    since we're calculating them, so they would return exact averages if
    all the X or Y values were the same, rather than results with possible
    rounding errors.
    
    I also wonder if it would be worth doing something similar for the
    single-variable aggregates so that var_pop(), var_samp(),
    stddev_pop(), and stddev_samp() would all return exactly zero, and
    avg() would return the exact common value, if all the inputs were
    constant.
    
    Regards,
    Dean