Thread

  1. Fixing some ancient errors in hash join costing

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-27T21:31:37Z

    I spent some time digging into bug #19363 [1], and figured out the
    reason why the planner is failing to reject a horribly bad plan.
    Even without any stats, it should be able to figure out that building
    a hash table estimated at 10 billion rows is less good than building
    one estimated at 1000 rows ... but it didn't.  The cause is
    
    (1) estimate_hash_bucket_stats is defined to return zero to *mcv_freq
    if it cannot obtain a value for the frequency of the most common
    value.
    
    (2) final_cost_hashjoin neglected this specification, and would
    blindly adopt zero for the innermcvfreq.
    
    (3) This results in calculating zero for the largest hash bucket size,
    and thus the code that intends to disable hashjoin when that bucket
    size exceeds get_hash_memory_limit() is turned into a no-op.
    
    This is the exact opposite of what we want, I think.  The intent in
    the planner has always been to avoid hashing unless we are pretty
    confident that the inner relation's hash key is well-distributed.
    Turning off the disable-for-giant-hash-table check when we have
    no stats is the polar opposite of sane.
    
    So I said to myself "this is a one-liner fix, we just have to
    disregard any mcv_freq reported as zero".  And that did fix the
    case shown in the bug report, but it also broke a bunch of
    regression test cases.  Upon closer investigation, there is also
    an old oversight within estimate_hash_bucket_stats itself.  It
    returns zero for mcv_freq if there's no STATISTIC_KIND_MCV entry,
    but that neglects the fact that ANALYZE does not make an MCV entry
    if it doesn't find any data values that are noticeably more common
    than any others.  So the correct behavior really should be to
    assume the column is unique and set the mcv_freq to 1 / rows.
    In the attached draft patch I made it do this if there's no MCV
    stats entry but there is a STATISTIC_KIND_HISTOGRAM entry.
    If there's neither, we are probably dealing with a weird datatype
    that doesn't have meaningful scalar stats, so I'm hesitant to
    just apply the 1 / rows rule blindly.
    
    Even after that, there were more changes in regression test outputs
    than I'd expected.  Poking into it further, the first diff in join.out
    is precisely a case like the bug report's, where we have no stats
    about a potentially large self-join.  The repaired code decides that a
    hash join is too risky, so it disables it and we select a merge join
    instead, as desired.  However, none of the other places that visibly
    change plans are triggering that disable logic.  Instead what is
    happening is that the improved mcv_freq estimate is getting used
    within estimate_hash_bucket_stats to refine its bucket-size result:
    
        /*
         * Adjust estimated bucketsize upward to account for skewed distribution.
         */
        if (avgfreq > 0.0 && *mcv_freq > avgfreq)
            estfract *= *mcv_freq / avgfreq;
    
    This code does nothing if *mcv_freq is zero, but if we have a
    more-than-zero estimate it can increase the bucket size fraction,
    and that is indeed happening in the other places in the core
    regression tests where we see plans change.  AFAICT these changes
    are all perfectly sane and not anything to worry about.
    
    I also had to stick an additional ANALYZE step into join_hash.sql
    to keep plans from changing there.
    
    I remain a bit confused by the change in postgres_fdw.out though.
    It's deciding to push an ORDER BY down to the remote side when
    it didn't before, which seems like an improvement; but I fail to
    see how a marginal change in hash join costing would lead to that.
    Perhaps that is worth looking into more closely.
    
    			regards, tom lane
    
    [1] https://www.postgresql.org/message-id/19363-8dd32fc7600a1153%40postgresql.org