Thread

Fixing some ancient errors in hash join costing

Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-27T21:31:37Z
I spent some time digging into bug #19363 [1], and figured out the
reason why the planner is failing to reject a horribly bad plan.
Even without any stats, it should be able to figure out that building
a hash table estimated at 10 billion rows is less good than building
one estimated at 1000 rows ... but it didn't.  The cause is

(1) estimate_hash_bucket_stats is defined to return zero to *mcv_freq
if it cannot obtain a value for the frequency of the most common
value.

(2) final_cost_hashjoin neglected this specification, and would
blindly adopt zero for the innermcvfreq.

(3) This results in calculating zero for the largest hash bucket size,
and thus the code that intends to disable hashjoin when that bucket
size exceeds get_hash_memory_limit() is turned into a no-op.

This is the exact opposite of what we want, I think.  The intent in
the planner has always been to avoid hashing unless we are pretty
confident that the inner relation's hash key is well-distributed.
Turning off the disable-for-giant-hash-table check when we have
no stats is the polar opposite of sane.

So I said to myself "this is a one-liner fix, we just have to
disregard any mcv_freq reported as zero".  And that did fix the
case shown in the bug report, but it also broke a bunch of
regression test cases.  Upon closer investigation, there is also
an old oversight within estimate_hash_bucket_stats itself.  It
returns zero for mcv_freq if there's no STATISTIC_KIND_MCV entry,
but that neglects the fact that ANALYZE does not make an MCV entry
if it doesn't find any data values that are noticeably more common
than any others.  So the correct behavior really should be to
assume the column is unique and set the mcv_freq to 1 / rows.
In the attached draft patch I made it do this if there's no MCV
stats entry but there is a STATISTIC_KIND_HISTOGRAM entry.
If there's neither, we are probably dealing with a weird datatype
that doesn't have meaningful scalar stats, so I'm hesitant to
just apply the 1 / rows rule blindly.

Even after that, there were more changes in regression test outputs
than I'd expected.  Poking into it further, the first diff in join.out
is precisely a case like the bug report's, where we have no stats
about a potentially large self-join.  The repaired code decides that a
hash join is too risky, so it disables it and we select a merge join
instead, as desired.  However, none of the other places that visibly
change plans are triggering that disable logic.  Instead what is
happening is that the improved mcv_freq estimate is getting used
within estimate_hash_bucket_stats to refine its bucket-size result:

    /*
     * Adjust estimated bucketsize upward to account for skewed distribution.
     */
    if (avgfreq > 0.0 && *mcv_freq > avgfreq)
        estfract *= *mcv_freq / avgfreq;

This code does nothing if *mcv_freq is zero, but if we have a
more-than-zero estimate it can increase the bucket size fraction,
and that is indeed happening in the other places in the core
regression tests where we see plans change.  AFAICT these changes
are all perfectly sane and not anything to worry about.

I also had to stick an additional ANALYZE step into join_hash.sql
to keep plans from changing there.

I remain a bit confused by the change in postgres_fdw.out though.
It's deciding to push an ORDER BY down to the remote side when
it didn't before, which seems like an improvement; but I fail to
see how a marginal change in hash join costing would lead to that.
Perhaps that is worth looking into more closely.

			regards, tom lane

[1] https://www.postgresql.org/message-id/19363-8dd32fc7600a1153%40postgresql.org