Thread

  1. Re: hashjoins vs. Bloom filters (yet again)

    Andrew Dunstan <andrew@dunslane.net> — 2026-05-30T17:12:39Z

    On 2026-05-29 Fr 8:55 PM, Tomas Vondra wrote:
    > Hi,
    >
    > A random discussion at pgconf.dev made me revisit one of my ancient
    > patches, attempting to use Bloom filters to hash joins. I did work on
    > that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
    > me briefly revisit that, before I get to the new patch.
    >
    >
    > old patches
    > -----------
    >
    > Those old patches tried to do a fairly small thing during a hash join,
    > and that's building a Bloom filter on the inner relation (the one that
    > gets hashed), and then use that filter before probing the hash table.
    >
    > The benefits come from Bloom filters being (fairly) cheap, and a
    > negative answer (hash is not in the filter) may allows us to skip a much
    > more expensive operation.
    >
    > The old threads patches focused especially at two hash join cases:
    >
    > (a) A very selective join, i.e. a significant fraction of outer tuples
    > does not have a match in the hash table.
    >
    > (b) A selective hash join forced to do batching because the hash table
    > is too large, and thus forced to spill outer tuples to temporary files.
    >
    > For (a), the benefit comes from Bloom filters being much cheaper to
    > probe than a hash table. The exact cost depends on the implementation,
    > sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
    > the filter discards 90% of tuples, it can be a big win.
    >
    > For (b), the filter (for all the batches at once) allows us to discard
    > some of the outer tuples without writing them to temporary files. Which
    > is way more expensive than probing a hash table.
    >
    > The patches got stuck mostly because deciding if it makes sense to
    > build/use the Bloom filter is somewhat hard. For cases where 100% of the
    > tuples have a match it's pointless - it's just pure cost, no benefit.
    > The regressions are relatively small, though (<10%).
    >
    > For (b) it's much less sensitive to this kind of issues, of course. The
    > cost of writing outer tuples to temporary files is much higher than
    > building/probing a Bloom filter.
    >
    > Clearly, a filter that discards 99% of tuples is great. And a filter
    > that keeps 99% of tuples is not great. But where exactly are the
    > thresholds is not quite clear.
    >
    > There's also a related question of sizing the filter. Bloom filters are
    > usually sized by specifying the number of distinct values and the
    > desired false positive rate. And we could try doing that - pick a
    > standard false positive rate (e.g. the built-in bloom_filter aims for
    > 1-2%), estimate the ndistinct, and get the size of the Bloom filter.
    >
    > However, chances are the filter is too big. We can't get work_mem, the
    > join is already using that for the hash table etc. We can maybe use a
    > fraction of it, and that may not be enough to fit the "perfect" filter.
    > We could bail out and not use any Bloom filter at all, but that seems a
    > bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?
    >
    > Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
    > using a "worse" Bloom filter with 10% false positives would be a win?
    > It'd still discard ~89% of tuples.
    >
    > Yet another angle leading to this kind of questions is inaccurate
    > ndistinct estimates (and we all know those estimates can be quite
    > unreliable). Let's say we size the filter for 1M distinct values (and it
    > just about fits into the memory budget), but then during execution we
    > find there are 2M distinct values. Well, now we may have ~10% false
    > positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.
    >
    > At some point the filter stops being worth it, and we should either not
    > build it, or we should stop probing it. But when is that?
    >
    > I think we'd need some sort of cost model to make judgments about this.
    >
    > Anyway, this was just me summarizing the old threads, and what I think
    > got them stuck. Most of these questions are still open, although I think
    > we may be able to solve them better than we could ~10 years ago. We have
    > extended stats, we know about FK constraints during planning, ...
    >
    >
    > new patch
    > ---------
    >
    > Now let's talk about the new experimental/PoC patch that came from the
    > pgconf.dev discussions. It doesn't really solve the issues I just went
    > through, it's more of an attempt to take it one step further.
    >
    > One of the things mentioned in the 2018 thread was the possibility to
    > push the filter much deeper, instead of using it just in the hash join
    > node itself. It was merely discussed, but there was no code written, or
    > anything like that. But it's the thing I decided to take a stab at after
    > getting back from Vancouver.
    >
    > Consider a starjoin query
    >
    >    SELECT + FROM f JOIN d1 (f.id1 = d1.id)
    >                    JOIN d2 (f.id2 = d2.id)
    >                    JOIN d2 (f.id3 = d3.id)
    >     WHERE d1.x = 1
    >       AND d2.y = 2
    >       AND d3.z = 3;
    >
    > which will be planned using a left-deep plan like this one:
    >
    >          HJ
    >        /    \
    >      D3     HJ
    >           /    \
    >          D2    HJ
    >              /    \
    >             D1     F
    >
    > With hashes on "D" tables, and a scan on "F". With the "old" patches,
    > each HJ node would use a Bloom filter internally. But there's an
    > interesting opportunity to "push down" the filters to the scan on "F",
    > and evaluate them right there, a bit as if the scan had a local qual.
    >
    > The attached patch implements a PoC of this, and it's pretty effective.
    >
    > Of course, it depends on the selectivity of the joins (and thus how many
    > tuples get discarded by the filters). But because it moves all the
    > "cheap" filter probes *before* probing any of the hash tables, it has a
    > multiplication effect for the benefits.
    >
    > Yes, it still has most of the open issues discussed earlier, and those
    > will need to be addressed. But this "multiplication" may also make it
    > somewhat less sensitive to the regressions.
    >
    > In the example above, if each of the 3 joins has 20% selectivity (i.e.
    > 20% tuples go through), then the total selectivity is ~1%. So the "F"
    > scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
    > and it does not eliminate any tuples? That still means the overall
    > selectivity is only ~4%.
    >
    > Of course, this only works for larger joins, and maybe the joins are
    > correlated in some weird way, etc. Also, what does 4% selectivity mean
    > for the overall query duration?
    >
    > Attached is a PDF with results from a simple benchmark using joins like
    > the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
    > couple GUCs to eliminate variations in the plan. The dimension joins are
    > independent and match a variable fraction of the fact (1% - 100%).
    >
    > The columns are for three branches - master, and "patched" with the
    > push-down disabled and enabled, for joins with 1-3 dimensions.
    >
    > The last two column groups are comparing the "patched" results to
    > master. With "off" there's no difference (other than random noise), just
    > as expected. But with the push-down enabled, there are fairly
    > significant speedups (up to ~3x). Of course, this is just a benchmark,
    > practical queries may do other stuff, making the gains smaller. OTOH, it
    > may also be much better, if there are expensive nodes in between.
    >
    >
    > The PoC patch is not very big or complex. 280KB seems like a lot, but
    > like 99% of that is changes in test output, because the patch adds some
    > info about the Bloom filters to EXPLAIN. The actual .c changes are only
    > ~1000 lines, and a half of that is comments.
    >
    > The most interesting stuff happens in create_hashjoin_plan(), where we
    > attempt to push-down the filter to a scan in the outer subtree. If that
    > succeeds, then ExecInitHashJoin initializes the filter so that the scan
    > can find it, and Hash builds the filter along with the hash table. And
    > then the scan nodes probe the pushed-down filter in ExecScanExtended().
    >
    > There's bunch of boilerplate so that setrefs does the right thing with
    > expressions, etc. But it's a couple lines here and there. I'm actually
    > surprised how little code this is.
    >
    > There's one detail I haven't mentioned yet - there's a simple adaptive
    > behavior, to deal with filters that are not selective enough. Per some
    > initial tests there's little benefit when the filter keeps >75% tuples,
    > and for >90% there were measurable regressions (~50%). This was very
    > consistent for different data types, etc.
    >
    > So the patch tracks number of matching tuples per 1000 probes, when it
    > exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
    > the filter, and if the fraction drops <80%, all the tuples get probed
    > again. This is very simple, needs more thought. But for the purpose of
    > the testing it worked quite well. There still is a small regression
    > (~3%), which I assume is due to building the filter.
    >
    >
    > Aside from the issues with deciding if to use a filter at all, sizing
    > it, etc. - which are still valid (even with the adaptive thing), and
    > need to be solved, there's one more annoying issue specific to this new
    > push-down stuff.
    >
    >
    > Earlier, I mentioned the push-down happens in create_hashjoin_plan().
    > Which means it happens *after* planning and costing. There are reasons
    > for that, but it has some unfortunate & annoying consequences.
    >
    > Ideally, we'd know about the filters when constructing the scan nodes,
    > so we'd have a chance to estimate how many tuples will be eliminated by
    > probing the filters (which is about the same thing as estimating the
    > join sizes). But we can't do that, because our planner works bottom-up.
    > When constructing the scan nodes we know which tables we'll join with,
    > but we have no idea which of the join algorithms we'll pick.
    >
    > We'll consider all three join types, and the scan node has no say which
    > of those will win. But the Bloom filter push-down is specific to hash
    > joins. So what should the scan node do? Either it can assume it's under
    > hash join (and set rows/cost as if there's a Bloom filter), or it can
    > set costs in a join-agnostic way (like now).
    >
    > The only "correct" way I can think of dealing with this in the bottom-up
    > world is having two sets of paths - one set for a hash join, one set for
    > other joins. But that's not just for scans. We'd need that for all
    > paths, and for different combinations of joins. For the query with 3
    > joins, we'd end up with 2^3 combinations. That seems not great.
    >
    >
    > So I tend to see this as an opportunistic optimization. We do the
    > planning assuming there's no Bloom filter push-down, and then after the
    > fact we see if there's an opportunity after all. Which means we may not
    > pick a plan with hash joins, not realizing it might be made faster.
    >
    > But in my mind that's somewhat acceptable / defensible.
    >
    > The bigger issue for me is that it may make the EXPLAIN ANALYZE output
    > way harder to understand. The estimated "rows" are calculated before the
    > filter push-down happens, while the actual "rows" are with the filter
    > probing, of course. But it seems pretty easy to get confused by this,
    > and think it's just an incorrect estimate.
    >
    >
    > summary
    > -------
    >
    > I like the idea of pushing filters down to the scan nodes (or perhaps
    > even to some other intermediate nodes). But maybe it's too incompatible
    > with our bottom-up planning, and the issues with costing and/or EXPLAIN
    > output may be impossible to solve. I wonder what others think.
    >
    >
    > Now that I revisited the older threads, I think it probably makes sense
    > with using Bloom filters in the hash join, at least in the two cases
    > mentioned in the first section. It doesn't have the issues with
    > bottom-up planning/costing, because it happens in the hash join. And the
    > issues with that (deciding what fractions are OK, sizing the filter,
    > ...) apply to both that simpler case, and to the push-down.
    
    
    
    Hi, Tomas
    
    This is terrific and very timely from my POV.
    
    I've been experimenting with a table AM (implemented as a
    CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
    of the bigger wins available to it: a fact-table scan joined to a filtered
    dimension can use the filter to skip whole row groups and avoid
    decompressing columns entirely, rather than just rejecting a tuple after
    it's been produced. I'd hacked up a private version of this via a new
    table-AM callback (the hashjoin walks the outer subtree, builds a filter
    from the build side, and hands it to the AM's scan descriptor). Having now
    read your PoC, I think your framework is the better foundation, and I'd
    rather build on it than carry a parallel mechanism. But two things stand in
    the way of a storage-level consumer using it, and I think both are 
    relatively
    small.
    
    1) A CustomScan can't currently be a recipient.
    
    find_bloom_filter_recipient() only recognizes the stock scan tags, and the
    probe itself lives in ExecScanExtended(), which a CustomScan never calls
    (it dispatches to the provider's ExecCustomScan). The second part is
    actually a feature, not a bug: if a CustomScan provider does its own
    probing, it can choose the granularity -- per dictionary entry, per row
    group, or per row -- instead of being locked into the per-row,
    post-materialization probe that the stock nodes get. So all that's needed
    on your side is to let the planner attach a filter to a base-relation
    CustomScan; the provider takes care of consuming it.
    
    Concretely, that's adding T_CustomScan to the scan-leaf case in
    find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
    scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
    fall through to NULL), plus the matching fix_scan_bloom_filters() call in
    set_customscan_references(). The provider then calls ExecInitBloomFilters()
    in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
    inside its scan loop. Everything else -- producer registration, the
    es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
    unchanged.
    
    2) The combined-hash filter can't be tested against a single column.
    
    You build one filter keyed on hash32() of all the join keys combined. For a
    single-key join that's ideal, and a column store can use it directly: hash
    each distinct dictionary value once per row group and skip groups whose
    values are all absent. For a multi-column join, though, the combined hash
    mixes the keys, so it can only ever be tested per-row (with all key columns
    in hand) -- it can't be checked against any one column's dictionary. The
    per-row probe is still useful, but the row-group/dictionary skipping, which
    is where most of the storage win comes from, isn't available.
    
    The obvious thought is to key a filter per column instead. But I don't
    think that should *replace* the combined filter, because per-column filters
    are strictly less selective on multi-column joins: they only test whether
    each column's value appears *somewhere* in the build side, not whether the
    combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
    both per-column filters even though it matches no build row, whereas the
    combined filter rejects it. So for the row-level probe -- and especially
    for plain heap -- the combined filter is the better one, and replacing it
    would be a regression.
    
    What I think would actually help is to let the framework *optionally* emit
    per-column filters in addition to the combined one, when a recipient
    signals it can use them. The combined filter stays the default and does the
    precise per-row rejection (unchanged for heap, and usable per-row by a
    column store too); the per-column filters are extra, built only on demand,
    and let a storage consumer cheaply eliminate whole row groups before the
    combined filter does the exact work. The cost is the build CPU and memory
    for the extra filters -- but only for consumers that ask, so your design is
    untouched when nobody does. For a single-key join the two filters 
    coincide, so
    there'd be no reason to build both.
    
    
    I'd be happy to work on patches for these.
    
    cheers
    
    andrew
    
    --
    Andrew Dunstan
    EDB:https://www.enterprisedb.com