Thread

Re: hashjoins vs. Bloom filters (yet again)

Andrew Dunstan <andrew@dunslane.net> — 2026-05-30T17:12:39Z
On 2026-05-29 Fr 8:55 PM, Tomas Vondra wrote:
> Hi,
>
> A random discussion at pgconf.dev made me revisit one of my ancient
> patches, attempting to use Bloom filters to hash joins. I did work on
> that twice in the past - first in 2015/6 [1], then in 2018 [2]. So let
> me briefly revisit that, before I get to the new patch.
>
>
> old patches
> -----------
>
> Those old patches tried to do a fairly small thing during a hash join,
> and that's building a Bloom filter on the inner relation (the one that
> gets hashed), and then use that filter before probing the hash table.
>
> The benefits come from Bloom filters being (fairly) cheap, and a
> negative answer (hash is not in the filter) may allows us to skip a much
> more expensive operation.
>
> The old threads patches focused especially at two hash join cases:
>
> (a) A very selective join, i.e. a significant fraction of outer tuples
> does not have a match in the hash table.
>
> (b) A selective hash join forced to do batching because the hash table
> is too large, and thus forced to spill outer tuples to temporary files.
>
> For (a), the benefit comes from Bloom filters being much cheaper to
> probe than a hash table. The exact cost depends on the implementation,
> sizes, etc. We're in the ballpark of 50 vs. 500 cycles, maybe. But if
> the filter discards 90% of tuples, it can be a big win.
>
> For (b), the filter (for all the batches at once) allows us to discard
> some of the outer tuples without writing them to temporary files. Which
> is way more expensive than probing a hash table.
>
> The patches got stuck mostly because deciding if it makes sense to
> build/use the Bloom filter is somewhat hard. For cases where 100% of the
> tuples have a match it's pointless - it's just pure cost, no benefit.
> The regressions are relatively small, though (<10%).
>
> For (b) it's much less sensitive to this kind of issues, of course. The
> cost of writing outer tuples to temporary files is much higher than
> building/probing a Bloom filter.
>
> Clearly, a filter that discards 99% of tuples is great. And a filter
> that keeps 99% of tuples is not great. But where exactly are the
> thresholds is not quite clear.
>
> There's also a related question of sizing the filter. Bloom filters are
> usually sized by specifying the number of distinct values and the
> desired false positive rate. And we could try doing that - pick a
> standard false positive rate (e.g. the built-in bloom_filter aims for
> 1-2%), estimate the ndistinct, and get the size of the Bloom filter.
>
> However, chances are the filter is too big. We can't get work_mem, the
> join is already using that for the hash table etc. We can maybe use a
> fraction of it, and that may not be enough to fit the "perfect" filter.
> We could bail out and not use any Bloom filter at all, but that seems a
> bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?
>
> Surely if the join selectivity is 1% (i.e. it discards 99% tuples), then
> using a "worse" Bloom filter with 10% false positives would be a win?
> It'd still discard ~89% of tuples.
>
> Yet another angle leading to this kind of questions is inaccurate
> ndistinct estimates (and we all know those estimates can be quite
> unreliable). Let's say we size the filter for 1M distinct values (and it
> just about fits into the memory budget), but then during execution we
> find there are 2M distinct values. Well, now we may have ~10% false
> positive rate. Or maybe we got 5M, and it's 30%. Or 10M / 50%.
>
> At some point the filter stops being worth it, and we should either not
> build it, or we should stop probing it. But when is that?
>
> I think we'd need some sort of cost model to make judgments about this.
>
> Anyway, this was just me summarizing the old threads, and what I think
> got them stuck. Most of these questions are still open, although I think
> we may be able to solve them better than we could ~10 years ago. We have
> extended stats, we know about FK constraints during planning, ...
>
>
> new patch
> ---------
>
> Now let's talk about the new experimental/PoC patch that came from the
> pgconf.dev discussions. It doesn't really solve the issues I just went
> through, it's more of an attempt to take it one step further.
>
> One of the things mentioned in the 2018 thread was the possibility to
> push the filter much deeper, instead of using it just in the hash join
> node itself. It was merely discussed, but there was no code written, or
> anything like that. But it's the thing I decided to take a stab at after
> getting back from Vancouver.
>
> Consider a starjoin query
>
>    SELECT + FROM f JOIN d1 (f.id1 = d1.id)
>                    JOIN d2 (f.id2 = d2.id)
>                    JOIN d2 (f.id3 = d3.id)
>     WHERE d1.x = 1
>       AND d2.y = 2
>       AND d3.z = 3;
>
> which will be planned using a left-deep plan like this one:
>
>          HJ
>        /    \
>      D3     HJ
>           /    \
>          D2    HJ
>              /    \
>             D1     F
>
> With hashes on "D" tables, and a scan on "F". With the "old" patches,
> each HJ node would use a Bloom filter internally. But there's an
> interesting opportunity to "push down" the filters to the scan on "F",
> and evaluate them right there, a bit as if the scan had a local qual.
>
> The attached patch implements a PoC of this, and it's pretty effective.
>
> Of course, it depends on the selectivity of the joins (and thus how many
> tuples get discarded by the filters). But because it moves all the
> "cheap" filter probes *before* probing any of the hash tables, it has a
> multiplication effect for the benefits.
>
> Yes, it still has most of the open issues discussed earlier, and those
> will need to be addressed. But this "multiplication" may also make it
> somewhat less sensitive to the regressions.
>
> In the example above, if each of the 3 joins has 20% selectivity (i.e.
> 20% tuples go through), then the total selectivity is ~1%. So the "F"
> scan produces only 1/100 of tuples. Maybe we got one of the joins wrong,
> and it does not eliminate any tuples? That still means the overall
> selectivity is only ~4%.
>
> Of course, this only works for larger joins, and maybe the joins are
> correlated in some weird way, etc. Also, what does 4% selectivity mean
> for the overall query duration?
>
> Attached is a PDF with results from a simple benchmark using joins like
> the one above - fact + 1-3 dimensions. The scripts (in the .tgz) set a
> couple GUCs to eliminate variations in the plan. The dimension joins are
> independent and match a variable fraction of the fact (1% - 100%).
>
> The columns are for three branches - master, and "patched" with the
> push-down disabled and enabled, for joins with 1-3 dimensions.
>
> The last two column groups are comparing the "patched" results to
> master. With "off" there's no difference (other than random noise), just
> as expected. But with the push-down enabled, there are fairly
> significant speedups (up to ~3x). Of course, this is just a benchmark,
> practical queries may do other stuff, making the gains smaller. OTOH, it
> may also be much better, if there are expensive nodes in between.
>
>
> The PoC patch is not very big or complex. 280KB seems like a lot, but
> like 99% of that is changes in test output, because the patch adds some
> info about the Bloom filters to EXPLAIN. The actual .c changes are only
> ~1000 lines, and a half of that is comments.
>
> The most interesting stuff happens in create_hashjoin_plan(), where we
> attempt to push-down the filter to a scan in the outer subtree. If that
> succeeds, then ExecInitHashJoin initializes the filter so that the scan
> can find it, and Hash builds the filter along with the hash table. And
> then the scan nodes probe the pushed-down filter in ExecScanExtended().
>
> There's bunch of boilerplate so that setrefs does the right thing with
> expressions, etc. But it's a couple lines here and there. I'm actually
> surprised how little code this is.
>
> There's one detail I haven't mentioned yet - there's a simple adaptive
> behavior, to deal with filters that are not selective enough. Per some
> initial tests there's little benefit when the filter keeps >75% tuples,
> and for >90% there were measurable regressions (~50%). This was very
> consistent for different data types, etc.
>
> So the patch tracks number of matching tuples per 1000 probes, when it
> exceeds 90% it switches to sampling. Only 1% of tuples gets probed in
> the filter, and if the fraction drops <80%, all the tuples get probed
> again. This is very simple, needs more thought. But for the purpose of
> the testing it worked quite well. There still is a small regression
> (~3%), which I assume is due to building the filter.
>
>
> Aside from the issues with deciding if to use a filter at all, sizing
> it, etc. - which are still valid (even with the adaptive thing), and
> need to be solved, there's one more annoying issue specific to this new
> push-down stuff.
>
>
> Earlier, I mentioned the push-down happens in create_hashjoin_plan().
> Which means it happens *after* planning and costing. There are reasons
> for that, but it has some unfortunate & annoying consequences.
>
> Ideally, we'd know about the filters when constructing the scan nodes,
> so we'd have a chance to estimate how many tuples will be eliminated by
> probing the filters (which is about the same thing as estimating the
> join sizes). But we can't do that, because our planner works bottom-up.
> When constructing the scan nodes we know which tables we'll join with,
> but we have no idea which of the join algorithms we'll pick.
>
> We'll consider all three join types, and the scan node has no say which
> of those will win. But the Bloom filter push-down is specific to hash
> joins. So what should the scan node do? Either it can assume it's under
> hash join (and set rows/cost as if there's a Bloom filter), or it can
> set costs in a join-agnostic way (like now).
>
> The only "correct" way I can think of dealing with this in the bottom-up
> world is having two sets of paths - one set for a hash join, one set for
> other joins. But that's not just for scans. We'd need that for all
> paths, and for different combinations of joins. For the query with 3
> joins, we'd end up with 2^3 combinations. That seems not great.
>
>
> So I tend to see this as an opportunistic optimization. We do the
> planning assuming there's no Bloom filter push-down, and then after the
> fact we see if there's an opportunity after all. Which means we may not
> pick a plan with hash joins, not realizing it might be made faster.
>
> But in my mind that's somewhat acceptable / defensible.
>
> The bigger issue for me is that it may make the EXPLAIN ANALYZE output
> way harder to understand. The estimated "rows" are calculated before the
> filter push-down happens, while the actual "rows" are with the filter
> probing, of course. But it seems pretty easy to get confused by this,
> and think it's just an incorrect estimate.
>
>
> summary
> -------
>
> I like the idea of pushing filters down to the scan nodes (or perhaps
> even to some other intermediate nodes). But maybe it's too incompatible
> with our bottom-up planning, and the issues with costing and/or EXPLAIN
> output may be impossible to solve. I wonder what others think.
>
>
> Now that I revisited the older threads, I think it probably makes sense
> with using Bloom filters in the hash join, at least in the two cases
> mentioned in the first section. It doesn't have the issues with
> bottom-up planning/costing, because it happens in the hash join. And the
> issues with that (deciding what fractions are OK, sizing the filter,
> ...) apply to both that simpler case, and to the push-down.



Hi, Tomas

This is terrific and very timely from my POV.

I've been experimenting with a table AM (implemented as a
CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
of the bigger wins available to it: a fact-table scan joined to a filtered
dimension can use the filter to skip whole row groups and avoid
decompressing columns entirely, rather than just rejecting a tuple after
it's been produced. I'd hacked up a private version of this via a new
table-AM callback (the hashjoin walks the outer subtree, builds a filter
from the build side, and hands it to the AM's scan descriptor). Having now
read your PoC, I think your framework is the better foundation, and I'd
rather build on it than carry a parallel mechanism. But two things stand in
the way of a storage-level consumer using it, and I think both are 
relatively
small.

1) A CustomScan can't currently be a recipient.

find_bloom_filter_recipient() only recognizes the stock scan tags, and the
probe itself lives in ExecScanExtended(), which a CustomScan never calls
(it dispatches to the provider's ExecCustomScan). The second part is
actually a feature, not a bug: if a CustomScan provider does its own
probing, it can choose the granularity -- per dictionary entry, per row
group, or per row -- instead of being locked into the per-row,
post-materialization probe that the stock nodes get. So all that's needed
on your side is to let the planner attach a filter to a base-relation
CustomScan; the provider takes care of consuming it.

Concretely, that's adding T_CustomScan to the scan-leaf case in
find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
fall through to NULL), plus the matching fix_scan_bloom_filters() call in
set_customscan_references(). The provider then calls ExecInitBloomFilters()
in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
inside its scan loop. Everything else -- producer registration, the
es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
unchanged.

2) The combined-hash filter can't be tested against a single column.

You build one filter keyed on hash32() of all the join keys combined. For a
single-key join that's ideal, and a column store can use it directly: hash
each distinct dictionary value once per row group and skip groups whose
values are all absent. For a multi-column join, though, the combined hash
mixes the keys, so it can only ever be tested per-row (with all key columns
in hand) -- it can't be checked against any one column's dictionary. The
per-row probe is still useful, but the row-group/dictionary skipping, which
is where most of the storage win comes from, isn't available.

The obvious thought is to key a filter per column instead. But I don't
think that should *replace* the combined filter, because per-column filters
are strictly less selective on multi-column joins: they only test whether
each column's value appears *somewhere* in the build side, not whether the
combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
both per-column filters even though it matches no build row, whereas the
combined filter rejects it. So for the row-level probe -- and especially
for plain heap -- the combined filter is the better one, and replacing it
would be a regression.

What I think would actually help is to let the framework *optionally* emit
per-column filters in addition to the combined one, when a recipient
signals it can use them. The combined filter stays the default and does the
precise per-row rejection (unchanged for heap, and usable per-row by a
column store too); the per-column filters are extra, built only on demand,
and let a storage consumer cheaply eliminate whole row groups before the
combined filter does the exact work. The cost is the build CPU and memory
for the extra filters -- but only for consumers that ask, so your design is
untouched when nobody does. For a single-key join the two filters 
coincide, so
there'd be no reason to build both.


I'd be happy to work on patches for these.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com