Thread

Re: hashjoins vs. Bloom filters (yet again)

Andrew Dunstan <andrew@dunslane.net> — 2026-05-31T15:03:47Z
On 2026-05-30 Sa 2:14 PM, Tomas Vondra wrote:
>
>
> Hi, Tomas
>
> This is terrific and very timely from my POV.
>
> I've been experimenting with a table AM (implemented as a
> CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
> of the bigger wins available to it: a fact-table scan joined to a filtered
> dimension can use the filter to skip whole row groups and avoid
> decompressing columns entirely, rather than just rejecting a tuple after
> it's been produced. I'd hacked up a private version of this via a new
> table-AM callback (the hashjoin walks the outer subtree, builds a filter
> from the build side, and hands it to the AM's scan descriptor). Having now
> read your PoC, I think your framework is the better foundation, and I'd
> rather build on it than carry a parallel mechanism. But two things stand in
> the way of a storage-level consumer using it, and I think both are
> relatively
> small.
>
> OK, good to hear. I was actually thinking about that use case too, i.e.
> making it possible for the scan to do something smart with the filter
> (like even pushing it even further down, to "storage"). Or maybe the
> ForeignScan could push it to the remote side, so that it's actually
> filtered there.
>
> I didn't mention that my message, and there are some difficulties:
>
> 1) We only build the hash (and bloom) with a delay, after the scan
> already produces some tuples. That complicates the pushdown, whiich may
> need to happen when starting the scan. Presumably, we'd need to allow
> disabling this optimization, optionally.
>
> 2) We'd need some sort of "portable" Bloom filter, with serialization
> and deserialization, etc.
>
> Both of these seem rather solvable.
>
>> 1) A CustomScan can't currently be a recipient.
>>
>> find_bloom_filter_recipient() only recognizes the stock scan tags, and the
>> probe itself lives in ExecScanExtended(), which a CustomScan never calls
>> (it dispatches to the provider's ExecCustomScan). The second part is
>> actually a feature, not a bug: if a CustomScan provider does its own
>> probing, it can choose the granularity -- per dictionary entry, per row
>> group, or per row -- instead of being locked into the per-row,
>> post-materialization probe that the stock nodes get. So all that's needed
>> on your side is to let the planner attach a filter to a base-relation
>> CustomScan; the provider takes care of consuming it.
>>
>> Concretely, that's adding T_CustomScan to the scan-leaf case in
>> find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
>> scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
>> fall through to NULL), plus the matching fix_scan_bloom_filters() call in
>> set_customscan_references(). The provider then calls ExecInitBloomFilters()
>> in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
>> inside its scan loop. Everything else -- producer registration, the
>> es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
>> unchanged.
>>
> Yes, that should work and it's a mostly mechanical change.
>
> Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
> it can handle Bloom filters. Like, setting
> CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.
>
>> 2) The combined-hash filter can't be tested against a single column.
>>
>> You build one filter keyed on hash32() of all the join keys combined. For a
>> single-key join that's ideal, and a column store can use it directly: hash
>> each distinct dictionary value once per row group and skip groups whose
>> values are all absent. For a multi-column join, though, the combined hash
>> mixes the keys, so it can only ever be tested per-row (with all key columns
>> in hand) -- it can't be checked against any one column's dictionary. The
>> per-row probe is still useful, but the row-group/dictionary skipping, which
>> is where most of the storage win comes from, isn't available.
>>
>> The obvious thought is to key a filter per column instead. But I don't
>> think that should *replace* the combined filter, because per-column filters
>> are strictly less selective on multi-column joins: they only test whether
>> each column's value appears *somewhere* in the build side, not whether the
>> combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
>> both per-column filters even though it matches no build row, whereas the
>> combined filter rejects it. So for the row-level probe -- and especially
>> for plain heap -- the combined filter is the better one, and replacing it
>> would be a regression.
>>
>> What I think would actually help is to let the framework *optionally* emit
>> per-column filters in addition to the combined one, when a recipient
>> signals it can use them. The combined filter stays the default and does the
>> precise per-row rejection (unchanged for heap, and usable per-row by a
>> column store too); the per-column filters are extra, built only on demand,
>> and let a storage consumer cheaply eliminate whole row groups before the
>> combined filter does the exact work. The cost is the build CPU and memory
>> for the extra filters -- but only for consumers that ask, so your design is
>> untouched when nobody does. For a single-key join the two filters
>> coincide, so
>> there'd be no reason to build both.
>>
> I think I speculated about this (having per-key filters) in some of the
> comments in the patch, although the use case was different. I haven't
> thought about TAM, but about different joins where the join keys come
> from both sides. Consider a join like
>
>          HJ
>        /    \
>       A     HJ
>           /    \
>          B      C
>
> where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
> can't be pushed to either side. But we could:
>
> (1) Push the filter on top of the BC join (which in this example is not
> really a push-down).
>
> (2) Build filters on (x) and (y) separately, and push-down these.
>
> Or we could do both, really.
>
> I suppose a variation of (2) would work for your use case too, except
> we'd push all three filters (x,y), (x) and (y) to the same scan.
>
> I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.
>
> The question is how efficient can the smaller filters be. The complete
> filter can be very selective, while the per-key filters are terrible.
>
>> I'd be happy to work on patches for these.
>>
> Great. It's and interesting experiment / area to explore.


Here are 3 patches (developed using Claude) that sit on top of your POC.

Patch 1 enables the pushdown filters for custom scans. As you say it's 
fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS 
path flag.

Patch 2 provides for building per-key filters in addition to the 
multi-key filter if that flag is set. There may be other cases that 
would want it, but this would suit my immediate use case.

Patch 3 provides for eager creation of the filter(s) in such cases, 
disabling the optimization you mentioned in point 1 above.


>
> FWIW I think the main difficulty for this PoC is going to be the
> planning/costing stuff, and the impact on EXPLAIN.
>
>


I haven't dealt with that or other issues you raise, but I think this is 
enough for me to begin testing. I have adapted my TAM to it and verified 
that it acts as expected. I will start doing some benchmarks.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com