Thread

  1. [RFC] Secondary hash to split stuck hash-join batches

    Ben Mejia <benjamin.arthur.mejia@gmail.com> — 2026-05-31T19:41:13Z

    Hi,
    
    Tomas clearly describes the underlying problem in [1]: stuck batches cannot
    be split by batch doubling. This occurs when the tuples' hash values all
    map to the same batch, similar to all items ending up in a single hash
    bucket.
    
    The HashJoin executor currently faces two unappealing fallback options:
    
      a) Continually doubling nBatch in a futile attempt to split the
    unsplittable batch.
    
      b) Inflating spaceAllowed, which silently exceeds work_mem.
    
    While PG18 caps the first fallback to prevent runaway batch explosions, it
    attempts to manage the situation by incrementally increasing memory usage.
    Consequently, work_mem is routinely violated by HashJoin, particularly when
    encountering these unsplittable batches.
    
    I have been experimenting with a rehash-based approach to reduce the memory
    used by hash joins and would appreciate feedback on the direction before
    polishing it for submission.
    
    Sketch of Rehash Approach
    
     - When ExecHashIncreaseNumBatches detects a stuck batch, call the new
    function, ExecHashRehashBatch to attempt a rehash.
    
     - Rehash attempts to reassign each tuple in the stuck batch to a sub-batch
    using a secondary hash function.
    
     - Sub-batches are handled in the same way as normal batches, and by the
    same code.
    
     - Sub-batches are split, like normal batches until the sub-batch fits into
    spaceAllowed
    
     - If the rehash cannot split the batch, it fails and the current
    inflate-and-gut-it-out path is taken.
    
    In my prototype, I use a GUC, enable_hashjoin_rehash to turn on the rehash
    logic; otherwise the code works the same as before.
    
    Additional Heuristics
    
    Having the rehash available as a tool to control memory usage, I also
    explored further modifications:
    
     - Using planner estimates to detect skew vs. large tuples:
       This can avoid doing a batch split and immediately attempt a rehash.
    
     - Avoiding PG18's memory inflation:
       Based on skew detection above, avoids inflating spaceAllowed and
    attempts a rehash.
    
     - Diminishing returns detection:
       If batch splitting does less than a split-threshold attempt a rehash.
    
    
    Some Preliminary Results
    
    For the table:
    
    CREATE TABLE t_stuck_4m (id INT, pad TEXT);
    
    INSERT INTO t_stuck_4m
      SELECT a, repeat(md5(a::text), 5)
      FROM generate_series(1, 1000000000) s(a)
      WHERE hashint4(a)::bit(32) & x'FFFFE000'::bit(32) = 0::bit(32);
    
    INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 2x
    INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 4x
    INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 8x
    INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 16x
    
    SET work_mem = '4MB';
    SET enable_hashjoin_rehash = off;
    
    EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
         * FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
                                                 QUERY PLAN
    
    --------------------------------------------------------------------------------------------------
       Limit (actual rows=1.00 loops=1)
         ->  Hash Join (actual rows=1.00 loops=1)
               Hash Cond: (a.id = b.id)
               ->  Seq Scan on t_stuck_4m a (actual rows=1.00 loops=1)
               ->  Hash (actual rows=29248.00 loops=1)
                     Buckets: 32768 (originally 32768)  Batches: 4 (originally
    2)  Memory Usage: 5974kB
                     ->  Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
       Planning Time: 0.243 ms
       Execution Time: 10.288 ms
    
    SET enable_hashjoin_rehash = on;
      EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
         * FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
                                                  QUERY PLAN
    
    --------------------------------------------------------------------------------------------------
       Limit (actual rows=1.00 loops=1)
         ->  Hash Join (actual rows=1.00 loops=1)
               Hash Cond: (a.id = b.id)
               ->  Seq Scan on t_stuck_4m a (actual rows=4.00 loops=1)
               ->  Hash (actual rows=29248.00 loops=1)
                     Buckets: 32768 (originally 32768)  Batches: 4 (originally
    2)  Memory Usage: 3083kB
                     Rehash Repartitions: 1 (max sub-batches: 2)
                     ->  Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
       Planning Time: 0.110 ms
       Execution Time: 10.071 ms
    
    So the rehash drops memory usage from 5974 kB -> 3083 kB, under the 4MB
    limit.
    
    My observations:
    
    1) The runtime when rehashing is used is nearly the same.
    
    2) This execution split the one unsplittable batch into 2, reducing
    total memory used.
    
    I have more data and the patch to share if people are interested.
    
    Best regards,
    Ben Mejia
    
      [1]
    https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me