Thread

  1. Re: Batching in executor

    amit <amitlangote09@gmail.com> — 2025-12-20T14:12:03Z

    On Fri, Dec 5, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com>
    wrote:
    > On Wed, Oct 29, 2025 at 3:37 PM Amit Langote <amitlangote09@gmail.com>
    wrote:
    > > On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <amitlangote09@gmail.com>
    wrote:
    > > > That would be nice to see if you have the time, but maybe after I post
    > > > a new version.
    > >
    > > I’ve created a CF entry marked WoA for this in the next CF under the
    > > title “Batching in executor, part 1: add batch variant of table AM
    > > scan API.” The idea is to track this piece separately so that later
    > > parts can have their own entries and we don’t end up with a single
    > > long-lived entry that never gets marked done. :-)
    >
    > I intend to continue working on this, so have just moved it into the
    > next fest.  I will post a new patch version next week that addresses
    > Daniil's comments and implements a few other things I mentioned I will
    > in my reply to Tomas on Oct 28; sorry for the delay.
    
    Before I go on vacation for a couple of weeks, here's an updated patch
    set.  I am only including the patches that add TAM interface, add
    TupleBatch executor wrapper for TAM batches, and use it in SeqScan as I had
    posted before.  There is a new patch to add a BATCHES option to EXPLAIN.  I
    renamed the testing GUC to executor_batch_rows (integer) from the boolean
    executor_batching.  EXPLAIN (BATCHES) example:
    
    +-- Basic batch stats output
    +select explain_filter('explain (analyze, batches, buffers off, costs off)
    select * from batch_test');
    +                         explain_filter
    +----------------------------------------------------------------
    + Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
    +   Batches: N  Avg Rows: N.N  Max: N  Min: N
    + Planning Time: N.N ms
    + Execution Time: N.N ms
    +(4 rows)
    
    What I have not included in this set are the patches that add
    ExecProcNodeBatch() so that TupleBatch can be passed from one plan node to
    another (parent), ExprEvalOps (EEOPs) for batched expression evaluation
    (qual and aggregate transition).  I would like to focus on the patches that
    allow reading batches from TAM into Scan nodes (only SeqScan for now).
    
    After I'm back from vacation, I will post patches for batched qual
    evaluation in SeqScan filter quals (once bugs are fixed and polished).
    Batching in Agg node can wait for now.
    
    In the meantime, what I would like to have someone's thoughts on:
    
    * the shape of the TAM APIs -- should I add a TAMBatch or something that is
    created, populated, and destroyed by the TAM instead of the current void
    pointer and TupleBatchOps that are initialized in the executor like this
    (excerpt from 0002):
    
    +    /* Lazily create the AM batch payload. */
    +    if (node->ss.ps.ps_Batch->am_payload == NULL)
    +    {
    +        const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY =
    scandesc->rs_rd->rd_tableam;
    +
    +        Assert(tam && tam->scan_begin_batch);
    +        node->ss.ps.ps_Batch->am_payload =
    +            table_scan_begin_batch(scandesc,
    node->ss.ps.ps_Batch->maxslots);
    +        node->ss.ps.ps_Batch->ops =
    table_batch_callbacks(node->ss.ss_currentRelation);
    +    }
    
    * the shape of TupleBatch itself -- its contents and operations defined in
    execBatch.c/h.
    
    * any other thoughts you might have on the project, patches.
    
    Benchmark:
    
    Scripts attached if you want to try them.
    
    (Negative % = faster than master)
    
    SELECT * FROM table LIMIT 1 OFFSET N:
    Rows      Master    batch=0   vs master   batch=64   vs master
    --------------------------------------------------------------
    1M          11ms       11ms        -0%        8ms       -23%
    2M          23ms       22ms        -1%       18ms       -23%
    3M          36ms       34ms        -5%       27ms       -25%
    4M          51ms       50ms        -2%       38ms       -26%
    5M          64ms       64ms        -1%       48ms       -26%
    10M        147ms      145ms        -1%      114ms       -22%
    
    SELECT * FROM WHERE a > 0 LIMIT 1 OFFSET N:
    Rows      Master    batch=0   vs master   batch=64   vs master
    --------------------------------------------------------------
    1M          31ms       31ms        +0%       16ms       -48%
    2M          64ms       64ms        -0%       34ms       -47%
    3M          67ms       66ms        -1%       50ms       -25%
    4M          91ms       90ms        -1%       71ms       -22%
    5M         119ms      113ms        -5%       88ms       -26%
    10M        262ms      261ms        -0%      205ms       -21%
    
    SELECT * FROM table WHERE o > 0 LIMIT 1 OFFSET N (last column -
    deform-heavy):
    Rows      Master    batch=0   vs master   batch=64   vs master
    --------------------------------------------------------------
    1M          38ms       37ms        -2%       38ms        +0%
    2M          79ms       75ms        -6%       77ms        -4%
    3M         182ms      186ms        +2%      160ms       -12%
    4M         250ms      252ms        +1%      219ms       -12%
    5M         314ms      316ms        +1%      273ms       -13%
    10M        647ms      651ms        +1%      604ms        -7%
    
    The smaller improvement with WHERE o > 0 is expected since accessing the
    last column requires deforming most of the tuple, which dominates the
    execution time. Future work on batched tuple deformation could help here.
    
    Note on regressions with executor_batch_rows = 0 vs master:
    
    I am not seeing the regressions with batch_rows=0 vs master as I did
    before.  I think some of it might have to do with my removing some stray
    fields from HeapScanData that were accidentally left there in the earlier
    patches.  Also, the regressions I was observing earlier seemed more to have
    to do with using gcc to compile master tree and clang to compile patched
    tree, which resulted in code layout changes that seemed to cause patched
    binary to regress.  Would be nice if these numbers can be verified by
    others.
    
    -- 
    Thanks, Amit Langote