Thread

  1. More speedups for tuple deformation

    David Rowley <dgrowleyml@gmail.com> — 2025-12-28T09:04:25Z

    Around this time last year I worked on a series of patches for v18 to
    speed up tuple deformation.  That involved about 5 separate patches,
    the main 3 of which were 5983a4cff (CompactAttribute), db448ce5a
    (faster offset aligning), and 58a359e58 (inline various deforming
    loops). The latter of those 3 changed slot_deform_heap_tuple() to add
    dedicated deforming loops for !slow mode and for tuples that don't
    have the HEAP_HASNULL bit set.
    
    When I was working on that, I wondered if it might be better to
    precalculate the attcacheoff rather than doing it in the deforming
    loop. I've finally written some code to do this, and I'm now ready to
    share some results.
    
    0001:
    
    This introduces a function named TupleDescFinalize(), which must be
    called after a TupleDesc has been created or changed. This function
    pre-calculates the attcacheoff for all fixed-width attributes and
    records the attnum of the first attribute without a cached offset (the
    first varlena or cstring attribute).  This allows the code in the
    deforming loops which was setting CompactAttribute's attcacheoff to be
    removed and allows a dedicated loop to process all attributes with an
    attcacheoff before falling through to the loop that handles
    non-attcacheoff attributes, which has to calculate the offset and
    alignment manually. If the tuple has a NULL value before the last
    attribute with a cached offset, then we can only use the attcacheoff
    until the NULL attribute.
    
    The expectation here is that knowing the offset beforehand is faster
    than calculating it each time. Calculating the offset requires first
    aligning the current offset according to the attributes attalign
    value, then once we've called fetch_att() to get the Datum value, we
    need to add the length of the attribute to skip forward to the next
    attribute. There's not much opportunity for instruction-level
    parallelism there due to the dependency on the previous calculation.
    
    The primary optimisation in 0001 is that it adds a dedicated tight
    loop to deform as many attributes with a cache offset as possible
    before breaking out that loop to deform any remaining attributes
    without using any cached offset.
    
    0002:
    
    After thinking about 0001 for a while, I wondered if we could do
    better than resorting to having to check att_isnull() for every
    attribute after we find the first NULL. What if the tuple has a NULL
    quite early on, then no NULLs after that. It would be good if we
    looked ahead in the tuple's NULL bitmap to identify exactly if and
    when the next NULL attribute occurs and loop without checking
    att_isnull() until that attribute.
    
    Effectively, what I came up with was something like:
    
    for (;;)
    {
        for(; attnum < nextNullAttr; attnum++)
        {
             // do fetch_att() without checking for NULLs
        }
        if (attnum >= natts)
            break;
        for(; attnum < nextNullSeqEnd; attnum++)
            isnull[attnum] = true;
    
        next_null_until(bp, attnum, natts, &nextNullAttr, &nextNullSeqEnd);
    }
    
    The next_null_until() function looks at the NULL bitmap starting at
    attnum and sets nextNullAttr to the next NULL and nextNullSeqEnd to
    the first attribute to the first non-NULL after the NULL. If there are
    no more NULLs, then nextNullAttr is set to natts, which allows the
    outer loop to complete.
    
    Test #5 seems to do well with this code, but I wasn't impressed with
    most of the other results. I'd have expected test #3 to improve with
    this change, but it didn't.
    
    0003:
    
    In 0002 I added a dedicated loop that handles tuples without
    HEAP_HASNULL. To see if it would make the performance any better I
    made 0003, which gets rid of that dedicated loop. I hoped that
    shrinking the code down might help performance. It didn't quite have
    the effect I hoped for.
    
    In each version, I experimented with having a dedicated deforming loop
    which can only handle attbyval == true columns.  If we know there are
    no byref attributes, then fetch_att() can be inlined without the
    branch that handles pointer types. That removes some branching
    overhead and makes for a tighter loop with fewer instructions. When
    this optimisation doesn't apply, there's a bit of extra overhead of
    having to check for "attnum < firstByRefAttr".
    
    Benchmarking:
    
    To get an idea if doing this is a win performance-wise, I designed a
    benchmark with various numbers of columns and various combinations of
    fixed vs varlena types along with NULLs and no NULLs. There are 8
    tests in total. For each of those 8 tests, I ran it with between 0
    and 40 extra INT NOT NULL columns.
    
    The tests are:
    
    1. first col int not null, last col int not null
    2. first col text not null, last col int not null
    3. first col int null, last col int not null
    4. first col text null, last col int not null
    5. first col int not null, last col int null
    6. first col text not null, last col int null
    7. first col int null, last col int null
    8. first col text null, last col int null
    
    So, for example #1 would look like:
    
    CREATE TABLE t1 (
      c INT NOT NULL DEFAULT 0,
     <extra 0-40 columns here>
     a INT NOT NULL,
     b INT NOT NULL DEFAULT 0
    );
    
    and #8 would be:
    
    CREATE TABLE t1 (
      c TEXT DEFAULT NULL,
     <extra 0-40 columns here>
     a INT NOT NULL,
     b INT DEFAULT NULL
    );
    
    For each of the 8 tests, I ran with 0, 10, 20, 30 and 40 extra
    columns, so 40 tests in total (8 tests * 5 for each variation of extra
    columns).
    
    Another benefit of 0001, besides using the fixed attcacheoff is that
    since we know where the first NULL attribute is, we can keep deforming
    without calling att_isnull() until we get to the first NULL attribute.
    Currently in master, if the tuple has the HEAP_HASNULL bit set, then
    the deforming code will call att_isnull for every attribute in the
    tuple. Test #5 should highlight this (you may notice the orange bar in
    the attached graphs is commonly the test with the biggest speedup)
    
    Now, not every query is bottlenecked on tuple deforming, so to try to
    maximise the amount of tuple deforming that occurs relative to other
    work, the query I ran was: SELECT sum(a) FROM t1;  since the "a"
    column is almost last, all prior attributes need to be deformed before
    "a" can be.
    
    I've tried to make the benchmark represent a large variety of
    scenarios to see if there are any performance regressions. I've
    benchmarked each patch with and without OPTIMIZE_BYVAL defined (the
    additional byval-only attribute deformer loop). I tried with gcc and
    with clang on my Zen 2 machine and also an Apple M2. Please see the
    attached graphs which show the results of the SUM(a) query on a table
    with 1 million rows.
    
    Analysing the results, it's not really that clear which patch is best.
    Which version works fastest seems to depend on the hardware. The AMD
    Zen 2 machine with gcc does really well with 0001+OPTIMIZE_BYVAL as it
    comes out an average of 21% faster and some tests are  more than 44%
    faster than master, and there are no performance regressions. With
    clang on the same Zen2 machine the performance isn't the same. There
    are a few regressions with the 0 extra column tests.  On the Apple M2
    tests #1 and #5 improve massively. The other  tests don't improve
    nearly as much and with certain patches a few regress slightly.
    
    Please see the attached gifs which show 6 graphs each. Top is the
    results of 0001, the middle row is 0001+0002 and the bottom row
    0001+0002+0003. The left column is without OPTIMIZE_BYVAL and the
    right column is with. The percentage shown is the query time speedup
    the patched version gives over master.
    
    Things still to do:
    
    * More benchmarking is needed. I've not yet completed the benchmarks
    on my Zen4 machine.  No Intel hardware has been tested at all. I don't
    really have any good Intel hardware to test with. Maybe someone else
    would like to help? Script is attached.
    * I've not looked at the JIT deforming code. At the moment the code
    won't even compile with LLVM enabled because I've removed the
    TTS_FLAG_SLOW flag. It's possible I'll have to adjust the JIT
    deforming code or consider keeping TTS_FLAG_SLOW.
    
    I'll add this patch to the January commitfest.
    
    David