Thread
-
More speedups for tuple deformation
David Rowley <dgrowleyml@gmail.com> — 2025-12-28T09:04:25Z
Around this time last year I worked on a series of patches for v18 to speed up tuple deformation. That involved about 5 separate patches, the main 3 of which were 5983a4cff (CompactAttribute), db448ce5a (faster offset aligning), and 58a359e58 (inline various deforming loops). The latter of those 3 changed slot_deform_heap_tuple() to add dedicated deforming loops for !slow mode and for tuples that don't have the HEAP_HASNULL bit set. When I was working on that, I wondered if it might be better to precalculate the attcacheoff rather than doing it in the deforming loop. I've finally written some code to do this, and I'm now ready to share some results. 0001: This introduces a function named TupleDescFinalize(), which must be called after a TupleDesc has been created or changed. This function pre-calculates the attcacheoff for all fixed-width attributes and records the attnum of the first attribute without a cached offset (the first varlena or cstring attribute). This allows the code in the deforming loops which was setting CompactAttribute's attcacheoff to be removed and allows a dedicated loop to process all attributes with an attcacheoff before falling through to the loop that handles non-attcacheoff attributes, which has to calculate the offset and alignment manually. If the tuple has a NULL value before the last attribute with a cached offset, then we can only use the attcacheoff until the NULL attribute. The expectation here is that knowing the offset beforehand is faster than calculating it each time. Calculating the offset requires first aligning the current offset according to the attributes attalign value, then once we've called fetch_att() to get the Datum value, we need to add the length of the attribute to skip forward to the next attribute. There's not much opportunity for instruction-level parallelism there due to the dependency on the previous calculation. The primary optimisation in 0001 is that it adds a dedicated tight loop to deform as many attributes with a cache offset as possible before breaking out that loop to deform any remaining attributes without using any cached offset. 0002: After thinking about 0001 for a while, I wondered if we could do better than resorting to having to check att_isnull() for every attribute after we find the first NULL. What if the tuple has a NULL quite early on, then no NULLs after that. It would be good if we looked ahead in the tuple's NULL bitmap to identify exactly if and when the next NULL attribute occurs and loop without checking att_isnull() until that attribute. Effectively, what I came up with was something like: for (;;) { for(; attnum < nextNullAttr; attnum++) { // do fetch_att() without checking for NULLs } if (attnum >= natts) break; for(; attnum < nextNullSeqEnd; attnum++) isnull[attnum] = true; next_null_until(bp, attnum, natts, &nextNullAttr, &nextNullSeqEnd); } The next_null_until() function looks at the NULL bitmap starting at attnum and sets nextNullAttr to the next NULL and nextNullSeqEnd to the first attribute to the first non-NULL after the NULL. If there are no more NULLs, then nextNullAttr is set to natts, which allows the outer loop to complete. Test #5 seems to do well with this code, but I wasn't impressed with most of the other results. I'd have expected test #3 to improve with this change, but it didn't. 0003: In 0002 I added a dedicated loop that handles tuples without HEAP_HASNULL. To see if it would make the performance any better I made 0003, which gets rid of that dedicated loop. I hoped that shrinking the code down might help performance. It didn't quite have the effect I hoped for. In each version, I experimented with having a dedicated deforming loop which can only handle attbyval == true columns. If we know there are no byref attributes, then fetch_att() can be inlined without the branch that handles pointer types. That removes some branching overhead and makes for a tighter loop with fewer instructions. When this optimisation doesn't apply, there's a bit of extra overhead of having to check for "attnum < firstByRefAttr". Benchmarking: To get an idea if doing this is a win performance-wise, I designed a benchmark with various numbers of columns and various combinations of fixed vs varlena types along with NULLs and no NULLs. There are 8 tests in total. For each of those 8 tests, I ran it with between 0 and 40 extra INT NOT NULL columns. The tests are: 1. first col int not null, last col int not null 2. first col text not null, last col int not null 3. first col int null, last col int not null 4. first col text null, last col int not null 5. first col int not null, last col int null 6. first col text not null, last col int null 7. first col int null, last col int null 8. first col text null, last col int null So, for example #1 would look like: CREATE TABLE t1 ( c INT NOT NULL DEFAULT 0, <extra 0-40 columns here> a INT NOT NULL, b INT NOT NULL DEFAULT 0 ); and #8 would be: CREATE TABLE t1 ( c TEXT DEFAULT NULL, <extra 0-40 columns here> a INT NOT NULL, b INT DEFAULT NULL ); For each of the 8 tests, I ran with 0, 10, 20, 30 and 40 extra columns, so 40 tests in total (8 tests * 5 for each variation of extra columns). Another benefit of 0001, besides using the fixed attcacheoff is that since we know where the first NULL attribute is, we can keep deforming without calling att_isnull() until we get to the first NULL attribute. Currently in master, if the tuple has the HEAP_HASNULL bit set, then the deforming code will call att_isnull for every attribute in the tuple. Test #5 should highlight this (you may notice the orange bar in the attached graphs is commonly the test with the biggest speedup) Now, not every query is bottlenecked on tuple deforming, so to try to maximise the amount of tuple deforming that occurs relative to other work, the query I ran was: SELECT sum(a) FROM t1; since the "a" column is almost last, all prior attributes need to be deformed before "a" can be. I've tried to make the benchmark represent a large variety of scenarios to see if there are any performance regressions. I've benchmarked each patch with and without OPTIMIZE_BYVAL defined (the additional byval-only attribute deformer loop). I tried with gcc and with clang on my Zen 2 machine and also an Apple M2. Please see the attached graphs which show the results of the SUM(a) query on a table with 1 million rows. Analysing the results, it's not really that clear which patch is best. Which version works fastest seems to depend on the hardware. The AMD Zen 2 machine with gcc does really well with 0001+OPTIMIZE_BYVAL as it comes out an average of 21% faster and some tests are more than 44% faster than master, and there are no performance regressions. With clang on the same Zen2 machine the performance isn't the same. There are a few regressions with the 0 extra column tests. On the Apple M2 tests #1 and #5 improve massively. The other tests don't improve nearly as much and with certain patches a few regress slightly. Please see the attached gifs which show 6 graphs each. Top is the results of 0001, the middle row is 0001+0002 and the bottom row 0001+0002+0003. The left column is without OPTIMIZE_BYVAL and the right column is with. The percentage shown is the query time speedup the patched version gives over master. Things still to do: * More benchmarking is needed. I've not yet completed the benchmarks on my Zen4 machine. No Intel hardware has been tested at all. I don't really have any good Intel hardware to test with. Maybe someone else would like to help? Script is attached. * I've not looked at the JIT deforming code. At the moment the code won't even compile with LLVM enabled because I've removed the TTS_FLAG_SLOW flag. It's possible I'll have to adjust the JIT deforming code or consider keeping TTS_FLAG_SLOW. I'll add this patch to the January commitfest. David