Thread
-
Re: tuple radix sort
Chengpeng Yan <chengpeng_yan@outlook.com> — 2025-11-23T05:33:44Z
> On Nov 12, 2025, at 21:57, John Naylor <johncnaylorls@gmail.com> wrote: > > I decided I wasn't quite comfortable with the full normalized datum > sharing space in SortTuple with isnull1. There's too much of a > cognitive burden involved in deciding when we do or don't need to > reset isnull1, and there's a non-zero risk of difficult-to-detect > bugs. For v4 I've instead used one byte of padding space in SortTuple > to store only the byte used for the current pass. That means we must > compute the normalized datum on every pass. That's actually better > than it sounds, since that one byte can now be used directly during > the "deal" step, rather than having to extract the byte from the > normalized datum by shifting and masking. That extraction step might > add significant cycles in cases where a pass requires multiple > iterations through the "deal" loop. It doesn't seem to make much > difference in practice, performance-wise, even with the following > pessimization: > > I had to scrap the qsort specialization on the normalized datum for > small sorts, since it's no longer stored. It could still be worth it > to compute the "next byte of the normalized datum" and perform a qsort > on that (falling back to the comparator function in the usual way), > but I haven't felt the need to resort to that yet. For v4, I just > divert to qsort_tuple in non-assert builds, with a threshold of 40. > […] > > I made an attempt at clean-up, but it's still under-commented. The > common prefix detection has moved to a separate patch (v4-0004). > > I've been forcing all eligible sorts to use radix sort in assert > builds, even when small enough that qsort would be faster. Since both > qsort and in-place radix sort are unstable, it's expected that some > regression tests need adjustment (v4-0002). One thing surprised me, > however: The pg_upgrade TAP test that runs regression tests on the old > cluster showed additional failures that I can't explain. I haven't > seen this before, but it's possible I never ran TAP tests when testing > new sort algorithms previously. This doesn't happen if you change the > current insertion sort threshold, so I haven't been able to reproduce > it aside from this patch. For that reason I can't completely rule out > an actual bug, although I actually have more confidence in the > verification of correct sort order in v4, since isnull1 now never > changes, just as in master. I found that changing some tests to have > additional sort keys seems to fix it (v4-0003). I did this in a rather > quick and haphazard fashion. There's probably a longer conversation to > be had about making test output more deterministic while still > covering the intended executor paths. > > Aside from that, this seems like a good place to settle down, so I'm > going to create a CF entry for this. I'll start more rigorous > performance testing in the near future. > Hi John, I have reviewed the v4 patch set. I applied the patches and ran "make check" on macOS 14.2.1 (M1), and all regression tests passed. Overall, the implementation looks solid and the feature is a great addition. I have a few suggestions regarding code optimization and one discussion point regarding the data structure design. Minor Comments / Optimizations: 1. Optimization in radix_sort_tuple (v4-0001) In radix_sort_tuple, there is a check: ``` if (part.offset == part.next_offset) ``` Since "part" is a local copy of the struct, this check might not reflect the latest state updated inside the loop. It might be slightly more efficient to check the array directly: ``` if (partitions[idx].offset == partitions[idx].next_offset) ``` This allows us to detect if a partition has been fully consumed/sorted within the current pass, potentially saving an iteration of the "while (num_remaining > 1)" loop. 1. Branchless calculation in Common Prefix (v4-0004) In sort_byvalue_datum, when calculating the common bits: ``` if (this_common_bits > common_upper_bits) common_upper_bits = this_common_bits; ``` Since we are looking for the leftmost bit difference, we could accumulate the differences using bitwise OR. This avoids a conditional branch inside the loop: ``` common_upper_bits |= this_common_bits; ``` 3. Short-circuit for identical keys (v4-0004) When calculating common_prefix, if common_upper_bits is 0, it implies that all non-null keys are identical (for the bits we care about). In this case, we might be able to skip the radix sort entirely or handle it as a single partition. Currently, the code handles it by passing "common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but perhaps not the most optimal path for identical keys. ----- Discussion: SortTuple Layout and normalize_datum I have a concern regarding the addition of "uint8 current_byte" to the SortTuple struct. 1. Struct Size and Abstraction: SortTuple is the generic atom for the entire sorting module. Adding a field specific to integer Radix Sort feels like a bit of a leaky abstraction. While it might fit into padding on some 64-bit platforms (keeping the size at 32 bytes), relying on padding is fragile. If the struct size increases, it reduces the number of tuples work_mem can hold, affecting all sort operations, not just integers. 2. Cache Invalidation (Read vs. Write): Currently, radix_sort_tuple writes to tup->current_byte in every pass. This turns what could be a read-only access to the tuple array into a read-write access, potentially dirtying cache lines and causing unnecessary memory traffic. Proposal: On-the-fly Normalization We can avoid storing current_byte by calculating the relevant byte on the fly. While this means re-calculating the byte extraction, we can optimize normalize_datum. The normalization logic (mapping signed integers to unsigned space) essentially flips the sign bit. Instead of doing a full 64-bit normalization for every byte extraction, we can apply this transformation only when extracting the most significant byte (MSB). The logic could look like this: ``` /* * Extract the byte at 'level' from 'key'. * level 0 denotes the Most Significant Byte. */ static inline uint8_t extract_raw_byte(Datum key, int level) { return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF; } /* * Extract the "logically normalized" byte at 'level'. * * This effectively replaces the need to call normalize_datum() on * the full key. * - For non-MSB levels: return the raw byte. * - For the MSB level: flip the sign bit if the comparator * requires it. * - If sorting DESC: invert the result. */ static inline uint8_t radix_extract_byte(Datum orig, int level, SortSupport ssup) { uint8_t byte; /* 1. Extract raw byte first */ byte = extract_raw_byte(orig, level); /* * 2. Apply normalization only to the Most Significant Byte. * * Mathematically, normalizing a signed integer for unsigned * comparison is equivalent to flipping the sign bit (adding * 2^(Width-1)). * In the byte domain, this means XORing the MSB with 0x80. */ if (level == 0) { if (ssup->comparator == ssup_datum_signed_cmp || ssup->comparator == ssup_datum_int32_cmp) { byte ^= 0x80; } else { Assert(ssup->comparator == ssup_datum_unsigned_cmp); } } /* * 3. Handle Reverse Sort * Instead of inverting the whole Datum, we just invert the * current byte. */ if (ssup->ssup_reverse) byte = ~byte; return byte; } ``` Benefits: 1. Keeps SortTuple minimal: No risk of struct bloat. 2. Read-only passes: The tuple array remains clean in cache during the counting phase. 3. Reduced instruction count: We avoid the overhead of full normalize_datum calls for lower bytes. I'm happy to help verify this approach if you think it's worth pursuing. Best regards, Chengpeng Yan