Thread

  1. Re: tuple radix sort

    Chengpeng Yan <chengpeng_yan@outlook.com> — 2025-11-23T05:33:44Z

    > On Nov 12, 2025, at 21:57, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > I decided I wasn't quite comfortable with the full normalized datum
    > sharing space in SortTuple with isnull1. There's too much of a
    > cognitive burden involved in deciding when we do or don't need to
    > reset isnull1, and there's a non-zero risk of difficult-to-detect
    > bugs. For v4 I've instead used one byte of padding space in SortTuple
    > to store only the byte used for the current pass. That means we must
    > compute the normalized datum on every pass. That's actually better
    > than it sounds, since that one byte can now be used directly during
    > the "deal" step, rather than having to extract the byte from the
    > normalized datum by shifting and masking. That extraction step might
    > add significant cycles in cases where a pass requires multiple
    > iterations through the "deal" loop. It doesn't seem to make much
    > difference in practice, performance-wise, even with the following
    > pessimization:
    > 
    > I had to scrap the qsort specialization on the normalized datum for
    > small sorts, since it's no longer stored. It could still be worth it
    > to compute the "next byte of the normalized datum" and perform a qsort
    > on that (falling back to the comparator function in the usual way),
    > but I haven't felt the need to resort to that yet. For v4, I just
    > divert to qsort_tuple in non-assert builds, with a threshold of 40.
    > […]
    > 
    > I made an attempt at clean-up, but it's still under-commented. The
    > common prefix detection has moved to a separate patch (v4-0004).
    > 
    > I've been forcing all eligible sorts to use radix sort in assert
    > builds, even when small enough that qsort would be faster. Since both
    > qsort and in-place radix sort are unstable, it's expected that some
    > regression tests need adjustment (v4-0002). One thing surprised me,
    > however: The pg_upgrade TAP test that runs regression tests on the old
    > cluster showed additional failures that I can't explain. I haven't
    > seen this before, but it's possible I never ran TAP tests when testing
    > new sort algorithms previously. This doesn't happen if you change the
    > current insertion sort threshold, so I haven't been able to reproduce
    > it aside from this patch. For that reason I can't completely rule out
    > an actual bug, although I actually have more confidence in the
    > verification of correct sort order in v4, since isnull1 now never
    > changes, just as in master. I found that changing some tests to have
    > additional sort keys seems to fix it (v4-0003). I did this in a rather
    > quick and haphazard fashion. There's probably a longer conversation to
    > be had about making test output more deterministic while still
    > covering the intended executor paths.
    > 
    > Aside from that, this seems like a good place to settle down, so I'm
    > going to create a CF entry for this. I'll start more rigorous
    > performance testing in the near future.
    > 
    
    Hi John,
    
    I have reviewed the v4 patch set. I applied the patches and ran
    "make check" on macOS 14.2.1 (M1), and all regression tests passed.
    
    Overall, the implementation looks solid and the feature is a great
    addition. I have a few suggestions regarding code optimization and one
    discussion point regarding the data structure design.
    
    Minor Comments / Optimizations:
    
    1. Optimization in radix_sort_tuple (v4-0001)
    
    In radix_sort_tuple, there is a check:
    
    ```
    if (part.offset == part.next_offset)
    ```
    
    Since "part" is a local copy of the struct, this check might not
    reflect the latest state updated inside the loop. It might be slightly
    more efficient to check the array directly:
    
    ```
    if (partitions[idx].offset == partitions[idx].next_offset)
    ```
    
    This allows us to detect if a partition has been fully consumed/sorted
    within the current pass, potentially saving an iteration of the
    "while (num_remaining > 1)" loop.
    
    1. Branchless calculation in Common Prefix (v4-0004)
    
    In sort_byvalue_datum, when calculating the common bits:
    
    ```
    if (this_common_bits > common_upper_bits)
    common_upper_bits = this_common_bits;
    ```
    
    Since we are looking for the leftmost bit difference, we could
    accumulate the differences using bitwise OR. This avoids a conditional
    branch inside the loop:
    
    ```
    common_upper_bits |= this_common_bits;
    ```
    
    3. Short-circuit for identical keys (v4-0004)
    
    When calculating common_prefix, if common_upper_bits is 0, it implies
    that all non-null keys are identical (for the bits we care about). In
    this case, we might be able to skip the radix sort entirely or handle
    it as a single partition. Currently, the code handles it by passing
    "common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but
    perhaps not the most optimal path for identical keys.
    
    -----
    
    Discussion: SortTuple Layout and normalize_datum
    
    I have a concern regarding the addition of "uint8 current_byte" to the
    SortTuple struct.
    
    1. Struct Size and Abstraction:
    
    SortTuple is the generic atom for the entire sorting module. Adding a
    field specific to integer Radix Sort feels like a bit of a leaky
    abstraction. While it might fit into padding on some 64-bit platforms
    (keeping the size at 32 bytes), relying on padding is fragile. If the
    struct size increases, it reduces the number of tuples work_mem can
    hold, affecting all sort operations, not just integers.
    
    2. Cache Invalidation (Read vs. Write):
    
    Currently, radix_sort_tuple writes to tup->current_byte in every pass.
    This turns what could be a read-only access to the tuple array into a
    read-write access, potentially dirtying cache lines and causing
    unnecessary memory traffic.
    
    Proposal: On-the-fly Normalization
    
    We can avoid storing current_byte by calculating the relevant byte on
    the fly. While this means re-calculating the byte extraction, we can
    optimize normalize_datum.
    
    The normalization logic (mapping signed integers to unsigned space)
    essentially flips the sign bit. Instead of doing a full 64-bit
    normalization for every byte extraction, we can apply this
    transformation only when extracting the most significant byte (MSB).
    
    The logic could look like this:
    
    ```
    /*
     * Extract the byte at 'level' from 'key'.
     * level 0 denotes the Most Significant Byte.
     */
    static inline uint8_t
    extract_raw_byte(Datum key, int level)
    {
        return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF;
    }
    
    /*
     * Extract the "logically normalized" byte at 'level'.
     *
     * This effectively replaces the need to call normalize_datum() on
     * the full key.
     * - For non-MSB levels: return the raw byte.
     * - For the MSB level: flip the sign bit if the comparator
     * requires it.
     * - If sorting DESC: invert the result.
     */
    static inline uint8_t
    radix_extract_byte(Datum orig, int level, SortSupport ssup)
    {
        uint8_t byte;
    
        /* 1. Extract raw byte first */
        byte = extract_raw_byte(orig, level);
    
        /*
         * 2. Apply normalization only to the Most Significant Byte.
         *
         * Mathematically, normalizing a signed integer for unsigned
         * comparison is equivalent to flipping the sign bit (adding
         * 2^(Width-1)).
         * In the byte domain, this means XORing the MSB with 0x80.
         */
        if (level == 0)
        {
            if (ssup->comparator == ssup_datum_signed_cmp ||
                ssup->comparator == ssup_datum_int32_cmp)
            {
                byte ^= 0x80;
            }
            else
            {
                Assert(ssup->comparator == ssup_datum_unsigned_cmp);
            }
        }
    
        /*
         * 3. Handle Reverse Sort
         * Instead of inverting the whole Datum, we just invert the
         * current byte.
         */
        if (ssup->ssup_reverse)
            byte = ~byte;
    
        return byte;
    }
    ```
    
    Benefits:
    
    1. Keeps SortTuple minimal: No risk of struct bloat.
    2. Read-only passes: The tuple array remains clean in cache during the
    counting phase.
    3. Reduced instruction count: We avoid the overhead of full
    normalize_datum calls for lower bytes.
    
    I'm happy to help verify this approach if you think it's worth
    pursuing.
    
    Best regards,
    Chengpeng Yan