Thread

Re: tuple radix sort

Chengpeng Yan <chengpeng_yan@outlook.com> — 2025-11-23T05:33:44Z
> On Nov 12, 2025, at 21:57, John Naylor <johncnaylorls@gmail.com> wrote:
> 
> I decided I wasn't quite comfortable with the full normalized datum
> sharing space in SortTuple with isnull1. There's too much of a
> cognitive burden involved in deciding when we do or don't need to
> reset isnull1, and there's a non-zero risk of difficult-to-detect
> bugs. For v4 I've instead used one byte of padding space in SortTuple
> to store only the byte used for the current pass. That means we must
> compute the normalized datum on every pass. That's actually better
> than it sounds, since that one byte can now be used directly during
> the "deal" step, rather than having to extract the byte from the
> normalized datum by shifting and masking. That extraction step might
> add significant cycles in cases where a pass requires multiple
> iterations through the "deal" loop. It doesn't seem to make much
> difference in practice, performance-wise, even with the following
> pessimization:
> 
> I had to scrap the qsort specialization on the normalized datum for
> small sorts, since it's no longer stored. It could still be worth it
> to compute the "next byte of the normalized datum" and perform a qsort
> on that (falling back to the comparator function in the usual way),
> but I haven't felt the need to resort to that yet. For v4, I just
> divert to qsort_tuple in non-assert builds, with a threshold of 40.
> […]
> 
> I made an attempt at clean-up, but it's still under-commented. The
> common prefix detection has moved to a separate patch (v4-0004).
> 
> I've been forcing all eligible sorts to use radix sort in assert
> builds, even when small enough that qsort would be faster. Since both
> qsort and in-place radix sort are unstable, it's expected that some
> regression tests need adjustment (v4-0002). One thing surprised me,
> however: The pg_upgrade TAP test that runs regression tests on the old
> cluster showed additional failures that I can't explain. I haven't
> seen this before, but it's possible I never ran TAP tests when testing
> new sort algorithms previously. This doesn't happen if you change the
> current insertion sort threshold, so I haven't been able to reproduce
> it aside from this patch. For that reason I can't completely rule out
> an actual bug, although I actually have more confidence in the
> verification of correct sort order in v4, since isnull1 now never
> changes, just as in master. I found that changing some tests to have
> additional sort keys seems to fix it (v4-0003). I did this in a rather
> quick and haphazard fashion. There's probably a longer conversation to
> be had about making test output more deterministic while still
> covering the intended executor paths.
> 
> Aside from that, this seems like a good place to settle down, so I'm
> going to create a CF entry for this. I'll start more rigorous
> performance testing in the near future.
> 

Hi John,

I have reviewed the v4 patch set. I applied the patches and ran
"make check" on macOS 14.2.1 (M1), and all regression tests passed.

Overall, the implementation looks solid and the feature is a great
addition. I have a few suggestions regarding code optimization and one
discussion point regarding the data structure design.

Minor Comments / Optimizations:

1. Optimization in radix_sort_tuple (v4-0001)

In radix_sort_tuple, there is a check:

```
if (part.offset == part.next_offset)
```

Since "part" is a local copy of the struct, this check might not
reflect the latest state updated inside the loop. It might be slightly
more efficient to check the array directly:

```
if (partitions[idx].offset == partitions[idx].next_offset)
```

This allows us to detect if a partition has been fully consumed/sorted
within the current pass, potentially saving an iteration of the
"while (num_remaining > 1)" loop.

1. Branchless calculation in Common Prefix (v4-0004)

In sort_byvalue_datum, when calculating the common bits:

```
if (this_common_bits > common_upper_bits)
common_upper_bits = this_common_bits;
```

Since we are looking for the leftmost bit difference, we could
accumulate the differences using bitwise OR. This avoids a conditional
branch inside the loop:

```
common_upper_bits |= this_common_bits;
```

3. Short-circuit for identical keys (v4-0004)

When calculating common_prefix, if common_upper_bits is 0, it implies
that all non-null keys are identical (for the bits we care about). In
this case, we might be able to skip the radix sort entirely or handle
it as a single partition. Currently, the code handles it by passing
"common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but
perhaps not the most optimal path for identical keys.

-----

Discussion: SortTuple Layout and normalize_datum

I have a concern regarding the addition of "uint8 current_byte" to the
SortTuple struct.

1. Struct Size and Abstraction:

SortTuple is the generic atom for the entire sorting module. Adding a
field specific to integer Radix Sort feels like a bit of a leaky
abstraction. While it might fit into padding on some 64-bit platforms
(keeping the size at 32 bytes), relying on padding is fragile. If the
struct size increases, it reduces the number of tuples work_mem can
hold, affecting all sort operations, not just integers.

2. Cache Invalidation (Read vs. Write):

Currently, radix_sort_tuple writes to tup->current_byte in every pass.
This turns what could be a read-only access to the tuple array into a
read-write access, potentially dirtying cache lines and causing
unnecessary memory traffic.

Proposal: On-the-fly Normalization

We can avoid storing current_byte by calculating the relevant byte on
the fly. While this means re-calculating the byte extraction, we can
optimize normalize_datum.

The normalization logic (mapping signed integers to unsigned space)
essentially flips the sign bit. Instead of doing a full 64-bit
normalization for every byte extraction, we can apply this
transformation only when extracting the most significant byte (MSB).

The logic could look like this:

```
/*
 * Extract the byte at 'level' from 'key'.
 * level 0 denotes the Most Significant Byte.
 */
static inline uint8_t
extract_raw_byte(Datum key, int level)
{
    return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF;
}

/*
 * Extract the "logically normalized" byte at 'level'.
 *
 * This effectively replaces the need to call normalize_datum() on
 * the full key.
 * - For non-MSB levels: return the raw byte.
 * - For the MSB level: flip the sign bit if the comparator
 * requires it.
 * - If sorting DESC: invert the result.
 */
static inline uint8_t
radix_extract_byte(Datum orig, int level, SortSupport ssup)
{
    uint8_t byte;

    /* 1. Extract raw byte first */
    byte = extract_raw_byte(orig, level);

    /*
     * 2. Apply normalization only to the Most Significant Byte.
     *
     * Mathematically, normalizing a signed integer for unsigned
     * comparison is equivalent to flipping the sign bit (adding
     * 2^(Width-1)).
     * In the byte domain, this means XORing the MSB with 0x80.
     */
    if (level == 0)
    {
        if (ssup->comparator == ssup_datum_signed_cmp ||
            ssup->comparator == ssup_datum_int32_cmp)
        {
            byte ^= 0x80;
        }
        else
        {
            Assert(ssup->comparator == ssup_datum_unsigned_cmp);
        }
    }

    /*
     * 3. Handle Reverse Sort
     * Instead of inverting the whole Datum, we just invert the
     * current byte.
     */
    if (ssup->ssup_reverse)
        byte = ~byte;

    return byte;
}
```

Benefits:

1. Keeps SortTuple minimal: No risk of struct bloat.
2. Read-only passes: The tuple array remains clean in cache during the
counting phase.
3. Reduced instruction count: We avoid the overhead of full
normalize_datum calls for lower bytes.

I'm happy to help verify this approach if you think it's worth
pursuing.

Best regards,
Chengpeng Yan