Thread

  1. Re: Optimize LISTEN/NOTIFY

    Joel Jacobson <joel@compiler.org> — 2025-11-15T21:53:52Z

    On Fri, Nov 14, 2025, at 17:01, Joel Jacobson wrote:
    > On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
    >> Attached, please find a new version rebased on top of the bug fix
    >> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
    >> 1b46990.
    >
    > To help reviewers, here is a new write-up of the patch:
    > [...write-up...]
    
    While reviewing all the comments in async.c to make sure they match the
    patch's code, I noticed a few discrepancies. One of them was the comment
    above QUEUE_CLEANUP_DELAY, which again made me think about how master
    currently uses that value as the threshold for when to "wake laggers".
    
    In this patch, QUEUE_CLEANUP_DELAY is no longer used for that
    purpose; it now only determines how often we try to advance the tail
    pointer.
    
    I realize someone who is familiar with the current code in master,
    might ask the following question:
    
        Why not do direct advancement, but just use the old "wake laggers"
        logic for listeners that lag behind more than QUEUE_CLEANUP_DELAY?
    
    On the surface it might look like a plausible alternative,
    since that's what master currently does (but for other databases).
    
    I was curious to see how such alternative approach would affect
    performance, so I changed SignalBackends and ran some benchmarks.
    
    Below is the v27 logic:
    
    ```
    if (QUEUE_BACKEND_IS_ADVANCING(i) ?
        QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
        QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
    {
        Assert(pid != InvalidPid);
        QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
        pids[count] = pid;
        procnos[count] = i;
        count++;
    }
    else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
              QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
    {
        Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
        QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
    }
    ```
    
    And here is the modified "wake laggers" version I tested:
    
    ```
    if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
        !QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite) &&
         QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
    {
        QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
    }
    else if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
                                QUEUE_POS_PAGE(pos)) >= QUEUE_CLEANUP_DELAY)
    {
        QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
        pids[count] = pid;
        procnos[count] = i;
        count++;
    }
    ```
    
    This preserves direct advancement under the same conditions as the patch,
    but only sends signals to backends that are "laggers" by
    the QUEUE_CLEANUP_DELAY threshold, similar to master's behavior.
    
    It turns out the "wake laggers" approach is significantly slower:
    
    "wake laggers":
    
    ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
    20 s: 80871 sent (4650/s), 80871 received (4650/s)
    Notification Latency Distribution:
     0.00-0.01ms                0 (0.0%) avg: 0.000ms
     0.01-0.10ms     #          380 (0.5%) avg: 0.085ms
     0.10-1.00ms     #          2765 (3.4%) avg: 0.289ms
     1.00-10.00ms    #          4947 (6.1%) avg: 5.870ms
     10.00-100.00ms  #######    57467 (71.1%) avg: 52.101ms
    >100.00ms       #          15312 (18.9%) avg: 119.847ms
    
    v27:
    
    % ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
    20 s: 229866 sent (13985/s), 229865 received (13984/s)
    Notification Latency Distribution:
     0.00-0.01ms                0 (0.0%) avg: 0.000ms
     0.01-0.10ms     #          18351 (8.0%) avg: 0.072ms
     0.10-1.00ms     #          32899 (14.3%) avg: 0.315ms
     1.00-10.00ms    ####       103273 (44.9%) avg: 5.055ms
     10.00-100.00ms  ###        75342 (32.8%) avg: 18.197ms
    >100.00ms                  0 (0.0%) avg: 0.000ms
    
    I reran the experiments a three times with similar results.
    Also tested other test permutaions, that also showed "wake laggers" was worse.
    
    At first glance, both approaches ultimately signal all backends
    interested in our notifies, so it may seem surprising that latency
    differs this much. The key point is what happens to non-interested
    backends:
    
    A backend that is *not* listening to our channels may have stopped at a
    position before the old queue head, because it woke up and read the head
    before all notifies for a previous commit were written. Such backend
    might actually be interested in the notifies that lie in between its
    current position and the old queue head, and it could therefore be
    urgent to wake it up to make it process the queue and delivery the
    notifies.
    
    In v27, such a backend gets signaled on the next NOTIFY, when it notice
    it is stationary at a pos behind the queueHeadBeforeWrite.
    
    With "wake laggers", however, it receives no signal until it becomes a
    "lagger" by QUEUE_CLEANUP_DELAY pages, which can be far later. This
    risks delaying its processing and the delivery of notifications it is
    interested in.
    
    In master this is fine because "wake laggers" is only applied to
    backends in other databases that we know are not interested in our
    notifications.
    
    In conclusion, the "wake laggers" mechanism seems inherently
    incompatible with the direct advancement mechanism, and I therefore
    think the current approach in v27 is sound.
    
    I just wanted to share this reasoning, not because anyone has raised any
    concerns about this, but because I hadn't fully internalized this myself,
    and thought the reasoning could possibly be helpful to others.
    
    The attached v28 is the same as v27, except some comments have been
    fixed to accurately reflect the code.
    
    /Joel