Thread

  1. sinval synchronization considered harmful

    Robert Haas <robertmhaas@gmail.com> — 2011-07-21T01:46:33Z

    For the last week or so, in between various other tasks, I've been
    trying to understand why performance drops off when you run pgbench -n
    -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
    counts.  The answer, in a word, is SIGetDataEntries().  I believe we
    need to bite the bullet and rewrite this using a lock-free algorithm,
    using memory barriers on processors with weak memory ordering.
    Perhaps there is another way to do it, but nothing I've tried has
    really worked so far, and I've tried quite a few things.  Here's the
    data.
    
    On unpatched master, performance scales pretty much linearly out to 32
    clients.  As you add more clients, it drops off:
    
    [1 client]
    tps = 4459.203159 (including connections establishing)
    tps = 4488.456559 (including connections establishing)
    tps = 4449.570530 (including connections establishing)
    [8 clients]
    tps = 33524.668762 (including connections establishing)
    tps = 33501.833907 (including connections establishing)
    tps = 33382.380908 (including connections establishing)
    [32 clients]
    tps = 178394.863906 (including connections establishing)
    tps = 178457.706972 (including connections establishing)
    tps = 179365.310179 (including connections establishing)
    [80 clients]
    tps = 132518.586371 (including connections establishing)
    tps = 130968.749747 (including connections establishing)
    tps = 132574.338942 (including connections establishing)
    
    With the lazy vxid locks patch, all of those numbers get better by a
    few percentage points (the more clients, the more points) except at 80
    clients, where the performance is sometimes better and sometimes
    worse.  These are 5-minute runs:
    
    [80 clients, with lazy vxid locks]
    tps = 119215.958372 (including connections establishing)
    tps = 113056.859871 (including connections establishing)
    tps = 160562.770998 (including connections establishing)
    
    I can't explain in detail why there is so much variation between the
    5-minute runs, or why some runs perform worse than without the lazy
    vxid locks, but I can say that apply the first of the two attached
    patches (sinval-per-backend-mutex.patch) fixes it.  The patch gets rid
    of SInvalReadLock and instead gives each backend its own spinlock.
    SICleanupQueue() must therefore get somewhat fuzzy answers, but it
    shouldn't matter.  The only real problem is if you need to do the
    super-duper-cleanup where you subtract a constant from all the values
    in unison.  I just got rid of that completely, by widening the
    counters to 64 bits.  Assuming I haven't botched the implementation,
    this is clearly a win.  There is still some performance drop-off going
    from 32 clients to 80 clients, but it is much less.
    
    [80 clients, with lazy vxid locks and sinval-per-backend-mutex]
    tps = 167392.042393 (including connections establishing)
    tps = 171336.145020 (including connections establishing)
    tps = 170500.529303 (including connections establishing)
    
    Profiling this combination of patches reveals that there is still some
    pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
    to me that on x86, we really don't need this lock ... or
    SInvalReadLock ... or a per-backend mutex.  The whole of
    SIGetDataEntries() can pretty easily be made lock-free.  The only real
    changes that seem to be are needed are (1) to use a 64-bit counter, so
    you never need to decrement and (2) to recheck resetState after
    reading the entries from the queue, to see if we got reset while we
    were reading those entries.  Since x86 guarantees that writes will
    become visible in the order they are executed, we only need to make
    sure that the compiler doesn't rearrange things.  As long as we first
    read the maxMsgNum and then read the messages, we can't read garbage.
    As long as we read the messages before we check resetState, we will be
    sure to notice if we got reset before we read all the messages (which
    is the only way that we can have read garbage messages).
    
    Now this implementation (sinval-unlocked.patch) is obviously not a
    serious proposal as it stands, because on processors with weak memory
    ordering it will presumably blow up all over the place.  But here are
    the results on x86:
    
    [80 clients, with lazy vxid locks and sinval-unlocked]
    tps = 203256.701227 (including connections establishing)
    tps = 190637.957571 (including connections establishing)
    tps = 190228.617178 (including connections establishing)
    
    Now, that's actually *faster* then the above 32-core numbers.  Turns
    out, with this approach, there's essentially no degradation between 32
    clients and 80 clients.  It appears that even one spinlock acquisition
    in SIGetDataEntries() is too many.  Currently, we have *three*: one to
    get SInvalReadLock, one to get msgnumLock, and one to release
    SInvalReadLock.
    
    For contrast, on a two-core MacBook Pro, I can't measure any
    difference at all between lazy-vxid,
    lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked.
    The last might even be a touch slower, although there's enough noise
    that it's hard to tell for sure, and the effect is very small if there
    is one.  But on the 32-core machine, the differences are dramatic.
    
    Thoughts?  Comments?  Ideas?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company