Thread

  1. write scalability

    Robert Haas <robertmhaas@gmail.com> — 2011-07-25T20:07:14Z

    I've long harbored a suspicion, based on some testing I did on my home
    machine, that WALInsertLock is a big performance bottleneck.  But I
    just did some benchmarking that doesn't entirely support that
    contention.  This is on Nate Boley's 32-core machine, with the
    following settings:
    
    max_connections = 100
    shared_buffers = 8GB
    synchronous_commit = off
    checkpoint_segments = 100
    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.9
    
    I did 5-minute pgbench runs with unlogged tables and with permanent
    tables, restarting the database server and reinitializing the tables
    between each run.  The number at the beginning of each line is the
    number of clients, while the p/u indicates which type of tables were
    used:
    
    01p tps = 628.626815 (including connections establishing)
    01p tps = 635.394288 (including connections establishing)
    01p tps = 634.972789 (including connections establishing)
    08p tps = 3342.787325 (including connections establishing)
    08p tps = 3883.876813 (including connections establishing)
    08p tps = 3941.253567 (including connections establishing)
    32p tps = 5597.472192 (including connections establishing)
    32p tps = 5738.139895 (including connections establishing)
    32p tps = 5794.490934 (including connections establishing)
    80p tps = 4499.685286 (including connections establishing)
    80p tps = 4917.060441 (including connections establishing)
    80p tps = 5050.931933 (including connections establishing)
    
    01u tps = 672.469142 (including connections establishing)
    01u tps = 671.256686 (including connections establishing)
    01u tps = 670.421003 (including connections establishing)
    08u tps = 4087.749529 (including connections establishing)
    08u tps = 3797.750851 (including connections establishing)
    08u tps = 4181.393560 (including connections establishing)
    32u tps = 8956.346905 (including connections establishing)
    32u tps = 8898.442517 (including connections establishing)
    32u tps = 8971.591569 (including connections establishing)
    80u tps = 7287.550952 (including connections establishing)
    80u tps = 7266.816989 (including connections establishing)
    80u tps = 7255.968109 (including connections establishing)
    
    The speed-up from using unlogged tables was not as large as I
    expected.  Turning off synchronous_commit here removes commit rate as
    the bottleneck, and I think perhaps the main benefit of unlogged
    tables in that case is the avoidance of I/O, and apparently this
    machine has enough I/O bandwidth, and just enough memory in general,
    that that's not an issue.
    
    With either type of tables, the 8 client results are about 6.1 times
    the single core results - not great, but not terrible, either.  With
    32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x
    is a long way from linear.  vmstat reveals that CPU usage is
    substantially less than 100%.  After some investigation, I found that
    using unlogged tables wasn't actually getting rid of all the
    write-ahead logging - the commit records were still being issued.  So
    I hacked up RecordTransactionCommit() not to emit transaction commit
    records in that case.  That doesn't actually completely eliminate the
    WAL activity, because it still emits records for zeroing new SLRU
    pages for CLOG, but it takes a big chunk out of it.  The upshot is
    that this improved both raw performance and scalability, but not
    dramatically.  Unlogged table results, with this change:
    
    01h tps = 708.189337 (including connections establishing)
    01h tps = 704.030175 (including connections establishing)
    01h tps = 701.644199 (including connections establishing)
    08h tps = 5196.615955 (including connections establishing)
    08h tps = 5126.162200 (including connections establishing)
    08h tps = 5067.568727 (including connections establishing)
    32h tps = 10661.275636 (including connections establishing)
    32h tps = 10621.085226 (including connections establishing)
    32h tps = 10575.267197 (including connections establishing)
    80h tps = 7557.965666 (including connections establishing)
    80h tps = 7545.697547 (including connections establishing)
    80h tps = 7556.379921 (including connections establishing)
    
    Now the 32-client numbers have risen to 15.1x the single-client
    numbers, but that's still not great.
    
    What does this mean?  Well, XLogInsert does strikes me as an awfully
    complex piece of code to be running with a hot LWLock held in
    exclusive mode.  But even so, I think this certainly means that
    WALInsertLock, at least on this workload, is not the whole problem...
    in this test, I'm not only eliminating the overhead of inserting the
    WAL, but also the overhead of writing it, flushing it, and generating
    it.   So there is something, other than WAL insertion, which is taking
    a big bite out of performance here.
    
    As to what that something might be, I reran this last test with
    LWLOCK_STATS enabled and here are the top things that are blocking:
    
    lwlock 310: shacq 96846 exacq 108433 blk 16275
    lwlock 3: shacq 64 exacq 2628381 blk 36027
    lwlock 7: shacq 0 exacq 2628615 blk 85220
    lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
    lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
    
    During this 5-minute test run, an LWLock acquisition blocked 6180335
    times.  As you can see from the above results, ProcArrayLock accounts
    for 54% of that blocking, and CLogControlLock accounts for another
    34%.  lwlock 7 is WALInsertLock, which manages to account for more
    than 1% of the blocking despite the fact that WAL has been largely
    eliminated in this test; lwlock 3 is XidGenLock.  I dunno what LWLock
    310 is but it doesn't matter.  After that there is a long tail of
    other stuff with gradually decreasing amounts of blocking.  Experience
    with the read scalability stuff has taught me also to look at which
    LWLocks have the most shared acquisitions, as that can cause spinlock
    and cache line contention.  The top few culprits are:
    
    lwlock 504: shacq 5315030 exacq 0 blk 0
    lwlock 45: shacq 5967317 exacq 13284 blk 1722
    lwlock 39: shacq 8219988 exacq 13342 blk 2291
    lwlock 5: shacq 26306020 exacq 0 blk 0
    lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
    lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
    
    In all, there were 238777533 shared LWLock acquisitions during this
    test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
    to be dealt with, as discussed elsewhere on-list), and then it gets
    down into the lock manager locks and a few others.
    
    I'm mulling over what to do about all of this, but thought I'd share
    the raw numbers first, in case anyone's interested.
    
    Thanks,
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company