Thread

write scalability

Robert Haas <robertmhaas@gmail.com> — 2011-07-25T20:07:14Z
I've long harbored a suspicion, based on some testing I did on my home
machine, that WALInsertLock is a big performance bottleneck.  But I
just did some benchmarking that doesn't entirely support that
contention.  This is on Nate Boley's 32-core machine, with the
following settings:

max_connections = 100
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9

I did 5-minute pgbench runs with unlogged tables and with permanent
tables, restarting the database server and reinitializing the tables
between each run.  The number at the beginning of each line is the
number of clients, while the p/u indicates which type of tables were
used:

01p tps = 628.626815 (including connections establishing)
01p tps = 635.394288 (including connections establishing)
01p tps = 634.972789 (including connections establishing)
08p tps = 3342.787325 (including connections establishing)
08p tps = 3883.876813 (including connections establishing)
08p tps = 3941.253567 (including connections establishing)
32p tps = 5597.472192 (including connections establishing)
32p tps = 5738.139895 (including connections establishing)
32p tps = 5794.490934 (including connections establishing)
80p tps = 4499.685286 (including connections establishing)
80p tps = 4917.060441 (including connections establishing)
80p tps = 5050.931933 (including connections establishing)

01u tps = 672.469142 (including connections establishing)
01u tps = 671.256686 (including connections establishing)
01u tps = 670.421003 (including connections establishing)
08u tps = 4087.749529 (including connections establishing)
08u tps = 3797.750851 (including connections establishing)
08u tps = 4181.393560 (including connections establishing)
32u tps = 8956.346905 (including connections establishing)
32u tps = 8898.442517 (including connections establishing)
32u tps = 8971.591569 (including connections establishing)
80u tps = 7287.550952 (including connections establishing)
80u tps = 7266.816989 (including connections establishing)
80u tps = 7255.968109 (including connections establishing)

The speed-up from using unlogged tables was not as large as I
expected.  Turning off synchronous_commit here removes commit rate as
the bottleneck, and I think perhaps the main benefit of unlogged
tables in that case is the avoidance of I/O, and apparently this
machine has enough I/O bandwidth, and just enough memory in general,
that that's not an issue.

With either type of tables, the 8 client results are about 6.1 times
the single core results - not great, but not terrible, either.  With
32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x
is a long way from linear.  vmstat reveals that CPU usage is
substantially less than 100%.  After some investigation, I found that
using unlogged tables wasn't actually getting rid of all the
write-ahead logging - the commit records were still being issued.  So
I hacked up RecordTransactionCommit() not to emit transaction commit
records in that case.  That doesn't actually completely eliminate the
WAL activity, because it still emits records for zeroing new SLRU
pages for CLOG, but it takes a big chunk out of it.  The upshot is
that this improved both raw performance and scalability, but not
dramatically.  Unlogged table results, with this change:

01h tps = 708.189337 (including connections establishing)
01h tps = 704.030175 (including connections establishing)
01h tps = 701.644199 (including connections establishing)
08h tps = 5196.615955 (including connections establishing)
08h tps = 5126.162200 (including connections establishing)
08h tps = 5067.568727 (including connections establishing)
32h tps = 10661.275636 (including connections establishing)
32h tps = 10621.085226 (including connections establishing)
32h tps = 10575.267197 (including connections establishing)
80h tps = 7557.965666 (including connections establishing)
80h tps = 7545.697547 (including connections establishing)
80h tps = 7556.379921 (including connections establishing)

Now the 32-client numbers have risen to 15.1x the single-client
numbers, but that's still not great.

What does this mean?  Well, XLogInsert does strikes me as an awfully
complex piece of code to be running with a hot LWLock held in
exclusive mode.  But even so, I think this certainly means that
WALInsertLock, at least on this workload, is not the whole problem...
in this test, I'm not only eliminating the overhead of inserting the
WAL, but also the overhead of writing it, flushing it, and generating
it.   So there is something, other than WAL insertion, which is taking
a big bite out of performance here.

As to what that something might be, I reran this last test with
LWLOCK_STATS enabled and here are the top things that are blocking:

lwlock 310: shacq 96846 exacq 108433 blk 16275
lwlock 3: shacq 64 exacq 2628381 blk 36027
lwlock 7: shacq 0 exacq 2628615 blk 85220
lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
lwlock 4: shacq 28667307 exacq 2628524 blk 3356651

During this 5-minute test run, an LWLock acquisition blocked 6180335
times.  As you can see from the above results, ProcArrayLock accounts
for 54% of that blocking, and CLogControlLock accounts for another
34%.  lwlock 7 is WALInsertLock, which manages to account for more
than 1% of the blocking despite the fact that WAL has been largely
eliminated in this test; lwlock 3 is XidGenLock.  I dunno what LWLock
310 is but it doesn't matter.  After that there is a long tail of
other stuff with gradually decreasing amounts of blocking.  Experience
with the read scalability stuff has taught me also to look at which
LWLocks have the most shared acquisitions, as that can cause spinlock
and cache line contention.  The top few culprits are:

lwlock 504: shacq 5315030 exacq 0 blk 0
lwlock 45: shacq 5967317 exacq 13284 blk 1722
lwlock 39: shacq 8219988 exacq 13342 blk 2291
lwlock 5: shacq 26306020 exacq 0 blk 0
lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
lwlock 11: shacq 84913908 exacq 4539551 blk 2119423

In all, there were 238777533 shared LWLock acquisitions during this
test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
to be dealt with, as discussed elsewhere on-list), and then it gets
down into the lock manager locks and a few others.

I'm mulling over what to do about all of this, but thought I'd share
the raw numbers first, in case anyone's interested.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company