Displaying accumulated autovacuum cost

Greg Smith <greg@2ndquadrant.com>
From: Greg Smith <greg@2ndQuadrant.com>
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
Date: 2011-08-17T21:04:33Z
Lists: pgsql-hackers
Attachments

pgbench-vacuum-stats.csv (text/csv)
vacuum_stats_v2.patch (text/x-patch)
Attached is a patch that tracks and displays the accumulated cost when 
autovacuum is running.  Code by Noah Misch and myself.  I hope this idea 
will bring a formal process to vacuum tuning, which is currently too 
hard to do.  I was about to add "without..." to that, but I then 
realized it needs no disclaimer; it's just too hard, period.  Vacuum 
issues are enemy #1 at all the terabyte scale customer sites I've been 
fighting with lately.

The patch updates the command string just before the workers sleep to 
show how much work they've done so far.  And at the end, it adds a few 
new lines to the information written to the logs, when the autovacuum is 
notable enough to be logged at all.  The overhead it adds is at most a 
few integer operations per buffer processed and a slower title string 
update once per sleep.  It's trivial compared to both the vacuum itself, 
and to the instrumentation's value to sites with vacuum issues.

To demonstrate the patch in action, here's a test case using a 6.4GB 
pgbench_accounts table:

$ createdb pgbench
$ pgbench -i -s 500 pgbench
$ psql -d pgbench -c "select pg_relation_size('pgbench_accounts');"
  pg_relation_size
------------------
        6714761216
$ psql -d pgbench -c "select relname,relpages from pg_class where 
relname='pgbench_accounts';"
      relname      | relpages
------------------+----------
  pgbench_accounts |   819673
$psql -d pgbench -c "delete from pgbench_accounts where aid<20000000"

You can see the new information in the command string with ps and grep:

$ while [ 1 ] ; do (ps -eaf | grep "[a]utovacuum worker" && sleep 60) ; done
gsmith    2687 17718  0 15:44 ?        00:00:00 postgres: autovacuum 
worker process   h=19 m=14196 d=14185
...
gsmith    2687 17718  0 15:44 ?        00:00:09 postgres: autovacuum 
worker process   h=182701 m=301515 d=321345
...
gsmith    2687 17718  1 15:44 ?        00:00:23 postgres: autovacuum 
worker process   h=740359 m=679987 d=617559
...

That's accumulated hit/miss/dirty counts, the raw numbers.  When the 
autovacuum is finished, those totals appear as a new line in the log entry:

LOG:  automatic vacuum of table "pgbench.public.pgbench_accounts": index 
scans: 1
     pages: 0 removed, 819673 remain
     tuples: 19999999 removed, 30000022 remain
     buffer usage: 809537 hits, 749340 misses, 686660 dirtied
     system usage: CPU 5.70s/19.73u sec elapsed 2211.60 sec

To check if this makes sense, we need the relevant parameters at the 
time, which were the defaults (I only tweaked some basic config bits 
here, including shared_buffers=400MB so a bit more was cached):

vacuum_cost_page_hit = 1       # 0-10000 credits
vacuum_cost_page_miss = 10     # 0-10000 credits
vacuum_cost_page_dirty = 20        # 0-10000 credits
vacuum_cost_limit = 200        # 1-10000 credits
autovacuum_vacuum_cost_delay = 20ms

Every 20ms equals 50 times/second.  That means the cost accumulation 
should be 200 * 50 = 10000 / second, or 600K/minute.  That's how fast 
the cost should be increasing here.  Given a runtime of 2211.60 seconds, 
that's a total estimated cost of 2209.15 * 10000 = 22,091,500.  Now we 
check that against the totals printed at the end of the vacuum:

1 * 809537 hits=809,537
10 * 749340 misses=7,493,400
20 * 686607 dirtied=13,732,140

And that gives a directly computed total of 22,035,077.  Close enough to 
show this is working as expected.  And how I computed all that should 
give you an idea how you might use these numbers to extract other useful 
statistics, if you'd like to tune the balance of various cost_page_* 
parameters as one example.  I have no idea how anyone could ever set 
those relative to one another without this data, it would take epic 
guessing skills.

What else can do you do with this data?

-Figure out if the VACUUM is still making progress when it appears stuck
-Estimate how long it will take to finish, based on current progress and 
whatever total cost was logged last time VACUUM ran against this relation.
-Compute approximate hit rate on the read side.  OS caching issues and 
the ring buffer are obviously a problem with that, this isn't too valuable.
-Can see the cost split when multiple vacuums are running.  This problem 
is why sites can't just use "total time to vacuum" as a useful proxy to 
estimate how long one will take to run.
-Easy to track the read/write ratio
-Directly measure the write rate

That last one is I think the part people are most perplexed by right 
now, and this makes it trivial.  How do you turn all these cost figures 
into real-world read/write rates?  It's been hard to do.

Now, you can take a bunch of samples of the data at 1 minute intervals, 
like my little "ps | grep" example above does.  The delta in the 
"dirty=" column is how much was written per minute, in units of 8K 
(usually) buffers.  Multiply that by 8192/(60*1024*1024), and you get 
MB/s out of there.  I collected that data for a cleanup run of the 
pgbench_accounts damage done above, CSV file with all the statistics is 
attached.

I also collected OS level stats from Linux about the actual read/write 
rate of the process, converted into "Write Mbps" (those are actually in 
MB/s, sloppy capitalization is via OpenOffice "autocorrect").  Those 
numbers are close enough to make me confident the dirty buffer totals 
tracked here do turn into useful MB/s values.  Sample of the most 
interesting part:

Cost Delta    Dirty Mbps    Write Mbps
589,890    2.56    2.73
591,151    2.57    2.73
589,035    2.56    2.72
593,775    3.14    0.20
599,420    2.05    0.00
598,503    2.05    0.00
599,421    2.05    0.00
574,046    0.60    0.01
574,779    0.64    0.67
609,140    2.56    2.68
612,397    2.57    2.69
611,744    2.57    2.69
610,008    2.56    2.68

This shows the expected 600K/minute cost accumulation.  And using the 
dirty= numbers to compute MB/s of write speed closely matches the total 
write speed of this process.  Some of the difference might be I/O to 
other things besides the main table here, some of it is just because OS 
write caching will influence the write rate.  In the spots where the OS 
value and what's derived from the dirty rate diverge most, it appears to 
be because vacuum is filling Linux's write cache.  Actual writes 
accumulated against the process them block for a while.  It's a small 
difference most of the time.

I'd be willing to accept a "Dirty MB/s" figure computed this way as 
accurate enough for most purposes.  And this patch lets you get that 
data, currently unavailable without poking into the OS statistics (if at 
all), just by doing a little log file and/or command string scraping.  
Total at the end or real-time monitoring, based on how much work you 
want to put into it.  For a busy site where one or more autovacuum 
processes are running most of the time, being able to monitor the vacuum 
portion of the I/O this way will be a huge improvement over the current 
state of things.  I already have a stack of tools built on top of this 
data I'm working on, and they're making it much easier to come up with 
an iterative tuning process for autovacuum.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us