Thread

  1. Displaying accumulated autovacuum cost

    Greg Smith <greg@2ndquadrant.com> — 2011-08-17T21:04:33Z

    Attached is a patch that tracks and displays the accumulated cost when 
    autovacuum is running.  Code by Noah Misch and myself.  I hope this idea 
    will bring a formal process to vacuum tuning, which is currently too 
    hard to do.  I was about to add "without..." to that, but I then 
    realized it needs no disclaimer; it's just too hard, period.  Vacuum 
    issues are enemy #1 at all the terabyte scale customer sites I've been 
    fighting with lately.
    
    The patch updates the command string just before the workers sleep to 
    show how much work they've done so far.  And at the end, it adds a few 
    new lines to the information written to the logs, when the autovacuum is 
    notable enough to be logged at all.  The overhead it adds is at most a 
    few integer operations per buffer processed and a slower title string 
    update once per sleep.  It's trivial compared to both the vacuum itself, 
    and to the instrumentation's value to sites with vacuum issues.
    
    To demonstrate the patch in action, here's a test case using a 6.4GB 
    pgbench_accounts table:
    
    $ createdb pgbench
    $ pgbench -i -s 500 pgbench
    $ psql -d pgbench -c "select pg_relation_size('pgbench_accounts');"
      pg_relation_size
    ------------------
            6714761216
    $ psql -d pgbench -c "select relname,relpages from pg_class where 
    relname='pgbench_accounts';"
          relname      | relpages
    ------------------+----------
      pgbench_accounts |   819673
    $psql -d pgbench -c "delete from pgbench_accounts where aid<20000000"
    
    You can see the new information in the command string with ps and grep:
    
    $ while [ 1 ] ; do (ps -eaf | grep "[a]utovacuum worker" && sleep 60) ; done
    gsmith    2687 17718  0 15:44 ?        00:00:00 postgres: autovacuum 
    worker process   h=19 m=14196 d=14185
    ...
    gsmith    2687 17718  0 15:44 ?        00:00:09 postgres: autovacuum 
    worker process   h=182701 m=301515 d=321345
    ...
    gsmith    2687 17718  1 15:44 ?        00:00:23 postgres: autovacuum 
    worker process   h=740359 m=679987 d=617559
    ...
    
    That's accumulated hit/miss/dirty counts, the raw numbers.  When the 
    autovacuum is finished, those totals appear as a new line in the log entry:
    
    LOG:  automatic vacuum of table "pgbench.public.pgbench_accounts": index 
    scans: 1
         pages: 0 removed, 819673 remain
         tuples: 19999999 removed, 30000022 remain
         buffer usage: 809537 hits, 749340 misses, 686660 dirtied
         system usage: CPU 5.70s/19.73u sec elapsed 2211.60 sec
    
    To check if this makes sense, we need the relevant parameters at the 
    time, which were the defaults (I only tweaked some basic config bits 
    here, including shared_buffers=400MB so a bit more was cached):
    
    vacuum_cost_page_hit = 1       # 0-10000 credits
    vacuum_cost_page_miss = 10     # 0-10000 credits
    vacuum_cost_page_dirty = 20        # 0-10000 credits
    vacuum_cost_limit = 200        # 1-10000 credits
    autovacuum_vacuum_cost_delay = 20ms
    
    Every 20ms equals 50 times/second.  That means the cost accumulation 
    should be 200 * 50 = 10000 / second, or 600K/minute.  That's how fast 
    the cost should be increasing here.  Given a runtime of 2211.60 seconds, 
    that's a total estimated cost of 2209.15 * 10000 = 22,091,500.  Now we 
    check that against the totals printed at the end of the vacuum:
    
    1 * 809537 hits=809,537
    10 * 749340 misses=7,493,400
    20 * 686607 dirtied=13,732,140
    
    And that gives a directly computed total of 22,035,077.  Close enough to 
    show this is working as expected.  And how I computed all that should 
    give you an idea how you might use these numbers to extract other useful 
    statistics, if you'd like to tune the balance of various cost_page_* 
    parameters as one example.  I have no idea how anyone could ever set 
    those relative to one another without this data, it would take epic 
    guessing skills.
    
    What else can do you do with this data?
    
    -Figure out if the VACUUM is still making progress when it appears stuck
    -Estimate how long it will take to finish, based on current progress and 
    whatever total cost was logged last time VACUUM ran against this relation.
    -Compute approximate hit rate on the read side.  OS caching issues and 
    the ring buffer are obviously a problem with that, this isn't too valuable.
    -Can see the cost split when multiple vacuums are running.  This problem 
    is why sites can't just use "total time to vacuum" as a useful proxy to 
    estimate how long one will take to run.
    -Easy to track the read/write ratio
    -Directly measure the write rate
    
    That last one is I think the part people are most perplexed by right 
    now, and this makes it trivial.  How do you turn all these cost figures 
    into real-world read/write rates?  It's been hard to do.
    
    Now, you can take a bunch of samples of the data at 1 minute intervals, 
    like my little "ps | grep" example above does.  The delta in the 
    "dirty=" column is how much was written per minute, in units of 8K 
    (usually) buffers.  Multiply that by 8192/(60*1024*1024), and you get 
    MB/s out of there.  I collected that data for a cleanup run of the 
    pgbench_accounts damage done above, CSV file with all the statistics is 
    attached.
    
    I also collected OS level stats from Linux about the actual read/write 
    rate of the process, converted into "Write Mbps" (those are actually in 
    MB/s, sloppy capitalization is via OpenOffice "autocorrect").  Those 
    numbers are close enough to make me confident the dirty buffer totals 
    tracked here do turn into useful MB/s values.  Sample of the most 
    interesting part:
    
    Cost Delta    Dirty Mbps    Write Mbps
    589,890    2.56    2.73
    591,151    2.57    2.73
    589,035    2.56    2.72
    593,775    3.14    0.20
    599,420    2.05    0.00
    598,503    2.05    0.00
    599,421    2.05    0.00
    574,046    0.60    0.01
    574,779    0.64    0.67
    609,140    2.56    2.68
    612,397    2.57    2.69
    611,744    2.57    2.69
    610,008    2.56    2.68
    
    This shows the expected 600K/minute cost accumulation.  And using the 
    dirty= numbers to compute MB/s of write speed closely matches the total 
    write speed of this process.  Some of the difference might be I/O to 
    other things besides the main table here, some of it is just because OS 
    write caching will influence the write rate.  In the spots where the OS 
    value and what's derived from the dirty rate diverge most, it appears to 
    be because vacuum is filling Linux's write cache.  Actual writes 
    accumulated against the process them block for a while.  It's a small 
    difference most of the time.
    
    I'd be willing to accept a "Dirty MB/s" figure computed this way as 
    accurate enough for most purposes.  And this patch lets you get that 
    data, currently unavailable without poking into the OS statistics (if at 
    all), just by doing a little log file and/or command string scraping.  
    Total at the end or real-time monitoring, based on how much work you 
    want to put into it.  For a busy site where one or more autovacuum 
    processes are running most of the time, being able to monitor the vacuum 
    portion of the I/O this way will be a huge improvement over the current 
    state of things.  I already have a stack of tools built on top of this 
    data I'm working on, and they're making it much easier to come up with 
    an iterative tuning process for autovacuum.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us