Thread

  1. index-only scans

    Robert Haas <robertmhaas@gmail.com> — 2011-08-11T20:06:08Z

    Please find attached a patch implementing a basic version of
    index-only scans.  This patch is the work of my colleague Ibrar Ahmed
    and myself, and also incorporates some code from previous patches
    posted by Heikki Linnakanagas.
    
    I'm able to demonstrate a significant performance benefit from this
    patch, even in a situation where it doesn't save any disk I/O, due to
    reduced thrashing of shared_buffers.  Configuration settings:
    
    max_connections = 100
    shared_buffers = 400MB
    maintenance_work_mem = 256MB
    synchronous_commit = off
    checkpoint_segments = 100
    checkpoint_timeout = 30min
    checkpoint_completion_target = 0.9
    checkpoint_warning = 90s
    seq_page_cost = 1.0
    random_page_cost = 1.0
    effective_cache_size = 3GB
    
    Test setup:
    
    pgbench -i -s 50
    create table sample_data as select (random()*5000000)::int as aid,
    repeat('a', 1000) as filler from generate_series(1,100000);
    
    Test queries:
    
    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
    select sum(aid) from sample_data a1 where exists (select * from
    pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
    
    On my laptop, the first query executes in about 555 ms, while the
    second one takes about 1125 ms.  Inspection via pg_buffercache reveals
    that the second one thrashes shared_buffers something fierce, while
    the first one does not.  You can actually get the time for the first
    query down to about 450 ms if you can persuade PostgreSQL to cache the
    entire sample_data table - which is difficult, due the
    BufferAccessStrategy stuff - and as soon as you run the second query,
    it blows the table out of cache, so practically speaking you're not
    going to get that faster time very often.  I expect that you could get
    an even larger benefit from this type of query if you could avoid
    actual disk I/O, rather than just buffer cache thrashing, but I
    haven't come up with a suitable test cases for that yet (volunteers?).
    
    There are several things about this patch that probably need further
    thought and work, or at least some discussion.
    
    1. The way that nodeIndexscan.c builds up the faux heap tuple is
    perhaps susceptible to improvement.  I thought about building a
    virtual tuple, but then what do I do with an OID column, if I have
    one?  Or maybe this should be done some other way altogether.
    
    2. Suppose we scan one tuple on a not-all-visible page followed by 99
    tuples on all-visible pages.  The code as written will hold the pin on
    the first heap page for the entire scan.  As soon as we hit the end of
    the scan or another tuple where we have to actually visit the page,
    the old pin will be released, but until then we hold onto it.  This
    isn't totally stupid: the next tuple that's on a not-all-visible page
    could easily be on the same not-all-visible page we already have
    pinned.  And in 99% cases holding the pin for slightly longer is
    probably completely harmless.  On the flip side, it could increase the
    chances of interfering with VACUUM.  Then again, a non-index-only scan
    would still interfere with the same VACUUM, so maybe we don't care.
    
    3. The code in create_index_path() builds up a bitmapset of heap
    attributes that get used for any purpose anywhere in the query, and
    hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
    index under consideration.  However, if it were somehow possible to
    have the rel involved without using any attributes at all, we'd
    rebuild the cache over and over, since it would never become non-NULL.
     I don't think that can happen right now, but future planner changes
    might make it possible.
    
    4. There are a couple of cases that use index-only scans even though
    the EXPLAIN output sort of makes it look like they shouldn't.  For
    example, in the above queries, an index-only scan is chosen even
    though the query does "SELECT *" from the table being scanned.  Even
    though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
    that the target list of an EXISTS query is in fact discarded, e.g.:
    
    create or replace function error() returns int as $$begin select 1/0;
    end$$ language plpgsql;
    select * from pgbench_accounts a where exists (select error() from
    pgbench_branches b where b.bid = a.aid); -- doesn't error out!
    
    Along similar lines, COUNT(*) does not preclude an index-only scan,
    because the * is apparently just window dressing.  You'll still get
    just a seq-scan unless you have an indexable qual in the query
    somewhere, because...
    
    5. We haven't made any planner changes at all, not even for cost
    estimation.  It is not clear to me what the right way to do cost
    estimation here is.  It seems that it would be useful to know what
    percentage of the table is all-visible at planning time, but even if
    we did, there could (and probably often will) be correlation between
    the pages the query is interested in and which visibility map bits are
    set. So I'm concerned about being overly optimistic about the cost
    savings.  Also, there's the problem of figuring out how to keep the
    percent-of-table-that-has-the-vm-bits-set statistic up to date.  Maybe
    that's a job for ANALYZE but I haven't thought about it much yet.
    
    6. I'm sure there are probably some statements in the documentation
    that need updating, but I haven't tracked 'em down yet.
    
    Comments, testing, review appreciated...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company