Thread

  1. Re: index-only scans

    Oleg Bartunov <oleg@sai.msu.su> — 2011-08-12T13:09:49Z

    Robert,
    
    I imagine we store positional information in gin index and return
    tuples in relevant order - instant full-text search ! 
    Great work, guys !
    
    Oleg
    
    On Thu, 11 Aug 2011, Robert Haas wrote:
    
    > Please find attached a patch implementing a basic version of
    > index-only scans.  This patch is the work of my colleague Ibrar Ahmed
    > and myself, and also incorporates some code from previous patches
    > posted by Heikki Linnakanagas.
    >
    > I'm able to demonstrate a significant performance benefit from this
    > patch, even in a situation where it doesn't save any disk I/O, due to
    > reduced thrashing of shared_buffers.  Configuration settings:
    >
    > max_connections = 100
    > shared_buffers = 400MB
    > maintenance_work_mem = 256MB
    > synchronous_commit = off
    > checkpoint_segments = 100
    > checkpoint_timeout = 30min
    > checkpoint_completion_target = 0.9
    > checkpoint_warning = 90s
    > seq_page_cost = 1.0
    > random_page_cost = 1.0
    > effective_cache_size = 3GB
    >
    > Test setup:
    >
    > pgbench -i -s 50
    > create table sample_data as select (random()*5000000)::int as aid,
    > repeat('a', 1000) as filler from generate_series(1,100000);
    >
    > Test queries:
    >
    > select sum(aid) from sample_data a1 where exists (select * from
    > pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
    > select sum(aid) from sample_data a1 where exists (select * from
    > pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
    >
    > On my laptop, the first query executes in about 555 ms, while the
    > second one takes about 1125 ms.  Inspection via pg_buffercache reveals
    > that the second one thrashes shared_buffers something fierce, while
    > the first one does not.  You can actually get the time for the first
    > query down to about 450 ms if you can persuade PostgreSQL to cache the
    > entire sample_data table - which is difficult, due the
    > BufferAccessStrategy stuff - and as soon as you run the second query,
    > it blows the table out of cache, so practically speaking you're not
    > going to get that faster time very often.  I expect that you could get
    > an even larger benefit from this type of query if you could avoid
    > actual disk I/O, rather than just buffer cache thrashing, but I
    > haven't come up with a suitable test cases for that yet (volunteers?).
    >
    > There are several things about this patch that probably need further
    > thought and work, or at least some discussion.
    >
    > 1. The way that nodeIndexscan.c builds up the faux heap tuple is
    > perhaps susceptible to improvement.  I thought about building a
    > virtual tuple, but then what do I do with an OID column, if I have
    > one?  Or maybe this should be done some other way altogether.
    >
    > 2. Suppose we scan one tuple on a not-all-visible page followed by 99
    > tuples on all-visible pages.  The code as written will hold the pin on
    > the first heap page for the entire scan.  As soon as we hit the end of
    > the scan or another tuple where we have to actually visit the page,
    > the old pin will be released, but until then we hold onto it.  This
    > isn't totally stupid: the next tuple that's on a not-all-visible page
    > could easily be on the same not-all-visible page we already have
    > pinned.  And in 99% cases holding the pin for slightly longer is
    > probably completely harmless.  On the flip side, it could increase the
    > chances of interfering with VACUUM.  Then again, a non-index-only scan
    > would still interfere with the same VACUUM, so maybe we don't care.
    >
    > 3. The code in create_index_path() builds up a bitmapset of heap
    > attributes that get used for any purpose anywhere in the query, and
    > hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
    > index under consideration.  However, if it were somehow possible to
    > have the rel involved without using any attributes at all, we'd
    > rebuild the cache over and over, since it would never become non-NULL.
    > I don't think that can happen right now, but future planner changes
    > might make it possible.
    >
    > 4. There are a couple of cases that use index-only scans even though
    > the EXPLAIN output sort of makes it look like they shouldn't.  For
    > example, in the above queries, an index-only scan is chosen even
    > though the query does "SELECT *" from the table being scanned.  Even
    > though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
    > that the target list of an EXISTS query is in fact discarded, e.g.:
    >
    > create or replace function error() returns int as $$begin select 1/0;
    > end$$ language plpgsql;
    > select * from pgbench_accounts a where exists (select error() from
    > pgbench_branches b where b.bid = a.aid); -- doesn't error out!
    >
    > Along similar lines, COUNT(*) does not preclude an index-only scan,
    > because the * is apparently just window dressing.  You'll still get
    > just a seq-scan unless you have an indexable qual in the query
    > somewhere, because...
    >
    > 5. We haven't made any planner changes at all, not even for cost
    > estimation.  It is not clear to me what the right way to do cost
    > estimation here is.  It seems that it would be useful to know what
    > percentage of the table is all-visible at planning time, but even if
    > we did, there could (and probably often will) be correlation between
    > the pages the query is interested in and which visibility map bits are
    > set. So I'm concerned about being overly optimistic about the cost
    > savings.  Also, there's the problem of figuring out how to keep the
    > percent-of-table-that-has-the-vm-bits-set statistic up to date.  Maybe
    > that's a job for ANALYZE but I haven't thought about it much yet.
    >
    > 6. I'm sure there are probably some statements in the documentation
    > that need updating, but I haven't tracked 'em down yet.
    >
    > Comments, testing, review appreciated...
    >
    >
    
     	Regards,
     		Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83