Thread

Re: index-only scans

Oleg Bartunov <oleg@sai.msu.su> — 2011-08-12T13:09:49Z
Robert,

I imagine we store positional information in gin index and return
tuples in relevant order - instant full-text search ! 
Great work, guys !

Oleg

On Thu, 11 Aug 2011, Robert Haas wrote:

> Please find attached a patch implementing a basic version of
> index-only scans.  This patch is the work of my colleague Ibrar Ahmed
> and myself, and also incorporates some code from previous patches
> posted by Heikki Linnakanagas.
>
> I'm able to demonstrate a significant performance benefit from this
> patch, even in a situation where it doesn't save any disk I/O, due to
> reduced thrashing of shared_buffers.  Configuration settings:
>
> max_connections = 100
> shared_buffers = 400MB
> maintenance_work_mem = 256MB
> synchronous_commit = off
> checkpoint_segments = 100
> checkpoint_timeout = 30min
> checkpoint_completion_target = 0.9
> checkpoint_warning = 90s
> seq_page_cost = 1.0
> random_page_cost = 1.0
> effective_cache_size = 3GB
>
> Test setup:
>
> pgbench -i -s 50
> create table sample_data as select (random()*5000000)::int as aid,
> repeat('a', 1000) as filler from generate_series(1,100000);
>
> Test queries:
>
> select sum(aid) from sample_data a1 where exists (select * from
> pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
> select sum(aid) from sample_data a1 where exists (select * from
> pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
>
> On my laptop, the first query executes in about 555 ms, while the
> second one takes about 1125 ms.  Inspection via pg_buffercache reveals
> that the second one thrashes shared_buffers something fierce, while
> the first one does not.  You can actually get the time for the first
> query down to about 450 ms if you can persuade PostgreSQL to cache the
> entire sample_data table - which is difficult, due the
> BufferAccessStrategy stuff - and as soon as you run the second query,
> it blows the table out of cache, so practically speaking you're not
> going to get that faster time very often.  I expect that you could get
> an even larger benefit from this type of query if you could avoid
> actual disk I/O, rather than just buffer cache thrashing, but I
> haven't come up with a suitable test cases for that yet (volunteers?).
>
> There are several things about this patch that probably need further
> thought and work, or at least some discussion.
>
> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
> perhaps susceptible to improvement.  I thought about building a
> virtual tuple, but then what do I do with an OID column, if I have
> one?  Or maybe this should be done some other way altogether.
>
> 2. Suppose we scan one tuple on a not-all-visible page followed by 99
> tuples on all-visible pages.  The code as written will hold the pin on
> the first heap page for the entire scan.  As soon as we hit the end of
> the scan or another tuple where we have to actually visit the page,
> the old pin will be released, but until then we hold onto it.  This
> isn't totally stupid: the next tuple that's on a not-all-visible page
> could easily be on the same not-all-visible page we already have
> pinned.  And in 99% cases holding the pin for slightly longer is
> probably completely harmless.  On the flip side, it could increase the
> chances of interfering with VACUUM.  Then again, a non-index-only scan
> would still interfere with the same VACUUM, so maybe we don't care.
>
> 3. The code in create_index_path() builds up a bitmapset of heap
> attributes that get used for any purpose anywhere in the query, and
> hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
> index under consideration.  However, if it were somehow possible to
> have the rel involved without using any attributes at all, we'd
> rebuild the cache over and over, since it would never become non-NULL.
> I don't think that can happen right now, but future planner changes
> might make it possible.
>
> 4. There are a couple of cases that use index-only scans even though
> the EXPLAIN output sort of makes it look like they shouldn't.  For
> example, in the above queries, an index-only scan is chosen even
> though the query does "SELECT *" from the table being scanned.  Even
> though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
> that the target list of an EXISTS query is in fact discarded, e.g.:
>
> create or replace function error() returns int as $$begin select 1/0;
> end$$ language plpgsql;
> select * from pgbench_accounts a where exists (select error() from
> pgbench_branches b where b.bid = a.aid); -- doesn't error out!
>
> Along similar lines, COUNT(*) does not preclude an index-only scan,
> because the * is apparently just window dressing.  You'll still get
> just a seq-scan unless you have an indexable qual in the query
> somewhere, because...
>
> 5. We haven't made any planner changes at all, not even for cost
> estimation.  It is not clear to me what the right way to do cost
> estimation here is.  It seems that it would be useful to know what
> percentage of the table is all-visible at planning time, but even if
> we did, there could (and probably often will) be correlation between
> the pages the query is interested in and which visibility map bits are
> set. So I'm concerned about being overly optimistic about the cost
> savings.  Also, there's the problem of figuring out how to keep the
> percent-of-table-that-has-the-vm-bits-set statistic up to date.  Maybe
> that's a job for ANALYZE but I haven't thought about it much yet.
>
> 6. I'm sure there are probably some statements in the documentation
> that need updating, but I haven't tracked 'em down yet.
>
> Comments, testing, review appreciated...
>
>

 	Regards,
 		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83