Thread

index-only scans

Robert Haas <robertmhaas@gmail.com> — 2011-08-11T20:06:08Z
Please find attached a patch implementing a basic version of
index-only scans.  This patch is the work of my colleague Ibrar Ahmed
and myself, and also incorporates some code from previous patches
posted by Heikki Linnakanagas.

I'm able to demonstrate a significant performance benefit from this
patch, even in a situation where it doesn't save any disk I/O, due to
reduced thrashing of shared_buffers.  Configuration settings:

max_connections = 100
shared_buffers = 400MB
maintenance_work_mem = 256MB
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9
checkpoint_warning = 90s
seq_page_cost = 1.0
random_page_cost = 1.0
effective_cache_size = 3GB

Test setup:

pgbench -i -s 50
create table sample_data as select (random()*5000000)::int as aid,
repeat('a', 1000) as filler from generate_series(1,100000);

Test queries:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

On my laptop, the first query executes in about 555 ms, while the
second one takes about 1125 ms.  Inspection via pg_buffercache reveals
that the second one thrashes shared_buffers something fierce, while
the first one does not.  You can actually get the time for the first
query down to about 450 ms if you can persuade PostgreSQL to cache the
entire sample_data table - which is difficult, due the
BufferAccessStrategy stuff - and as soon as you run the second query,
it blows the table out of cache, so practically speaking you're not
going to get that faster time very often.  I expect that you could get
an even larger benefit from this type of query if you could avoid
actual disk I/O, rather than just buffer cache thrashing, but I
haven't come up with a suitable test cases for that yet (volunteers?).

There are several things about this patch that probably need further
thought and work, or at least some discussion.

1. The way that nodeIndexscan.c builds up the faux heap tuple is
perhaps susceptible to improvement.  I thought about building a
virtual tuple, but then what do I do with an OID column, if I have
one?  Or maybe this should be done some other way altogether.

2. Suppose we scan one tuple on a not-all-visible page followed by 99
tuples on all-visible pages.  The code as written will hold the pin on
the first heap page for the entire scan.  As soon as we hit the end of
the scan or another tuple where we have to actually visit the page,
the old pin will be released, but until then we hold onto it.  This
isn't totally stupid: the next tuple that's on a not-all-visible page
could easily be on the same not-all-visible page we already have
pinned.  And in 99% cases holding the pin for slightly longer is
probably completely harmless.  On the flip side, it could increase the
chances of interfering with VACUUM.  Then again, a non-index-only scan
would still interfere with the same VACUUM, so maybe we don't care.

3. The code in create_index_path() builds up a bitmapset of heap
attributes that get used for any purpose anywhere in the query, and
hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
index under consideration.  However, if it were somehow possible to
have the rel involved without using any attributes at all, we'd
rebuild the cache over and over, since it would never become non-NULL.
 I don't think that can happen right now, but future planner changes
might make it possible.

4. There are a couple of cases that use index-only scans even though
the EXPLAIN output sort of makes it look like they shouldn't.  For
example, in the above queries, an index-only scan is chosen even
though the query does "SELECT *" from the table being scanned.  Even
though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
that the target list of an EXISTS query is in fact discarded, e.g.:

create or replace function error() returns int as $$begin select 1/0;
end$$ language plpgsql;
select * from pgbench_accounts a where exists (select error() from
pgbench_branches b where b.bid = a.aid); -- doesn't error out!

Along similar lines, COUNT(*) does not preclude an index-only scan,
because the * is apparently just window dressing.  You'll still get
just a seq-scan unless you have an indexable qual in the query
somewhere, because...

5. We haven't made any planner changes at all, not even for cost
estimation.  It is not clear to me what the right way to do cost
estimation here is.  It seems that it would be useful to know what
percentage of the table is all-visible at planning time, but even if
we did, there could (and probably often will) be correlation between
the pages the query is interested in and which visibility map bits are
set. So I'm concerned about being overly optimistic about the cost
savings.  Also, there's the problem of figuring out how to keep the
percent-of-table-that-has-the-vm-bits-set statistic up to date.  Maybe
that's a job for ANALYZE but I haven't thought about it much yet.

6. I'm sure there are probably some statements in the documentation
that need updating, but I haven't tracked 'em down yet.

Comments, testing, review appreciated...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company