Thread

Re: again on index usage

Zeugswetter Andreas ADI SD <zeugswettera@spardat.at> — 2002-01-11T11:46:42Z

> This topic seems to come up a lot.  Is there something we are missing in
> the FAQ?

Most of the reports we seem to see are the usual, "but the seq scan is actually 
faster" case. Daniel actually has a case where the optimizer chooses a bad plan.

The difficulty seems to be, that the optimizer cooses a correct plan for an idle
system, but with his workload the index path would be far better (2 vs 4 Minutes).

This is one of the main problems of the current optimizer which imho rather 
aggressively chooses seq scans over index scans. During high load this does 
not pay off. My preference would actually be a way to make the optimizer
choose a plan that causes minimal workload, and not shortest runtime 
(which will obviously only be fast with low overall workload)
The reasoning behind this is, that during low workload your response times
will be good enough with a "bad" plan, but during high workload your response 
times will be best with a plan that produces the least additional workload.

Andreas

Re: again on index usage

Tom Lane <tgl@sss.pgh.pa.us> — 2002-01-11T16:34:07Z

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> My preference would actually be a way to make the optimizer
> choose a plan that causes minimal workload, and not shortest runtime 

?? I am not sure that I see the difference.

What I think you are saying is that when there's lots of competing work,
seqscans have less advantage over indexscans because the
sequential-access locality advantage is lost when the disk drive has to
go off and service some other request.  If that's the mechanism, I think
that the appropriate answer is just to reduce random_page_cost.  It's
true that the current default of 4.0 was chosen using measurements on
otherwise-unloaded systems.  If you assume that the system is (a) too
busy to do read-aheads for you, and (b) has to move the disk arm to
service other requests between each of your requests, then it's not
clear that sequential reads have any performance advantage at all :-(.
I don't think I'd go as far as to lower random_page_cost to 1.0, but
certainly there's a case for using an intermediate value.

			regards, tom lane

Re: again on index usage

Don Baccus <dhogaza@pacifier.com> — 2002-01-11T16:41:11Z

Zeugswetter Andreas SB SD wrote:

> This is one of the main problems of the current optimizer which imho rather 
> aggressively chooses seq scans over index scans. During high load this does 
> not pay off.

Bingo ... dragging huge tables through the buffer cache via a sequential 
scan guarantees that a) the next query sequentially scanning the same 
table will have to read every block again (if the table's longer than 
available PG and OS cache) b) on a high-concurrency system other queries 
end up doing extra I/O, too.

Oracle partially mitigates the second effect by refusing to trash its 
entire buffer cache on any given sequential scan.  Or so I've been told 
by people who know Oracle well.  A repeat of the sequential scan will 
still have to reread the entire table but that's true anyway if the 
table's at least one block longer than available cache.

Of course, Oracle picks sequential scans in horribly and obviously wrong 
cases as well.  On one project over the summer I had a query Oracle 
refused to use an available index on until I told it to do so explictly, 
and when I did it sped up by a factor of about 100.

All optimizers will fail miserably for certain queries and datasets.

-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org

Re: again on index usage

Bruce Momjian <pgman@candle.pha.pa.us> — 2002-01-11T16:42:43Z

Don Baccus wrote:
> Zeugswetter Andreas SB SD wrote:
> 
> 
> > This is one of the main problems of the current optimizer which imho rather 
> > aggressively chooses seq scans over index scans. During high load this does 
> > not pay off.
> 
> 
> Bingo ... dragging huge tables through the buffer cache via a sequential 
> scan guarantees that a) the next query sequentially scanning the same 
> table will have to read every block again (if the table's longer than 
> available PG and OS cache) b) on a high-concurrency system other queries 
> end up doing extra I/O, too.
> 
> Oracle partially mitigates the second effect by refusing to trash its 
> entire buffer cache on any given sequential scan.  Or so I've been told 
> by people who know Oracle well.  A repeat of the sequential scan will 
> still have to reread the entire table but that's true anyway if the 
> table's at least one block longer than available cache.

That is on our TODO list, at least.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: again on index usage

Don Baccus <dhogaza@pacifier.com> — 2002-01-11T17:01:46Z

Bruce Momjian wrote:

>>Oracle partially mitigates the second effect by refusing to trash its 
>>entire buffer cache on any given sequential scan.  Or so I've been told 
>>by people who know Oracle well.  A repeat of the sequential scan will 
>>still have to reread the entire table but that's true anyway if the 
>>table's at least one block longer than available cache.
>>
> 
> That is on our TODO list, at least.

I didn't realize this, it's good news.  (I don't follow PG development 
closely these days).

BTW overall I think the cost-estimating portion of the PG optimizer does 
about as well as Oracle's.   Oracle is a lot smarter about doing 
transformations of certain types of queries (turning "scalar in (select 
...)" into something akin to an "exists") but of course this has nothing 
to do with estimating the cost of index vs. sequential scans.

-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org

Re: again on index usage

Ross Reedstrom <reedstrm@rice.edu> — 2002-01-11T17:22:09Z

On Fri, Jan 11, 2002 at 11:42:43AM -0500, Bruce Momjian wrote:
> Don Baccus wrote:
> > Zeugswetter Andreas SB SD wrote:
> > 
> > 
> > > This is one of the main problems of the current optimizer which imho rather 
> > > aggressively chooses seq scans over index scans. During high load this does 
> > > not pay off.
> > 
> > 
> > Bingo ... dragging huge tables through the buffer cache via a sequential 
> > scan guarantees that a) the next query sequentially scanning the same 
> > table will have to read every block again (if the table's longer than 
> > available PG and OS cache) b) on a high-concurrency system other queries 
> > end up doing extra I/O, too.
> > 
> > Oracle partially mitigates the second effect by refusing to trash its 
> > entire buffer cache on any given sequential scan.  Or so I've been told 
> > by people who know Oracle well.  A repeat of the sequential scan will 
> > still have to reread the entire table but that's true anyway if the 
> > table's at least one block longer than available cache.
> 
> That is on our TODO list, at least.
> 

Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
scans) would have interesting load senstive behavior: since Linux uses
all not-otherwise in use RAM as buffer cache, if you've got a low-load
system, even very large tables will be cached. Once other processes start
competing for RAM, your buffers go away. Bruce, what does xBSD do?

I like it, since anything that needs to be sensitive to system wide
information, like the total load on the machine, should be handled by
the system, not a particular app.

Ross

Re: again on index usage

Don Baccus <dhogaza@pacifier.com> — 2002-01-11T18:23:27Z

Ross J. Reedstrom wrote:

> Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
> scans) would have interesting load senstive behavior: since Linux uses
> all not-otherwise in use RAM as buffer cache, if you've got a low-load
> system, even very large tables will be cached. Once other processes start
> competing for RAM, your buffers go away. Bruce, what does xBSD do?

For people who run dedicated database services simply not using pg 
buffers for sequential scans is probably too simplistic.  Assuming one 
allocates a very large pg buffer space, as I tend to do.  Remember that 
shuffling data between a pg buffer and OS cache buffer takes cycles, and 
modern machines tend to be starved for memory bandwidth - perhaps 
another reason why Oracle requested and got writes that bypass the OS 
cache entirely?  This bypasses the byte-shuffling.

Of course, Oracle's preferred approach is to have you set up your OS 
environment so that Oracle pretty much eats the machine.  They tell you 
to set SHMAX to 4GB in the installation docs, for instance, then the 
installer by default will configure Oracle to use about 1/3 of your 
available RAM for its buffer cache.  Books on tuning generally tell you 
that's far too low.

Anyway, I've been told that Oracle's approach is more along the lines of 
"don't cache sequential scans that eat up more than N% of our own cache 
space".

Then shorter tables still get fully cached when sequentially scanned, 
while humongous tables don't wipe out the cache causing dirty pages to 
be flushed to the platter and other concurrent processes to do file I/O 
reads because everything but the huge table's disappeared.

Someone in an earlier post mentioned "thrashing" and that's what 
dragging a table bigger than cache causes on busy systems.

> 
> I like it, since anything that needs to be sensitive to system wide
> information, like the total load on the machine, should be handled by
> the system, not a particular app.
> 
> Ross
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 
> .
> 
> 

-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org

Re: again on index usage

Bruce Momjian <pgman@candle.pha.pa.us> — 2002-01-11T18:30:53Z

> > That is on our TODO list, at least.
> > 
> 
> Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
> scans) would have interesting load senstive behavior: since Linux uses
> all not-otherwise in use RAM as buffer cache, if you've got a low-load
> system, even very large tables will be cached. Once other processes start
> competing for RAM, your buffers go away. Bruce, what does xBSD do?

FreeBSD does, NetBSD will soon, not sure about the others.  I believe
NetBSD will be tunable because page cache vs. i/o cache is not always
best done with a FIFO setup.

Also, when we pull from the kernel cache we have to read into our shared
buffers;  much faster than disk i/o but slower than if they were already
in the cache.  For me the idea of doing non-cached sequential scans came
from a Solaris internals book I was reading.  I think it will be
possible to mark large sequential scan buffer i/o with lower priority
caching that may help performance.  However, if others are also doing
sequential scans of the same table _only_, our normal caching may be
best.  As you can see, this gets quite complicated and I am doubtful
there will be a general solution to this problem --- more of a feedback
loop may be the best bet --- determine which sequential scans are wiping
the cache with little other purpose and start not caching them.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: again on index usage

mlw <markw@mohawksoft.com> — 2002-01-11T18:58:09Z

Tom Lane wrote:
> What I think you are saying is that when there's lots of competing work,
> seqscans have less advantage over indexscans because the
> sequential-access locality advantage is lost when the disk drive has to
> go off and service some other request.  If that's the mechanism, I think
> that the appropriate answer is just to reduce random_page_cost.  It's
> true that the current default of 4.0 was chosen using measurements on
> otherwise-unloaded systems.  If you assume that the system is (a) too
> busy to do read-aheads for you, and (b) has to move the disk arm to
> service other requests between each of your requests, then it's not
> clear that sequential reads have any performance advantage at all :-(.
> I don't think I'd go as far as to lower random_page_cost to 1.0, but
> certainly there's a case for using an intermediate value.

It is even more interesting than simple sequential vs random scan
thinking. Depending on the maker of the drive, even an unloaded system
will move the head randomly. Modern drives have almost no resemblance to
their predecessors. Sectors are mapped however the OEM sees fit. A
numerically sequential read from a hard disk may have the drive heads
bouncing all over the disk because the internal configuration of the
disk has almost nothing to do with the external representation.

Think about a RAID device. What does a sequential scan mean to a RAID
system? Very little depending on how the image is constructed. Storage
devices are now black boxes. The only predictable advantage a
"sequential scan" can have on a modern computer is OS level caching.

Re: again on index usage

Tom Lane <tgl@sss.pgh.pa.us> — 2002-01-11T19:09:21Z

mlw <markw@mohawksoft.com> writes:
> ... Storage
> devices are now black boxes. The only predictable advantage a
> "sequential scan" can have on a modern computer is OS level caching.

You mean read-ahead.  True enough, but that "only advantage" is very
significant.  The 4.0 number did not come out of the air, it came
from actual measurements.

I think the real point in this thread is that measurements on an idle
system might not extrapolate very well to measurements on a heavily
loaded system.  I can see the point, but I don't really have time to
investigate it right now.  I'd be willing to reduce the default value of
random_page_cost to something around 2, if someone can come up with
experimental evidence justifying it ...

			regards, tom lane

Re: again on index usage

Don Baccus <dhogaza@pacifier.com> — 2002-01-11T19:42:38Z

Bruce Momjian wrote:

> Also, when we pull from the kernel cache we have to read into our shared
> buffers;  much faster than disk i/o but slower than if they were already
> in the cache.

Yes ... this is one reason why folks like Oracle want to be able to 
bypass the kernel cache.

>  For me the idea of doing non-cached sequential scans came
> from a Solaris internals book I was reading.  I think it will be
> possible to mark large sequential scan buffer i/o with lower priority
> caching that may help performance.  However, if others are also doing
> sequential scans of the same table _only_, our normal caching may be
> best.  As you can see, this gets quite complicated and I am doubtful
> there will be a general solution to this problem --- more of a feedback
> loop may be the best bet --- determine which sequential scans are wiping
> the cache with little other purpose and start not caching them.

It would be interesting to learn more about the actual hueristic Oracle 
uses (straight percents of the buffer cache?  Something based on 
concurrency?  I have no idea).  The Oracle folks have got tons and tons 
of data on real-world big, busy db installations to draw from when they 
investigate such things.

-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org

Re: again on index usage

Hannu Krosing <hannu@tm.ee> — 2002-01-12T15:08:24Z

Don Baccus wrote:
> 
> Zeugswetter Andreas SB SD wrote:
> 
> > This is one of the main problems of the current optimizer which imho rather
> > aggressively chooses seq scans over index scans. During high load this does
> > not pay off.
> 
> Bingo ... dragging huge tables through the buffer cache via a sequential
> scan guarantees that a) the next query sequentially scanning the same
> table will have to read every block again (if the table's longer than
> available PG and OS cache) b) on a high-concurrency system other queries
> end up doing extra I/O, too.
> 
> Oracle partially mitigates the second effect by refusing to trash its
> entire buffer cache on any given sequential scan.  Or so I've been told
> by people who know Oracle well.  A repeat of the sequential scan will
> still have to reread the entire table but that's true anyway if the
> table's at least one block longer than available cache.

One radical way to get better-than-average cache behaviour in such 
pathologigal casescases would be to discard a _random_ page instead of 
LRU page (perhaps tuned to not not select from 1/N of pages on that are
MRU)

-------------
Hannu

Re: again on index usage

Don Baccus <dhogaza@pacifier.com> — 2002-01-12T15:44:29Z

Hannu Krosing wrote:

> One radical way to get better-than-average cache behaviour in such 
> pathologigal casescases would be to discard a _random_ page instead of 
> LRU page (perhaps tuned to not not select from 1/N of pages on that are
> MRU)

Yep, that's one of the ways to improve performance when the same table's 
being scanned sequentially multiple times, or where different queries 
sometimes scan it sequentially, other times by index.  MRU would help if 
you're constantly doing sequential scans.

So would flipping the scan order depending on what's in the cache :)

But none of these would mitigate the effects on other concurrent queries 
that don't query the large table at all.

-- 
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org