Thread
-
Re: again on index usage
Zeugswetter Andreas ADI SD <zeugswettera@spardat.at> — 2002-01-11T11:46:42Z
> This topic seems to come up a lot. Is there something we are missing in > the FAQ? Most of the reports we seem to see are the usual, "but the seq scan is actually faster" case. Daniel actually has a case where the optimizer chooses a bad plan. The difficulty seems to be, that the optimizer cooses a correct plan for an idle system, but with his workload the index path would be far better (2 vs 4 Minutes). This is one of the main problems of the current optimizer which imho rather aggressively chooses seq scans over index scans. During high load this does not pay off. My preference would actually be a way to make the optimizer choose a plan that causes minimal workload, and not shortest runtime (which will obviously only be fast with low overall workload) The reasoning behind this is, that during low workload your response times will be good enough with a "bad" plan, but during high workload your response times will be best with a plan that produces the least additional workload. Andreas
-
Re: again on index usage
Tom Lane <tgl@sss.pgh.pa.us> — 2002-01-11T16:34:07Z
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > My preference would actually be a way to make the optimizer > choose a plan that causes minimal workload, and not shortest runtime ?? I am not sure that I see the difference. What I think you are saying is that when there's lots of competing work, seqscans have less advantage over indexscans because the sequential-access locality advantage is lost when the disk drive has to go off and service some other request. If that's the mechanism, I think that the appropriate answer is just to reduce random_page_cost. It's true that the current default of 4.0 was chosen using measurements on otherwise-unloaded systems. If you assume that the system is (a) too busy to do read-aheads for you, and (b) has to move the disk arm to service other requests between each of your requests, then it's not clear that sequential reads have any performance advantage at all :-(. I don't think I'd go as far as to lower random_page_cost to 1.0, but certainly there's a case for using an intermediate value. regards, tom lane
-
Re: again on index usage
Don Baccus <dhogaza@pacifier.com> — 2002-01-11T16:41:11Z
Zeugswetter Andreas SB SD wrote: > This is one of the main problems of the current optimizer which imho rather > aggressively chooses seq scans over index scans. During high load this does > not pay off. Bingo ... dragging huge tables through the buffer cache via a sequential scan guarantees that a) the next query sequentially scanning the same table will have to read every block again (if the table's longer than available PG and OS cache) b) on a high-concurrency system other queries end up doing extra I/O, too. Oracle partially mitigates the second effect by refusing to trash its entire buffer cache on any given sequential scan. Or so I've been told by people who know Oracle well. A repeat of the sequential scan will still have to reread the entire table but that's true anyway if the table's at least one block longer than available cache. Of course, Oracle picks sequential scans in horribly and obviously wrong cases as well. On one project over the summer I had a query Oracle refused to use an available index on until I told it to do so explictly, and when I did it sped up by a factor of about 100. All optimizers will fail miserably for certain queries and datasets. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
-
Re: again on index usage
Bruce Momjian <pgman@candle.pha.pa.us> — 2002-01-11T16:42:43Z
Don Baccus wrote: > Zeugswetter Andreas SB SD wrote: > > > > This is one of the main problems of the current optimizer which imho rather > > aggressively chooses seq scans over index scans. During high load this does > > not pay off. > > > Bingo ... dragging huge tables through the buffer cache via a sequential > scan guarantees that a) the next query sequentially scanning the same > table will have to read every block again (if the table's longer than > available PG and OS cache) b) on a high-concurrency system other queries > end up doing extra I/O, too. > > Oracle partially mitigates the second effect by refusing to trash its > entire buffer cache on any given sequential scan. Or so I've been told > by people who know Oracle well. A repeat of the sequential scan will > still have to reread the entire table but that's true anyway if the > table's at least one block longer than available cache. That is on our TODO list, at least. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-
Re: again on index usage
Don Baccus <dhogaza@pacifier.com> — 2002-01-11T17:01:46Z
Bruce Momjian wrote: >>Oracle partially mitigates the second effect by refusing to trash its >>entire buffer cache on any given sequential scan. Or so I've been told >>by people who know Oracle well. A repeat of the sequential scan will >>still have to reread the entire table but that's true anyway if the >>table's at least one block longer than available cache. >> > > That is on our TODO list, at least. I didn't realize this, it's good news. (I don't follow PG development closely these days). BTW overall I think the cost-estimating portion of the PG optimizer does about as well as Oracle's. Oracle is a lot smarter about doing transformations of certain types of queries (turning "scalar in (select ...)" into something akin to an "exists") but of course this has nothing to do with estimating the cost of index vs. sequential scans. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
-
Re: again on index usage
Ross Reedstrom <reedstrm@rice.edu> — 2002-01-11T17:22:09Z
On Fri, Jan 11, 2002 at 11:42:43AM -0500, Bruce Momjian wrote: > Don Baccus wrote: > > Zeugswetter Andreas SB SD wrote: > > > > > > > This is one of the main problems of the current optimizer which imho rather > > > aggressively chooses seq scans over index scans. During high load this does > > > not pay off. > > > > > > Bingo ... dragging huge tables through the buffer cache via a sequential > > scan guarantees that a) the next query sequentially scanning the same > > table will have to read every block again (if the table's longer than > > available PG and OS cache) b) on a high-concurrency system other queries > > end up doing extra I/O, too. > > > > Oracle partially mitigates the second effect by refusing to trash its > > entire buffer cache on any given sequential scan. Or so I've been told > > by people who know Oracle well. A repeat of the sequential scan will > > still have to reread the entire table but that's true anyway if the > > table's at least one block longer than available cache. > > That is on our TODO list, at least. > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential scans) would have interesting load senstive behavior: since Linux uses all not-otherwise in use RAM as buffer cache, if you've got a low-load system, even very large tables will be cached. Once other processes start competing for RAM, your buffers go away. Bruce, what does xBSD do? I like it, since anything that needs to be sensitive to system wide information, like the total load on the machine, should be handled by the system, not a particular app. Ross
-
Re: again on index usage
Don Baccus <dhogaza@pacifier.com> — 2002-01-11T18:23:27Z
Ross J. Reedstrom wrote: > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential > scans) would have interesting load senstive behavior: since Linux uses > all not-otherwise in use RAM as buffer cache, if you've got a low-load > system, even very large tables will be cached. Once other processes start > competing for RAM, your buffers go away. Bruce, what does xBSD do? For people who run dedicated database services simply not using pg buffers for sequential scans is probably too simplistic. Assuming one allocates a very large pg buffer space, as I tend to do. Remember that shuffling data between a pg buffer and OS cache buffer takes cycles, and modern machines tend to be starved for memory bandwidth - perhaps another reason why Oracle requested and got writes that bypass the OS cache entirely? This bypasses the byte-shuffling. Of course, Oracle's preferred approach is to have you set up your OS environment so that Oracle pretty much eats the machine. They tell you to set SHMAX to 4GB in the installation docs, for instance, then the installer by default will configure Oracle to use about 1/3 of your available RAM for its buffer cache. Books on tuning generally tell you that's far too low. Anyway, I've been told that Oracle's approach is more along the lines of "don't cache sequential scans that eat up more than N% of our own cache space". Then shorter tables still get fully cached when sequentially scanned, while humongous tables don't wipe out the cache causing dirty pages to be flushed to the platter and other concurrent processes to do file I/O reads because everything but the huge table's disappeared. Someone in an earlier post mentioned "thrashing" and that's what dragging a table bigger than cache causes on busy systems. > > I like it, since anything that needs to be sensitive to system wide > information, like the total load on the machine, should be handled by > the system, not a particular app. > > Ross > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > > . > > -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
-
Re: again on index usage
Bruce Momjian <pgman@candle.pha.pa.us> — 2002-01-11T18:30:53Z
> > That is on our TODO list, at least. > > > > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential > scans) would have interesting load senstive behavior: since Linux uses > all not-otherwise in use RAM as buffer cache, if you've got a low-load > system, even very large tables will be cached. Once other processes start > competing for RAM, your buffers go away. Bruce, what does xBSD do? FreeBSD does, NetBSD will soon, not sure about the others. I believe NetBSD will be tunable because page cache vs. i/o cache is not always best done with a FIFO setup. Also, when we pull from the kernel cache we have to read into our shared buffers; much faster than disk i/o but slower than if they were already in the cache. For me the idea of doing non-cached sequential scans came from a Solaris internals book I was reading. I think it will be possible to mark large sequential scan buffer i/o with lower priority caching that may help performance. However, if others are also doing sequential scans of the same table _only_, our normal caching may be best. As you can see, this gets quite complicated and I am doubtful there will be a general solution to this problem --- more of a feedback loop may be the best bet --- determine which sequential scans are wiping the cache with little other purpose and start not caching them. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-
Re: again on index usage
mlw <markw@mohawksoft.com> — 2002-01-11T18:58:09Z
Tom Lane wrote: > What I think you are saying is that when there's lots of competing work, > seqscans have less advantage over indexscans because the > sequential-access locality advantage is lost when the disk drive has to > go off and service some other request. If that's the mechanism, I think > that the appropriate answer is just to reduce random_page_cost. It's > true that the current default of 4.0 was chosen using measurements on > otherwise-unloaded systems. If you assume that the system is (a) too > busy to do read-aheads for you, and (b) has to move the disk arm to > service other requests between each of your requests, then it's not > clear that sequential reads have any performance advantage at all :-(. > I don't think I'd go as far as to lower random_page_cost to 1.0, but > certainly there's a case for using an intermediate value. It is even more interesting than simple sequential vs random scan thinking. Depending on the maker of the drive, even an unloaded system will move the head randomly. Modern drives have almost no resemblance to their predecessors. Sectors are mapped however the OEM sees fit. A numerically sequential read from a hard disk may have the drive heads bouncing all over the disk because the internal configuration of the disk has almost nothing to do with the external representation. Think about a RAID device. What does a sequential scan mean to a RAID system? Very little depending on how the image is constructed. Storage devices are now black boxes. The only predictable advantage a "sequential scan" can have on a modern computer is OS level caching.
-
Re: again on index usage
Tom Lane <tgl@sss.pgh.pa.us> — 2002-01-11T19:09:21Z
mlw <markw@mohawksoft.com> writes: > ... Storage > devices are now black boxes. The only predictable advantage a > "sequential scan" can have on a modern computer is OS level caching. You mean read-ahead. True enough, but that "only advantage" is very significant. The 4.0 number did not come out of the air, it came from actual measurements. I think the real point in this thread is that measurements on an idle system might not extrapolate very well to measurements on a heavily loaded system. I can see the point, but I don't really have time to investigate it right now. I'd be willing to reduce the default value of random_page_cost to something around 2, if someone can come up with experimental evidence justifying it ... regards, tom lane
-
Re: again on index usage
Don Baccus <dhogaza@pacifier.com> — 2002-01-11T19:42:38Z
Bruce Momjian wrote: > Also, when we pull from the kernel cache we have to read into our shared > buffers; much faster than disk i/o but slower than if they were already > in the cache. Yes ... this is one reason why folks like Oracle want to be able to bypass the kernel cache. > For me the idea of doing non-cached sequential scans came > from a Solaris internals book I was reading. I think it will be > possible to mark large sequential scan buffer i/o with lower priority > caching that may help performance. However, if others are also doing > sequential scans of the same table _only_, our normal caching may be > best. As you can see, this gets quite complicated and I am doubtful > there will be a general solution to this problem --- more of a feedback > loop may be the best bet --- determine which sequential scans are wiping > the cache with little other purpose and start not caching them. It would be interesting to learn more about the actual hueristic Oracle uses (straight percents of the buffer cache? Something based on concurrency? I have no idea). The Oracle folks have got tons and tons of data on real-world big, busy db installations to draw from when they investigate such things. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
-
Re: again on index usage
Hannu Krosing <hannu@tm.ee> — 2002-01-12T15:08:24Z
Don Baccus wrote: > > Zeugswetter Andreas SB SD wrote: > > > This is one of the main problems of the current optimizer which imho rather > > aggressively chooses seq scans over index scans. During high load this does > > not pay off. > > Bingo ... dragging huge tables through the buffer cache via a sequential > scan guarantees that a) the next query sequentially scanning the same > table will have to read every block again (if the table's longer than > available PG and OS cache) b) on a high-concurrency system other queries > end up doing extra I/O, too. > > Oracle partially mitigates the second effect by refusing to trash its > entire buffer cache on any given sequential scan. Or so I've been told > by people who know Oracle well. A repeat of the sequential scan will > still have to reread the entire table but that's true anyway if the > table's at least one block longer than available cache. One radical way to get better-than-average cache behaviour in such pathologigal casescases would be to discard a _random_ page instead of LRU page (perhaps tuned to not not select from 1/N of pages on that are MRU) ------------- Hannu
-
Re: again on index usage
Don Baccus <dhogaza@pacifier.com> — 2002-01-12T15:44:29Z
Hannu Krosing wrote: > One radical way to get better-than-average cache behaviour in such > pathologigal casescases would be to discard a _random_ page instead of > LRU page (perhaps tuned to not not select from 1/N of pages on that are > MRU) Yep, that's one of the ways to improve performance when the same table's being scanned sequentially multiple times, or where different queries sometimes scan it sequentially, other times by index. MRU would help if you're constantly doing sequential scans. So would flipping the scan order depending on what's in the cache :) But none of these would mitigate the effects on other concurrent queries that don't query the large table at all. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org