Thread

  1. Re: again on index usage

    Zeugswetter Andreas ADI SD <zeugswettera@spardat.at> — 2002-01-11T11:46:42Z

    > This topic seems to come up a lot.  Is there something we are missing in
    > the FAQ?
    
    Most of the reports we seem to see are the usual, "but the seq scan is actually 
    faster" case. Daniel actually has a case where the optimizer chooses a bad plan.
    
    The difficulty seems to be, that the optimizer cooses a correct plan for an idle
    system, but with his workload the index path would be far better (2 vs 4 Minutes).
    
    This is one of the main problems of the current optimizer which imho rather 
    aggressively chooses seq scans over index scans. During high load this does 
    not pay off. My preference would actually be a way to make the optimizer
    choose a plan that causes minimal workload, and not shortest runtime 
    (which will obviously only be fast with low overall workload)
    The reasoning behind this is, that during low workload your response times
    will be good enough with a "bad" plan, but during high workload your response 
    times will be best with a plan that produces the least additional workload.
    
    Andreas
    
    
  2. Re: again on index usage

    Tom Lane <tgl@sss.pgh.pa.us> — 2002-01-11T16:34:07Z

    "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
    > My preference would actually be a way to make the optimizer
    > choose a plan that causes minimal workload, and not shortest runtime 
    
    ?? I am not sure that I see the difference.
    
    What I think you are saying is that when there's lots of competing work,
    seqscans have less advantage over indexscans because the
    sequential-access locality advantage is lost when the disk drive has to
    go off and service some other request.  If that's the mechanism, I think
    that the appropriate answer is just to reduce random_page_cost.  It's
    true that the current default of 4.0 was chosen using measurements on
    otherwise-unloaded systems.  If you assume that the system is (a) too
    busy to do read-aheads for you, and (b) has to move the disk arm to
    service other requests between each of your requests, then it's not
    clear that sequential reads have any performance advantage at all :-(.
    I don't think I'd go as far as to lower random_page_cost to 1.0, but
    certainly there's a case for using an intermediate value.
    
    			regards, tom lane
    
    
  3. Re: again on index usage

    Don Baccus <dhogaza@pacifier.com> — 2002-01-11T16:41:11Z

    Zeugswetter Andreas SB SD wrote:
    
    
    > This is one of the main problems of the current optimizer which imho rather 
    > aggressively chooses seq scans over index scans. During high load this does 
    > not pay off.
    
    
    Bingo ... dragging huge tables through the buffer cache via a sequential 
    scan guarantees that a) the next query sequentially scanning the same 
    table will have to read every block again (if the table's longer than 
    available PG and OS cache) b) on a high-concurrency system other queries 
    end up doing extra I/O, too.
    
    Oracle partially mitigates the second effect by refusing to trash its 
    entire buffer cache on any given sequential scan.  Or so I've been told 
    by people who know Oracle well.  A repeat of the sequential scan will 
    still have to reread the entire table but that's true anyway if the 
    table's at least one block longer than available cache.
    
    Of course, Oracle picks sequential scans in horribly and obviously wrong 
    cases as well.  On one project over the summer I had a query Oracle 
    refused to use an available index on until I told it to do so explictly, 
    and when I did it sped up by a factor of about 100.
    
    All optimizers will fail miserably for certain queries and datasets.
    
    -- 
    Don Baccus
    Portland, OR
    http://donb.photo.net, http://birdnotes.net, http://openacs.org
    
    
    
  4. Re: again on index usage

    Bruce Momjian <pgman@candle.pha.pa.us> — 2002-01-11T16:42:43Z

    Don Baccus wrote:
    > Zeugswetter Andreas SB SD wrote:
    > 
    > 
    > > This is one of the main problems of the current optimizer which imho rather 
    > > aggressively chooses seq scans over index scans. During high load this does 
    > > not pay off.
    > 
    > 
    > Bingo ... dragging huge tables through the buffer cache via a sequential 
    > scan guarantees that a) the next query sequentially scanning the same 
    > table will have to read every block again (if the table's longer than 
    > available PG and OS cache) b) on a high-concurrency system other queries 
    > end up doing extra I/O, too.
    > 
    > Oracle partially mitigates the second effect by refusing to trash its 
    > entire buffer cache on any given sequential scan.  Or so I've been told 
    > by people who know Oracle well.  A repeat of the sequential scan will 
    > still have to reread the entire table but that's true anyway if the 
    > table's at least one block longer than available cache.
    
    That is on our TODO list, at least.
    
    -- 
      Bruce Momjian                        |  http://candle.pha.pa.us
      pgman@candle.pha.pa.us               |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  5. Re: again on index usage

    Don Baccus <dhogaza@pacifier.com> — 2002-01-11T17:01:46Z

    Bruce Momjian wrote:
    
    
    >>Oracle partially mitigates the second effect by refusing to trash its 
    >>entire buffer cache on any given sequential scan.  Or so I've been told 
    >>by people who know Oracle well.  A repeat of the sequential scan will 
    >>still have to reread the entire table but that's true anyway if the 
    >>table's at least one block longer than available cache.
    >>
    > 
    > That is on our TODO list, at least.
    
    
    I didn't realize this, it's good news.  (I don't follow PG development 
    closely these days).
    
    BTW overall I think the cost-estimating portion of the PG optimizer does 
    about as well as Oracle's.   Oracle is a lot smarter about doing 
    transformations of certain types of queries (turning "scalar in (select 
    ...)" into something akin to an "exists") but of course this has nothing 
    to do with estimating the cost of index vs. sequential scans.
    
    
    -- 
    Don Baccus
    Portland, OR
    http://donb.photo.net, http://birdnotes.net, http://openacs.org
    
    
    
  6. Re: again on index usage

    Ross Reedstrom <reedstrm@rice.edu> — 2002-01-11T17:22:09Z

    On Fri, Jan 11, 2002 at 11:42:43AM -0500, Bruce Momjian wrote:
    > Don Baccus wrote:
    > > Zeugswetter Andreas SB SD wrote:
    > > 
    > > 
    > > > This is one of the main problems of the current optimizer which imho rather 
    > > > aggressively chooses seq scans over index scans. During high load this does 
    > > > not pay off.
    > > 
    > > 
    > > Bingo ... dragging huge tables through the buffer cache via a sequential 
    > > scan guarantees that a) the next query sequentially scanning the same 
    > > table will have to read every block again (if the table's longer than 
    > > available PG and OS cache) b) on a high-concurrency system other queries 
    > > end up doing extra I/O, too.
    > > 
    > > Oracle partially mitigates the second effect by refusing to trash its 
    > > entire buffer cache on any given sequential scan.  Or so I've been told 
    > > by people who know Oracle well.  A repeat of the sequential scan will 
    > > still have to reread the entire table but that's true anyway if the 
    > > table's at least one block longer than available cache.
    > 
    > That is on our TODO list, at least.
    > 
    
    Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
    scans) would have interesting load senstive behavior: since Linux uses
    all not-otherwise in use RAM as buffer cache, if you've got a low-load
    system, even very large tables will be cached. Once other processes start
    competing for RAM, your buffers go away. Bruce, what does xBSD do?
    
    I like it, since anything that needs to be sensitive to system wide
    information, like the total load on the machine, should be handled by
    the system, not a particular app.
    
    Ross
    
    
  7. Re: again on index usage

    Don Baccus <dhogaza@pacifier.com> — 2002-01-11T18:23:27Z

    Ross J. Reedstrom wrote:
    
    
    > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
    > scans) would have interesting load senstive behavior: since Linux uses
    > all not-otherwise in use RAM as buffer cache, if you've got a low-load
    > system, even very large tables will be cached. Once other processes start
    > competing for RAM, your buffers go away. Bruce, what does xBSD do?
    
    
    For people who run dedicated database services simply not using pg 
    buffers for sequential scans is probably too simplistic.  Assuming one 
    allocates a very large pg buffer space, as I tend to do.  Remember that 
    shuffling data between a pg buffer and OS cache buffer takes cycles, and 
    modern machines tend to be starved for memory bandwidth - perhaps 
    another reason why Oracle requested and got writes that bypass the OS 
    cache entirely?  This bypasses the byte-shuffling.
    
    Of course, Oracle's preferred approach is to have you set up your OS 
    environment so that Oracle pretty much eats the machine.  They tell you 
    to set SHMAX to 4GB in the installation docs, for instance, then the 
    installer by default will configure Oracle to use about 1/3 of your 
    available RAM for its buffer cache.  Books on tuning generally tell you 
    that's far too low.
    
    Anyway, I've been told that Oracle's approach is more along the lines of 
    "don't cache sequential scans that eat up more than N% of our own cache 
    space".
    
    Then shorter tables still get fully cached when sequentially scanned, 
    while humongous tables don't wipe out the cache causing dirty pages to 
    be flushed to the platter and other concurrent processes to do file I/O 
    reads because everything but the huge table's disappeared.
    
    Someone in an earlier post mentioned "thrashing" and that's what 
    dragging a table bigger than cache causes on busy systems.
    
    
    > 
    > I like it, since anything that needs to be sensitive to system wide
    > information, like the total load on the machine, should be handled by
    > the system, not a particular app.
    > 
    > Ross
    > 
    > ---------------------------(end of broadcast)---------------------------
    > TIP 6: Have you searched our list archives?
    > 
    > http://archives.postgresql.org
    > 
    > .
    > 
    > 
    
    
    
    -- 
    Don Baccus
    Portland, OR
    http://donb.photo.net, http://birdnotes.net, http://openacs.org
    
    
    
  8. Re: again on index usage

    Bruce Momjian <pgman@candle.pha.pa.us> — 2002-01-11T18:30:53Z

    > > That is on our TODO list, at least.
    > > 
    > 
    > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential
    > scans) would have interesting load senstive behavior: since Linux uses
    > all not-otherwise in use RAM as buffer cache, if you've got a low-load
    > system, even very large tables will be cached. Once other processes start
    > competing for RAM, your buffers go away. Bruce, what does xBSD do?
    
    FreeBSD does, NetBSD will soon, not sure about the others.  I believe
    NetBSD will be tunable because page cache vs. i/o cache is not always
    best done with a FIFO setup.
    
    Also, when we pull from the kernel cache we have to read into our shared
    buffers;  much faster than disk i/o but slower than if they were already
    in the cache.  For me the idea of doing non-cached sequential scans came
    from a Solaris internals book I was reading.  I think it will be
    possible to mark large sequential scan buffer i/o with lower priority
    caching that may help performance.  However, if others are also doing
    sequential scans of the same table _only_, our normal caching may be
    best.  As you can see, this gets quite complicated and I am doubtful
    there will be a general solution to this problem --- more of a feedback
    loop may be the best bet --- determine which sequential scans are wiping
    the cache with little other purpose and start not caching them.
    
    -- 
      Bruce Momjian                        |  http://candle.pha.pa.us
      pgman@candle.pha.pa.us               |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  9. Re: again on index usage

    mlw <markw@mohawksoft.com> — 2002-01-11T18:58:09Z

    Tom Lane wrote:
    > What I think you are saying is that when there's lots of competing work,
    > seqscans have less advantage over indexscans because the
    > sequential-access locality advantage is lost when the disk drive has to
    > go off and service some other request.  If that's the mechanism, I think
    > that the appropriate answer is just to reduce random_page_cost.  It's
    > true that the current default of 4.0 was chosen using measurements on
    > otherwise-unloaded systems.  If you assume that the system is (a) too
    > busy to do read-aheads for you, and (b) has to move the disk arm to
    > service other requests between each of your requests, then it's not
    > clear that sequential reads have any performance advantage at all :-(.
    > I don't think I'd go as far as to lower random_page_cost to 1.0, but
    > certainly there's a case for using an intermediate value.
    
    It is even more interesting than simple sequential vs random scan
    thinking. Depending on the maker of the drive, even an unloaded system
    will move the head randomly. Modern drives have almost no resemblance to
    their predecessors. Sectors are mapped however the OEM sees fit. A
    numerically sequential read from a hard disk may have the drive heads
    bouncing all over the disk because the internal configuration of the
    disk has almost nothing to do with the external representation.
    
    Think about a RAID device. What does a sequential scan mean to a RAID
    system? Very little depending on how the image is constructed. Storage
    devices are now black boxes. The only predictable advantage a
    "sequential scan" can have on a modern computer is OS level caching.
    
    
  10. Re: again on index usage

    Tom Lane <tgl@sss.pgh.pa.us> — 2002-01-11T19:09:21Z

    mlw <markw@mohawksoft.com> writes:
    > ... Storage
    > devices are now black boxes. The only predictable advantage a
    > "sequential scan" can have on a modern computer is OS level caching.
    
    You mean read-ahead.  True enough, but that "only advantage" is very
    significant.  The 4.0 number did not come out of the air, it came
    from actual measurements.
    
    I think the real point in this thread is that measurements on an idle
    system might not extrapolate very well to measurements on a heavily
    loaded system.  I can see the point, but I don't really have time to
    investigate it right now.  I'd be willing to reduce the default value of
    random_page_cost to something around 2, if someone can come up with
    experimental evidence justifying it ...
    
    			regards, tom lane
    
    
  11. Re: again on index usage

    Don Baccus <dhogaza@pacifier.com> — 2002-01-11T19:42:38Z

    Bruce Momjian wrote:
    
    
    > Also, when we pull from the kernel cache we have to read into our shared
    > buffers;  much faster than disk i/o but slower than if they were already
    > in the cache.
    
    
    Yes ... this is one reason why folks like Oracle want to be able to 
    bypass the kernel cache.
    
    >  For me the idea of doing non-cached sequential scans came
    > from a Solaris internals book I was reading.  I think it will be
    > possible to mark large sequential scan buffer i/o with lower priority
    > caching that may help performance.  However, if others are also doing
    > sequential scans of the same table _only_, our normal caching may be
    > best.  As you can see, this gets quite complicated and I am doubtful
    > there will be a general solution to this problem --- more of a feedback
    > loop may be the best bet --- determine which sequential scans are wiping
    > the cache with little other purpose and start not caching them.
    
    
    It would be interesting to learn more about the actual hueristic Oracle 
    uses (straight percents of the buffer cache?  Something based on 
    concurrency?  I have no idea).  The Oracle folks have got tons and tons 
    of data on real-world big, busy db installations to draw from when they 
    investigate such things.
    
    
    
    -- 
    Don Baccus
    Portland, OR
    http://donb.photo.net, http://birdnotes.net, http://openacs.org
    
    
    
  12. Re: again on index usage

    Hannu Krosing <hannu@tm.ee> — 2002-01-12T15:08:24Z

    Don Baccus wrote:
    > 
    > Zeugswetter Andreas SB SD wrote:
    > 
    > > This is one of the main problems of the current optimizer which imho rather
    > > aggressively chooses seq scans over index scans. During high load this does
    > > not pay off.
    > 
    > Bingo ... dragging huge tables through the buffer cache via a sequential
    > scan guarantees that a) the next query sequentially scanning the same
    > table will have to read every block again (if the table's longer than
    > available PG and OS cache) b) on a high-concurrency system other queries
    > end up doing extra I/O, too.
    > 
    > Oracle partially mitigates the second effect by refusing to trash its
    > entire buffer cache on any given sequential scan.  Or so I've been told
    > by people who know Oracle well.  A repeat of the sequential scan will
    > still have to reread the entire table but that's true anyway if the
    > table's at least one block longer than available cache.
    
    One radical way to get better-than-average cache behaviour in such 
    pathologigal casescases would be to discard a _random_ page instead of 
    LRU page (perhaps tuned to not not select from 1/N of pages on that are
    MRU)
    
    -------------
    Hannu
    
    
  13. Re: again on index usage

    Don Baccus <dhogaza@pacifier.com> — 2002-01-12T15:44:29Z

    Hannu Krosing wrote:
    
    
    > One radical way to get better-than-average cache behaviour in such 
    > pathologigal casescases would be to discard a _random_ page instead of 
    > LRU page (perhaps tuned to not not select from 1/N of pages on that are
    > MRU)
    
    
    Yep, that's one of the ways to improve performance when the same table's 
    being scanned sequentially multiple times, or where different queries 
    sometimes scan it sequentially, other times by index.  MRU would help if 
    you're constantly doing sequential scans.
    
    So would flipping the scan order depending on what's in the cache :)
    
    But none of these would mitigate the effects on other concurrent queries 
    that don't query the large table at all.
    
    
    
    -- 
    Don Baccus
    Portland, OR
    http://donb.photo.net, http://birdnotes.net, http://openacs.org