Thread

  1. cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-15T21:38:57Z

    Hello
    
    cache estimation and cache access cost are currently not accounted
    explicitly: they have a cost associated with but no constants (other
    than effective_cache_size but it has a very limited usage).
    
    Every IO cost is build with a derivation of the seq_page_cost,
    random_page_cost and the number of pages. Formulas are used in some
    places to make the cost more or less, to take into account caching and
    data alignment.
    
    There are:
    
     * estimation of page we will find in the postgresql buffer cache
     * estimation of page we will find in the operating system cache buffer cache
    
    and they can be compute for :
    
     * first access
     * several access
    
    We currently don't make distinction between both cache areas (there is
    more cache areas but we don't care here) and we 'prefer' estimate
    several access instead of the first one.
    
    There is also a point related to cost estimation, they are strong: for
    example once a sort goes over work_mem, its cost jumped because page
    access are accounted.
    
    The current cost estimations are already very good, most of our
    queries run well without those famous 'HINT' and the planner provide
    the best plan in most cases.
    
    But I believe that now we need more tools to improve even more the
    cost estimation.
    I would like to propose some ideas, not my ideas in all cases, the
    topic is in the air since a long time and probably that everything has
    already being said (at least around a beer or a pepsi)
    
    Adding a new GUC "cache_page_cost":
    - allows to cost the page access when it is estimated in cache
    - allows to cost a sort exceeding work_mem but which should not hit disk
    - allows to use random_page_cost for what it should be.
    (I was tempted by a GUC "write_page_cost" but I am unsure for this one
    at this stage)
    
    Adding 2 columns to pg_class "oscache_percent" and "pgcache_percent"
    (or similar names): they allow to store stats about the percentage of
    a relation in each cache.
    - Usage should be to estimate cost of first access to pages then use
    the Mackert and Lohman formula on next access. The later only provide
    a way to estimate cost of re-reading.
    
    It is hard to advocate here with real expected performance gain other
    than: we will have more options for more precise planner decision and
    we may reduce the number of report for bad planning. (it is also in
    the todolist to improve  cache estimation)
    
    --
    
    I've already hack a bit the core for that and added the 2 new columns
    with hooks to update them. ANALYZE OSCACHE update one of them and a
    plugin can be used to provide the estimate (so how it's filled is not
    important, most OSes have solutions to estimate it accurately if
    someone wonder)
    It is as-is for POC, probably not clean enough to go to commit festand
    not expected to go there before some consensus are done.
    http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
    
    -- 
    
    Hacking costsize is ... dangerous, I would say. Breaking something
    which works already so well is easy. Changing only one cost function
    is not enough to keep a good balance....
    Performance farm should help here ... and the full cycle for 9.2 too.
    
    Comments ?
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  2. Re: cache estimates, cache access cost

    Greg Smith <greg@2ndquadrant.com> — 2011-05-16T03:52:56Z

    Cédric Villemain wrote:
    > http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
    >   
    
    This rebases easily to make Cedric's changes move to the end; I just 
    pushed a version with that change to 
    https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone 
    wants a cleaner one to browse.  I've attached a patch too if that's more 
    your thing.
    
    I'd recommend not getting too stuck on the particular hook Cédric has 
    added here to compute the cache estimate, which uses mmap and mincore to 
    figure it out.  It's possible to compute similar numbers, albeit less 
    accurate, using an approach similar to how pg_buffercache inspects 
    things.  And I even once wrote a background writer extension that 
    collected this sort of data as it was running the LRU scan anyway.  
    Discussions of this idea seem to focus on how the "what's in the cache?" 
    data is collected, which as far as I'm concerned is the least important 
    part.  There are multiple options, some work better than others, and 
    there's no reason that can't be swapped out later.  The more important 
    question is how to store the data collected and then use it for 
    optimizing queries.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
    
    
    
  3. Re: cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-05-17T19:14:28Z

    On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
    > Cédric Villemain wrote:
    >>
    >> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
    >
    > This rebases easily to make Cedric's changes move to the end; I just pushed
    > a version with that change to
    > https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
    > wants a cleaner one to browse.  I've attached a patch too if that's more
    > your thing.
    
    Thank you.  I don't much like sucking in other people's git repos - it
    tends to take a lot longer than just opening a patch file, and if I
    add the repo as a remote then my git repo ends up bloated.  :-(
    
    > The more important question is how to store the data collected and
    > then use it for optimizing queries.
    
    Agreed, but unless I'm missing something, this patch does nothing
    about that.  I think the first step needs to be to update all the
    formulas that are based on random_page_cost and seq_page_cost to
    properly take cache_page_cost into account - and in some cases it may
    be a bit debatable what the right mathematics are.
    
    For what it's worth, I don't believe for a minute that an analyze
    process that may run only run on a given table every six months has a
    chance of producing useful statistics about the likelihood that a
    table will be cached.  The buffer cache can turn over completely in
    under a minute, and a minute is a lot less than a month.  Now, if we
    measured this information periodically for a long period of time and
    averaged it, that might be a believable basis for setting an optimizer
    parameter.  But I think we should take the approach recently discussed
    on performance: allow it to be manually set by the administrator on a
    per-relation basis, with some reasonable default (maybe based on the
    size of the relation relative to effective_cache_size) if the
    administrator doesn't intervene.  I don't want to be excessively
    negative about the approach of examining the actual behavior of the
    system and using that to guide system behavior - indeed, I think there
    are quite a few places where we would do well to incorporate that
    approach to a greater degree than we do currently.  But I think that
    it's going to take a lot of research, and a lot of work, and a lot of
    performance testing, to convince ourselves that we've come up with an
    appropriate feedback mechanism that will actually deliver better
    performance across a large variety of workloads.  It would be much
    better, IMHO, to *first* get a cached_page_cost parameter added, even
    if the mechanism by which caching percentages are set is initially
    quite crude - that will give us a clear-cut benefit that people can
    begin enjoying immediately.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  4. Re: cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-17T22:11:52Z

    2011/5/17 Robert Haas <robertmhaas@gmail.com>:
    > On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
    >> Cédric Villemain wrote:
    >>>
    >>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
    >>
    >> This rebases easily to make Cedric's changes move to the end; I just pushed
    >> a version with that change to
    >> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
    >> wants a cleaner one to browse.  I've attached a patch too if that's more
    >> your thing.
    >
    > Thank you.  I don't much like sucking in other people's git repos - it
    > tends to take a lot longer than just opening a patch file, and if I
    > add the repo as a remote then my git repo ends up bloated.  :-(
    >
    >> The more important question is how to store the data collected and
    >> then use it for optimizing queries.
    >
    > Agreed, but unless I'm missing something, this patch does nothing
    > about that.  I think the first step needs to be to update all the
    > formulas that are based on random_page_cost and seq_page_cost to
    > properly take cache_page_cost into account - and in some cases it may
    > be a bit debatable what the right mathematics are.
    
    Yes, I provide the branch only in case someone want to hack the
    costsize and to close the problem of getting stats.
    
    >
    > For what it's worth, I don't believe for a minute that an analyze
    > process that may run only run on a given table every six months has a
    > chance of producing useful statistics about the likelihood that a
    > table will be cached.  The buffer cache can turn over completely in
    > under a minute, and a minute is a lot less than a month.  Now, if we
    > measured this information periodically for a long period of time and
    > averaged it, that might be a believable basis for setting an optimizer
    
    The point is to get ratio in cache, not the distribution of the data
    in cache (pgfincore also allows you to see this information).
    I don't see how a stable (a server in production) system can have its
    ratio moving up and down so fast without known pattern.
    Maybe it is datawarehouse, so data move a lot, then just update your
    per-relation stats before starting your queries as suggested in other
    threads. Maybe it is just a matter of frequency of stats update or
    explicit request like we *use to do* (ANALYZE foo;) to handle those
    situations.
    
    > parameter.  But I think we should take the approach recently discussed
    > on performance: allow it to be manually set by the administrator on a
    > per-relation basis, with some reasonable default (maybe based on the
    > size of the relation relative to effective_cache_size) if the
    > administrator doesn't intervene.  I don't want to be excessively
    > negative about the approach of examining the actual behavior of the
    > system and using that to guide system behavior - indeed, I think there
    > are quite a few places where we would do well to incorporate that
    > approach to a greater degree than we do currently.  But I think that
    > it's going to take a lot of research, and a lot of work, and a lot of
    > performance testing, to convince ourselves that we've come up with an
    > appropriate feedback mechanism that will actually deliver better
    > performance across a large variety of workloads.  It would be much
    > better, IMHO, to *first* get a cached_page_cost parameter added, even
    > if the mechanism by which caching percentages are set is initially
    > quite crude - that will give us a clear-cut benefit that people can
    > begin enjoying immediately.
    
    The plugin I provided is just to be able to do first analysis on how
    the os cache size move. You can either use pgfincore to monitor that
    per table or use the patch and monitor columns values for *cache.
    
    I took the Hooks approach because it allows to do what you want :)
    You can set up a hook where you set the values you want to see, it
    allows for example to fix cold start values, or permanent values set
    by DBA or ... do what you want here.
    
    The topic is do we need more parameters to increase the value of our planner ?
    1/ cache_page_cost
    2/ cache information, arbitrary set or not.
    
    Starting with 1/ is ok for me, I prefer to try both at once if
    possible to remove the pain to hack twice costsize.c
    
    Several items are to be discussed after that: formulas to handle
    'small' tables, data distribution usage (this one hit an old topic
    about auto-partitionning  as we are here), cold state, hot state, ...
    
    PS: there is very good blocker for the pg_class changes : what happens
    in a standby ? Maybe it just opens the door on how to unlock that or
    find another option to get the information per table but distinct per
    server. (or we don't care, at least for a first implementation, like
    for other parameters)
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  5. Re: cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-05-19T12:15:04Z

    On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
    <cedric.villemain.debian@gmail.com> wrote:
    > The point is to get ratio in cache, not the distribution of the data
    > in cache (pgfincore also allows you to see this information).
    > I don't see how a stable (a server in production) system can have its
    > ratio moving up and down so fast without known pattern.
    
    Really?  It doesn't seem that hard to me.  For example, your nightly
    reports might use a different set of tables than are active during the
    day....
    
    > PS: there is very good blocker for the pg_class changes : what happens
    > in a standby ? Maybe it just opens the door on how to unlock that or
    > find another option to get the information per table but distinct per
    > server. (or we don't care, at least for a first implementation, like
    > for other parameters)
    
    That's a good point, too.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  6. Re: cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-19T12:19:23Z

    2011/5/19 Robert Haas <robertmhaas@gmail.com>:
    > On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
    > <cedric.villemain.debian@gmail.com> wrote:
    >> The point is to get ratio in cache, not the distribution of the data
    >> in cache (pgfincore also allows you to see this information).
    >> I don't see how a stable (a server in production) system can have its
    >> ratio moving up and down so fast without known pattern.
    >
    > Really?  It doesn't seem that hard to me.  For example, your nightly
    > reports might use a different set of tables than are active during the
    > day....
    
    yes, this is known pattern, I believe we can work with it.
    
    >
    >> PS: there is very good blocker for the pg_class changes : what happens
    >> in a standby ? Maybe it just opens the door on how to unlock that or
    >> find another option to get the information per table but distinct per
    >> server. (or we don't care, at least for a first implementation, like
    >> for other parameters)
    >
    > That's a good point, too.
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  7. Re: cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-05-19T12:47:08Z

    On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
    <cedric.villemain.debian@gmail.com> wrote:
    > 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
    >> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
    >> <cedric.villemain.debian@gmail.com> wrote:
    >>> The point is to get ratio in cache, not the distribution of the data
    >>> in cache (pgfincore also allows you to see this information).
    >>> I don't see how a stable (a server in production) system can have its
    >>> ratio moving up and down so fast without known pattern.
    >>
    >> Really?  It doesn't seem that hard to me.  For example, your nightly
    >> reports might use a different set of tables than are active during the
    >> day....
    >
    > yes, this is known pattern, I believe we can work with it.
    
    I guess the case where I agree that this would be relatively static is
    on something like a busy OLTP system.  If different users access
    different portions of the main tables, which parts of each relation
    are hot might move around, but overall the percentage of that relation
    in cache probably won't move around a ton, except perhaps just after
    running a one-off reporting query, or when the system is first
    starting up.
    
    But that's not everybody's workload.  Imagine a system that is
    relatively lightly used.  Every once in a while someone comes along
    and runs a big reporting query.  Well, the contents of the buffer
    caches are might vary considerably depending on *which* big reporting
    queries ran most recently.
    
    Also, even if we knew what was going to be in cache at the start of
    the query, the execution of the query might change things greatly as
    it runs.  For example, imagine a join between some table and itself.
    If we estimate that none of the data is i cache, we will almost
    certainly be wrong, because it's likely both sides of the join are
    going to access some of the same pages.  Exactly how many depends on
    the details of the join condition and whether we choose to implement
    it by merging, sorting, or hashing.  But it's likely going to be more
    than zero.  This problem can also arise in other contexts - for
    example, if a query accesses a bunch of large tables, the tables that
    are accessed later in the computation might be less cached than the
    ones accessed earlier in the computation, because the earlier accesses
    pushed parts of the tables accessed later out of cache.  Or, if a
    query requires a large sort, and the value of work_mem is very high
    (say 1GB), the sort might evict data from cache.  Now maybe none of
    this matters a bit in practice, but it's something to think about.
    
    There was an interesting report on a problem along these lines from
    Kevin Grittner a while back.  He found he needed to set seq_page_cost
    and random_page_cost differently for the database user that ran the
    nightly reports, precisely because the degree of caching was very
    different than it was for the daily activity, and he got bad plans
    otherwise.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  8. Re: cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-19T22:16:48Z

    2011/5/19 Robert Haas <robertmhaas@gmail.com>:
    > On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
    > <cedric.villemain.debian@gmail.com> wrote:
    >> 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
    >>> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
    >>> <cedric.villemain.debian@gmail.com> wrote:
    >>>> The point is to get ratio in cache, not the distribution of the data
    >>>> in cache (pgfincore also allows you to see this information).
    >>>> I don't see how a stable (a server in production) system can have its
    >>>> ratio moving up and down so fast without known pattern.
    >>>
    >>> Really?  It doesn't seem that hard to me.  For example, your nightly
    >>> reports might use a different set of tables than are active during the
    >>> day....
    >>
    >> yes, this is known pattern, I believe we can work with it.
    >
    > I guess the case where I agree that this would be relatively static is
    > on something like a busy OLTP system.  If different users access
    > different portions of the main tables, which parts of each relation
    > are hot might move around, but overall the percentage of that relation
    > in cache probably won't move around a ton, except perhaps just after
    > running a one-off reporting query, or when the system is first
    > starting up.
    
    yes.
    
    >
    > But that's not everybody's workload.  Imagine a system that is
    > relatively lightly used.  Every once in a while someone comes along
    > and runs a big reporting query.  Well, the contents of the buffer
    > caches are might vary considerably depending on *which* big reporting
    > queries ran most recently.
    
    Yes, I agree. This scenario is for the case where oscache_percent and
    pgcache_percent are subject to change I guess. We can defined 1/ if
    the values can/need to be change 2/ when update the values. For 2/ the
    database usage may help to trigger an ANALYZE when required. But to be
    honest I'd like to hear more of the strategy suggested by Greg here.
    
    Those scenari are good keep in mind to build good indicators for both
    the plugin to do the ANALYZE and to solve 2/
    
    >
    > Also, even if we knew what was going to be in cache at the start of
    > the query, the execution of the query might change things greatly as
    > it runs.  For example, imagine a join between some table and itself.
    > If we estimate that none of the data is i cache, we will almost
    > certainly be wrong, because it's likely both sides of the join are
    > going to access some of the same pages.  Exactly how many depends on
    > the details of the join condition and whether we choose to implement
    > it by merging, sorting, or hashing.  But it's likely going to be more
    > than zero.  This problem can also arise in other contexts - for
    > example, if a query accesses a bunch of large tables, the tables that
    > are accessed later in the computation might be less cached than the
    > ones accessed earlier in the computation, because the earlier accesses
    > pushed parts of the tables accessed later out of cache.
    
    Yes I believe the Mackert and Lohman formula has been good so far and
    I didn't suggest at any moment to remove it.
    It will need some rewrite to handle it with the new GUC and new
    pg_class columns but the code is already in the place for that.
    
    > Or, if a
    > query requires a large sort, and the value of work_mem is very high
    > (say 1GB), the sort might evict data from cache.  Now maybe none of
    > this matters a bit in practice, but it's something to think about.
    
    Yes I agree again.
    
    >
    > There was an interesting report on a problem along these lines from
    > Kevin Grittner a while back.  He found he needed to set seq_page_cost
    > and random_page_cost differently for the database user that ran the
    > nightly reports, precisely because the degree of caching was very
    > different than it was for the daily activity, and he got bad plans
    > otherwise.
    
    this is in fact a very interesting use case.  I believe the same
    strategy can be applied and update cache_page_cost and pg_class.
    But I really like if it closes this use case: seq_page_cost,
    random_page_cost and cache_page_cost must not need to be changed, they
    should be more 'hardware dependent'. What will need to be changed is
    in fact the frequency of ANALYZE CACHE in such case (or arbitrary set
    values). It should allow the planner and costsize functions to have
    accurate values and provide the best plan (again, the cache estimation
    coming from the running query remain in the hands of the Mackert and
    Lohman).
    OK, maybe the user will have to write some ANALYZE CACHE; between some
    queries in his scenarios.
    
    Maybe a good scenario to add to the performance farm ? (as others but
    this one has the very good value to be a production case)
    
    I'll write those scenarios in a wiki page so it can be used to review
    corner cases and possible issues (not now, it is late here).
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  9. [WIP] cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T14:29:36Z

    2011/5/16 Greg Smith <greg@2ndquadrant.com>:
    > Cédric Villemain wrote:
    >>
    >>
    >> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
    >>
    >
    > This rebases easily to make Cedric's changes move to the end; I just pushed
    > a version with that change to
    > https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
    > wants a cleaner one to browse.  I've attached a patch too if that's more
    > your thing.
    >
    > I'd recommend not getting too stuck on the particular hook Cédric has added
    > here to compute the cache estimate, which uses mmap and mincore to figure it
    > out.  It's possible to compute similar numbers, albeit less accurate, using
    > an approach similar to how pg_buffercache inspects things.  And I even once
    > wrote a background writer extension that collected this sort of data as it
    > was running the LRU scan anyway.  Discussions of this idea seem to focus on
    > how the "what's in the cache?" data is collected, which as far as I'm
    > concerned is the least important part.  There are multiple options, some
    > work better than others, and there's no reason that can't be swapped out
    > later.  The more important question is how to store the data collected and
    > then use it for optimizing queries.
    
    Attached are updated patches without the plugin itself. I've also
    added the cache_page_cost GUC, this one is not per tablespace, like
    others page_cost.
    
    There are 6 patches:
    
    0001-Add-reloscache-column-to-pg_class.patch
    0002-Add-a-function-to-update-the-new-pg_class-cols.patch
    0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
    0004-Add-a-Hook-to-handle-OSCache-stats.patch
    0005-Add-reloscache-to-Index-Rel-OptInfo.patch
    0006-Add-cache_page_cost-GUC.patch
    
    I have some comments on my own code:
    
    * I am not sure of the best datatype to use for 'reloscache'
    * I didn't include the catalog number change in the patch itself.
    * oscache_update_relstats() is very similar to vac_update_relstats(),
    maybe better to merge them but reloscache should not be updated at the
    same time than other stats.
    * There is probably too much work done in do_oscache_analyze_rel()
    because I kept vac_open_indexes() (not a big drama atm)
    * I don't know so much how gram.y works, so I am not sure my changes
    cover all cases.
    * No tests; similar columns and GUC does not have test either, but it
    lacks a test for ANALYZE OSCACHE
    
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
  10. Re: [WIP] cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-06-14T15:04:56Z

    On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
    <cedric.villemain.debian@gmail.com> wrote:
    > 0001-Add-reloscache-column-to-pg_class.patch
    > 0002-Add-a-function-to-update-the-new-pg_class-cols.patch
    > 0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
    > 0004-Add-a-Hook-to-handle-OSCache-stats.patch
    > 0005-Add-reloscache-to-Index-Rel-OptInfo.patch
    > 0006-Add-cache_page_cost-GUC.patch
    
    It seems to me that posting updated versions of this patch gets us no
    closer to addressing the concerns I (and Tom, on other threads)
    expressed about this idea previously.  Specifically:
    
    1. ANALYZE happens far too infrequently to believe that any data taken
    at ANALYZE time will still be relevant at execution time.
    2. Using data gathered by ANALYZE will make plans less stable, and our
    users complain not infrequently about the plan instability we already
    have, therefore we should not add more.
    3. Even if the data were accurate and did not cause plan stability, we
    have no evidence that using it will improve real-world performance.
    
    Now, it's possible that you or someone else could provide some
    experimental evidence refuting these points.  But right now there
    isn't any, and until there is, -1 from me on applying any of this.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  11. Re: [WIP] cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T16:06:58Z

    2011/6/14 Robert Haas <robertmhaas@gmail.com>:
    > On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
    > <cedric.villemain.debian@gmail.com> wrote:
    >> 0001-Add-reloscache-column-to-pg_class.patch
    >> 0002-Add-a-function-to-update-the-new-pg_class-cols.patch
    >> 0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
    >> 0004-Add-a-Hook-to-handle-OSCache-stats.patch
    >> 0005-Add-reloscache-to-Index-Rel-OptInfo.patch
    >> 0006-Add-cache_page_cost-GUC.patch
    >
    > It seems to me that posting updated versions of this patch gets us no
    > closer to addressing the concerns I (and Tom, on other threads)
    > expressed about this idea previously.  Specifically:
    >
    > 1. ANALYZE happens far too infrequently to believe that any data taken
    > at ANALYZE time will still be relevant at execution time.
    
    ANALYZE happens when people execute it, else it is auto-analyze and I
    am not providing auto-analyze-oscache.
    ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
    frequency is not important here, I believe.
    
    > 2. Using data gathered by ANALYZE will make plans less stable, and our
    > users complain not infrequently about the plan instability we already
    > have, therefore we should not add more.
    
    Again, it is hard to do a UPDATE pg_class SET reloscache, so I used
    ANALYZE logic.
    Also I have taken into account the fact that someone may want to SET
    the values like it was also suggested, so my patches allow to do :
    'this table is 95% in cache, the DBA said' (it is stable, not based on
    OS stats).
    
    This case has been suggested several times and is covered by my patch.
    
    > 3. Even if the data were accurate and did not cause plan stability, we
    > have no evidence that using it will improve real-world performance.
    
    I have not finish my work on cost estimation and I believe this work
    will take some time and can be done in another commitfest. At the
    moment my patches do not change anything on the dcision of the
    planner, just offers the tools I need to hack cost estimates.
    
    >
    > Now, it's possible that you or someone else could provide some
    > experimental evidence refuting these points.  But right now there
    > isn't any, and until there is, -1 from me on applying any of this.
    
    I was trying to split the patch size by group of features to reduce
    its size. The work is in progress.
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  12. Re: [WIP] cache estimates, cache access cost

    Greg Smith <greg@2ndquadrant.com> — 2011-06-14T17:10:24Z

    On 06/14/2011 11:04 AM, Robert Haas wrote:
    > Even if the data were accurate and did not cause plan stability, we
    > have no evidence that using it will improve real-world performance.
    >    
    
    That's the dependency Cédric has provided us a way to finally make 
    progress on.  Everyone says there's no evidence that this whole approach 
    will improve performance.  But we can't collect such data, to prove or 
    disprove it helps, without a proof of concept patch that implements 
    *something*.  You may not like the particular way the data is collected 
    here, but it's a working implementation that may be useful for some 
    people.  I'll take "data collected at ANALYZE time" as a completely 
    reasonable way to populate the new structures with realistic enough test 
    data to use initially.
    
    Surely at least one other way to populate the statistics, and possibly 
    multiple other ways that the user selects, will be needed eventually.  I 
    commented a while ago on this thread:  every one of these discussions 
    always gets dragged into the details of how the cache statistics data 
    will be collected and rejects whatever is suggested as not good enough.  
    Until that stops, no progress will ever get made on the higher level 
    details.  By its nature, developing toward integrating cached 
    percentages is going to lurch forward on both "collecting the cache 
    data" and "using the cache knowledge in queries" fronts almost 
    independently.  This is not a commit candidate; it's the first useful 
    proof of concept step for something we keep talking about but never 
    really doing.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
    
    
    
    
  13. Re: [WIP] cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-06-14T17:16:46Z

    On Tue, Jun 14, 2011 at 1:10 PM, Greg Smith <greg@2ndquadrant.com> wrote:
    > On 06/14/2011 11:04 AM, Robert Haas wrote:
    >> Even if the data were accurate and did not cause plan stability, we
    >> have no evidence that using it will improve real-world performance.
    >
    > That's the dependency Cédric has provided us a way to finally make progress
    > on.  Everyone says there's no evidence that this whole approach will improve
    > performance.  But we can't collect such data, to prove or disprove it helps,
    > without a proof of concept patch that implements *something*.  You may not
    > like the particular way the data is collected here, but it's a working
    > implementation that may be useful for some people.  I'll take "data
    > collected at ANALYZE time" as a completely reasonable way to populate the
    > new structures with realistic enough test data to use initially.
    
    But there's no reason that code (which may or may not eventually prove
    useful) has to be incorporated into the main tree.  We don't commit
    code so people can go benchmark it; we ask for the benchmarking to be
    done first, and then if the results are favorable, we commit the code.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  14. Re: [WIP] cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-06-14T17:36:26Z

    On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
    <cedric.villemain.debian@gmail.com> wrote:
    >> 1. ANALYZE happens far too infrequently to believe that any data taken
    >> at ANALYZE time will still be relevant at execution time.
    >
    > ANALYZE happens when people execute it, else it is auto-analyze and I
    > am not providing auto-analyze-oscache.
    > ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
    > frequency is not important here, I believe.
    
    Well, I'm not saying you have to have all the answers to post a WIP
    patch, certainly.  But in terms of getting something committable, it
    seems like we need to have at least an outline of what the long-term
    plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
    isn't going to be a reliable guide to what will happen at execution
    time...
    
    >> 2. Using data gathered by ANALYZE will make plans less stable, and our
    >> users complain not infrequently about the plan instability we already
    >> have, therefore we should not add more.
    
    ...and if it is a frequent operation then it's going to result in
    unstable plans (and maybe pg_class bloat).  There's a fundamental
    tension here that I don't think you can just wave your hands at.
    
    > I was trying to split the patch size by group of features to reduce
    > its size. The work is in progress.
    
    Totally reasonable, but I can't see committing any of it without some
    evidence that there's light at the end of the tunnel.  No performance
    tests *whatsoever* have been done.  We can debate the exact amount of
    evidence that should be required to prove that something is useful
    from a performance perspective, but we at least need some.  I'm
    beating on this point because I believe that the whole idea of trying
    to feed this information back into the planner is going to turn out to
    be something that we don't want to do.  I think it's going to turn out
    to have downsides that are far larger than the upsides.  I am
    completely willing to be be proven wrong, but right now I think this
    will make things worse and you think it will make things better and I
    don't see any way to bridge that gap without doing some measurements.
    
    For example, if you run this patch on a system and subject that system
    to a relatively even workload, how much do the numbers bounce around
    between runs?  What if you vary the workload, so that you blast it
    with OLTP traffic at some times and then run reporting queries at
    other times?  Or different tables become hot at different times?
    
    Once you've written code to make the planner do something with the
    caching % values, then you can start to explore other questions.  Can
    you generate plan instability, especially on complex queries, which
    are more prone to change quickly based on small changes in the cost
    estimates?  Can you demonstrate a workload where bad performance is
    inevitable with the current code, but with your code, the system
    becomes self-tuning and ends up with good performance?  What happens
    if you have a large cold table with a small hot end where all activity
    is concentrated?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  15. Re: [WIP] cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T19:11:45Z

    2011/6/14 Robert Haas <robertmhaas@gmail.com>:
    > On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
    > <cedric.villemain.debian@gmail.com> wrote:
    >>> 1. ANALYZE happens far too infrequently to believe that any data taken
    >>> at ANALYZE time will still be relevant at execution time.
    >>
    >> ANALYZE happens when people execute it, else it is auto-analyze and I
    >> am not providing auto-analyze-oscache.
    >> ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
    >> frequency is not important here, I believe.
    >
    > Well, I'm not saying you have to have all the answers to post a WIP
    > patch, certainly.  But in terms of getting something committable, it
    > seems like we need to have at least an outline of what the long-term
    > plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
    > isn't going to be a reliable guide to what will happen at execution
    > time...
    
    Ok.
    
    >
    >>> 2. Using data gathered by ANALYZE will make plans less stable, and our
    >>> users complain not infrequently about the plan instability we already
    >>> have, therefore we should not add more.
    >
    > ...and if it is a frequent operation then it's going to result in
    > unstable plans (and maybe pg_class bloat).  There's a fundamental
    > tension here that I don't think you can just wave your hands at.
    
    I don't want to hide that point, which is just correct.
    The idea is not to have something (which need to be) updated too much
    but it needs to be taken into account.
    
    >
    >> I was trying to split the patch size by group of features to reduce
    >> its size. The work is in progress.
    >
    > Totally reasonable, but I can't see committing any of it without some
    > evidence that there's light at the end of the tunnel.  No performance
    > tests *whatsoever* have been done.  We can debate the exact amount of
    > evidence that should be required to prove that something is useful
    > from a performance perspective, but we at least need some.  I'm
    > beating on this point because I believe that the whole idea of trying
    > to feed this information back into the planner is going to turn out to
    > be something that we don't want to do.  I think it's going to turn out
    > to have downsides that are far larger than the upsides.
    
    it is possible, yes.
    I try to do changes in a way that if the reloscache values is the one
    by default then the planner keep the same behavior than in the past.
    
    > I am
    > completely willing to be be proven wrong, but right now I think this
    > will make things worse and you think it will make things better and I
    > don't see any way to bridge that gap without doing some measurements.
    
    correct.
    
    >
    > For example, if you run this patch on a system and subject that system
    > to a relatively even workload, how much do the numbers bounce around
    > between runs?  What if you vary the workload, so that you blast it
    > with OLTP traffic at some times and then run reporting queries at
    > other times?  Or different tables become hot at different times?
    
    This is all true, this is *already* true.
    Like the thread about random_page_cost vs index_page_cost where the
    good option is to change the parameters at certain moment in the day
    (IIRC the use case).
    
    I mean that I agree that those benchs need to be done, hopefully I can
    fix some usecases, while not breaking others too much or not at all,
    or ...
    
    >
    > Once you've written code to make the planner do something with the
    > caching % values, then you can start to explore other questions.  Can
    > you generate plan instability, especially on complex queries, which
    > are more prone to change quickly based on small changes in the cost
    > estimates?  Can you demonstrate a workload where bad performance is
    > inevitable with the current code, but with your code, the system
    
    My next step is cost estimation changes. I have already some very
    small usecases where the minimum changes I did so far are interesting
    but it is not enought to come with that as evidences.
    
    > becomes self-tuning and ends up with good performance?  What happens
    > if you have a large cold table with a small hot end where all activity
    > is concentrated?
    
    We are at step 3 here :-) I have already some ideas to handle those
    situations but not yet polished.
    
    The current idea is to be conservative, like PostgreSQL used to be, for example:
    
    	/*
    	 * disk and cache costs
    	 * this assumes an agnostic knowledge of the data repartition and query
    	 * usage despite large tables may have a hot part of 10% which is the only
    	 * requested part or that we select only (c)old data so the cache useless.
    	 * We keep the original strategy to not guess too much and just ponderate
    	 * the cost globaly.
    	 */
    	run_cost += baserel->pages * ( spc_seq_page_cost * (1 - baserel->oscache)
    						     + cache_page_cost   * baserel->oscache);
    
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  16. Re: [WIP] cache estimates, cache access cost

    Alvaro Herrera <alvherre@commandprompt.com> — 2011-06-14T20:31:32Z

    Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:
    
    > Attached are updated patches without the plugin itself. I've also
    > added the cache_page_cost GUC, this one is not per tablespace, like
    > others page_cost.
    > 
    > There are 6 patches:
    > 
    > 0001-Add-reloscache-column-to-pg_class.patch
    
    Hmm, do you really need this to be a new column?  Would it work to have
    it be a reloption?
    
    -- 
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
    
    
  17. Re: [WIP] cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T21:10:20Z

    2011/6/14 Alvaro Herrera <alvherre@commandprompt.com>:
    > Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:
    >
    >> Attached are updated patches without the plugin itself. I've also
    >> added the cache_page_cost GUC, this one is not per tablespace, like
    >> others page_cost.
    >>
    >> There are 6 patches:
    >>
    >> 0001-Add-reloscache-column-to-pg_class.patch
    >
    > Hmm, do you really need this to be a new column?  Would it work to have
    > it be a reloption?
    
    If we can have ALTER TABLE running on heavy workload, why not.
    I am bit scared by the effect of such reloption, it focus on HINT
    oriented strategy when I would like to allow a dynamic strategy from
    the server. This work is not done and may not work, so a reloption is
    good at least as a backup  (and is more in the idea suggested by Tom
    and others)
    
    >
    > --
    > Álvaro Herrera <alvherre@commandprompt.com>
    > The PostgreSQL Company - Command Prompt, Inc.
    > PostgreSQL Replication, Consulting, Custom Development, 24x7 support
    >
    
    
    
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  18. Re: [WIP] cache estimates, cache access cost

    Alvaro Herrera <alvherre@commandprompt.com> — 2011-06-14T21:17:25Z

    Excerpts from Cédric Villemain's message of mar jun 14 17:10:20 -0400 2011:
    
    > If we can have ALTER TABLE running on heavy workload, why not.
    > I am bit scared by the effect of such reloption, it focus on HINT
    > oriented strategy when I would like to allow a dynamic strategy from
    > the server. This work is not done and may not work, so a reloption is
    > good at least as a backup  (and is more in the idea suggested by Tom
    > and others)
    
    Hmm, sounds like yet another use case for pg_class_nt.  Why do these
    keep popping up?
    
    -- 
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
    
    
  19. Re: [WIP] cache estimates, cache access cost

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-14T22:02:25Z

    Alvaro Herrera <alvherre@commandprompt.com> writes:
    > Excerpts from Cdric Villemain's message of mar jun 14 10:29:36 -0400 2011:
    >> 0001-Add-reloscache-column-to-pg_class.patch
    
    > Hmm, do you really need this to be a new column?  Would it work to have
    > it be a reloption?
    
    If it's to be updated in the same way as ANALYZE updates reltuples and
    relpages (ie, an in-place non-transactional update), I think it'll have
    to be a real column.
    
    			regards, tom lane
    
    
  20. Re: [WIP] cache estimates, cache access cost

    Greg Smith <greg@2ndquadrant.com> — 2011-06-14T22:17:52Z

    On 06/14/2011 01:16 PM, Robert Haas wrote:
    > But there's no reason that code (which may or may not eventually prove
    > useful) has to be incorporated into the main tree.  We don't commit
    > code so people can go benchmark it; we ask for the benchmarking to be
    > done first, and then if the results are favorable, we commit the code.
    >    
    
    Who said anything about this being a commit candidate?  The "WIP" in the 
    subject says it's not intended to be.  The community asks people to 
    submit design ideas early so that ideas around them can be explored 
    publicly.  One of the things that needs to be explored, and that could 
    use some community feedback, is exactly how this should be benchmarked 
    in the first place.  This topic--planning based on cached 
    percentage--keeps coming up, but hasn't gone very far as an abstract 
    discussion.  Having a patch to test lets it turn to a concrete one.
    
    Note that I already listed myself as the reviewer  here, so it's not 
    even like this is asking explicitly for a community volunteer to help.  
    Would you like us to research this privately and then dump a giant patch 
    that is commit candidate quality on everyone six months from now, 
    without anyone else getting input to the process, or would you like the 
    work to happen here?  I recommended Cédric not ever bother soliciting 
    ideas early, because I didn't want to get into this sort of debate.  I 
    avoid sending anything here unless I already have a strong idea about 
    the solution, because it's hard to keep criticism at bay even with 
    that.  He was more optimistic about working within the community 
    contribution guidelines and decided to send this over early instead.  If 
    you feel this is too rough to even discuss, I'll mark it returned with 
    feedback and we'll go develop this ourselves.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
    
    
    
    
  21. Re: [WIP] cache estimates, cache access cost

    Bruce Momjian <bruce@momjian.us> — 2011-06-14T22:45:26Z

    Greg Smith wrote:
    > On 06/14/2011 01:16 PM, Robert Haas wrote:
    > > But there's no reason that code (which may or may not eventually prove
    > > useful) has to be incorporated into the main tree.  We don't commit
    > > code so people can go benchmark it; we ask for the benchmarking to be
    > > done first, and then if the results are favorable, we commit the code.
    > >    
    > 
    > Who said anything about this being a commit candidate?  The "WIP" in the 
    > subject says it's not intended to be.  The community asks people to 
    > submit design ideas early so that ideas around them can be explored 
    > publicly.  One of the things that needs to be explored, and that could 
    > use some community feedback, is exactly how this should be benchmarked 
    > in the first place.  This topic--planning based on cached 
    > percentage--keeps coming up, but hasn't gone very far as an abstract 
    > discussion.  Having a patch to test lets it turn to a concrete one.
    > 
    > Note that I already listed myself as the reviewer  here, so it's not 
    > even like this is asking explicitly for a community volunteer to help.  
    > Would you like us to research this privately and then dump a giant patch 
    > that is commit candidate quality on everyone six months from now, 
    > without anyone else getting input to the process, or would you like the 
    > work to happen here?  I recommended C?dric not ever bother soliciting 
    > ideas early, because I didn't want to get into this sort of debate.  I 
    > avoid sending anything here unless I already have a strong idea about 
    > the solution, because it's hard to keep criticism at bay even with 
    > that.  He was more optimistic about working within the community 
    > contribution guidelines and decided to send this over early instead.  If 
    > you feel this is too rough to even discuss, I'll mark it returned with 
    > feedback and we'll go develop this ourselves.
    
    I would like to see us continue researching in this direction.  I think
    perhaps the background writer would be ideal for collecting this
    information because it scans the buffer cache already, and frequently.
    (Yes, I know it can't access databases.)
    
    I think random_page_cost is a dead-end --- it will never be possible for
    it to produce the right value for us.  Its value is tied up in caching,
    e.g. the default 4 is not the right value for a physical drive (it
    should be much higher), but kernel and shared buffer caching require it
    to be a hybrid number that isn't really realistic.  And once we have
    caching in that number, it is not going to be even caching for all
    tables, obviously.  Hence, there is no way for random_page_cost to be
    improved and we have to start thinking about alternatives.
    
    Basically, random_page_cost is a terrible setting and we have to admit
    that and move forward.  I realize the concerns about unstable plans, and
    we might need to give users the option of stable plans with a fixed
    random_page_cost, but at this point we don't even have enough data to
    know we need that.  What we do know is that random_page_cost is
    inadequate.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  22. Re: [WIP] cache estimates, cache access cost

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-14T23:08:09Z

    Greg Smith <greg@2ndQuadrant.com> writes:
    > On 06/14/2011 01:16 PM, Robert Haas wrote:
    >> But there's no reason that code (which may or may not eventually prove
    >> useful) has to be incorporated into the main tree.  We don't commit
    >> code so people can go benchmark it; we ask for the benchmarking to be
    >> done first, and then if the results are favorable, we commit the code.
    
    > Who said anything about this being a commit candidate?  The "WIP" in the 
    > subject says it's not intended to be.  The community asks people to 
    > submit design ideas early so that ideas around them can be explored 
    > publicly.  One of the things that needs to be explored, and that could 
    > use some community feedback, is exactly how this should be benchmarked 
    > in the first place.  This topic--planning based on cached 
    > percentage--keeps coming up, but hasn't gone very far as an abstract 
    > discussion.  Having a patch to test lets it turn to a concrete one.
    
    Yeah, it *can't* go very far as an abstract discussion ... we need some
    realistic testing to decide whether this is a good idea, and you can't
    get that without code.
    
    I think the real underlying issue here is that we have this CommitFest
    process that is focused on getting committable or nearly-committable
    code into the tree, and it just doesn't fit well for experimental code.
    I concur with Robert's desire to not push experimental code into the
    main repository, but we need to have *some* way of working with it.
    Maybe a separate repo where experimental branches could hang out would
    be helpful?
    
    (Another way of phrasing my point is that "WIP" is not conveying the
    true status of this patch.  Maybe "Experimental" would be an appropriate
    label.)
    
    			regards, tom lane
    
    
  23. Re: [WIP] cache estimates, cache access cost

    Greg Smith <greg@2ndquadrant.com> — 2011-06-15T00:01:44Z

    On 06/14/2011 07:08 PM, Tom Lane wrote:
    > I concur with Robert's desire to not push experimental code into the
    > main repository, but we need to have *some* way of working with it.
    > Maybe a separate repo where experimental branches could hang out would
    > be helpful?
    >    
    
    Well, this one is sitting around in branches at both git.postgresql.org 
    and github so far, both being updated periodically.  Maybe there's some 
    value around an official experimental repository too, but I thought that 
    was the idea of individual people having their own directories on 
    git.postgres.org.  Do we need something fancier than that?  It would be 
    nice, but seems little return on investment to improve that, relative to 
    what you can do easily enough now.
    
    The idea David Fetter has been advocating of having a "bit rot" farm to 
    help detect when the experimental branches drift too far out of date 
    tries to make that concept really formal.  I like that idea, too, but 
    find it hard to marshal enough resources to do something about it.  The 
    current status quo isn't that terrible; noticing bit rot when it's 
    relevant isn't that hard to do.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
    
    
    
    
  24. Re: [WIP] cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-06-15T03:43:31Z

    On Tue, Jun 14, 2011 at 6:17 PM, Greg Smith <greg@2ndquadrant.com> wrote:
    > Who said anything about this being a commit candidate?  The "WIP" in the
    > subject says it's not intended to be.  The community asks people to submit
    > design ideas early so that ideas around them can be explored publicly.  One
    > of the things that needs to be explored, and that could use some community
    > feedback, is exactly how this should be benchmarked in the first place.
    >  This topic--planning based on cached percentage--keeps coming up, but
    > hasn't gone very far as an abstract discussion.  Having a patch to test lets
    > it turn to a concrete one.
    >
    > Note that I already listed myself as the reviewer  here, so it's not even
    > like this is asking explicitly for a community volunteer to help.  Would you
    > like us to research this privately and then dump a giant patch that is
    > commit candidate quality on everyone six months from now, without anyone
    > else getting input to the process, or would you like the work to happen
    > here?  I recommended Cédric not ever bother soliciting ideas early, because
    > I didn't want to get into this sort of debate.  I avoid sending anything
    > here unless I already have a strong idea about the solution, because it's
    > hard to keep criticism at bay even with that.  He was more optimistic about
    > working within the community contribution guidelines and decided to send
    > this over early instead.  If you feel this is too rough to even discuss,
    > I'll mark it returned with feedback and we'll go develop this ourselves.
    
    My usual trope on this subject is that WIP patches tend to elicit
    helpful feedback if and only if the patch author is clear about what
    sort of feedback they are seeking.  I'm interested in this topic, so,
    I'm willing to put some effort into it; but, as I've said before, I
    think this patch is coming from the wrong end, so in the absence of
    any specific guidance on what sort of input would be useful, that's
    the feedback you're getting.  Feel free to clarify what would be more
    helpful.  :-)
    
    Incidentally, I have done a bit of math around how to rejigger the
    costing formulas to take cached_page_cost and caching_percentage into
    account, which I think is the most interesting end place to start this
    work.  If it's helpful, I can write it up in a more organized way and
    post that; it likely wouldn't be that much work to incorporate it into
    what Cedric has here already.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  25. Re: [WIP] cache estimates, cache access cost

    Greg Stark <stark@mit.edu> — 2011-06-19T13:38:21Z

    On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
    > 1. ANALYZE happens far too infrequently to believe that any data taken
    > at ANALYZE time will still be relevant at execution time.
    > 2. Using data gathered by ANALYZE will make plans less stable, and our
    > users complain not infrequently about the plan instability we already
    > have, therefore we should not add more.
    > 3. Even if the data were accurate and did not cause plan stability, we
    > have no evidence that using it will improve real-world performance.
    
    I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
    our interface for telling postgres to gather stats and we generally
    agree that having stats and modelling the system behaviour as
    accurately as practical is the right direction so we need a specific
    reason why this stat and this bit of modeling is a bad idea before we
    dismiss it.
    
    I think the kernel of truth in these concerns is simply that
    everything else ANALYZE looks at mutates only on DML. If you load the
    same data into two databases and run ANALYZE you'll get (modulo random
    sampling) the same stats. And if you never modify it and analyze it
    again a week later you'll get the same stats again. So autovacuum can
    guess when to run analyze based on the number of DML operations, it
    can run it without regard to how busy the system is, and it can hold
    off on running it if the data hasn't changed.
    
    In the case of the filesystem buffer cache the cached percentage will
    vary over time regardless of whether the data changes. Plain select
    queries will change it, even other activity outside the database will
    change it. There are a bunch of strategies for mitigating this
    problem: we might want to look at the cache situation more frequently,
    discount the results we see since more aggressively, and possibly
    maintain a kind of running average over time.
    
    There's another problem which I haven't seen mentioned. Because the
    access method will affect the cache there's the possibility of
    feedback loops. e.g. A freshly loaded system prefers sequential scans
    for a given table because without the cache the seeks of random reads
    are too expensive... causing it to never load that table into cache...
    causing that table to never be cached and never switch to an index
    method. It's possible there are mitigation strategies for this as well
    such as keeping a running average over time and discounting the
    estimates with some heuristic values.
    
    
    
    
    
    
    
    -- 
    greg
    
    
  26. Re: [WIP] cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-19T14:26:03Z

    2011/6/19 Greg Stark <stark@mit.edu>:
    > On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
    >> 1. ANALYZE happens far too infrequently to believe that any data taken
    >> at ANALYZE time will still be relevant at execution time.
    >> 2. Using data gathered by ANALYZE will make plans less stable, and our
    >> users complain not infrequently about the plan instability we already
    >> have, therefore we should not add more.
    >> 3. Even if the data were accurate and did not cause plan stability, we
    >> have no evidence that using it will improve real-world performance.
    >
    > I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
    > our interface for telling postgres to gather stats and we generally
    > agree that having stats and modelling the system behaviour as
    > accurately as practical is the right direction so we need a specific
    > reason why this stat and this bit of modeling is a bad idea before we
    > dismiss it.
    >
    > I think the kernel of truth in these concerns is simply that
    > everything else ANALYZE looks at mutates only on DML. If you load the
    > same data into two databases and run ANALYZE you'll get (modulo random
    > sampling) the same stats. And if you never modify it and analyze it
    > again a week later you'll get the same stats again. So autovacuum can
    > guess when to run analyze based on the number of DML operations, it
    > can run it without regard to how busy the system is, and it can hold
    > off on running it if the data hasn't changed.
    >
    > In the case of the filesystem buffer cache the cached percentage will
    > vary over time regardless of whether the data changes. Plain select
    > queries will change it, even other activity outside the database will
    > change it. There are a bunch of strategies for mitigating this
    > problem: we might want to look at the cache situation more frequently,
    > discount the results we see since more aggressively, and possibly
    > maintain a kind of running average over time.
    
    Yes.
    
    >
    > There's another problem which I haven't seen mentioned. Because the
    > access method will affect the cache there's the possibility of
    > feedback loops. e.g. A freshly loaded system prefers sequential scans
    > for a given table because without the cache the seeks of random reads
    > are too expensive... causing it to never load that table into cache...
    > causing that table to never be cached and never switch to an index
    > method. It's possible there are mitigation strategies for this as well
    
    Yeah, that's one of the problem to solve. So far I've tried to keep a
    planner which behave as currently when the rel_oscache == 0. So that
    fresh server will have the same planning than a server without
    rel_oscache.
    
    Those points are to be solved in costestimates (and selfunc). For this
    case, there is a balance between page filtering cost and index access
    cost. *And* once  the table is in cache, the index cost less and can
    be better because it need less filtering (less rows, less pages, less
    work). there is also a possible issue here (if using the index remove
    the table from cache) but I am not too much afraid of that right now.
    
    > such as keeping a running average over time and discounting the
    > estimates with some heuristic values.
    
    yes, definitively something to think about. My biggest fear here is
    for shared servers (when there is competition between services to use
    the OS cache, shooting down kernel cache strategies).
    
    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  27. Re: [WIP] cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-06-19T18:52:12Z

    On Sun, Jun 19, 2011 at 9:38 AM, Greg Stark <stark@mit.edu> wrote:
    > On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
    >> 1. ANALYZE happens far too infrequently to believe that any data taken
    >> at ANALYZE time will still be relevant at execution time.
    >> 2. Using data gathered by ANALYZE will make plans less stable, and our
    >> users complain not infrequently about the plan instability we already
    >> have, therefore we should not add more.
    >> 3. Even if the data were accurate and did not cause plan stability, we
    >> have no evidence that using it will improve real-world performance.
    >
    > I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
    > our interface for telling postgres to gather stats and we generally
    > agree that having stats and modelling the system behaviour as
    > accurately as practical is the right direction so we need a specific
    > reason why this stat and this bit of modeling is a bad idea before we
    > dismiss it.
    >
    > I think the kernel of truth in these concerns is simply that
    > everything else ANALYZE looks at mutates only on DML. If you load the
    > same data into two databases and run ANALYZE you'll get (modulo random
    > sampling) the same stats. And if you never modify it and analyze it
    > again a week later you'll get the same stats again. So autovacuum can
    > guess when to run analyze based on the number of DML operations, it
    > can run it without regard to how busy the system is, and it can hold
    > off on running it if the data hasn't changed.
    >
    > In the case of the filesystem buffer cache the cached percentage will
    > vary over time regardless of whether the data changes. Plain select
    > queries will change it, even other activity outside the database will
    > change it. There are a bunch of strategies for mitigating this
    > problem: we might want to look at the cache situation more frequently,
    > discount the results we see since more aggressively, and possibly
    > maintain a kind of running average over time.
    >
    > There's another problem which I haven't seen mentioned. Because the
    > access method will affect the cache there's the possibility of
    > feedback loops. e.g. A freshly loaded system prefers sequential scans
    > for a given table because without the cache the seeks of random reads
    > are too expensive... causing it to never load that table into cache...
    > causing that table to never be cached and never switch to an index
    > method. It's possible there are mitigation strategies for this as well
    > such as keeping a running average over time and discounting the
    > estimates with some heuristic values.
    
    *scratches head*
    
    Well, yeah.  I completely agree with you that these are the things we
    need to worry about.  Maybe I did a bad job explaining myself, because
    ISTM you said my concerns were FUD and then went on to restate them in
    different words.
    
    I'm not bent out of shape about using ANALYZE to try to gather the
    information.  That's probably a reasonable approach if it turns out we
    actually need to do it at all.  I am not sure we do.  What I've argued
    for in the past is that we start by estimating the percentage of the
    relation that will be cached based on its size relative to
    effective_cache_size, and allow the administrator to override the
    percentage on a per-relation basis if it turns out to be wrong.  That
    would avoid all of these concerns and allow us to focus on the issue
    of how the caching percentages impact the choice of plan, and whether
    the plans that pop out are in fact better when you provide information
    on caching as input.  If we have that facility in core, then people
    can write scripts or plug-in modules to do ALTER TABLE .. SET
    (caching_percentage = XYZ) every hour or so based on the sorts of
    statistics that Cedric is gathering here, and users will be able to
    experiment with a variety of algorithms and determine which ones work
    the best.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  28. Re: [WIP] cache estimates, cache access cost

    Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-19T19:32:13Z

    2011/6/19 Robert Haas <robertmhaas@gmail.com>:
    > On Sun, Jun 19, 2011 at 9:38 AM, Greg Stark <stark@mit.edu> wrote:
    >> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
    >>> 1. ANALYZE happens far too infrequently to believe that any data taken
    >>> at ANALYZE time will still be relevant at execution time.
    >>> 2. Using data gathered by ANALYZE will make plans less stable, and our
    >>> users complain not infrequently about the plan instability we already
    >>> have, therefore we should not add more.
    >>> 3. Even if the data were accurate and did not cause plan stability, we
    >>> have no evidence that using it will improve real-world performance.
    >>
    >> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
    >> our interface for telling postgres to gather stats and we generally
    >> agree that having stats and modelling the system behaviour as
    >> accurately as practical is the right direction so we need a specific
    >> reason why this stat and this bit of modeling is a bad idea before we
    >> dismiss it.
    >>
    >> I think the kernel of truth in these concerns is simply that
    >> everything else ANALYZE looks at mutates only on DML. If you load the
    >> same data into two databases and run ANALYZE you'll get (modulo random
    >> sampling) the same stats. And if you never modify it and analyze it
    >> again a week later you'll get the same stats again. So autovacuum can
    >> guess when to run analyze based on the number of DML operations, it
    >> can run it without regard to how busy the system is, and it can hold
    >> off on running it if the data hasn't changed.
    >>
    >> In the case of the filesystem buffer cache the cached percentage will
    >> vary over time regardless of whether the data changes. Plain select
    >> queries will change it, even other activity outside the database will
    >> change it. There are a bunch of strategies for mitigating this
    >> problem: we might want to look at the cache situation more frequently,
    >> discount the results we see since more aggressively, and possibly
    >> maintain a kind of running average over time.
    >>
    >> There's another problem which I haven't seen mentioned. Because the
    >> access method will affect the cache there's the possibility of
    >> feedback loops. e.g. A freshly loaded system prefers sequential scans
    >> for a given table because without the cache the seeks of random reads
    >> are too expensive... causing it to never load that table into cache...
    >> causing that table to never be cached and never switch to an index
    >> method. It's possible there are mitigation strategies for this as well
    >> such as keeping a running average over time and discounting the
    >> estimates with some heuristic values.
    >
    > *scratches head*
    >
    > Well, yeah.  I completely agree with you that these are the things we
    > need to worry about.  Maybe I did a bad job explaining myself, because
    > ISTM you said my concerns were FUD and then went on to restate them in
    > different words.
    >
    > I'm not bent out of shape about using ANALYZE to try to gather the
    > information.  That's probably a reasonable approach if it turns out we
    > actually need to do it at all.  I am not sure we do.  What I've argued
    > for in the past is that we start by estimating the percentage of the
    > relation that will be cached based on its size relative to
    > effective_cache_size, and allow the administrator to override the
    > percentage on a per-relation basis if it turns out to be wrong.  That
    > would avoid all of these concerns and allow us to focus on the issue
    > of how the caching percentages impact the choice of plan, and whether
    > the plans that pop out are in fact better when you provide information
    > on caching as input.  If we have that facility in core, then people
    > can write scripts or plug-in modules to do ALTER TABLE .. SET
    > (caching_percentage = XYZ) every hour or so based on the sorts of
    > statistics that Cedric is gathering here, and users will be able to
    > experiment with a variety of algorithms and determine which ones work
    > the best.
    
    Robert, I am very surprised.
    My patch does offer that.
    
    1st, I used ANALYZE because it is the way to update pg_class I found.
    You are suggesting ALTER TABLE instead, that is fine, but give me that
    lock-free :) else we have the ahem.. Alvaro's pg_class_ng (I find this
    one interesting because it will be lot easier to have different values
    on standby server if we find a way to have pg_class_ng 'updatable' per
    server)
    So, as long as the value can be change without problem, I don't care
    where it resides.
    
    2nd, I provided the patches on the last CF, exactly to allow to go to
    the exciting part: the cost-estimates changes. (after all, we can work
    on the cost estimate, and if later we find a way to use ALTER TABLE or
    pg_class_ng, just do it instead of via the ANALYZE magic)
    
    3nd, you can right now write a plugin to set the value of rel_oscache
    (exactly like the one you'll do for a ALTER TABLE SET reloscache...)
    
    RelationGetRelationOSCacheInFork(Relation relation, ForkNumber forkNum)
    {
           float4 percent = 0;
           /* if a plugin is present, let it manage things */
           if (OSCache_hook)
                   percent = (*OSCache_hook) (relation, forkNum);
           return percent;
     }
    
    Looks like the main fear is because I used the ANALYZE word...
    
    PS: ANALYZE OSCACHE does *not* run with ANALYZE, those are distinct
    operations. (ANALYZE won't do the job of ANALYZE OSCACHE, we can
    discuss the grammar, maybe a ANALYZE ([OSCACHE], [DATA], ...) will be
    better ).
    -- 
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
    
    
  29. Re: [WIP] cache estimates, cache access cost

    Greg Smith <greg@2ndquadrant.com> — 2011-06-19T21:32:42Z

    On 06/19/2011 09:38 AM, Greg Stark wrote:
    > There's another problem which I haven't seen mentioned. Because the
    > access method will affect the cache there's the possibility of
    > feedback loops. e.g. A freshly loaded system prefers sequential scans
    > for a given table because without the cache the seeks of random reads
    > are too expensive...
    
    Not sure if it's been mentioned in this thread yet, but he feedback 
    issue has popped up in regards to this area plenty of times.  I think 
    everyone who's producing regular input into this is aware of it, even if 
    it's not mentioned regularly.  I'm not too concerned about the specific 
    case you warned about because I don't see how sequential scan vs. index 
    costing will be any different on a fresh system than it is now.  But 
    there are plenty of cases like it to be mapped out here, and many are 
    not solvable--they're just something that needs to be documented as a risk.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
    
    
    
    
  30. Re: [WIP] cache estimates, cache access cost

    Robert Haas <robertmhaas@gmail.com> — 2011-06-20T00:49:19Z

    On Sun, Jun 19, 2011 at 3:32 PM, Cédric Villemain
    <cedric.villemain.debian@gmail.com> wrote:
    > 2nd, I provided the patches on the last CF, exactly to allow to go to
    > the exciting part: the cost-estimates changes. (after all, we can work
    > on the cost estimate, and if later we find a way to use ALTER TABLE or
    > pg_class_ng, just do it instead of via the ANALYZE magic)
    
    We're talking past each other here, somehow.  The cost-estimating part
    does not require this patch in order to something useful, but this
    patch, AFAICT, absolutely does require the cost-estimating part to do
    something useful.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company