Thread

cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-15T21:38:57Z

Hello

cache estimation and cache access cost are currently not accounted
explicitly: they have a cost associated with but no constants (other
than effective_cache_size but it has a very limited usage).

Every IO cost is build with a derivation of the seq_page_cost,
random_page_cost and the number of pages. Formulas are used in some
places to make the cost more or less, to take into account caching and
data alignment.

There are:

* estimation of page we will find in the postgresql buffer cache
* estimation of page we will find in the operating system cache buffer cache

and they can be compute for :

* first access
* several access

We currently don't make distinction between both cache areas (there is
more cache areas but we don't care here) and we 'prefer' estimate
several access instead of the first one.

There is also a point related to cost estimation, they are strong: for
example once a sort goes over work_mem, its cost jumped because page
access are accounted.

The current cost estimations are already very good, most of our
queries run well without those famous 'HINT' and the planner provide
the best plan in most cases.

But I believe that now we need more tools to improve even more the
cost estimation.
I would like to propose some ideas, not my ideas in all cases, the
topic is in the air since a long time and probably that everything has
already being said (at least around a beer or a pepsi)

Adding a new GUC "cache_page_cost":
- allows to cost the page access when it is estimated in cache
- allows to cost a sort exceeding work_mem but which should not hit disk
- allows to use random_page_cost for what it should be.
(I was tempted by a GUC "write_page_cost" but I am unsure for this one
at this stage)

Adding 2 columns to pg_class "oscache_percent" and "pgcache_percent"
(or similar names): they allow to store stats about the percentage of
a relation in each cache.
- Usage should be to estimate cost of first access to pages then use
the Mackert and Lohman formula on next access. The later only provide
a way to estimate cost of re-reading.

It is hard to advocate here with real expected performance gain other
than: we will have more options for more precise planner decision and
we may reduce the number of report for bad planning. (it is also in
the todolist to improve cache estimation)

I've already hack a bit the core for that and added the 2 new columns
with hooks to update them. ANALYZE OSCACHE update one of them and a
plugin can be used to provide the estimate (so how it's filled is not
important, most OSes have solutions to estimate it accurately if
someone wonder)
It is as-is for POC, probably not clean enough to go to commit festand
not expected to go there before some consensus are done.
http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

Hacking costsize is ... dangerous, I would say. Breaking something
which works already so well is easy. Changing only one cost function
is not enough to keep a good balance....
Performance farm should help here ... and the full cycle for 9.2 too.

Comments ?
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

Re: cache estimates, cache access cost

Greg Smith <greg@2ndquadrant.com> — 2011-05-16T03:52:56Z

Cédric Villemain wrote:
> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>   

This rebases easily to make Cedric's changes move to the end; I just 
pushed a version with that change to 
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone 
wants a cleaner one to browse.  I've attached a patch too if that's more 
your thing.

I'd recommend not getting too stuck on the particular hook Cédric has 
added here to compute the cache estimate, which uses mmap and mincore to 
figure it out.  It's possible to compute similar numbers, albeit less 
accurate, using an approach similar to how pg_buffercache inspects 
things.  And I even once wrote a background writer extension that 
collected this sort of data as it was running the LRU scan anyway.  
Discussions of this idea seem to focus on how the "what's in the cache?" 
data is collected, which as far as I'm concerned is the least important 
part.  There are multiple options, some work better than others, and 
there's no reason that can't be swapped out later.  The more important 
question is how to store the data collected and then use it for 
optimizing queries.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-05-17T19:14:28Z

On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Cédric Villemain wrote:
>>
>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>
> This rebases easily to make Cedric's changes move to the end; I just pushed
> a version with that change to
> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
> wants a cleaner one to browse.  I've attached a patch too if that's more
> your thing.

Thank you.  I don't much like sucking in other people's git repos - it
tends to take a lot longer than just opening a patch file, and if I
add the repo as a remote then my git repo ends up bloated.  :-(

> The more important question is how to store the data collected and
> then use it for optimizing queries.

Agreed, but unless I'm missing something, this patch does nothing
about that.  I think the first step needs to be to update all the
formulas that are based on random_page_cost and seq_page_cost to
properly take cache_page_cost into account - and in some cases it may
be a bit debatable what the right mathematics are.

For what it's worth, I don't believe for a minute that an analyze
process that may run only run on a given table every six months has a
chance of producing useful statistics about the likelihood that a
table will be cached.  The buffer cache can turn over completely in
under a minute, and a minute is a lot less than a month.  Now, if we
measured this information periodically for a long period of time and
averaged it, that might be a believable basis for setting an optimizer
parameter.  But I think we should take the approach recently discussed
on performance: allow it to be manually set by the administrator on a
per-relation basis, with some reasonable default (maybe based on the
size of the relation relative to effective_cache_size) if the
administrator doesn't intervene.  I don't want to be excessively
negative about the approach of examining the actual behavior of the
system and using that to guide system behavior - indeed, I think there
are quite a few places where we would do well to incorporate that
approach to a greater degree than we do currently.  But I think that
it's going to take a lot of research, and a lot of work, and a lot of
performance testing, to convince ourselves that we've come up with an
appropriate feedback mechanism that will actually deliver better
performance across a large variety of workloads.  It would be much
better, IMHO, to *first* get a cached_page_cost parameter added, even
if the mechanism by which caching percentages are set is initially
quite crude - that will give us a clear-cut benefit that people can
begin enjoying immediately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-17T22:11:52Z

2011/5/17 Robert Haas <robertmhaas@gmail.com>:
> On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Cédric Villemain wrote:
>>>
>>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>>
>> This rebases easily to make Cedric's changes move to the end; I just pushed
>> a version with that change to
>> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
>> wants a cleaner one to browse.  I've attached a patch too if that's more
>> your thing.
>
> Thank you.  I don't much like sucking in other people's git repos - it
> tends to take a lot longer than just opening a patch file, and if I
> add the repo as a remote then my git repo ends up bloated.  :-(
>
>> The more important question is how to store the data collected and
>> then use it for optimizing queries.
>
> Agreed, but unless I'm missing something, this patch does nothing
> about that.  I think the first step needs to be to update all the
> formulas that are based on random_page_cost and seq_page_cost to
> properly take cache_page_cost into account - and in some cases it may
> be a bit debatable what the right mathematics are.

Yes, I provide the branch only in case someone want to hack the
costsize and to close the problem of getting stats.

>
> For what it's worth, I don't believe for a minute that an analyze
> process that may run only run on a given table every six months has a
> chance of producing useful statistics about the likelihood that a
> table will be cached.  The buffer cache can turn over completely in
> under a minute, and a minute is a lot less than a month.  Now, if we
> measured this information periodically for a long period of time and
> averaged it, that might be a believable basis for setting an optimizer

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.
Maybe it is datawarehouse, so data move a lot, then just update your
per-relation stats before starting your queries as suggested in other
threads. Maybe it is just a matter of frequency of stats update or
explicit request like we *use to do* (ANALYZE foo;) to handle those
situations.

> parameter.  But I think we should take the approach recently discussed
> on performance: allow it to be manually set by the administrator on a
> per-relation basis, with some reasonable default (maybe based on the
> size of the relation relative to effective_cache_size) if the
> administrator doesn't intervene.  I don't want to be excessively
> negative about the approach of examining the actual behavior of the
> system and using that to guide system behavior - indeed, I think there
> are quite a few places where we would do well to incorporate that
> approach to a greater degree than we do currently.  But I think that
> it's going to take a lot of research, and a lot of work, and a lot of
> performance testing, to convince ourselves that we've come up with an
> appropriate feedback mechanism that will actually deliver better
> performance across a large variety of workloads.  It would be much
> better, IMHO, to *first* get a cached_page_cost parameter added, even
> if the mechanism by which caching percentages are set is initially
> quite crude - that will give us a clear-cut benefit that people can
> begin enjoying immediately.

The plugin I provided is just to be able to do first analysis on how
the os cache size move. You can either use pgfincore to monitor that
per table or use the patch and monitor columns values for *cache.

I took the Hooks approach because it allows to do what you want :)
You can set up a hook where you set the values you want to see, it
allows for example to fix cold start values, or permanent values set
by DBA or ... do what you want here.

The topic is do we need more parameters to increase the value of our planner ?
1/ cache_page_cost
2/ cache information, arbitrary set or not.

Starting with 1/ is ok for me, I prefer to try both at once if
possible to remove the pain to hack twice costsize.c

Several items are to be discussed after that: formulas to handle
'small' tables, data distribution usage (this one hit an old topic
about auto-partitionning  as we are here), cold state, hot state, ...

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)
-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-05-19T12:15:04Z

On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> The point is to get ratio in cache, not the distribution of the data
> in cache (pgfincore also allows you to see this information).
> I don't see how a stable (a server in production) system can have its
> ratio moving up and down so fast without known pattern.

Really?  It doesn't seem that hard to me.  For example, your nightly
reports might use a different set of tables than are active during the
day....

> PS: there is very good blocker for the pg_class changes : what happens
> in a standby ? Maybe it just opens the door on how to unlock that or
> find another option to get the information per table but distinct per
> server. (or we don't care, at least for a first implementation, like
> for other parameters)

That's a good point, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-19T12:19:23Z

2011/5/19 Robert Haas <robertmhaas@gmail.com>:
> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> The point is to get ratio in cache, not the distribution of the data
>> in cache (pgfincore also allows you to see this information).
>> I don't see how a stable (a server in production) system can have its
>> ratio moving up and down so fast without known pattern.
>
> Really?  It doesn't seem that hard to me.  For example, your nightly
> reports might use a different set of tables than are active during the
> day....

yes, this is known pattern, I believe we can work with it.

>
>> PS: there is very good blocker for the pg_class changes : what happens
>> in a standby ? Maybe it just opens the door on how to unlock that or
>> find another option to get the information per table but distinct per
>> server. (or we don't care, at least for a first implementation, like
>> for other parameters)
>
> That's a good point, too.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-05-19T12:47:08Z

On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
>> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>> The point is to get ratio in cache, not the distribution of the data
>>> in cache (pgfincore also allows you to see this information).
>>> I don't see how a stable (a server in production) system can have its
>>> ratio moving up and down so fast without known pattern.
>>
>> Really?  It doesn't seem that hard to me.  For example, your nightly
>> reports might use a different set of tables than are active during the
>> day....
>
> yes, this is known pattern, I believe we can work with it.

I guess the case where I agree that this would be relatively static is
on something like a busy OLTP system.  If different users access
different portions of the main tables, which parts of each relation
are hot might move around, but overall the percentage of that relation
in cache probably won't move around a ton, except perhaps just after
running a one-off reporting query, or when the system is first
starting up.

But that's not everybody's workload.  Imagine a system that is
relatively lightly used.  Every once in a while someone comes along
and runs a big reporting query.  Well, the contents of the buffer
caches are might vary considerably depending on *which* big reporting
queries ran most recently.

Also, even if we knew what was going to be in cache at the start of
the query, the execution of the query might change things greatly as
it runs.  For example, imagine a join between some table and itself.
If we estimate that none of the data is i cache, we will almost
certainly be wrong, because it's likely both sides of the join are
going to access some of the same pages.  Exactly how many depends on
the details of the join condition and whether we choose to implement
it by merging, sorting, or hashing.  But it's likely going to be more
than zero.  This problem can also arise in other contexts - for
example, if a query accesses a bunch of large tables, the tables that
are accessed later in the computation might be less cached than the
ones accessed earlier in the computation, because the earlier accesses
pushed parts of the tables accessed later out of cache.  Or, if a
query requires a large sort, and the value of work_mem is very high
(say 1GB), the sort might evict data from cache.  Now maybe none of
this matters a bit in practice, but it's something to think about.

There was an interesting report on a problem along these lines from
Kevin Grittner a while back.  He found he needed to set seq_page_cost
and random_page_cost differently for the database user that ran the
nightly reports, precisely because the degree of caching was very
different than it was for the daily activity, and he got bad plans
otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-05-19T22:16:48Z

2011/5/19 Robert Haas <robertmhaas@gmail.com>:
> On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
>>> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
>>> <cedric.villemain.debian@gmail.com> wrote:
>>>> The point is to get ratio in cache, not the distribution of the data
>>>> in cache (pgfincore also allows you to see this information).
>>>> I don't see how a stable (a server in production) system can have its
>>>> ratio moving up and down so fast without known pattern.
>>>
>>> Really?  It doesn't seem that hard to me.  For example, your nightly
>>> reports might use a different set of tables than are active during the
>>> day....
>>
>> yes, this is known pattern, I believe we can work with it.
>
> I guess the case where I agree that this would be relatively static is
> on something like a busy OLTP system.  If different users access
> different portions of the main tables, which parts of each relation
> are hot might move around, but overall the percentage of that relation
> in cache probably won't move around a ton, except perhaps just after
> running a one-off reporting query, or when the system is first
> starting up.

yes.

>
> But that's not everybody's workload.  Imagine a system that is
> relatively lightly used.  Every once in a while someone comes along
> and runs a big reporting query.  Well, the contents of the buffer
> caches are might vary considerably depending on *which* big reporting
> queries ran most recently.

Yes, I agree. This scenario is for the case where oscache_percent and
pgcache_percent are subject to change I guess. We can defined 1/ if
the values can/need to be change 2/ when update the values. For 2/ the
database usage may help to trigger an ANALYZE when required. But to be
honest I'd like to hear more of the strategy suggested by Greg here.

Those scenari are good keep in mind to build good indicators for both
the plugin to do the ANALYZE and to solve 2/

>
> Also, even if we knew what was going to be in cache at the start of
> the query, the execution of the query might change things greatly as
> it runs.  For example, imagine a join between some table and itself.
> If we estimate that none of the data is i cache, we will almost
> certainly be wrong, because it's likely both sides of the join are
> going to access some of the same pages.  Exactly how many depends on
> the details of the join condition and whether we choose to implement
> it by merging, sorting, or hashing.  But it's likely going to be more
> than zero.  This problem can also arise in other contexts - for
> example, if a query accesses a bunch of large tables, the tables that
> are accessed later in the computation might be less cached than the
> ones accessed earlier in the computation, because the earlier accesses
> pushed parts of the tables accessed later out of cache.

Yes I believe the Mackert and Lohman formula has been good so far and
I didn't suggest at any moment to remove it.
It will need some rewrite to handle it with the new GUC and new
pg_class columns but the code is already in the place for that.

> Or, if a
> query requires a large sort, and the value of work_mem is very high
> (say 1GB), the sort might evict data from cache.  Now maybe none of
> this matters a bit in practice, but it's something to think about.

Yes I agree again.

>
> There was an interesting report on a problem along these lines from
> Kevin Grittner a while back.  He found he needed to set seq_page_cost
> and random_page_cost differently for the database user that ran the
> nightly reports, precisely because the degree of caching was very
> different than it was for the daily activity, and he got bad plans
> otherwise.

this is in fact a very interesting use case.  I believe the same
strategy can be applied and update cache_page_cost and pg_class.
But I really like if it closes this use case: seq_page_cost,
random_page_cost and cache_page_cost must not need to be changed, they
should be more 'hardware dependent'. What will need to be changed is
in fact the frequency of ANALYZE CACHE in such case (or arbitrary set
values). It should allow the planner and costsize functions to have
accurate values and provide the best plan (again, the cache estimation
coming from the running query remain in the hands of the Mackert and
Lohman).
OK, maybe the user will have to write some ANALYZE CACHE; between some
queries in his scenarios.

Maybe a good scenario to add to the performance farm ? (as others but
this one has the very good value to be a production case)

I'll write those scenarios in a wiki page so it can be used to review
corner cases and possible issues (not now, it is late here).

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

[WIP] cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T14:29:36Z

2011/5/16 Greg Smith <greg@2ndquadrant.com>:
> Cédric Villemain wrote:
>>
>>
>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>>
>
> This rebases easily to make Cedric's changes move to the end; I just pushed
> a version with that change to
> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
> wants a cleaner one to browse.  I've attached a patch too if that's more
> your thing.
>
> I'd recommend not getting too stuck on the particular hook Cédric has added
> here to compute the cache estimate, which uses mmap and mincore to figure it
> out.  It's possible to compute similar numbers, albeit less accurate, using
> an approach similar to how pg_buffercache inspects things.  And I even once
> wrote a background writer extension that collected this sort of data as it
> was running the LRU scan anyway.  Discussions of this idea seem to focus on
> how the "what's in the cache?" data is collected, which as far as I'm
> concerned is the least important part.  There are multiple options, some
> work better than others, and there's no reason that can't be swapped out
> later.  The more important question is how to store the data collected and
> then use it for optimizing queries.

Attached are updated patches without the plugin itself. I've also
added the cache_page_cost GUC, this one is not per tablespace, like
others page_cost.

There are 6 patches:

0001-Add-reloscache-column-to-pg_class.patch
0002-Add-a-function-to-update-the-new-pg_class-cols.patch
0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
0004-Add-a-Hook-to-handle-OSCache-stats.patch
0005-Add-reloscache-to-Index-Rel-OptInfo.patch
0006-Add-cache_page_cost-GUC.patch

I have some comments on my own code:

* I am not sure of the best datatype to use for 'reloscache'
* I didn't include the catalog number change in the patch itself.
* oscache_update_relstats() is very similar to vac_update_relstats(),
maybe better to merge them but reloscache should not be updated at the
same time than other stats.
* There is probably too much work done in do_oscache_analyze_rel()
because I kept vac_open_indexes() (not a big drama atm)
* I don't know so much how gram.y works, so I am not sure my changes
cover all cases.
* No tests; similar columns and GUC does not have test either, but it
lacks a test for ANALYZE OSCACHE

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-06-14T15:04:56Z

On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 0001-Add-reloscache-column-to-pg_class.patch
> 0002-Add-a-function-to-update-the-new-pg_class-cols.patch
> 0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
> 0004-Add-a-Hook-to-handle-OSCache-stats.patch
> 0005-Add-reloscache-to-Index-Rel-OptInfo.patch
> 0006-Add-cache_page_cost-GUC.patch

It seems to me that posting updated versions of this patch gets us no
closer to addressing the concerns I (and Tom, on other threads)
expressed about this idea previously.  Specifically:

1. ANALYZE happens far too infrequently to believe that any data taken
at ANALYZE time will still be relevant at execution time.
2. Using data gathered by ANALYZE will make plans less stable, and our
users complain not infrequently about the plan instability we already
have, therefore we should not add more.
3. Even if the data were accurate and did not cause plan stability, we
have no evidence that using it will improve real-world performance.

Now, it's possible that you or someone else could provide some
experimental evidence refuting these points.  But right now there
isn't any, and until there is, -1 from me on applying any of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T16:06:58Z

2011/6/14 Robert Haas <robertmhaas@gmail.com>:
> On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 0001-Add-reloscache-column-to-pg_class.patch
>> 0002-Add-a-function-to-update-the-new-pg_class-cols.patch
>> 0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
>> 0004-Add-a-Hook-to-handle-OSCache-stats.patch
>> 0005-Add-reloscache-to-Index-Rel-OptInfo.patch
>> 0006-Add-cache_page_cost-GUC.patch
>
> It seems to me that posting updated versions of this patch gets us no
> closer to addressing the concerns I (and Tom, on other threads)
> expressed about this idea previously.  Specifically:
>
> 1. ANALYZE happens far too infrequently to believe that any data taken
> at ANALYZE time will still be relevant at execution time.

ANALYZE happens when people execute it, else it is auto-analyze and I
am not providing auto-analyze-oscache.
ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
frequency is not important here, I believe.

> 2. Using data gathered by ANALYZE will make plans less stable, and our
> users complain not infrequently about the plan instability we already
> have, therefore we should not add more.

Again, it is hard to do a UPDATE pg_class SET reloscache, so I used
ANALYZE logic.
Also I have taken into account the fact that someone may want to SET
the values like it was also suggested, so my patches allow to do :
'this table is 95% in cache, the DBA said' (it is stable, not based on
OS stats).

This case has been suggested several times and is covered by my patch.

> 3. Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.

I have not finish my work on cost estimation and I believe this work
will take some time and can be done in another commitfest. At the
moment my patches do not change anything on the dcision of the
planner, just offers the tools I need to hack cost estimates.

>
> Now, it's possible that you or someone else could provide some
> experimental evidence refuting these points.  But right now there
> isn't any, and until there is, -1 from me on applying any of this.

I was trying to split the patch size by group of features to reduce
its size. The work is in progress.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

Greg Smith <greg@2ndquadrant.com> — 2011-06-14T17:10:24Z

On 06/14/2011 11:04 AM, Robert Haas wrote:
> Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.
>    

That's the dependency Cédric has provided us a way to finally make 
progress on.  Everyone says there's no evidence that this whole approach 
will improve performance.  But we can't collect such data, to prove or 
disprove it helps, without a proof of concept patch that implements 
*something*.  You may not like the particular way the data is collected 
here, but it's a working implementation that may be useful for some 
people.  I'll take "data collected at ANALYZE time" as a completely 
reasonable way to populate the new structures with realistic enough test 
data to use initially.

Surely at least one other way to populate the statistics, and possibly 
multiple other ways that the user selects, will be needed eventually.  I 
commented a while ago on this thread:  every one of these discussions 
always gets dragged into the details of how the cache statistics data 
will be collected and rejects whatever is suggested as not good enough.  
Until that stops, no progress will ever get made on the higher level 
details.  By its nature, developing toward integrating cached 
percentages is going to lurch forward on both "collecting the cache 
data" and "using the cache knowledge in queries" fronts almost 
independently.  This is not a commit candidate; it's the first useful 
proof of concept step for something we keep talking about but never 
really doing.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-06-14T17:16:46Z

On Tue, Jun 14, 2011 at 1:10 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 06/14/2011 11:04 AM, Robert Haas wrote:
>> Even if the data were accurate and did not cause plan stability, we
>> have no evidence that using it will improve real-world performance.
>
> That's the dependency Cédric has provided us a way to finally make progress
> on.  Everyone says there's no evidence that this whole approach will improve
> performance.  But we can't collect such data, to prove or disprove it helps,
> without a proof of concept patch that implements *something*.  You may not
> like the particular way the data is collected here, but it's a working
> implementation that may be useful for some people.  I'll take "data
> collected at ANALYZE time" as a completely reasonable way to populate the
> new structures with realistic enough test data to use initially.

But there's no reason that code (which may or may not eventually prove
useful) has to be incorporated into the main tree.  We don't commit
code so people can go benchmark it; we ask for the benchmarking to be
done first, and then if the results are favorable, we commit the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-06-14T17:36:26Z

On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
>> 1. ANALYZE happens far too infrequently to believe that any data taken
>> at ANALYZE time will still be relevant at execution time.
>
> ANALYZE happens when people execute it, else it is auto-analyze and I
> am not providing auto-analyze-oscache.
> ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
> frequency is not important here, I believe.

Well, I'm not saying you have to have all the answers to post a WIP
patch, certainly.  But in terms of getting something committable, it
seems like we need to have at least an outline of what the long-term
plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
isn't going to be a reliable guide to what will happen at execution
time...

>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>> users complain not infrequently about the plan instability we already
>> have, therefore we should not add more.

...and if it is a frequent operation then it's going to result in
unstable plans (and maybe pg_class bloat).  There's a fundamental
tension here that I don't think you can just wave your hands at.

> I was trying to split the patch size by group of features to reduce
> its size. The work is in progress.

Totally reasonable, but I can't see committing any of it without some
evidence that there's light at the end of the tunnel.  No performance
tests *whatsoever* have been done.  We can debate the exact amount of
evidence that should be required to prove that something is useful
from a performance perspective, but we at least need some.  I'm
beating on this point because I believe that the whole idea of trying
to feed this information back into the planner is going to turn out to
be something that we don't want to do.  I think it's going to turn out
to have downsides that are far larger than the upsides.  I am
completely willing to be be proven wrong, but right now I think this
will make things worse and you think it will make things better and I
don't see any way to bridge that gap without doing some measurements.

For example, if you run this patch on a system and subject that system
to a relatively even workload, how much do the numbers bounce around
between runs?  What if you vary the workload, so that you blast it
with OLTP traffic at some times and then run reporting queries at
other times?  Or different tables become hot at different times?

Once you've written code to make the planner do something with the
caching % values, then you can start to explore other questions.  Can
you generate plan instability, especially on complex queries, which
are more prone to change quickly based on small changes in the cost
estimates?  Can you demonstrate a workload where bad performance is
inevitable with the current code, but with your code, the system
becomes self-tuning and ends up with good performance?  What happens
if you have a large cold table with a small hot end where all activity
is concentrated?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T19:11:45Z

2011/6/14 Robert Haas <robertmhaas@gmail.com>:
> On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>>> 1. ANALYZE happens far too infrequently to believe that any data taken
>>> at ANALYZE time will still be relevant at execution time.
>>
>> ANALYZE happens when people execute it, else it is auto-analyze and I
>> am not providing auto-analyze-oscache.
>> ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
>> frequency is not important here, I believe.
>
> Well, I'm not saying you have to have all the answers to post a WIP
> patch, certainly.  But in terms of getting something committable, it
> seems like we need to have at least an outline of what the long-term
> plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
> isn't going to be a reliable guide to what will happen at execution
> time...

Ok.

>
>>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>>> users complain not infrequently about the plan instability we already
>>> have, therefore we should not add more.
>
> ...and if it is a frequent operation then it's going to result in
> unstable plans (and maybe pg_class bloat).  There's a fundamental
> tension here that I don't think you can just wave your hands at.

I don't want to hide that point, which is just correct.
The idea is not to have something (which need to be) updated too much
but it needs to be taken into account.

>
>> I was trying to split the patch size by group of features to reduce
>> its size. The work is in progress.
>
> Totally reasonable, but I can't see committing any of it without some
> evidence that there's light at the end of the tunnel.  No performance
> tests *whatsoever* have been done.  We can debate the exact amount of
> evidence that should be required to prove that something is useful
> from a performance perspective, but we at least need some.  I'm
> beating on this point because I believe that the whole idea of trying
> to feed this information back into the planner is going to turn out to
> be something that we don't want to do.  I think it's going to turn out
> to have downsides that are far larger than the upsides.

it is possible, yes.
I try to do changes in a way that if the reloscache values is the one
by default then the planner keep the same behavior than in the past.

> I am
> completely willing to be be proven wrong, but right now I think this
> will make things worse and you think it will make things better and I
> don't see any way to bridge that gap without doing some measurements.

correct.

>
> For example, if you run this patch on a system and subject that system
> to a relatively even workload, how much do the numbers bounce around
> between runs?  What if you vary the workload, so that you blast it
> with OLTP traffic at some times and then run reporting queries at
> other times?  Or different tables become hot at different times?

This is all true, this is *already* true.
Like the thread about random_page_cost vs index_page_cost where the
good option is to change the parameters at certain moment in the day
(IIRC the use case).

I mean that I agree that those benchs need to be done, hopefully I can
fix some usecases, while not breaking others too much or not at all,
or ...

>
> Once you've written code to make the planner do something with the
> caching % values, then you can start to explore other questions.  Can
> you generate plan instability, especially on complex queries, which
> are more prone to change quickly based on small changes in the cost
> estimates?  Can you demonstrate a workload where bad performance is
> inevitable with the current code, but with your code, the system

My next step is cost estimation changes. I have already some very
small usecases where the minimum changes I did so far are interesting
but it is not enought to come with that as evidences.

> becomes self-tuning and ends up with good performance?  What happens
> if you have a large cold table with a small hot end where all activity
> is concentrated?

We are at step 3 here :-) I have already some ideas to handle those
situations but not yet polished.

The current idea is to be conservative, like PostgreSQL used to be, for example:

	/*
	 * disk and cache costs
	 * this assumes an agnostic knowledge of the data repartition and query
	 * usage despite large tables may have a hot part of 10% which is the only
	 * requested part or that we select only (c)old data so the cache useless.
	 * We keep the original strategy to not guess too much and just ponderate
	 * the cost globaly.
	 */
	run_cost += baserel->pages * ( spc_seq_page_cost * (1 - baserel->oscache)
						     + cache_page_cost   * baserel->oscache);


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

Alvaro Herrera <alvherre@commandprompt.com> — 2011-06-14T20:31:32Z

Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:

> Attached are updated patches without the plugin itself. I've also
> added the cache_page_cost GUC, this one is not per tablespace, like
> others page_cost.
> 
> There are 6 patches:
> 
> 0001-Add-reloscache-column-to-pg_class.patch

Hmm, do you really need this to be a new column?  Would it work to have
it be a reloption?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-14T21:10:20Z

2011/6/14 Alvaro Herrera <alvherre@commandprompt.com>:
> Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:
>
>> Attached are updated patches without the plugin itself. I've also
>> added the cache_page_cost GUC, this one is not per tablespace, like
>> others page_cost.
>>
>> There are 6 patches:
>>
>> 0001-Add-reloscache-column-to-pg_class.patch
>
> Hmm, do you really need this to be a new column?  Would it work to have
> it be a reloption?

If we can have ALTER TABLE running on heavy workload, why not.
I am bit scared by the effect of such reloption, it focus on HINT
oriented strategy when I would like to allow a dynamic strategy from
the server. This work is not done and may not work, so a reloption is
good at least as a backup  (and is more in the idea suggested by Tom
and others)

>
> --
> Álvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

Alvaro Herrera <alvherre@commandprompt.com> — 2011-06-14T21:17:25Z

Excerpts from Cédric Villemain's message of mar jun 14 17:10:20 -0400 2011:

> If we can have ALTER TABLE running on heavy workload, why not.
> I am bit scared by the effect of such reloption, it focus on HINT
> oriented strategy when I would like to allow a dynamic strategy from
> the server. This work is not done and may not work, so a reloption is
> good at least as a backup  (and is more in the idea suggested by Tom
> and others)

Hmm, sounds like yet another use case for pg_class_nt.  Why do these
keep popping up?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] cache estimates, cache access cost

Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-14T22:02:25Z

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Cdric Villemain's message of mar jun 14 10:29:36 -0400 2011:
>> 0001-Add-reloscache-column-to-pg_class.patch

> Hmm, do you really need this to be a new column?  Would it work to have
> it be a reloption?

If it's to be updated in the same way as ANALYZE updates reltuples and
relpages (ie, an in-place non-transactional update), I think it'll have
to be a real column.

			regards, tom lane

Re: [WIP] cache estimates, cache access cost

Greg Smith <greg@2ndquadrant.com> — 2011-06-14T22:17:52Z

On 06/14/2011 01:16 PM, Robert Haas wrote:
> But there's no reason that code (which may or may not eventually prove
> useful) has to be incorporated into the main tree.  We don't commit
> code so people can go benchmark it; we ask for the benchmarking to be
> done first, and then if the results are favorable, we commit the code.
>    

Who said anything about this being a commit candidate?  The "WIP" in the 
subject says it's not intended to be.  The community asks people to 
submit design ideas early so that ideas around them can be explored 
publicly.  One of the things that needs to be explored, and that could 
use some community feedback, is exactly how this should be benchmarked 
in the first place.  This topic--planning based on cached 
percentage--keeps coming up, but hasn't gone very far as an abstract 
discussion.  Having a patch to test lets it turn to a concrete one.

Note that I already listed myself as the reviewer  here, so it's not 
even like this is asking explicitly for a community volunteer to help.  
Would you like us to research this privately and then dump a giant patch 
that is commit candidate quality on everyone six months from now, 
without anyone else getting input to the process, or would you like the 
work to happen here?  I recommended Cédric not ever bother soliciting 
ideas early, because I didn't want to get into this sort of debate.  I 
avoid sending anything here unless I already have a strong idea about 
the solution, because it's hard to keep criticism at bay even with 
that.  He was more optimistic about working within the community 
contribution guidelines and decided to send this over early instead.  If 
you feel this is too rough to even discuss, I'll mark it returned with 
feedback and we'll go develop this ourselves.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

Bruce Momjian <bruce@momjian.us> — 2011-06-14T22:45:26Z

Greg Smith wrote:
> On 06/14/2011 01:16 PM, Robert Haas wrote:
> > But there's no reason that code (which may or may not eventually prove
> > useful) has to be incorporated into the main tree.  We don't commit
> > code so people can go benchmark it; we ask for the benchmarking to be
> > done first, and then if the results are favorable, we commit the code.
> >    
> 
> Who said anything about this being a commit candidate?  The "WIP" in the 
> subject says it's not intended to be.  The community asks people to 
> submit design ideas early so that ideas around them can be explored 
> publicly.  One of the things that needs to be explored, and that could 
> use some community feedback, is exactly how this should be benchmarked 
> in the first place.  This topic--planning based on cached 
> percentage--keeps coming up, but hasn't gone very far as an abstract 
> discussion.  Having a patch to test lets it turn to a concrete one.
> 
> Note that I already listed myself as the reviewer  here, so it's not 
> even like this is asking explicitly for a community volunteer to help.  
> Would you like us to research this privately and then dump a giant patch 
> that is commit candidate quality on everyone six months from now, 
> without anyone else getting input to the process, or would you like the 
> work to happen here?  I recommended C?dric not ever bother soliciting 
> ideas early, because I didn't want to get into this sort of debate.  I 
> avoid sending anything here unless I already have a strong idea about 
> the solution, because it's hard to keep criticism at bay even with 
> that.  He was more optimistic about working within the community 
> contribution guidelines and decided to send this over early instead.  If 
> you feel this is too rough to even discuss, I'll mark it returned with 
> feedback and we'll go develop this ourselves.

I would like to see us continue researching in this direction.  I think
perhaps the background writer would be ideal for collecting this
information because it scans the buffer cache already, and frequently.
(Yes, I know it can't access databases.)

I think random_page_cost is a dead-end --- it will never be possible for
it to produce the right value for us.  Its value is tied up in caching,
e.g. the default 4 is not the right value for a physical drive (it
should be much higher), but kernel and shared buffer caching require it
to be a hybrid number that isn't really realistic.  And once we have
caching in that number, it is not going to be even caching for all
tables, obviously.  Hence, there is no way for random_page_cost to be
improved and we have to start thinking about alternatives.

Basically, random_page_cost is a terrible setting and we have to admit
that and move forward.  I realize the concerns about unstable plans, and
we might need to give users the option of stable plans with a fixed
random_page_cost, but at this point we don't even have enough data to
know we need that.  What we do know is that random_page_cost is
inadequate.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: [WIP] cache estimates, cache access cost

Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-14T23:08:09Z

Greg Smith <greg@2ndQuadrant.com> writes:
> On 06/14/2011 01:16 PM, Robert Haas wrote:
>> But there's no reason that code (which may or may not eventually prove
>> useful) has to be incorporated into the main tree.  We don't commit
>> code so people can go benchmark it; we ask for the benchmarking to be
>> done first, and then if the results are favorable, we commit the code.

> Who said anything about this being a commit candidate?  The "WIP" in the 
> subject says it's not intended to be.  The community asks people to 
> submit design ideas early so that ideas around them can be explored 
> publicly.  One of the things that needs to be explored, and that could 
> use some community feedback, is exactly how this should be benchmarked 
> in the first place.  This topic--planning based on cached 
> percentage--keeps coming up, but hasn't gone very far as an abstract 
> discussion.  Having a patch to test lets it turn to a concrete one.

Yeah, it *can't* go very far as an abstract discussion ... we need some
realistic testing to decide whether this is a good idea, and you can't
get that without code.

I think the real underlying issue here is that we have this CommitFest
process that is focused on getting committable or nearly-committable
code into the tree, and it just doesn't fit well for experimental code.
I concur with Robert's desire to not push experimental code into the
main repository, but we need to have *some* way of working with it.
Maybe a separate repo where experimental branches could hang out would
be helpful?

(Another way of phrasing my point is that "WIP" is not conveying the
true status of this patch.  Maybe "Experimental" would be an appropriate
label.)

			regards, tom lane

Re: [WIP] cache estimates, cache access cost

Greg Smith <greg@2ndquadrant.com> — 2011-06-15T00:01:44Z

On 06/14/2011 07:08 PM, Tom Lane wrote:
> I concur with Robert's desire to not push experimental code into the
> main repository, but we need to have *some* way of working with it.
> Maybe a separate repo where experimental branches could hang out would
> be helpful?
>    

Well, this one is sitting around in branches at both git.postgresql.org 
and github so far, both being updated periodically.  Maybe there's some 
value around an official experimental repository too, but I thought that 
was the idea of individual people having their own directories on 
git.postgres.org.  Do we need something fancier than that?  It would be 
nice, but seems little return on investment to improve that, relative to 
what you can do easily enough now.

The idea David Fetter has been advocating of having a "bit rot" farm to 
help detect when the experimental branches drift too far out of date 
tries to make that concept really formal.  I like that idea, too, but 
find it hard to marshal enough resources to do something about it.  The 
current status quo isn't that terrible; noticing bit rot when it's 
relevant isn't that hard to do.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-06-15T03:43:31Z

On Tue, Jun 14, 2011 at 6:17 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Who said anything about this being a commit candidate?  The "WIP" in the
> subject says it's not intended to be.  The community asks people to submit
> design ideas early so that ideas around them can be explored publicly.  One
> of the things that needs to be explored, and that could use some community
> feedback, is exactly how this should be benchmarked in the first place.
>  This topic--planning based on cached percentage--keeps coming up, but
> hasn't gone very far as an abstract discussion.  Having a patch to test lets
> it turn to a concrete one.
>
> Note that I already listed myself as the reviewer  here, so it's not even
> like this is asking explicitly for a community volunteer to help.  Would you
> like us to research this privately and then dump a giant patch that is
> commit candidate quality on everyone six months from now, without anyone
> else getting input to the process, or would you like the work to happen
> here?  I recommended Cédric not ever bother soliciting ideas early, because
> I didn't want to get into this sort of debate.  I avoid sending anything
> here unless I already have a strong idea about the solution, because it's
> hard to keep criticism at bay even with that.  He was more optimistic about
> working within the community contribution guidelines and decided to send
> this over early instead.  If you feel this is too rough to even discuss,
> I'll mark it returned with feedback and we'll go develop this ourselves.

My usual trope on this subject is that WIP patches tend to elicit
helpful feedback if and only if the patch author is clear about what
sort of feedback they are seeking.  I'm interested in this topic, so,
I'm willing to put some effort into it; but, as I've said before, I
think this patch is coming from the wrong end, so in the absence of
any specific guidance on what sort of input would be useful, that's
the feedback you're getting.  Feel free to clarify what would be more
helpful.  :-)

Incidentally, I have done a bit of math around how to rejigger the
costing formulas to take cached_page_cost and caching_percentage into
account, which I think is the most interesting end place to start this
work.  If it's helpful, I can write it up in a more organized way and
post that; it likely wouldn't be that much work to incorporate it into
what Cedric has here already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

Greg Stark <stark@mit.edu> — 2011-06-19T13:38:21Z

On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. ANALYZE happens far too infrequently to believe that any data taken
> at ANALYZE time will still be relevant at execution time.
> 2. Using data gathered by ANALYZE will make plans less stable, and our
> users complain not infrequently about the plan instability we already
> have, therefore we should not add more.
> 3. Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.

I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
our interface for telling postgres to gather stats and we generally
agree that having stats and modelling the system behaviour as
accurately as practical is the right direction so we need a specific
reason why this stat and this bit of modeling is a bad idea before we
dismiss it.

I think the kernel of truth in these concerns is simply that
everything else ANALYZE looks at mutates only on DML. If you load the
same data into two databases and run ANALYZE you'll get (modulo random
sampling) the same stats. And if you never modify it and analyze it
again a week later you'll get the same stats again. So autovacuum can
guess when to run analyze based on the number of DML operations, it
can run it without regard to how busy the system is, and it can hold
off on running it if the data hasn't changed.

In the case of the filesystem buffer cache the cached percentage will
vary over time regardless of whether the data changes. Plain select
queries will change it, even other activity outside the database will
change it. There are a bunch of strategies for mitigating this
problem: we might want to look at the cache situation more frequently,
discount the results we see since more aggressively, and possibly
maintain a kind of running average over time.

There's another problem which I haven't seen mentioned. Because the
access method will affect the cache there's the possibility of
feedback loops. e.g. A freshly loaded system prefers sequential scans
for a given table because without the cache the seeks of random reads
are too expensive... causing it to never load that table into cache...
causing that table to never be cached and never switch to an index
method. It's possible there are mitigation strategies for this as well
such as keeping a running average over time and discounting the
estimates with some heuristic values.

-- 
greg

Re: [WIP] cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-19T14:26:03Z

2011/6/19 Greg Stark <stark@mit.edu>:
> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. ANALYZE happens far too infrequently to believe that any data taken
>> at ANALYZE time will still be relevant at execution time.
>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>> users complain not infrequently about the plan instability we already
>> have, therefore we should not add more.
>> 3. Even if the data were accurate and did not cause plan stability, we
>> have no evidence that using it will improve real-world performance.
>
> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
> our interface for telling postgres to gather stats and we generally
> agree that having stats and modelling the system behaviour as
> accurately as practical is the right direction so we need a specific
> reason why this stat and this bit of modeling is a bad idea before we
> dismiss it.
>
> I think the kernel of truth in these concerns is simply that
> everything else ANALYZE looks at mutates only on DML. If you load the
> same data into two databases and run ANALYZE you'll get (modulo random
> sampling) the same stats. And if you never modify it and analyze it
> again a week later you'll get the same stats again. So autovacuum can
> guess when to run analyze based on the number of DML operations, it
> can run it without regard to how busy the system is, and it can hold
> off on running it if the data hasn't changed.
>
> In the case of the filesystem buffer cache the cached percentage will
> vary over time regardless of whether the data changes. Plain select
> queries will change it, even other activity outside the database will
> change it. There are a bunch of strategies for mitigating this
> problem: we might want to look at the cache situation more frequently,
> discount the results we see since more aggressively, and possibly
> maintain a kind of running average over time.

Yes.

>
> There's another problem which I haven't seen mentioned. Because the
> access method will affect the cache there's the possibility of
> feedback loops. e.g. A freshly loaded system prefers sequential scans
> for a given table because without the cache the seeks of random reads
> are too expensive... causing it to never load that table into cache...
> causing that table to never be cached and never switch to an index
> method. It's possible there are mitigation strategies for this as well

Yeah, that's one of the problem to solve. So far I've tried to keep a
planner which behave as currently when the rel_oscache == 0. So that
fresh server will have the same planning than a server without
rel_oscache.

Those points are to be solved in costestimates (and selfunc). For this
case, there is a balance between page filtering cost and index access
cost. *And* once  the table is in cache, the index cost less and can
be better because it need less filtering (less rows, less pages, less
work). there is also a possible issue here (if using the index remove
the table from cache) but I am not too much afraid of that right now.

> such as keeping a running average over time and discounting the
> estimates with some heuristic values.

yes, definitively something to think about. My biggest fear here is
for shared servers (when there is competition between services to use
the OS cache, shooting down kernel cache strategies).

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-06-19T18:52:12Z

On Sun, Jun 19, 2011 at 9:38 AM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. ANALYZE happens far too infrequently to believe that any data taken
>> at ANALYZE time will still be relevant at execution time.
>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>> users complain not infrequently about the plan instability we already
>> have, therefore we should not add more.
>> 3. Even if the data were accurate and did not cause plan stability, we
>> have no evidence that using it will improve real-world performance.
>
> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
> our interface for telling postgres to gather stats and we generally
> agree that having stats and modelling the system behaviour as
> accurately as practical is the right direction so we need a specific
> reason why this stat and this bit of modeling is a bad idea before we
> dismiss it.
>
> I think the kernel of truth in these concerns is simply that
> everything else ANALYZE looks at mutates only on DML. If you load the
> same data into two databases and run ANALYZE you'll get (modulo random
> sampling) the same stats. And if you never modify it and analyze it
> again a week later you'll get the same stats again. So autovacuum can
> guess when to run analyze based on the number of DML operations, it
> can run it without regard to how busy the system is, and it can hold
> off on running it if the data hasn't changed.
>
> In the case of the filesystem buffer cache the cached percentage will
> vary over time regardless of whether the data changes. Plain select
> queries will change it, even other activity outside the database will
> change it. There are a bunch of strategies for mitigating this
> problem: we might want to look at the cache situation more frequently,
> discount the results we see since more aggressively, and possibly
> maintain a kind of running average over time.
>
> There's another problem which I haven't seen mentioned. Because the
> access method will affect the cache there's the possibility of
> feedback loops. e.g. A freshly loaded system prefers sequential scans
> for a given table because without the cache the seeks of random reads
> are too expensive... causing it to never load that table into cache...
> causing that table to never be cached and never switch to an index
> method. It's possible there are mitigation strategies for this as well
> such as keeping a running average over time and discounting the
> estimates with some heuristic values.

*scratches head*

Well, yeah.  I completely agree with you that these are the things we
need to worry about.  Maybe I did a bad job explaining myself, because
ISTM you said my concerns were FUD and then went on to restate them in
different words.

I'm not bent out of shape about using ANALYZE to try to gather the
information.  That's probably a reasonable approach if it turns out we
actually need to do it at all.  I am not sure we do.  What I've argued
for in the past is that we start by estimating the percentage of the
relation that will be cached based on its size relative to
effective_cache_size, and allow the administrator to override the
percentage on a per-relation basis if it turns out to be wrong.  That
would avoid all of these concerns and allow us to focus on the issue
of how the caching percentages impact the choice of plan, and whether
the plans that pop out are in fact better when you provide information
on caching as input.  If we have that facility in core, then people
can write scripts or plug-in modules to do ALTER TABLE .. SET
(caching_percentage = XYZ) every hour or so based on the sorts of
statistics that Cedric is gathering here, and users will be able to
experiment with a variety of algorithms and determine which ones work
the best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

Cédric Villemain <cedric.villemain.debian@gmail.com> — 2011-06-19T19:32:13Z

2011/6/19 Robert Haas <robertmhaas@gmail.com>:
> On Sun, Jun 19, 2011 at 9:38 AM, Greg Stark <stark@mit.edu> wrote:
>> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> 1. ANALYZE happens far too infrequently to believe that any data taken
>>> at ANALYZE time will still be relevant at execution time.
>>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>>> users complain not infrequently about the plan instability we already
>>> have, therefore we should not add more.
>>> 3. Even if the data were accurate and did not cause plan stability, we
>>> have no evidence that using it will improve real-world performance.
>>
>> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
>> our interface for telling postgres to gather stats and we generally
>> agree that having stats and modelling the system behaviour as
>> accurately as practical is the right direction so we need a specific
>> reason why this stat and this bit of modeling is a bad idea before we
>> dismiss it.
>>
>> I think the kernel of truth in these concerns is simply that
>> everything else ANALYZE looks at mutates only on DML. If you load the
>> same data into two databases and run ANALYZE you'll get (modulo random
>> sampling) the same stats. And if you never modify it and analyze it
>> again a week later you'll get the same stats again. So autovacuum can
>> guess when to run analyze based on the number of DML operations, it
>> can run it without regard to how busy the system is, and it can hold
>> off on running it if the data hasn't changed.
>>
>> In the case of the filesystem buffer cache the cached percentage will
>> vary over time regardless of whether the data changes. Plain select
>> queries will change it, even other activity outside the database will
>> change it. There are a bunch of strategies for mitigating this
>> problem: we might want to look at the cache situation more frequently,
>> discount the results we see since more aggressively, and possibly
>> maintain a kind of running average over time.
>>
>> There's another problem which I haven't seen mentioned. Because the
>> access method will affect the cache there's the possibility of
>> feedback loops. e.g. A freshly loaded system prefers sequential scans
>> for a given table because without the cache the seeks of random reads
>> are too expensive... causing it to never load that table into cache...
>> causing that table to never be cached and never switch to an index
>> method. It's possible there are mitigation strategies for this as well
>> such as keeping a running average over time and discounting the
>> estimates with some heuristic values.
>
> *scratches head*
>
> Well, yeah.  I completely agree with you that these are the things we
> need to worry about.  Maybe I did a bad job explaining myself, because
> ISTM you said my concerns were FUD and then went on to restate them in
> different words.
>
> I'm not bent out of shape about using ANALYZE to try to gather the
> information.  That's probably a reasonable approach if it turns out we
> actually need to do it at all.  I am not sure we do.  What I've argued
> for in the past is that we start by estimating the percentage of the
> relation that will be cached based on its size relative to
> effective_cache_size, and allow the administrator to override the
> percentage on a per-relation basis if it turns out to be wrong.  That
> would avoid all of these concerns and allow us to focus on the issue
> of how the caching percentages impact the choice of plan, and whether
> the plans that pop out are in fact better when you provide information
> on caching as input.  If we have that facility in core, then people
> can write scripts or plug-in modules to do ALTER TABLE .. SET
> (caching_percentage = XYZ) every hour or so based on the sorts of
> statistics that Cedric is gathering here, and users will be able to
> experiment with a variety of algorithms and determine which ones work
> the best.

Robert, I am very surprised.
My patch does offer that.

1st, I used ANALYZE because it is the way to update pg_class I found.
You are suggesting ALTER TABLE instead, that is fine, but give me that
lock-free :) else we have the ahem.. Alvaro's pg_class_ng (I find this
one interesting because it will be lot easier to have different values
on standby server if we find a way to have pg_class_ng 'updatable' per
server)
So, as long as the value can be change without problem, I don't care
where it resides.

2nd, I provided the patches on the last CF, exactly to allow to go to
the exciting part: the cost-estimates changes. (after all, we can work
on the cost estimate, and if later we find a way to use ALTER TABLE or
pg_class_ng, just do it instead of via the ANALYZE magic)

3nd, you can right now write a plugin to set the value of rel_oscache
(exactly like the one you'll do for a ALTER TABLE SET reloscache...)

RelationGetRelationOSCacheInFork(Relation relation, ForkNumber forkNum)
{
       float4 percent = 0;
       /* if a plugin is present, let it manage things */
       if (OSCache_hook)
               percent = (*OSCache_hook) (relation, forkNum);
       return percent;
 }

Looks like the main fear is because I used the ANALYZE word...

PS: ANALYZE OSCACHE does *not* run with ANALYZE, those are distinct
operations. (ANALYZE won't do the job of ANALYZE OSCACHE, we can
discuss the grammar, maybe a ANALYZE ([OSCACHE], [DATA], ...) will be
better ).
-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

Greg Smith <greg@2ndquadrant.com> — 2011-06-19T21:32:42Z

On 06/19/2011 09:38 AM, Greg Stark wrote:
> There's another problem which I haven't seen mentioned. Because the
> access method will affect the cache there's the possibility of
> feedback loops. e.g. A freshly loaded system prefers sequential scans
> for a given table because without the cache the seeks of random reads
> are too expensive...

Not sure if it's been mentioned in this thread yet, but he feedback 
issue has popped up in regards to this area plenty of times.  I think 
everyone who's producing regular input into this is aware of it, even if 
it's not mentioned regularly.  I'm not too concerned about the specific 
case you warned about because I don't see how sequential scan vs. index 
costing will be any different on a fresh system than it is now.  But 
there are plenty of cases like it to be mapped out here, and many are 
not solvable--they're just something that needs to be documented as a risk.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

Robert Haas <robertmhaas@gmail.com> — 2011-06-20T00:49:19Z

On Sun, Jun 19, 2011 at 3:32 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2nd, I provided the patches on the last CF, exactly to allow to go to
> the exciting part: the cost-estimates changes. (after all, we can work
> on the cost estimate, and if later we find a way to use ALTER TABLE or
> pg_class_ng, just do it instead of via the ANALYZE magic)

We're talking past each other here, somehow.  The cost-estimating part
does not require this patch in order to something useful, but this
patch, AFAICT, absolutely does require the cost-estimating part to do
something useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company