Thread

  1. Changing the default random_page_cost value

    Greg Sabino Mullane <htamfids@gmail.com> — 2024-09-27T14:07:14Z

    tl;dr let's assume SSDs are popular and HDDs are the exception and flip our
    default
    
    As I write this email, it's the year 2024. I think it is time we lower our
    "default" setting of random_page_cost (as set in postgresql.conf.sample and
    the docs). Even a decade ago, the current default of 4 was considered
    fairly conservative and often lowered. The git logs shows that this value
    was last touched in 2006, during the age of spinning metal. We are now in a
    new era, the age of SSDs, and thus we should lower this default value to
    reflect the fact that the vast majority of people using Postgres these days
    are doing so on solid state drives. We tend to stay ultra-conservative in
    all of our settings, but we also need to recognize when there has been a
    major shift in the underlying hardware - and calculations that our defaults
    are based on.
    
    Granted, there are other factors involved, and yes, perhaps we should tweak
    some of the similar settings as well, but ranom_page_cost is the one
    setting most out of sync with today's hardware realities. So I'll be brave
    and throw a number out there: 1.2. And change our docs to say wordage like
    "if you are using an older hard disk drive technology, you may want to try
    raising rpc" to replace our fairly-hidden note about SSDs buried in the
    last sentence - of the fourth paragraph - of the rpc docs.
    
    Real data about performance on today's SSDs are welcome, and/or some way to
    generate a more accurate default.
    
    Cheers,
    Greg
    
  2. Re: Changing the default random_page_cost value

    Roberto Mello <roberto.mello@gmail.com> — 2024-09-27T14:26:38Z

    On Fri, Sep 27, 2024 at 8:07 AM Greg Sabino Mullane <htamfids@gmail.com>
    wrote:
    
    > tl;dr let's assume SSDs are popular and HDDs are the exception and flip
    > our default
    >
    
    <snip>
    
    
    > Granted, there are other factors involved, and yes, perhaps we should
    > tweak some of the similar settings as well, but ranom_page_cost is the one
    > setting most out of sync with today's hardware realities. So I'll be brave
    > and throw a number out there: 1.2. And change our docs to say wordage like
    > "if you are using an older hard disk drive technology, you may want to try
    > raising rpc" to replace our fairly-hidden note about SSDs buried in the
    > last sentence - of the fourth paragraph - of the rpc docs.
    >
    
    +1
    
    I suggest a slightly nicer comment in the default conf file, like "For
    spinning hard drives, raise this to at least 3 and test"
    
    Roberto
    
  3. Re: Changing the default random_page_cost value

    Laurenz Albe <laurenz.albe@cybertec.at> — 2024-09-27T15:35:51Z

    On Fri, 2024-09-27 at 10:07 -0400, Greg Sabino Mullane wrote:
    > So I'll be brave and throw a number out there: 1.2.
    
    +1
    
    Laurenz Albe
    
    
    
    
  4. Re: Changing the default random_page_cost value

    Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2024-09-27T16:03:33Z

    Greg Sabino Mullane <htamfids@gmail.com> writes:
    
    > So I'll be brave and throw a number out there: 1.2. And change our
    > docs to say wordage like "if you are using an older hard disk drive
    > technology, you may want to try raising rpc" to replace our
    > fairly-hidden note about SSDs buried in the last sentence - of the
    > fourth paragraph - of the rpc docs.
    
    It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
    which is typically backed by SSDs, but has extra network latency.
    
    - ilmari
    
    
    
    
  5. Re: Changing the default random_page_cost value

    Greg Sabino Mullane <htamfids@gmail.com> — 2024-09-30T14:05:29Z

    On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
    wrote:
    
    > It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
    > which is typically backed by SSDs, but has extra network latency.
    >
    
    That seems a little too in the weeds for me, but wording suggestions are
    welcome. To get things moving forward, I made a doc patch which changes a
    few things, namely:
    
    * Mentions the distinction between ssd and hdd right up front.
    * Moves the tablespace talk to the very end, as tablespace use is getting
    rarer (again, thanks in part to ssds)
    * Mentions the capability to set per-database and per-role since we mention
    per-tablespace.
    * Removes a lot of the talk of caches and justifications for the 4.0
    setting. While those are interesting, I've been tuning this parameter for
    many years and never really cared about the "90% cache rate". The proof is
    in the pudding: rpc is the canonical "try it and see" parameter. Tweak.
    Test. Repeat.
    
    Cheers,
    Greg
    
  6. Re: Changing the default random_page_cost value

    Bruce Momjian <bruce@momjian.us> — 2024-10-14T21:15:18Z

    On Mon, Sep 30, 2024 at 10:05:29AM -0400, Greg Sabino Mullane wrote:
    > On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
    > wrote:
    > 
    >     It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
    >     which is typically backed by SSDs, but has extra network latency.
    > 
    > 
    > That seems a little too in the weeds for me, but wording suggestions are
    > welcome. To get things moving forward, I made a doc patch which changes a few
    > things, namely:
    > 
    > * Mentions the distinction between ssd and hdd right up front.
    > * Moves the tablespace talk to the very end, as tablespace use is getting rarer
    > (again, thanks in part to ssds)
    > * Mentions the capability to set per-database and per-role since we mention
    > per-tablespace.
    > * Removes a lot of the talk of caches and justifications for the 4.0 setting.
    > While those are interesting, I've been tuning this parameter for many years and
    > never really cared about the "90% cache rate". The proof is in the pudding: rpc
    > is the canonical "try it and see" parameter. Tweak. Test. Repeat.
    
    I am not a fan of this patch.  I don't see why _removing_ the magnetic
    part helps because you then have no logic for any 1.2 was chosen.  I
    would put the magnetic in a separate paragraph perhaps, and recommend
    4.0 for it.  Also, per-tablespace makes sense because of physical media
    differences, but what purpose would per-database and per-role serve? 
    Also, per-tablespace is not a connection-activated item like the other
    two.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      When a patient asks the doctor, "Am I going to die?", he means 
      "Am I going to die soon?"
    
    
    
    
  7. Re: Changing the default random_page_cost value

    David Rowley <dgrowleyml@gmail.com> — 2024-10-15T02:20:31Z

    On Tue, 15 Oct 2024 at 10:15, Bruce Momjian <bruce@momjian.us> wrote:
    > I am not a fan of this patch.  I don't see why _removing_ the magnetic
    > part helps because you then have no logic for any 1.2 was chosen.  I
    > would put the magnetic in a separate paragraph perhaps, and recommend
    > 4.0 for it.  Also, per-tablespace makes sense because of physical media
    > differences, but what purpose would per-database and per-role serve?
    > Also, per-tablespace is not a connection-activated item like the other
    > two.
    
    Yeah, I think any effort to change the default value for this setting
    would require some analysis to prove that the newly proposed default
    is a more suitable setting than the current default. I mean, why 1.2
    and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
    for this?
    
    I don't think just providing evidence that random read times are
    closer to sequential read times on SSDs are closer than they are with
    HDDs is going to be enough. What we want to know is if the planner
    costs become more closely related to the execution time or not. From
    my experience, random_page_cost really only has a loose grasp on
    reality, so you might find that it's hard to prove this with any
    degree of confidence (just have a look at how inconsiderate
    index_pages_fetched() is to other queries running on the database, for
    example).
    
    I suggest first identifying all the locations that use
    random_page_cost then coming up with some test cases that run queries
    that exercise those locations in a way that does things like vary the
    actual selectivity of some value to have the planner switch plans then
    try varying the random_page_cost to show that the switchover point is
    more correct with the new value than it is with the old value. It
    would be nice to have this as a script so that other people could
    easily run it on their hardware to ensure that random_page_cost we
    choose as the new default is representative of the average hardware.
    You'll likely need to do this with varying index sizes. I imagine to
    properly test this so that we'd have any useful degree of confidence
    that the new value is better than the old one would likely require a
    benchmark that runs for several hours. At the upper end, you'd likely
    want the data sizes to exceed the size of RAM. Another dimension that
    the tests should likely explore is varying data locality.
    
    David
    
    
    
    
  8. Re: Changing the default random_page_cost value

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-10-15T02:38:01Z

    David Rowley <dgrowleyml@gmail.com> writes:
    > Yeah, I think any effort to change the default value for this setting
    > would require some analysis to prove that the newly proposed default
    > is a more suitable setting than the current default. I mean, why 1.2
    > and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
    > for this?
    
    Yeah, that's been my main concern about this proposal too.
    
    I recall that when we settled on 4.0 as a good number for
    spinning-rust drives, it came out of some experimentation that
    I'd done that involved multiple-day-long tests.  I don't recall any
    more details than that sadly, but perhaps trawling the mailing list
    archives would yield useful info.  It looks like the 4.0 value came
    in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
    time frame to look in.
    
    			regards, tom lane
    
    
    
    
  9. Re: Changing the default random_page_cost value

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-10-15T03:03:59Z

    I wrote:
    > I recall that when we settled on 4.0 as a good number for
    > spinning-rust drives, it came out of some experimentation that
    > I'd done that involved multiple-day-long tests.  I don't recall any
    > more details than that sadly, but perhaps trawling the mailing list
    > archives would yield useful info.  It looks like the 4.0 value came
    > in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
    > time frame to look in.
    
    I tried asking https://www.postgresql.org/search/ about
    random_page_cost, and got nothing except search engine timeouts :-(.
    However, some digging in my own local archives yielded
    
    https://www.postgresql.org/message-id/flat/25387.948414692%40sss.pgh.pa.us
    
    https://www.postgresql.org/message-id/flat/14601.949786166%40sss.pgh.pa.us
    
    That confirms my recollection of multiple-day test runs, but doesn't
    offer much more useful detail than that :-(.  What I think I did
    though was to create some large tables (much bigger than the RAM on
    the machine I had) and actually measure the runtime of seq scans
    versus full-table index scans, repeating plenty 'o times to try to
    average out the noise.  There was some talk in those threads of
    reducing that to a publishable script, but it was never followed up
    on.
    
    			regards, tom lane
    
    
    
    
  10. Re: Changing the default random_page_cost value

    Greg Sabino Mullane <htamfids@gmail.com> — 2024-10-25T00:01:11Z

    On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:
    
    > I am not a fan of this patch.  I don't see why _removing_ the magnetic
    > part helps because you then have no logic for any 1.2 was chosen.
    
    
    Okay, but we have no documented logic on why 4.0 was chosen either. :)
    
    I would put the magnetic in a separate paragraph perhaps, and recommend
    > 4.0 for it.
    
    
    Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly
    standard practice, but I'm fine with a recommendation of 4. Partly because
    I doubt anyone will use it much.
    
     Also, per-tablespace makes sense because of physical media
    > differences, but what purpose would per-database and per-role serve?
    > Also, per-tablespace is not a connection-activated item like the other
    > two.
    >
    
    Good point, I withdraw that part.
    
    Cheers,
    Greg
    
  11. Re: Changing the default random_page_cost value

    Greg Sabino Mullane <htamfids@gmail.com> — 2024-10-25T00:13:28Z

    On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
    
    > Yeah, I think any effort to change the default value for this setting
    > would require some analysis to prove that the newly proposed default
    > is a more suitable setting than the current default. I mean, why 1.2 and
    > not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
    > for this?
    >
    
    As I said, I was just throwing that 1.2 number out there. It felt right,
    although perhaps a tad high (which seems right as we keep things very
    conservative). I agree we should make a best effort to have an accurate,
    defendable default. We all know (I hope) that 4.0 is wrong for SSDs.
    
    
    > I don't think just providing evidence that random read times are closer to
    > sequential read times on SSDs are closer than they are with
    > HDDs is going to be enough.
    
    ...
    
    > It would be nice to have this as a script so that other people could
    > easily run it on their hardware to ensure that random_page_cost we
    > choose as the new default is representative of the average hardware.
    
    
    Heh, this is starting to feel like belling the cat (see
    https://fablesofaesop.com/belling-the-cat.html)
    
    Remember this is still just a default, and we should encourage people to
    tweak it themselves based on their own workloads. I just want people to
    start in the right neighborhood. I'll see about working on some more
    research / generating a script, but help from others is more than welcome.
    
    Cheers,
    Greg
    
  12. Re: Changing the default random_page_cost value

    David Rowley <dgrowleyml@gmail.com> — 2024-10-25T01:29:14Z

    On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane <htamfids@gmail.com> wrote:
    >
    > On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
    >>
    >> Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newly proposed default
    >> is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
    >> for this?
    >
    > As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems right as we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default. We all know (I hope) that 4.0 is wrong for SSDs.
    
    I don't think we're going to find the correct new value for this
    setting by throwing randomly chosen numbers at each other on an email
    thread. Unfortunately, someone is going to have to do some work to
    figure out what the number should be, and then hopefully someone else
    can verify that work to check that person is correct.
    
    I'm not trying to be smart or funny here, but I just am failing to
    comprehend why you think you offering a number without any information
    about how you selected that number to set as the new default
    random_page_cost would be acceptable.  Are you expecting someone else
    to go and do the work to prove that your selected number is the
    correct one? It's been 4 weeks since your first email and nobody has
    done that yet, so maybe you might need to consider other ways to
    achieve your goal.
    
    >> I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer than they are with
    >> HDDs is going to be enough.
    >
    > ...
    >>
    >> It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_cost we
    >> choose as the new default is representative of the average hardware.
    >
    >
    > Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html)
    
    I don't see the similarity. Changing the default random_page_cost
    requires analysis to find what the new default should be. The
    execution of the actual change in default is dead simple.  With
    belling the cat, it seems like the execution is the hard part and
    nobody is debating the idea itself.
    
    > Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads. I just want people to start in the right neighborhood. I'll see about working on some more research / generating a script, but help from others is more than welcome.
    
    You might be mistakenly thinking that the best random_page_cost is an
    exact ratio of how much slower a random seek and read is from a
    sequential read. There are unfortunately many other factors to
    consider. The correct setting is going to be the one where the chosen
    plan uses the scan method that's the fastest and knowing the answer to
    that is going to take some benchmarks on PostgreSQL. Our cost model
    simply just isn't perfect enough for you to assume that I/O is the
    only factor that changes between an Index Scan and a Seq Scan.
    
    I'd say it's not overly difficult to come up with test cases that go
    to prove the value you select is "correct". I've done this before for
    CPU-related costs. I think with I/O the main difference will be that
    your tests should be much larger, and doing that will mean getting the
    results takes much more time. Here's a link to some analysis I did to
    help solve a problem relating to partition-wise aggregates [1]. Maybe
    you can use a similar method to determine random_page_cost.
    
    David
    
    [1] https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com
    
    
    
    
  13. Re: Changing the default random_page_cost value

    wenhui qiu <qiuwenhuifx@gmail.com> — 2024-10-25T03:56:57Z

    HI Greg Sabino Mullane
        Another thing is that you simply change the configuration template is
    not effective,
    
    need to modify the DEFAULT_RANDOM_PAGE_COST values
    {
    {"random_page_cost", PGC_USERSET, QUERY_TUNING_COST,
    gettext_noop("Sets the planner's estimate of the cost of a "
    "nonsequentially fetched disk page."),
    NULL,
    GUC_EXPLAIN
    },
    &random_page_cost,
    DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX,
    NULL, NULL, NULL
    },
    
    src/include/optimizer/cost.h
    /* defaults for costsize.c's Cost parameters */
    /* NB: cost-estimation code should use the variables, not these constants!
    */
    /* If you change these, update backend/utils/misc/postgresql.conf.sample */
    #define DEFAULT_SEQ_PAGE_COST 1.0
    #define DEFAULT_RANDOM_PAGE_COST 4.0
    #define DEFAULT_CPU_TUPLE_COST 0.01
    #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
    #define DEFAULT_CPU_OPERATOR_COST 0.0025
    #define DEFAULT_PARALLEL_TUPLE_COST 0.1
    #define DEFAULT_PARALLEL_SETUP_COST 1000.0
    
    Thanks
    
    >
    >
    >
    >
    
  14. Re: Changing the default random_page_cost value

    Bruce Momjian <bruce@momjian.us> — 2024-10-31T17:43:56Z

    On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
    > On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:
    > 
    >     I am not a fan of this patch.  I don't see why _removing_ the magnetic
    >     part helps because you then have no logic for any 1.2 was chosen.
    > 
    > 
    > Okay, but we have no documented logic on why 4.0 was chosen either. :)
    
    Uh, we do, and it is in the docs:
    
            Random access to mechanical disk storage is normally much more expensive
            than four times sequential access.  However, a lower default is used
            (4.0) because the majority of random accesses to disk, such as indexed
            reads, are assumed to be in cache.  The default value can be thought of
            as modeling random access as 40 times slower than sequential, while
            expecting 90% of random reads to be cached.
    
    You surely saw this when you created the patch and removed the text.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      When a patient asks the doctor, "Am I going to die?", he means 
      "Am I going to die soon?"
    
    
    
    
  15. Re: Changing the default random_page_cost value

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-10-31T17:53:16Z

    Bruce Momjian <bruce@momjian.us> writes:
    > On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
    >> Okay, but we have no documented logic on why 4.0 was chosen either. :)
    
    > Uh, we do, and it is in the docs:
    
    >         Random access to mechanical disk storage is normally much more expensive
    >         than four times sequential access.  However, a lower default is used
    >         (4.0) because the majority of random accesses to disk, such as indexed
    >         reads, are assumed to be in cache.  The default value can be thought of
    >         as modeling random access as 40 times slower than sequential, while
    >         expecting 90% of random reads to be cached.
    
    Meh.  Reality is that that is somebody's long-after-the-fact apologia
    for a number that was obtained by experimentation.
    
    			regards, tom lane
    
    
    
    
  16. Re: Changing the default random_page_cost value

    Greg Sabino Mullane <htamfids@gmail.com> — 2024-11-01T13:54:41Z

    On Thu, Oct 31, 2024 at 1:43 PM Bruce Momjian <bruce@momjian.us> wrote:
    
    > >     I am not a fan of this patch.  I don't see why _removing_ the
    > magnetic
    > >     part helps because you then have no logic for any 1.2 was chosen.
    > >
    > > Okay, but we have no documented logic on why 4.0 was chosen either. :)
    >
    > Uh, we do, and it is in the docs:
    >
    >         Random access to mechanical disk storage is normally much more
    > expensive
    >         than four times sequential access.  However, a lower default is
    > used
    >         (4.0) because the majority of random accesses to disk, such as
    > indexed
    >         reads, are assumed to be in cache.  The default value can be
    > thought of
    >         as modeling random access as 40 times slower than sequential, while
    >         expecting 90% of random reads to be cached.
    >
    > You surely saw this when you created the patch and removed the text.
    >
    
    Yes, I did, but there is no documentation to support those numbers. Is it
    really 40 times slower? Which hard disks, in what year? Where did the 90%
    come from, and is that really an accurate average for production systems
    with multiple caching layers? I know Tom ran actual tests many years ago,
    but at the end of the day, all of these numbers will vary wildly depending
    on a host of factors, so we have to make some educated guesses.
    
    I guess my point was that my 1.2 (based on many years of tuning many client
    systems) seems no less imprecise than 4.0 (based on actual tests many years
    ago, with hardware that has changed a lot), for a default value that people
    should tune for their specific system anyway.
    
    Maybe these new tests people are asking for in this thread to determine a
    better default rpc for SSDs can also help us determine a better rpc for HDs
    as well.
    
    Cheers,
    Greg