Thread

  1. Update on my 6.4.2 progress

    Wayne Piekarski <wayne@senet.com.au> — 1999-06-19T07:39:08Z

    Hi,
    
    Just thought I'd drop an email again - I was the one having neverending
    trouble with 6.4.2 jamming with backends waiting, and other types of
    problems.
    
    Although I'm still using 6.4.2, I hope this will still be helpful for
    the developers in case it impacts on things in 6.5.
    
    We installed Tom Lanes shared memory patches, which I emailed about
    earlier, and they helped a bit, but unfortunately, we still get backends
    stuck waiting even today....
    
    The interesting thing is, we went and put in another 128 mb of ram (from
    256 to 384 now) and recompiled the kernel with more semaphores and shared
    memory, and the improvement was incredible! Before, we would get semget
    failures every so often when we had about 50 backends going, causing the
    whole thing to fall over, but now we get
    "fmgr_info: function 111257088: cache lookup failed"
    after 64 backends (which is what we compiled postgres for) which I
    assume isn't so fatal and the whole system keeps running.
    
    For three days after our little upgrade, the whole thing ran smoothly,
    then we ran into the problem of the stuck waiting backends. We thought the
    problem was gone but it was still there. So what would happen is a backend
    would get stuck, cause others to get stuck, and the postgres' would just
    build up until it hit 64, then we'd have to kill them off and would be ok
    again. At least now the number of problems have decreased slightly.
    
    One interesting message we got during this problem was:
    NOTICE:  LockRelease: locktable lookup failed, no lock 
    
    It seems as though the backends are waiting for a lock that got deleted
    accidentally, although I have no idea how the code works so can't offer
    any advice where.
    
    Lately though, the problems are happening with higher frequency, and every
    so often we still get the BTP_CHAIN problems with tables (which I sent
    another email about fixing) so I need to fix this.
    
    
    One thing I was disappointed with was after adding an extra 128 mb of ram,
    I was hoping that this would be used for disk caching, but when performing
    repeated select queries on tables, where I did something like:
    
    select sum(some_value) from some_table;
    
    The result took the same amount of time to run each time, and was not
    cached at all (the table was about 100 mb) and when doing the query, our
    raid controller would just light up which I wanted to avoid. After seeing
    this, I read posts on the hackers list where people were talking about
    fsync'ing the pg_log to note down whether things had been commited or not.
    
    The table I was testing was totally read only, no modifications being
    made, however, another table gets almost continuous changes 24 hours per
    day, more than 1 per second, so would this be causing the machine to
    continuously flush pg_log to disk and cause my read-only tables to still
    not be cached?
    
    I guess my next question is, can i comment out the fsync call? <grin> With
    the disks performing more efficient updates, the whole thing would run
    faster and run less risks of crashing. Currently, the performance can be
    quite bad sometimes when the machine is doing lots of disk activity,
    because even the simplest read only queries block because they aren't
    cached.
    
    Would moving pg_log to a 2nd disk make a difference? Are there other
    important files like pg_log which should go onto separate disks as well? I
    have no problem with multiple disks, but it was only recently that I
    discovered this fsyncing thing on pg_log. Is pg_log more speed and fsync
    critical than the actual data itself? I have two raid controllers, a slow
    and a fast one, and I want to move pg_log to one of them, but not sure
    which one.
    
    
    So in summary, I've learned that if you are having troubles, put in more
    memory, (even if you have some free) and increase your kernels internal
    sizes for semaphores and shared memory values to really large values, even
    when postgres isn't complaining. It makes a difference for some reason
    and everything was a lot happier.
    
    BTP_CHAIN and the backends waiting problem are still occuring, although I
    cannot build a test case for either of them, they are very much problems
    which occur accidentally and at random times.
    
    
    thanks again,
    Wayne
    
    ------------------------------------------------------------------------------
    Wayne Piekarski                               Tel:     (08) 8221 5221
    Research & Development Manager                Fax:     (08) 8221 5220
    SE Network Access Pty Ltd                     Mob:     0407 395 889
    222 Grote Street                              Email:   wayne@senet.com.au
    Adelaide SA 5000                              WWW:     http://www.senet.com.au
    
    
  2. Re: [HACKERS] Update on my 6.4.2 progress

    Hannu Krosing <hannu@trust.ee> — 1999-06-19T10:17:35Z

    Wayne Piekarski wrote:
    > 
    > I guess my next question is, can i comment out the fsync call?
    
    if you ar confident in your os and hardware, you can 
    pass the -F flag to backend and no fsyncs are done.
    
    (add -o '-F' to postmaster startup line)
    
    I think it is in some faq too.
    
    --------------
    Hannu
    
    
  3. Re: [HACKERS] Update on my 6.4.2 progress

    Bruce Momjian <maillist@candle.pha.pa.us> — 1999-06-19T12:03:30Z

    > The interesting thing is, we went and put in another 128 mb of ram (from
    > 256 to 384 now) and recompiled the kernel with more semaphores and shared
    > memory, and the improvement was incredible! Before, we would get semget
    > failures every so often when we had about 50 backends going, causing the
    > whole thing to fall over, but now we get
    > "fmgr_info: function 111257088: cache lookup failed"
    > after 64 backends (which is what we compiled postgres for) which I
    > assume isn't so fatal and the whole system keeps running.
    
    The 6.4.2 code would not allocate all shared memory/semaphores at
    startup, and only fail when you go to a large number of backends.  6.5
    fixes this by allocating it all on startup.
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  4. Re: [HACKERS] Update on my 6.4.2 progress

    Wayne Piekarski <wayne@senet.com.au> — 1999-06-20T12:18:44Z

    > Wayne Piekarski wrote:
    > > 
    > > I guess my next question is, can i comment out the fsync call?
    > 
    > if you ar confident in your os and hardware, you can 
    > pass the -F flag to backend and no fsyncs are done.
    > 
    > (add -o '-F' to postmaster startup line)
    > 
    > I think it is in some faq too.
    
    I already have the -o -F switch in the startup file (which I believe is
    working) but I'm under the impression from what I read that there are two
    fsync's - one you can switch off, and one which is fixed into the code
    and possibly can't be removed?
    
    Regards,
    Wayne
    
    ------------------------------------------------------------------------------
    Wayne Piekarski                               Tel:     (08) 8221 5221
    Research & Development Manager                Fax:     (08) 8221 5220
    SE Network Access Pty Ltd                     Mob:     0407 395 889
    222 Grote Street                              Email:   wayne@senet.com.au
    Adelaide SA 5000                              WWW:     http://www.senet.com.au
    
    
  5. Re: [HACKERS] Update on my 6.4.2 progress

    Wayne Piekarski <wayne@senet.com.au> — 1999-06-21T08:49:21Z

    > > The interesting thing is, we went and put in another 128 mb of ram (from
    > > 256 to 384 now) and recompiled the kernel with more semaphores and shared
    > > memory, and the improvement was incredible! Before, we would get semget
    > > failures every so often when we had about 50 backends going, causing the
    > > whole thing to fall over, but now we get
    > > "fmgr_info: function 111257088: cache lookup failed"
    > > after 64 backends (which is what we compiled postgres for) which I
    > > assume isn't so fatal and the whole system keeps running.
    > 
    > The 6.4.2 code would not allocate all shared memory/semaphores at
    > startup, and only fail when you go to a large number of backends.  6.5
    > fixes this by allocating it all on startup.
    
    Ok, thats cool ... One question though: is the cache lookup failed message
    really bad or is it a cryptic way of saying that the connection is refused
    but everything else is cool? I have no problem with the fact that the
    connection failed, but does it cause corruption or postgres to fall over
    later on? Ie, if you get a semget failure, shortly after the whole thing
    will die, possibly causing data corruption or something. Would these kind
    of errors cause BTP_CHAIN errors, or is that totally unrelated?
    
    As another general question, if I randomly kill postgres backends during
    the middle of transactions, is there a possibility for corruption, or is
    it safe due to the way transactions are commited, etc. I've always been
    very nervous when it comes to killing backends, as I was worried something
    might go wrong, leaving something out of sync.
    
    thanks,
    Wayne
    
    ------------------------------------------------------------------------------
    Wayne Piekarski                               Tel:     (08) 8221 5221
    Research & Development Manager                Fax:     (08) 8221 5220
    SE Network Access Pty Ltd                     Mob:     0407 395 889
    222 Grote Street                              Email:   wayne@senet.com.au
    Adelaide SA 5000                              WWW:     http://www.senet.com.au
    
    
  6. Re: [HACKERS] Update on my 6.4.2 progress

    Wayne Piekarski <wayne@senet.com.au> — 1999-06-21T09:11:42Z

    > > > I guess my next question is, can i comment out the fsync call?
    > > 
    > > if you ar confident in your os and hardware, you can 
    > > pass the -F flag to backend and no fsyncs are done.
    > > 
    > > (add -o '-F' to postmaster startup line)
    > > 
    > > I think it is in some faq too.
    > 
    > I already have the -o -F switch in the startup file (which I believe is
    > working) but I'm under the impression from what I read that there are two
    > fsync's - one you can switch off, and one which is fixed into the code
    > and possibly can't be removed?
    
    Eeeep! When I wrote the above, I was mistaken. My config file did not have
    -o -F, which was why the fsync's were occuring. Sorry for messing you
    around here .... 
    
    What I was concerned about was the lack of caching and thrashing, but I
    guess I can solve that with no fsync.
    
    thanks,
    Wayne
    
    ------------------------------------------------------------------------------
    Wayne Piekarski                               Tel:     (08) 8221 5221
    Research & Development Manager                Fax:     (08) 8221 5220
    SE Network Access Pty Ltd                     Mob:     0407 395 889
    222 Grote Street                              Email:   wayne@senet.com.au
    Adelaide SA 5000                              WWW:     http://www.senet.com.au