Thread

  1. Re: [HACKERS] Update on my 6.4.2 progress

    Tom Lane <tgl@sss.pgh.pa.us> — 1999-06-21T13:43:27Z

    Wayne Piekarski <wayne@senet.com.au> writes:
    >>>> whole thing to fall over, but now we get
    >>>> "fmgr_info: function 111257088: cache lookup failed"
    >>>> after 64 backends (which is what we compiled postgres for) which I
    >>>> assume isn't so fatal and the whole system keeps running.
    
    > ... One question though: is the cache lookup failed message
    > really bad or is it a cryptic way of saying that the connection is refused
    > but everything else is cool?
    
    I'd put it in the "really bad" category, mainly because I don't see the
    cause-and-effect chain.  It is *not* anything to do with connection
    validation, that's for sure.  My guess is that the additional backend
    has connected and is trying to make queries, and that queries are now
    failing for some resource-exhaustion kind of reason.  But I don't know
    why that would tend to show up as an fmgr_info failure before anything
    else.  Do you use user-defined functions especially heavily in this
    database?  For that matter, does the OID reported by fmgr_info actually
    correspond to any row of pg_proc?
    
    > As another general question, if I randomly kill postgres backends during
    > the middle of transactions, is there a possibility for corruption, or is
    > it safe due to the way transactions are commited, etc.
    
    I'd regard it as very risky --- if that backend is in the middle of
    modifying shared memory, you could leave shared memory datastructures
    and/or disk blocks in inconsistent states.  You could probably get away
    with it for a backend that was blocked waiting for a lock.
    
    			regards, tom lane
    
    
  2. Re: [HACKERS] Update on my 6.4.2 progress

    Wayne Piekarski <wayne@senet.com.au> — 1999-07-03T07:51:39Z

    Hi,
    
    > Wayne Piekarski <wayne@senet.com.au> writes:
    > >>>> whole thing to fall over, but now we get
    > >>>> "fmgr_info: function 111257088: cache lookup failed"
    > >>>> after 64 backends (which is what we compiled postgres for) which I
    > >>>> assume isn't so fatal and the whole system keeps running.
    >
    > failing for some resource-exhaustion kind of reason.  But I don't know
    > why that would tend to show up as an fmgr_info failure before anything
    > else.  Do you use user-defined functions especially heavily in this
    > database?  For that matter, does the OID reported by fmgr_info actually
    > correspond to any row of pg_proc?
    
    I had a look, and there is no entry in pg_proc for any oid like the above
    mentioned. One thing that is very interesting is that we use a ton of user
    defined function (in C, plpgsql, and SQL) like you asked and that we
    also had this problem a while back:
    
    At midnight, we have a process called the vacuum manager, which drops the
    indices on a table, vacuum's it, and then recreates the indices. During
    this time, we suspend the processes which could possibly do work, so they
    sit there waiting for this lock file on disk to disappear, then they
    resume their work when the vacuum manager is finished.
    
    The interesting part is, when this one process would resume, it would die
    inside a plpgsql function. It would crash the backend with a message like:
    ExecOpenR: relation == NULL, heap_open failed". I put some extra code to
    find the oid value, but the oid didn't exist in pg_proc. I think somewhere
    internally postgres had stored the oid of an index, and then barfed when
    it tried to use that index later on. 
    
    To avoid backends crashing, we reconnected when the lock file was removed,
    and this fixed the problem up. However, I don't know why this happened at
    all, it was really bizarre. The stranger part was that the query that died
    would always be in a plpgsql function, why is that? My next question is,
    are user defined function bad in general, could they cause locking
    problems, crashing, etc, which might explain some of the massive problems
    I'm having [Still got problems with BTP_CHAIN and backends waiting - 6.4.2]
    
    > > As another general question, if I randomly kill postgres backends during
    > > the middle of transactions, is there a possibility for corruption, or is
    > > it safe due to the way transactions are commited, etc.
    > 
    > I'd regard it as very risky --- if that backend is in the middle of
    > modifying shared memory, you could leave shared memory datastructures
    > and/or disk blocks in inconsistent states.  You could probably get away
    > with it for a backend that was blocked waiting for a lock.
    
    Well, technically when a backend crashes, it kills all the other backends
    as well so this should avoid the shared memory corruption problems right?
    
    ****
    
    Also, I'm still having troubles with this BTP_CHAIN stuff ... I think I've
    worked out how to reproduce it, but not enough to write a script for it.
    
    Basically, if I have lots of writers and readers doing small work and then
    someone comes along with a huge read or write (ie, join against a big
    table and it takes ages) then all of a sudden queries will try to do an
    update and I get the BTP_CHAIN problem.
    
    Apart from reloading the table, is there any way I can fix up the
    BTP_CHAIN problem an easier way? It takes ages to reload a 100 mb table :(
    Vacuum fails with blowawayrelationbuffers = -2 (As re my previous email)
    
    This BTP_CHAIN stuff is really bad, I can't make this stuff work reliably
    and it causes n-million problems for the people who need to use the dbms
    and the table is dead.
    
    ****
    
    
    thanks,
    Wayne
    
    ------------------------------------------------------------------------------
    Wayne Piekarski                               Tel:     (08) 8221 5221
    Research & Development Manager                Fax:     (08) 8221 5220
    SE Network Access Pty Ltd                     Mob:     0407 395 889
    222 Grote Street                              Email:   wayne@senet.com.au
    Adelaide SA 5000                              WWW:     http://www.senet.com.au