Thread

Re: [HACKERS] Update on my 6.4.2 progress

Tom Lane <tgl@sss.pgh.pa.us> — 1999-06-21T13:43:27Z

Wayne Piekarski <wayne@senet.com.au> writes:
>>>> whole thing to fall over, but now we get
>>>> "fmgr_info: function 111257088: cache lookup failed"
>>>> after 64 backends (which is what we compiled postgres for) which I
>>>> assume isn't so fatal and the whole system keeps running.

> ... One question though: is the cache lookup failed message
> really bad or is it a cryptic way of saying that the connection is refused
> but everything else is cool?

I'd put it in the "really bad" category, mainly because I don't see the
cause-and-effect chain.  It is *not* anything to do with connection
validation, that's for sure.  My guess is that the additional backend
has connected and is trying to make queries, and that queries are now
failing for some resource-exhaustion kind of reason.  But I don't know
why that would tend to show up as an fmgr_info failure before anything
else.  Do you use user-defined functions especially heavily in this
database?  For that matter, does the OID reported by fmgr_info actually
correspond to any row of pg_proc?

> As another general question, if I randomly kill postgres backends during
> the middle of transactions, is there a possibility for corruption, or is
> it safe due to the way transactions are commited, etc.

I'd regard it as very risky --- if that backend is in the middle of
modifying shared memory, you could leave shared memory datastructures
and/or disk blocks in inconsistent states.  You could probably get away
with it for a backend that was blocked waiting for a lock.

			regards, tom lane

Re: [HACKERS] Update on my 6.4.2 progress

Wayne Piekarski <wayne@senet.com.au> — 1999-07-03T07:51:39Z

Hi,

> Wayne Piekarski <wayne@senet.com.au> writes:
> >>>> whole thing to fall over, but now we get
> >>>> "fmgr_info: function 111257088: cache lookup failed"
> >>>> after 64 backends (which is what we compiled postgres for) which I
> >>>> assume isn't so fatal and the whole system keeps running.
>
> failing for some resource-exhaustion kind of reason.  But I don't know
> why that would tend to show up as an fmgr_info failure before anything
> else.  Do you use user-defined functions especially heavily in this
> database?  For that matter, does the OID reported by fmgr_info actually
> correspond to any row of pg_proc?

I had a look, and there is no entry in pg_proc for any oid like the above
mentioned. One thing that is very interesting is that we use a ton of user
defined function (in C, plpgsql, and SQL) like you asked and that we
also had this problem a while back:

At midnight, we have a process called the vacuum manager, which drops the
indices on a table, vacuum's it, and then recreates the indices. During
this time, we suspend the processes which could possibly do work, so they
sit there waiting for this lock file on disk to disappear, then they
resume their work when the vacuum manager is finished.

The interesting part is, when this one process would resume, it would die
inside a plpgsql function. It would crash the backend with a message like:
ExecOpenR: relation == NULL, heap_open failed". I put some extra code to
find the oid value, but the oid didn't exist in pg_proc. I think somewhere
internally postgres had stored the oid of an index, and then barfed when
it tried to use that index later on. 

To avoid backends crashing, we reconnected when the lock file was removed,
and this fixed the problem up. However, I don't know why this happened at
all, it was really bizarre. The stranger part was that the query that died
would always be in a plpgsql function, why is that? My next question is,
are user defined function bad in general, could they cause locking
problems, crashing, etc, which might explain some of the massive problems
I'm having [Still got problems with BTP_CHAIN and backends waiting - 6.4.2]

> > As another general question, if I randomly kill postgres backends during
> > the middle of transactions, is there a possibility for corruption, or is
> > it safe due to the way transactions are commited, etc.
> 
> I'd regard it as very risky --- if that backend is in the middle of
> modifying shared memory, you could leave shared memory datastructures
> and/or disk blocks in inconsistent states.  You could probably get away
> with it for a backend that was blocked waiting for a lock.

Well, technically when a backend crashes, it kills all the other backends
as well so this should avoid the shared memory corruption problems right?

****

Also, I'm still having troubles with this BTP_CHAIN stuff ... I think I've
worked out how to reproduce it, but not enough to write a script for it.

Basically, if I have lots of writers and readers doing small work and then
someone comes along with a huge read or write (ie, join against a big
table and it takes ages) then all of a sudden queries will try to do an
update and I get the BTP_CHAIN problem.

Apart from reloading the table, is there any way I can fix up the
BTP_CHAIN problem an easier way? It takes ages to reload a 100 mb table :(
Vacuum fails with blowawayrelationbuffers = -2 (As re my previous email)

This BTP_CHAIN stuff is really bad, I can't make this stuff work reliably
and it causes n-million problems for the people who need to use the dbms
and the table is dead.

****

thanks,
Wayne

------------------------------------------------------------------------------
Wayne Piekarski                               Tel:     (08) 8221 5221
Research & Development Manager                Fax:     (08) 8221 5220
SE Network Access Pty Ltd                     Mob:     0407 395 889
222 Grote Street                              Email:   wayne@senet.com.au
Adelaide SA 5000                              WWW:     http://www.senet.com.au