Thread

OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

daveg <daveg@sonic.net> — 2011-08-12T23:19:37Z

This seems to be bug month for my client. Now there are seeing periods
where all new connections fail immediately with the error:

   FATAL:  lock AccessShareLock on object 0/1260/0 is already held 

This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
been up for months. It started happening sporadicly a few days ago. It will
do this for a period of several minutes to an hour and then go back to
normal for hours or days.

One complete failing session out of several hundred around that time:
-----------------
2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
------------------
 
What can I do to help track this down?

-dg


--
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

Robert Haas <robertmhaas@gmail.com> — 2011-08-14T04:16:39Z

On Fri, Aug 12, 2011 at 7:19 PM, daveg <daveg@sonic.net> wrote:
> This seems to be bug month for my client. Now there are seeing periods
> where all new connections fail immediately with the error:
>
>   FATAL:  lock AccessShareLock on object 0/1260/0 is already held
>
> This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
> been up for months. It started happening sporadicly a few days ago. It will
> do this for a period of several minutes to an hour and then go back to
> normal for hours or days.
>
> One complete failing session out of several hundred around that time:
> -----------------
> 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
> 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
> 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
> ------------------
>
> What can I do to help track this down?

I've seen that error (though not that exact fact pattern) caused by
bad RAM.  It's unclear to me what else could cause it.

In terms of debugging, it seems like it might be sensible to start by
injecting some debugging code that dumps out the contents of the LOCK
and LOCALLOCK structures at the point the error occurs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

daveg <daveg@sonic.net> — 2011-08-14T05:44:44Z

On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
> On Fri, Aug 12, 2011 at 7:19 PM, daveg <daveg@sonic.net> wrote:
> > This seems to be bug month for my client. Now there are seeing periods
> > where all new connections fail immediately with the error:
> >
> >   FATAL:  lock AccessShareLock on object 0/1260/0 is already held
> >
> > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
> > been up for months. It started happening sporadicly a few days ago. It will
> > do this for a period of several minutes to an hour and then go back to
> > normal for hours or days.
> >
> > One complete failing session out of several hundred around that time:
> > -----------------
> > 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
> > 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
> > 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
> > ------------------
> >
> > What can I do to help track this down?
> 
> I've seen that error (though not that exact fact pattern) caused by
> bad RAM.  It's unclear to me what else could cause it.

I'll look into that. I think it is only happening on one host, so that might
make sense. On the other hand, these are pretty fancy hosts all ECC and that
so I'd hope they would have squeaked about bad ram.
 
> In terms of debugging, it seems like it might be sensible to start by
> injecting some debugging code that dumps out the contents of the LOCK
> and LOCALLOCK structures at the point the error occurs.

Hmm, we will update to 9.0 next week on these hosts, so I'll try to hold off
on this part at least until then.

-dg
 
-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-08-22T07:31:31Z

On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote:
> 
> This seems to be bug month for my client. Now there are seeing periods
> where all new connections fail immediately with the error:
> 
>    FATAL:  lock AccessShareLock on object 0/1260/0 is already held 
> 
> This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
> been up for months. It started happening sporadicly a few days ago. It will
> do this for a period of several minutes to an hour and then go back to
> normal for hours or days.
> 
> One complete failing session out of several hundred around that time:
> -----------------
> 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
> 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
> 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
> ------------------

This is to add additional information to the original report:

For a while this was happening on many different databases in one postgresql
8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5.
That has been quiet for several days and the newest batches of errors have
happened on only on a single database 'c23', in a postgresql 9.0.4 instance
on a smaller host ('A', 64GB 8cpu) running SuSE 10.2.

No memory errors or other misbehaviour have been seen on either of these
hosts in recent months.

The original error was:

  lock AccessShareLock on object 0/1260/0 is already held

which is for pg_database. The recent errors are:

  lock AccessShareLock on object 16403/2615/0 is already held

which is for pg_namespace in database c23.

All of the orginal and most of the recent batchs of errors were immediately
after connecting to a database and being authorized, that is, before any
statements were attempted. However, some of the most recent are on the first
"query" statement. That is after logging in and doing things like "set
transaction ... " the first select would hit this error.

It seems to come in clusters, sometimes, which suggests something shared
by multiple processes. For example, here are the times for the errors
on c23 in the afternoon of August 20:

20 07:14:12.722

20 16:05:07.798
20 16:05:07.808

20 16:05:10.519

20 16:07:07.726
20 16:07:08.722
20 16:07:09.734
20 16:07:10.656

20 16:07:25.436

20 16:22:23.983
20 16:22:24.014
20 16:22:24.335
20 16:22:24.409
20 16:22:24.477
20 16:22:24.499
20 16:22:24.516

20 16:30:58.210

20 16:31:15.261
20 16:31:15.296
20 16:31:15.324
20 16:31:15.348

20 18:06:16.515

20 18:06:49.198
20 18:06:49.204

20 18:06:51.444

20 21:03:05.940

So far I've got:

  - affects system tables
  - happens very soon after process startup
  - in 8.4.7 and 9.0.4
  - not likely to be hardware or OS related
  - happens in clusters for period of a few second to many minutes

I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
hard to get downtime to pick up new builds. Any other ideas on getting to
the bottom of this?

Thanks

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

daveg <daveg@sonic.net> — 2011-08-23T07:24:47Z

On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
> On Fri, Aug 12, 2011 at 7:19 PM, daveg <daveg@sonic.net> wrote:
> > This seems to be bug month for my client. Now there are seeing periods
> > where all new connections fail immediately with the error:
> >
> >   FATAL:  lock AccessShareLock on object 0/1260/0 is already held
...
> > What can I do to help track this down?
> 
> I've seen that error (though not that exact fact pattern) caused by
> bad RAM.  It's unclear to me what else could cause it.
> 
> In terms of debugging, it seems like it might be sensible to start by
> injecting some debugging code that dumps out the contents of the LOCK
> and LOCALLOCK structures at the point the error occurs.

I've made up the attached patch to print this, please suggest any additions.
I'll deploy this on a couple of the production hosts that have had the
issue this evening, but there is no telling when or if it will strike next.

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Robert Haas <robertmhaas@gmail.com> — 2011-08-23T16:15:23Z

On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
> So far I've got:
>
>  - affects system tables
>  - happens very soon after process startup
>  - in 8.4.7 and 9.0.4
>  - not likely to be hardware or OS related
>  - happens in clusters for period of a few second to many minutes
>
> I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
> hard to get downtime to pick up new builds. Any other ideas on getting to
> the bottom of this?

I've been thinking this one over, and doing a little testing. I'm
still stumped, but I have a few thoughts.  What that error message is
really saying is that the LOCALLOCK bookkeeping doesn't match the
PROCLOCK bookkeeping; it doesn't tell us which one is to blame.

My first thought was that there might be some situation where
LockAcquireExtended() gets an interrupt between the time it does the
LOCALLOCK lookup and the time it acquires the partition lock.  If the
interrupt handler were to acquire (but not releases) a lock in the
meantime, then we'd get confused.  However, I can't see how that's
possible.  I inserted some debugging code to fail an assertion if
CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if
ImmediateInterruptOK is set on entering the function, and the system
still passes regression tests.

My second thought is that perhaps a process is occasionally managing
to exit without fully cleaning up the associated PROCLOCK entry.  At
first glance, it appears that this would explain the observed
symptoms.  A new backend gets the PGPROC belonging to the guy who
didn't clean up after himself, hits the error, and disconnects,
sticking himself right back on to the head of the SHM_QUEUE where the
next connection will inherit the same PGPROC and hit the same problem.
 But it's not clear to me what could cause the system to get into this
state in the first place, or how it would eventually right itself.

It might be worth kludging up your system to add a test to
InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
either NULL or empty, along the lines of the attached patch (which
assumes that assertions are enabled; otherwise, put in an elog() of
some sort).  Actually, I wonder if we shouldn't move all the
SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
doing it over again every time someone calls InitProcess().  Besides
being a waste of cycles, it's probably less robust this way.   If
there somehow are leftovers in one of those queues, the next
successful call to LockReleaseAll() ought to clean up the mess, but of
course there's no chance of that working if we've nuked the queue
pointers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-07T09:16:34Z

On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
> On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
> > So far I've got:
> >
> >  - affects system tables
> >  - happens very soon after process startup
> >  - in 8.4.7 and 9.0.4
> >  - not likely to be hardware or OS related
> >  - happens in clusters for period of a few second to many minutes
> >
> > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
> > hard to get downtime to pick up new builds. Any other ideas on getting to
> > the bottom of this?
> 
> I've been thinking this one over, and doing a little testing. I'm
> still stumped, but I have a few thoughts.  What that error message is
> really saying is that the LOCALLOCK bookkeeping doesn't match the
> PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
... 
> My second thought is that perhaps a process is occasionally managing
> to exit without fully cleaning up the associated PROCLOCK entry.  At
> first glance, it appears that this would explain the observed
> symptoms.  A new backend gets the PGPROC belonging to the guy who
> didn't clean up after himself, hits the error, and disconnects,
> sticking himself right back on to the head of the SHM_QUEUE where the
> next connection will inherit the same PGPROC and hit the same problem.
>  But it's not clear to me what could cause the system to get into this
> state in the first place, or how it would eventually right itself.
> 
> It might be worth kludging up your system to add a test to
> InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
> either NULL or empty, along the lines of the attached patch (which
> assumes that assertions are enabled; otherwise, put in an elog() of
> some sort).  Actually, I wonder if we shouldn't move all the
> SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
> doing it over again every time someone calls InitProcess().  Besides
> being a waste of cycles, it's probably less robust this way.   If
> there somehow are leftovers in one of those queues, the next
> successful call to LockReleaseAll() ought to clean up the mess, but of
> course there's no chance of that working if we've nuked the queue
> pointers.

I did this in the elog flavor as we don't build production images with asserts.
It has been running on all hosts for a few days. Today it hit the extra
checks in initproc.

00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 
00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
00:02:32.783  21535  LOG:  terminating any other active server processes
00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process

The patch that produced this is attached. If you can think of anything I
can add to this to help I'd be happy to do so. Also, can I clean this up
and continue somehow? Maybe clear the queue instead having to have a restart?
Or is there a way to just pause this proc here, maybe mark it not to be used
and exit, or just to sleep forever so I can debug later?

Thanks

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Robert Haas <robertmhaas@gmail.com> — 2011-09-07T13:46:19Z

On Wed, Sep 7, 2011 at 5:16 AM, daveg <daveg@sonic.net> wrote:
> On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
>> On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
>> > So far I've got:
>> >
>> >  - affects system tables
>> >  - happens very soon after process startup
>> >  - in 8.4.7 and 9.0.4
>> >  - not likely to be hardware or OS related
>> >  - happens in clusters for period of a few second to many minutes
>> >
>> > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
>> > hard to get downtime to pick up new builds. Any other ideas on getting to
>> > the bottom of this?
>>
>> I've been thinking this one over, and doing a little testing. I'm
>> still stumped, but I have a few thoughts.  What that error message is
>> really saying is that the LOCALLOCK bookkeeping doesn't match the
>> PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
> ...
>> My second thought is that perhaps a process is occasionally managing
>> to exit without fully cleaning up the associated PROCLOCK entry.  At
>> first glance, it appears that this would explain the observed
>> symptoms.  A new backend gets the PGPROC belonging to the guy who
>> didn't clean up after himself, hits the error, and disconnects,
>> sticking himself right back on to the head of the SHM_QUEUE where the
>> next connection will inherit the same PGPROC and hit the same problem.
>>  But it's not clear to me what could cause the system to get into this
>> state in the first place, or how it would eventually right itself.
>>
>> It might be worth kludging up your system to add a test to
>> InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
>> either NULL or empty, along the lines of the attached patch (which
>> assumes that assertions are enabled; otherwise, put in an elog() of
>> some sort).  Actually, I wonder if we shouldn't move all the
>> SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
>> doing it over again every time someone calls InitProcess().  Besides
>> being a waste of cycles, it's probably less robust this way.   If
>> there somehow are leftovers in one of those queues, the next
>> successful call to LockReleaseAll() ought to clean up the mess, but of
>> course there's no chance of that working if we've nuked the queue
>> pointers.
>
> I did this in the elog flavor as we don't build production images with asserts.
> It has been running on all hosts for a few days. Today it hit the extra
> checks in initproc.
>
> 00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
> 00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
> 00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558)
> 00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
> 00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
> 00:02:32.783  21535  LOG:  terminating any other active server processes
> 00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process
>
> The patch that produced this is attached. If you can think of anything I
> can add to this to help I'd be happy to do so. Also, can I clean this up
> and continue somehow? Maybe clear the queue instead having to have a restart?
> Or is there a way to just pause this proc here, maybe mark it not to be used
> and exit, or just to sleep forever so I can debug later?

I think what we really need to know here is: when the debugging code
you injected here fires, what happened to the previous owner of that
PGPROC that caused it to not clean up its state properly?  The PGPROC
that 8627 inherited in the above example is obviously messed up - but
what did the last guy do that messed it up?  It would be nice to log
the address of the PGPROC in every log message here - then you could
go back and look to see whether the last guy terminated in some
unusual way.  In the meantime, could you could look back a couple of
minutes from the time of the above occurrence and see if there are any
FATAL errors, or usual ERRORs?

After spending some time staring at the code, I do have one idea as to
what might be going on here.  When a backend is terminated,
ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
user locks, and the first one (or so we suppose) releases all normal
locks as part of aborting the transaction.  But what if there's no
transaction in progress?  In that case, AbortOutOfAnyTransaction()
won't do anything - which is fine as far as transaction-level locks
go, because we shouldn't be holding any of those anyway if there's no
transaction in progress.  However, if we hold a session-level lock at
that point, then we'd orphan it.  We don't make much use of session
locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
standby servers, redo of DROP DATABASE actions.  Any chance one of
those died or was killed off around the time this happened?

One thing that might be worth doing in the interest of debugging is
adding code just like what you've already added to ShutdownPostgres(),
*after* calling AbortOutOfAnyTransaction() and LockReleaseAll().  That
might enable us to trap the problem at the time it happens, rather
than waiting until the next backend gets that PGPROC.  Of course that
still might not answer the question of how it got like that, but at
least we'd be getting closer to the time of the actual event.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T14:20:24Z

Robert Haas <robertmhaas@gmail.com> writes:
> After spending some time staring at the code, I do have one idea as to
> what might be going on here.  When a backend is terminated,
> ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
> LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
> user locks, and the first one (or so we suppose) releases all normal
> locks as part of aborting the transaction.  But what if there's no
> transaction in progress?  In that case, AbortOutOfAnyTransaction()
> won't do anything - which is fine as far as transaction-level locks
> go, because we shouldn't be holding any of those anyway if there's no
> transaction in progress.  However, if we hold a session-level lock at
> that point, then we'd orphan it.  We don't make much use of session
> locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
> INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
> standby servers, redo of DROP DATABASE actions.  Any chance one of
> those died or was killed off around the time this happened?

I don't believe this theory at all, because if that were the issue,
we'd have heard about it long since.  The correct theory has to involve
a very-little-used triggering condition.  At the moment I'm wondering
about advisory (userspace) locks ... Dave, do your apps use any of those?

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-07T20:22:00Z

On Wed, Sep 07, 2011 at 10:20:24AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > After spending some time staring at the code, I do have one idea as to
> > what might be going on here.  When a backend is terminated,
> > ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
> > LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
> > user locks, and the first one (or so we suppose) releases all normal
> > locks as part of aborting the transaction.  But what if there's no
> > transaction in progress?  In that case, AbortOutOfAnyTransaction()
> > won't do anything - which is fine as far as transaction-level locks
> > go, because we shouldn't be holding any of those anyway if there's no
> > transaction in progress.  However, if we hold a session-level lock at
> > that point, then we'd orphan it.  We don't make much use of session
> > locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
> > INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
> > standby servers, redo of DROP DATABASE actions.  Any chance one of
> > those died or was killed off around the time this happened?
> 
> I don't believe this theory at all, because if that were the issue,
> we'd have heard about it long since.  The correct theory has to involve
> a very-little-used triggering condition.  At the moment I'm wondering
> about advisory (userspace) locks ... Dave, do your apps use any of those?

Yes, we make extensive use of advisory locks. That was my thought too when
Robert mentioned session level locks.

I'm happy to add any additional instrumentation, but my client would be
happier to actually run it if there was a way to recover from this without
an unplanned outage. Is there something I can do when the patch detects the
problem to be able to continue without a restart? Is is save to just reset
the proclock queue? I don't think they would mind leaking locks, for instance,
and a later planned restart to clear it up as much as they mind unscheduled
downtime.

Thank

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Robert Haas <robertmhaas@gmail.com> — 2011-09-07T20:28:41Z

On Wed, Sep 7, 2011 at 4:22 PM, daveg <daveg@sonic.net> wrote:
> Yes, we make extensive use of advisory locks. That was my thought too when
> Robert mentioned session level locks.
>
> I'm happy to add any additional instrumentation, but my client would be
> happier to actually run it if there was a way to recover from this without
> an unplanned outage. Is there something I can do when the patch detects the
> problem to be able to continue without a restart?

Well, basically, you want to do the same thing that LockRelease()
would do - remove the PROCLOCK from the SHM_QUEUE and delete it from
the hash table, adjusting the counts in the LOCK object as
appropriate.  If you just ignore the failure, you'll get the "blah
blah blah is already held" messages you were having before.

Tom's right to be skeptical of my theory, because it would require a
CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
pathways that use session-level locks, and I can't find one.

OTOH, I'm skeptical of the theory that this involves userlocks,
because this whole thread started because of a complaint about lock
0/1260/0 already being held.  That ain't no userlock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T20:55:24Z

Robert Haas <robertmhaas@gmail.com> writes:
> Tom's right to be skeptical of my theory, because it would require a
> CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
> pathways that use session-level locks, and I can't find one.

More to the point: session-level locks are released on error.  The only
way to get out of a transaction while still holding one is for the
VACUUM-or-whichever-command code to deliberately commit and exit while
still holding it.  An error exit path would release the lock.

> OTOH, I'm skeptical of the theory that this involves userlocks,
> because this whole thread started because of a complaint about lock
> 0/1260/0 already being held.  That ain't no userlock.

Yeah, and for that matter it seems to let VACUUM off the hook too.
If we assume that the reported object ID is non-corrupt (and if it's
always the same, that seems like the way to bet) then this is a lock
on pg_authid.

Hmmm ... could the pathway involve an error exit from client
authentication?  We're still finding bugs in the 9.0 rewrite of
auth-time database access.

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Robert Haas <robertmhaas@gmail.com> — 2011-09-07T22:17:30Z

On Wed, Sep 7, 2011 at 4:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, and for that matter it seems to let VACUUM off the hook too.
> If we assume that the reported object ID is non-corrupt (and if it's
> always the same, that seems like the way to bet) then this is a lock
> on pg_authid.
>
> Hmmm ... could the pathway involve an error exit from client
> authentication?  We're still finding bugs in the 9.0 rewrite of
> auth-time database access.

Well, according to Dave's report upthread, it's not only this one relation:

DG> The recent errors are:
DG> lock AccessShareLock on object 16403/2615/0 is already held
DG> which is for pg_namespace in database c23.

I thought about an error exit from client authentication, and that's a
somewhat appealing explanation, but I can't quite see why we wouldn't
clean up there the same as anywhere else.  The whole mechanism feels a
bit rickety to me - we don't actually release locks; we just abort the
transaction and *assume* that will cause locks to get released.  And
it should.  But there's a bit more action-at-a-distance there than I'd
ideally like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-07T22:23:26Z

On Wed, Sep 07, 2011 at 04:55:24PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Tom's right to be skeptical of my theory, because it would require a
> > CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
> > pathways that use session-level locks, and I can't find one.
> 
> More to the point: session-level locks are released on error.  The only
> way to get out of a transaction while still holding one is for the
> VACUUM-or-whichever-command code to deliberately commit and exit while
> still holding it.  An error exit path would release the lock.
> 
> > OTOH, I'm skeptical of the theory that this involves userlocks,
> > because this whole thread started because of a complaint about lock
> > 0/1260/0 already being held.  That ain't no userlock.
> 
> Yeah, and for that matter it seems to let VACUUM off the hook too.
> If we assume that the reported object ID is non-corrupt (and if it's
> always the same, that seems like the way to bet) then this is a lock
> on pg_authid.
> 
> Hmmm ... could the pathway involve an error exit from client
> authentication?  We're still finding bugs in the 9.0 rewrite of
> auth-time database access.

It does not seem restricted to pg_authid:

2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on object 16403/2615/0 

And I think I've seen it on other tables too.

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T22:25:23Z

Robert Haas <robertmhaas@gmail.com> writes:
> I thought about an error exit from client authentication, and that's a
> somewhat appealing explanation, but I can't quite see why we wouldn't
> clean up there the same as anywhere else.  The whole mechanism feels a
> bit rickety to me - we don't actually release locks; we just abort the
> transaction and *assume* that will cause locks to get released.

Well, transaction abort will call LockReleaseAll, which is carefully
coded to clean up the proclock lists regardless of what is in the
locallocks table, so I'm not sure why you find that any more rickety
than anything else.  But maybe it'd be interesting for Dave to stick a
LockReleaseAll call into ProcKill() and see if that makes things better.
(Dave: test that before you put it in production, I'm not totally sure
it's safe.)

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T22:35:08Z

daveg <daveg@sonic.net> writes:
> It does not seem restricted to pg_authid:
> 2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on object 16403/2615/0 
> And I think I've seen it on other tables too.

Hmm.  2615 = pg_namespace, which most likely is the first catalog
accessed by just about any SQL command that's going to access tables at
all, so I suspect that this is mostly just a "the first access failed"
thing and not something peculiar to pg_namespace.  But we still don't
have a clue why the locks are not getting released by the previous
owner of the PGPROC slot.  Have you trawled your logs to see if there's
any sign of any distress at all, shortly before the problem starts to
happen?

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-07T23:01:10Z

On Wed, Sep 07, 2011 at 06:35:08PM -0400, Tom Lane wrote:
> daveg <daveg@sonic.net> writes:
> > It does not seem restricted to pg_authid:
> > 2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on object 16403/2615/0 
> > And I think I've seen it on other tables too.
> 
> Hmm.  2615 = pg_namespace, which most likely is the first catalog
> accessed by just about any SQL command that's going to access tables at
> all, so I suspect that this is mostly just a "the first access failed"
> thing and not something peculiar to pg_namespace.  But we still don't
> have a clue why the locks are not getting released by the previous
> owner of the PGPROC slot.  Have you trawled your logs to see if there's
> any sign of any distress at all, shortly before the problem starts to
> happen?

Will do, but its a pretty big haystack. Sure wish I knew what the needle
looked like. ;-)

-dg
 
-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-07T23:05:21Z

On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I thought about an error exit from client authentication, and that's a
> > somewhat appealing explanation, but I can't quite see why we wouldn't
> > clean up there the same as anywhere else.  The whole mechanism feels a
> > bit rickety to me - we don't actually release locks; we just abort the
> > transaction and *assume* that will cause locks to get released.
> 
> Well, transaction abort will call LockReleaseAll, which is carefully
> coded to clean up the proclock lists regardless of what is in the
> locallocks table, so I'm not sure why you find that any more rickety
> than anything else.  But maybe it'd be interesting for Dave to stick a
> LockReleaseAll call into ProcKill() and see if that makes things better.
> (Dave: test that before you put it in production, I'm not totally sure
> it's safe.)

Re safety, what is the worst case here? 

Also, this is very intermittant, we have seen it only in recent months
on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
feels like a few times a month. Possibly some new application behaviour
is provoking it, but I have no guesses as to what.

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T23:16:12Z

daveg <daveg@sonic.net> writes:
> On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
>> ...  But maybe it'd be interesting for Dave to stick a
>> LockReleaseAll call into ProcKill() and see if that makes things better.
>> (Dave: test that before you put it in production, I'm not totally sure
>> it's safe.)

> Re safety, what is the worst case here? 

I think a failure would be pretty obvious --- if it gets through
regression tests it's probably fine.

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T23:39:15Z

daveg <daveg@sonic.net> writes:
> Also, this is very intermittant, we have seen it only in recent months
> on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
> feels like a few times a month. Possibly some new application behaviour
> is provoking it, but I have no guesses as to what.

BTW ... what were the last versions you were running on which you had
*not* seen the problem?  (Just wondering about the possibility that we
back-patched some "fix" that broke things.  It would be good to have
a version range before trawling the commit logs.)

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-08T00:55:59Z

On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
> daveg <daveg@sonic.net> writes:
> > Also, this is very intermittant, we have seen it only in recent months
> > on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
> > feels like a few times a month. Possibly some new application behaviour
> > is provoking it, but I have no guesses as to what.
> 
> BTW ... what were the last versions you were running on which you had
> *not* seen the problem?  (Just wondering about the possibility that we
> back-patched some "fix" that broke things.  It would be good to have
> a version range before trawling the commit logs.)

The first version we saw it on was 8.4.7.

-dg
 
-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-08T01:02:04Z

daveg <daveg@sonic.net> writes:
> On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
>> BTW ... what were the last versions you were running on which you had
>> *not* seen the problem?  (Just wondering about the possibility that we
>> back-patched some "fix" that broke things.  It would be good to have
>> a version range before trawling the commit logs.)

> The first version we saw it on was 8.4.7.

Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

			regards, tom lane

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Robert Haas <robertmhaas@gmail.com> — 2011-09-08T01:22:40Z

On Wed, Sep 7, 2011 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I thought about an error exit from client authentication, and that's a
>> somewhat appealing explanation, but I can't quite see why we wouldn't
>> clean up there the same as anywhere else.  The whole mechanism feels a
>> bit rickety to me - we don't actually release locks; we just abort the
>> transaction and *assume* that will cause locks to get released.
>
> Well, transaction abort will call LockReleaseAll, which is carefully
> coded to clean up the proclock lists regardless of what is in the
> locallocks table, so I'm not sure why you find that any more rickety
> than anything else.

Well, it's very clear that you CAN orphan locks if a backend holding a
session lock ever does CHECK_FOR_INTERRUPTS() outside of a
transaction.  Try the attached patch.

rhaas=# vacuum full foo;
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
The connection to the server was lost. Attempting reset: Succeeded.
rhaas=# vacuum full foo;
ERROR:  lock AccessExclusiveLock on object 16384/1431013/0 is already held

Now, I don't see any evidence of a live bug here (and on further
thought it can't be Dave's bug because he is orphaning
AccessShareLocks, not AccessExclusiveLocks), but I find this pretty
convincing as a demonstration of ricketiness.  It is certainly not
obvious on its face that a misplaced CHECK_FOR_INTERRUPTS() can result
in a backend exiting without cleaning up its locks, and I'd argue its
a bad idea to leave it that way even if there's no user-visible
problem there today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

daveg <daveg@sonic.net> — 2011-09-08T06:35:29Z

On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
> daveg <daveg@sonic.net> writes:
> > On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
> >> BTW ... what were the last versions you were running on which you had
> >> *not* seen the problem?  (Just wondering about the possibility that we
> >> back-patched some "fix" that broke things.  It would be good to have
> >> a version range before trawling the commit logs.)
> 
> > The first version we saw it on was 8.4.7.
> 
> Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
since shortly after it was released. Prior to that we have had all the major
and most of the minor releases since 7.1.
 
-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-08T15:31:51Z

daveg <daveg@sonic.net> writes:
> On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
>> daveg <daveg@sonic.net> writes:
>>> The first version we saw it on was 8.4.7.

>> Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

> Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
> since shortly after it was released. Prior to that we have had all the major
> and most of the minor releases since 7.1.

Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and
I can't find anything that looks like it could possibly be related.
So at this point I'm inclined to think that the bug is older than
that, but your usage patterns changed so that you started to tickle it.

Can you think of any changes in your usage that might date to around
that time, and would somehow be connected to backend startup/shutdown?

			regards, tom lane