Thread

  1. OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

    daveg <daveg@sonic.net> — 2011-08-12T23:19:37Z

    This seems to be bug month for my client. Now there are seeing periods
    where all new connections fail immediately with the error:
    
       FATAL:  lock AccessShareLock on object 0/1260/0 is already held 
    
    This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    been up for months. It started happening sporadicly a few days ago. It will
    do this for a period of several minutes to an hour and then go back to
    normal for hours or days.
    
    One complete failing session out of several hundred around that time:
    -----------------
    2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
    2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
    2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    ------------------
     
    What can I do to help track this down?
    
    -dg
    
    
    --
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  2. Re: OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

    Robert Haas <robertmhaas@gmail.com> — 2011-08-14T04:16:39Z

    On Fri, Aug 12, 2011 at 7:19 PM, daveg <daveg@sonic.net> wrote:
    > This seems to be bug month for my client. Now there are seeing periods
    > where all new connections fail immediately with the error:
    >
    >   FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    >
    > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    > been up for months. It started happening sporadicly a few days ago. It will
    > do this for a period of several minutes to an hour and then go back to
    > normal for hours or days.
    >
    > One complete failing session out of several hundred around that time:
    > -----------------
    > 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
    > 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
    > 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    > ------------------
    >
    > What can I do to help track this down?
    
    I've seen that error (though not that exact fact pattern) caused by
    bad RAM.  It's unclear to me what else could cause it.
    
    In terms of debugging, it seems like it might be sensible to start by
    injecting some debugging code that dumps out the contents of the LOCK
    and LOCALLOCK structures at the point the error occurs.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  3. Re: OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

    daveg <daveg@sonic.net> — 2011-08-14T05:44:44Z

    On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
    > On Fri, Aug 12, 2011 at 7:19 PM, daveg <daveg@sonic.net> wrote:
    > > This seems to be bug month for my client. Now there are seeing periods
    > > where all new connections fail immediately with the error:
    > >
    > >   FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    > >
    > > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    > > been up for months. It started happening sporadicly a few days ago. It will
    > > do this for a period of several minutes to an hour and then go back to
    > > normal for hours or days.
    > >
    > > One complete failing session out of several hundred around that time:
    > > -----------------
    > > 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
    > > 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
    > > 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    > > ------------------
    > >
    > > What can I do to help track this down?
    > 
    > I've seen that error (though not that exact fact pattern) caused by
    > bad RAM.  It's unclear to me what else could cause it.
    
    I'll look into that. I think it is only happening on one host, so that might
    make sense. On the other hand, these are pretty fancy hosts all ECC and that
    so I'd hope they would have squeaked about bad ram.
     
    > In terms of debugging, it seems like it might be sensible to start by
    > injecting some debugging code that dumps out the contents of the LOCK
    > and LOCALLOCK structures at the point the error occurs.
    
    Hmm, we will update to 9.0 next week on these hosts, so I'll try to hold off
    on this part at least until then.
    
    -dg
     
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  4. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-08-22T07:31:31Z

    On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote:
    > 
    > This seems to be bug month for my client. Now there are seeing periods
    > where all new connections fail immediately with the error:
    > 
    >    FATAL:  lock AccessShareLock on object 0/1260/0 is already held 
    > 
    > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    > been up for months. It started happening sporadicly a few days ago. It will
    > do this for a period of several minutes to an hour and then go back to
    > normal for hours or days.
    > 
    > One complete failing session out of several hundred around that time:
    > -----------------
    > 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
    > 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
    > 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    > ------------------
    
    This is to add additional information to the original report:
    
    For a while this was happening on many different databases in one postgresql
    8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5.
    That has been quiet for several days and the newest batches of errors have
    happened on only on a single database 'c23', in a postgresql 9.0.4 instance
    on a smaller host ('A', 64GB 8cpu) running SuSE 10.2.
    
    No memory errors or other misbehaviour have been seen on either of these
    hosts in recent months.
    
    The original error was:
    
      lock AccessShareLock on object 0/1260/0 is already held
    
    which is for pg_database. The recent errors are:
    
      lock AccessShareLock on object 16403/2615/0 is already held
    
    which is for pg_namespace in database c23.
    
    All of the orginal and most of the recent batchs of errors were immediately
    after connecting to a database and being authorized, that is, before any
    statements were attempted. However, some of the most recent are on the first
    "query" statement. That is after logging in and doing things like "set
    transaction ... " the first select would hit this error.
    
    It seems to come in clusters, sometimes, which suggests something shared
    by multiple processes. For example, here are the times for the errors
    on c23 in the afternoon of August 20:
    
    20 07:14:12.722
    
    20 16:05:07.798
    20 16:05:07.808
    
    20 16:05:10.519
    
    20 16:07:07.726
    20 16:07:08.722
    20 16:07:09.734
    20 16:07:10.656
    
    20 16:07:25.436
    
    20 16:22:23.983
    20 16:22:24.014
    20 16:22:24.335
    20 16:22:24.409
    20 16:22:24.477
    20 16:22:24.499
    20 16:22:24.516
    
    20 16:30:58.210
    
    20 16:31:15.261
    20 16:31:15.296
    20 16:31:15.324
    20 16:31:15.348
    
    20 18:06:16.515
    
    20 18:06:49.198
    20 18:06:49.204
    
    20 18:06:51.444
    
    20 21:03:05.940
    
    So far I've got:
    
      - affects system tables
      - happens very soon after process startup
      - in 8.4.7 and 9.0.4
      - not likely to be hardware or OS related
      - happens in clusters for period of a few second to many minutes
    
    I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    hard to get downtime to pick up new builds. Any other ideas on getting to
    the bottom of this?
    
    Thanks
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  5. Re: OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

    daveg <daveg@sonic.net> — 2011-08-23T07:24:47Z

    On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
    > On Fri, Aug 12, 2011 at 7:19 PM, daveg <daveg@sonic.net> wrote:
    > > This seems to be bug month for my client. Now there are seeing periods
    > > where all new connections fail immediately with the error:
    > >
    > >   FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    ...
    > > What can I do to help track this down?
    > 
    > I've seen that error (though not that exact fact pattern) caused by
    > bad RAM.  It's unclear to me what else could cause it.
    > 
    > In terms of debugging, it seems like it might be sensible to start by
    > injecting some debugging code that dumps out the contents of the LOCK
    > and LOCALLOCK structures at the point the error occurs.
    
    I've made up the attached patch to print this, please suggest any additions.
    I'll deploy this on a couple of the production hosts that have had the
    issue this evening, but there is no telling when or if it will strike next.
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
  6. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Robert Haas <robertmhaas@gmail.com> — 2011-08-23T16:15:23Z

    On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
    > So far I've got:
    >
    >  - affects system tables
    >  - happens very soon after process startup
    >  - in 8.4.7 and 9.0.4
    >  - not likely to be hardware or OS related
    >  - happens in clusters for period of a few second to many minutes
    >
    > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    > hard to get downtime to pick up new builds. Any other ideas on getting to
    > the bottom of this?
    
    I've been thinking this one over, and doing a little testing. I'm
    still stumped, but I have a few thoughts.  What that error message is
    really saying is that the LOCALLOCK bookkeeping doesn't match the
    PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
    
    My first thought was that there might be some situation where
    LockAcquireExtended() gets an interrupt between the time it does the
    LOCALLOCK lookup and the time it acquires the partition lock.  If the
    interrupt handler were to acquire (but not releases) a lock in the
    meantime, then we'd get confused.  However, I can't see how that's
    possible.  I inserted some debugging code to fail an assertion if
    CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if
    ImmediateInterruptOK is set on entering the function, and the system
    still passes regression tests.
    
    My second thought is that perhaps a process is occasionally managing
    to exit without fully cleaning up the associated PROCLOCK entry.  At
    first glance, it appears that this would explain the observed
    symptoms.  A new backend gets the PGPROC belonging to the guy who
    didn't clean up after himself, hits the error, and disconnects,
    sticking himself right back on to the head of the SHM_QUEUE where the
    next connection will inherit the same PGPROC and hit the same problem.
     But it's not clear to me what could cause the system to get into this
    state in the first place, or how it would eventually right itself.
    
    It might be worth kludging up your system to add a test to
    InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    either NULL or empty, along the lines of the attached patch (which
    assumes that assertions are enabled; otherwise, put in an elog() of
    some sort).  Actually, I wonder if we shouldn't move all the
    SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    doing it over again every time someone calls InitProcess().  Besides
    being a waste of cycles, it's probably less robust this way.   If
    there somehow are leftovers in one of those queues, the next
    successful call to LockReleaseAll() ought to clean up the mess, but of
    course there's no chance of that working if we've nuked the queue
    pointers.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
  7. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-07T09:16:34Z

    On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
    > On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
    > > So far I've got:
    > >
    > >  - affects system tables
    > >  - happens very soon after process startup
    > >  - in 8.4.7 and 9.0.4
    > >  - not likely to be hardware or OS related
    > >  - happens in clusters for period of a few second to many minutes
    > >
    > > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    > > hard to get downtime to pick up new builds. Any other ideas on getting to
    > > the bottom of this?
    > 
    > I've been thinking this one over, and doing a little testing. I'm
    > still stumped, but I have a few thoughts.  What that error message is
    > really saying is that the LOCALLOCK bookkeeping doesn't match the
    > PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
    ... 
    > My second thought is that perhaps a process is occasionally managing
    > to exit without fully cleaning up the associated PROCLOCK entry.  At
    > first glance, it appears that this would explain the observed
    > symptoms.  A new backend gets the PGPROC belonging to the guy who
    > didn't clean up after himself, hits the error, and disconnects,
    > sticking himself right back on to the head of the SHM_QUEUE where the
    > next connection will inherit the same PGPROC and hit the same problem.
    >  But it's not clear to me what could cause the system to get into this
    > state in the first place, or how it would eventually right itself.
    > 
    > It might be worth kludging up your system to add a test to
    > InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    > either NULL or empty, along the lines of the attached patch (which
    > assumes that assertions are enabled; otherwise, put in an elog() of
    > some sort).  Actually, I wonder if we shouldn't move all the
    > SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    > doing it over again every time someone calls InitProcess().  Besides
    > being a waste of cycles, it's probably less robust this way.   If
    > there somehow are leftovers in one of those queues, the next
    > successful call to LockReleaseAll() ought to clean up the mess, but of
    > course there's no chance of that working if we've nuked the queue
    > pointers.
    
    I did this in the elog flavor as we don't build production images with asserts.
    It has been running on all hosts for a few days. Today it hit the extra
    checks in initproc.
    
    00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
    00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
    00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 
    00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
    00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
    00:02:32.783  21535  LOG:  terminating any other active server processes
    00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process
    
    The patch that produced this is attached. If you can think of anything I
    can add to this to help I'd be happy to do so. Also, can I clean this up
    and continue somehow? Maybe clear the queue instead having to have a restart?
    Or is there a way to just pause this proc here, maybe mark it not to be used
    and exit, or just to sleep forever so I can debug later?
    
    Thanks
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
  8. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Robert Haas <robertmhaas@gmail.com> — 2011-09-07T13:46:19Z

    On Wed, Sep 7, 2011 at 5:16 AM, daveg <daveg@sonic.net> wrote:
    > On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
    >> On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
    >> > So far I've got:
    >> >
    >> >  - affects system tables
    >> >  - happens very soon after process startup
    >> >  - in 8.4.7 and 9.0.4
    >> >  - not likely to be hardware or OS related
    >> >  - happens in clusters for period of a few second to many minutes
    >> >
    >> > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    >> > hard to get downtime to pick up new builds. Any other ideas on getting to
    >> > the bottom of this?
    >>
    >> I've been thinking this one over, and doing a little testing. I'm
    >> still stumped, but I have a few thoughts.  What that error message is
    >> really saying is that the LOCALLOCK bookkeeping doesn't match the
    >> PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
    > ...
    >> My second thought is that perhaps a process is occasionally managing
    >> to exit without fully cleaning up the associated PROCLOCK entry.  At
    >> first glance, it appears that this would explain the observed
    >> symptoms.  A new backend gets the PGPROC belonging to the guy who
    >> didn't clean up after himself, hits the error, and disconnects,
    >> sticking himself right back on to the head of the SHM_QUEUE where the
    >> next connection will inherit the same PGPROC and hit the same problem.
    >>  But it's not clear to me what could cause the system to get into this
    >> state in the first place, or how it would eventually right itself.
    >>
    >> It might be worth kludging up your system to add a test to
    >> InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    >> either NULL or empty, along the lines of the attached patch (which
    >> assumes that assertions are enabled; otherwise, put in an elog() of
    >> some sort).  Actually, I wonder if we shouldn't move all the
    >> SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    >> doing it over again every time someone calls InitProcess().  Besides
    >> being a waste of cycles, it's probably less robust this way.   If
    >> there somehow are leftovers in one of those queues, the next
    >> successful call to LockReleaseAll() ought to clean up the mess, but of
    >> course there's no chance of that working if we've nuked the queue
    >> pointers.
    >
    > I did this in the elog flavor as we don't build production images with asserts.
    > It has been running on all hosts for a few days. Today it hit the extra
    > checks in initproc.
    >
    > 00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
    > 00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
    > 00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558)
    > 00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
    > 00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
    > 00:02:32.783  21535  LOG:  terminating any other active server processes
    > 00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process
    >
    > The patch that produced this is attached. If you can think of anything I
    > can add to this to help I'd be happy to do so. Also, can I clean this up
    > and continue somehow? Maybe clear the queue instead having to have a restart?
    > Or is there a way to just pause this proc here, maybe mark it not to be used
    > and exit, or just to sleep forever so I can debug later?
    
    I think what we really need to know here is: when the debugging code
    you injected here fires, what happened to the previous owner of that
    PGPROC that caused it to not clean up its state properly?  The PGPROC
    that 8627 inherited in the above example is obviously messed up - but
    what did the last guy do that messed it up?  It would be nice to log
    the address of the PGPROC in every log message here - then you could
    go back and look to see whether the last guy terminated in some
    unusual way.  In the meantime, could you could look back a couple of
    minutes from the time of the above occurrence and see if there are any
    FATAL errors, or usual ERRORs?
    
    After spending some time staring at the code, I do have one idea as to
    what might be going on here.  When a backend is terminated,
    ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
    LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
    user locks, and the first one (or so we suppose) releases all normal
    locks as part of aborting the transaction.  But what if there's no
    transaction in progress?  In that case, AbortOutOfAnyTransaction()
    won't do anything - which is fine as far as transaction-level locks
    go, because we shouldn't be holding any of those anyway if there's no
    transaction in progress.  However, if we hold a session-level lock at
    that point, then we'd orphan it.  We don't make much use of session
    locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
    INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
    standby servers, redo of DROP DATABASE actions.  Any chance one of
    those died or was killed off around the time this happened?
    
    One thing that might be worth doing in the interest of debugging is
    adding code just like what you've already added to ShutdownPostgres(),
    *after* calling AbortOutOfAnyTransaction() and LockReleaseAll().  That
    might enable us to trap the problem at the time it happens, rather
    than waiting until the next backend gets that PGPROC.  Of course that
    still might not answer the question of how it got like that, but at
    least we'd be getting closer to the time of the actual event.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  9. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T14:20:24Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > After spending some time staring at the code, I do have one idea as to
    > what might be going on here.  When a backend is terminated,
    > ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
    > LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
    > user locks, and the first one (or so we suppose) releases all normal
    > locks as part of aborting the transaction.  But what if there's no
    > transaction in progress?  In that case, AbortOutOfAnyTransaction()
    > won't do anything - which is fine as far as transaction-level locks
    > go, because we shouldn't be holding any of those anyway if there's no
    > transaction in progress.  However, if we hold a session-level lock at
    > that point, then we'd orphan it.  We don't make much use of session
    > locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
    > INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
    > standby servers, redo of DROP DATABASE actions.  Any chance one of
    > those died or was killed off around the time this happened?
    
    I don't believe this theory at all, because if that were the issue,
    we'd have heard about it long since.  The correct theory has to involve
    a very-little-used triggering condition.  At the moment I'm wondering
    about advisory (userspace) locks ... Dave, do your apps use any of those?
    
    			regards, tom lane
    
    
  10. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-07T20:22:00Z

    On Wed, Sep 07, 2011 at 10:20:24AM -0400, Tom Lane wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    > > After spending some time staring at the code, I do have one idea as to
    > > what might be going on here.  When a backend is terminated,
    > > ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
    > > LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
    > > user locks, and the first one (or so we suppose) releases all normal
    > > locks as part of aborting the transaction.  But what if there's no
    > > transaction in progress?  In that case, AbortOutOfAnyTransaction()
    > > won't do anything - which is fine as far as transaction-level locks
    > > go, because we shouldn't be holding any of those anyway if there's no
    > > transaction in progress.  However, if we hold a session-level lock at
    > > that point, then we'd orphan it.  We don't make much use of session
    > > locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
    > > INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
    > > standby servers, redo of DROP DATABASE actions.  Any chance one of
    > > those died or was killed off around the time this happened?
    > 
    > I don't believe this theory at all, because if that were the issue,
    > we'd have heard about it long since.  The correct theory has to involve
    > a very-little-used triggering condition.  At the moment I'm wondering
    > about advisory (userspace) locks ... Dave, do your apps use any of those?
    
    Yes, we make extensive use of advisory locks. That was my thought too when
    Robert mentioned session level locks.
    
    I'm happy to add any additional instrumentation, but my client would be
    happier to actually run it if there was a way to recover from this without
    an unplanned outage. Is there something I can do when the patch detects the
    problem to be able to continue without a restart? Is is save to just reset
    the proclock queue? I don't think they would mind leaking locks, for instance,
    and a later planned restart to clear it up as much as they mind unscheduled
    downtime.
    
    Thank
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  11. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Robert Haas <robertmhaas@gmail.com> — 2011-09-07T20:28:41Z

    On Wed, Sep 7, 2011 at 4:22 PM, daveg <daveg@sonic.net> wrote:
    > Yes, we make extensive use of advisory locks. That was my thought too when
    > Robert mentioned session level locks.
    >
    > I'm happy to add any additional instrumentation, but my client would be
    > happier to actually run it if there was a way to recover from this without
    > an unplanned outage. Is there something I can do when the patch detects the
    > problem to be able to continue without a restart?
    
    Well, basically, you want to do the same thing that LockRelease()
    would do - remove the PROCLOCK from the SHM_QUEUE and delete it from
    the hash table, adjusting the counts in the LOCK object as
    appropriate.  If you just ignore the failure, you'll get the "blah
    blah blah is already held" messages you were having before.
    
    Tom's right to be skeptical of my theory, because it would require a
    CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
    pathways that use session-level locks, and I can't find one.
    
    OTOH, I'm skeptical of the theory that this involves userlocks,
    because this whole thread started because of a complaint about lock
    0/1260/0 already being held.  That ain't no userlock.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  12. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T20:55:24Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > Tom's right to be skeptical of my theory, because it would require a
    > CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
    > pathways that use session-level locks, and I can't find one.
    
    More to the point: session-level locks are released on error.  The only
    way to get out of a transaction while still holding one is for the
    VACUUM-or-whichever-command code to deliberately commit and exit while
    still holding it.  An error exit path would release the lock.
    
    > OTOH, I'm skeptical of the theory that this involves userlocks,
    > because this whole thread started because of a complaint about lock
    > 0/1260/0 already being held.  That ain't no userlock.
    
    Yeah, and for that matter it seems to let VACUUM off the hook too.
    If we assume that the reported object ID is non-corrupt (and if it's
    always the same, that seems like the way to bet) then this is a lock
    on pg_authid.
    
    Hmmm ... could the pathway involve an error exit from client
    authentication?  We're still finding bugs in the 9.0 rewrite of
    auth-time database access.
    
    			regards, tom lane
    
    
  13. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Robert Haas <robertmhaas@gmail.com> — 2011-09-07T22:17:30Z

    On Wed, Sep 7, 2011 at 4:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Yeah, and for that matter it seems to let VACUUM off the hook too.
    > If we assume that the reported object ID is non-corrupt (and if it's
    > always the same, that seems like the way to bet) then this is a lock
    > on pg_authid.
    >
    > Hmmm ... could the pathway involve an error exit from client
    > authentication?  We're still finding bugs in the 9.0 rewrite of
    > auth-time database access.
    
    Well, according to Dave's report upthread, it's not only this one relation:
    
    DG> The recent errors are:
    DG> lock AccessShareLock on object 16403/2615/0 is already held
    DG> which is for pg_namespace in database c23.
    
    I thought about an error exit from client authentication, and that's a
    somewhat appealing explanation, but I can't quite see why we wouldn't
    clean up there the same as anywhere else.  The whole mechanism feels a
    bit rickety to me - we don't actually release locks; we just abort the
    transaction and *assume* that will cause locks to get released.  And
    it should.  But there's a bit more action-at-a-distance there than I'd
    ideally like.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  14. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-07T22:23:26Z

    On Wed, Sep 07, 2011 at 04:55:24PM -0400, Tom Lane wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    > > Tom's right to be skeptical of my theory, because it would require a
    > > CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
    > > pathways that use session-level locks, and I can't find one.
    > 
    > More to the point: session-level locks are released on error.  The only
    > way to get out of a transaction while still holding one is for the
    > VACUUM-or-whichever-command code to deliberately commit and exit while
    > still holding it.  An error exit path would release the lock.
    > 
    > > OTOH, I'm skeptical of the theory that this involves userlocks,
    > > because this whole thread started because of a complaint about lock
    > > 0/1260/0 already being held.  That ain't no userlock.
    > 
    > Yeah, and for that matter it seems to let VACUUM off the hook too.
    > If we assume that the reported object ID is non-corrupt (and if it's
    > always the same, that seems like the way to bet) then this is a lock
    > on pg_authid.
    > 
    > Hmmm ... could the pathway involve an error exit from client
    > authentication?  We're still finding bugs in the 9.0 rewrite of
    > auth-time database access.
    
    It does not seem restricted to pg_authid:
    
    2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on object 16403/2615/0 
    
    And I think I've seen it on other tables too.
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  15. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T22:25:23Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > I thought about an error exit from client authentication, and that's a
    > somewhat appealing explanation, but I can't quite see why we wouldn't
    > clean up there the same as anywhere else.  The whole mechanism feels a
    > bit rickety to me - we don't actually release locks; we just abort the
    > transaction and *assume* that will cause locks to get released.
    
    Well, transaction abort will call LockReleaseAll, which is carefully
    coded to clean up the proclock lists regardless of what is in the
    locallocks table, so I'm not sure why you find that any more rickety
    than anything else.  But maybe it'd be interesting for Dave to stick a
    LockReleaseAll call into ProcKill() and see if that makes things better.
    (Dave: test that before you put it in production, I'm not totally sure
    it's safe.)
    
    			regards, tom lane
    
    
  16. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T22:35:08Z

    daveg <daveg@sonic.net> writes:
    > It does not seem restricted to pg_authid:
    > 2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on object 16403/2615/0 
    > And I think I've seen it on other tables too.
    
    Hmm.  2615 = pg_namespace, which most likely is the first catalog
    accessed by just about any SQL command that's going to access tables at
    all, so I suspect that this is mostly just a "the first access failed"
    thing and not something peculiar to pg_namespace.  But we still don't
    have a clue why the locks are not getting released by the previous
    owner of the PGPROC slot.  Have you trawled your logs to see if there's
    any sign of any distress at all, shortly before the problem starts to
    happen?
    
    			regards, tom lane
    
    
  17. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-07T23:01:10Z

    On Wed, Sep 07, 2011 at 06:35:08PM -0400, Tom Lane wrote:
    > daveg <daveg@sonic.net> writes:
    > > It does not seem restricted to pg_authid:
    > > 2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on object 16403/2615/0 
    > > And I think I've seen it on other tables too.
    > 
    > Hmm.  2615 = pg_namespace, which most likely is the first catalog
    > accessed by just about any SQL command that's going to access tables at
    > all, so I suspect that this is mostly just a "the first access failed"
    > thing and not something peculiar to pg_namespace.  But we still don't
    > have a clue why the locks are not getting released by the previous
    > owner of the PGPROC slot.  Have you trawled your logs to see if there's
    > any sign of any distress at all, shortly before the problem starts to
    > happen?
    
    Will do, but its a pretty big haystack. Sure wish I knew what the needle
    looked like. ;-)
    
    -dg
     
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  18. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-07T23:05:21Z

    On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    > > I thought about an error exit from client authentication, and that's a
    > > somewhat appealing explanation, but I can't quite see why we wouldn't
    > > clean up there the same as anywhere else.  The whole mechanism feels a
    > > bit rickety to me - we don't actually release locks; we just abort the
    > > transaction and *assume* that will cause locks to get released.
    > 
    > Well, transaction abort will call LockReleaseAll, which is carefully
    > coded to clean up the proclock lists regardless of what is in the
    > locallocks table, so I'm not sure why you find that any more rickety
    > than anything else.  But maybe it'd be interesting for Dave to stick a
    > LockReleaseAll call into ProcKill() and see if that makes things better.
    > (Dave: test that before you put it in production, I'm not totally sure
    > it's safe.)
    
    Re safety, what is the worst case here? 
    
    Also, this is very intermittant, we have seen it only in recent months
    on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
    feels like a few times a month. Possibly some new application behaviour
    is provoking it, but I have no guesses as to what.
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  19. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T23:16:12Z

    daveg <daveg@sonic.net> writes:
    > On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
    >> ...  But maybe it'd be interesting for Dave to stick a
    >> LockReleaseAll call into ProcKill() and see if that makes things better.
    >> (Dave: test that before you put it in production, I'm not totally sure
    >> it's safe.)
    
    > Re safety, what is the worst case here? 
    
    I think a failure would be pretty obvious --- if it gets through
    regression tests it's probably fine.
    
    			regards, tom lane
    
    
  20. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T23:39:15Z

    daveg <daveg@sonic.net> writes:
    > Also, this is very intermittant, we have seen it only in recent months
    > on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
    > feels like a few times a month. Possibly some new application behaviour
    > is provoking it, but I have no guesses as to what.
    
    BTW ... what were the last versions you were running on which you had
    *not* seen the problem?  (Just wondering about the possibility that we
    back-patched some "fix" that broke things.  It would be good to have
    a version range before trawling the commit logs.)
    
    			regards, tom lane
    
    
  21. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-08T00:55:59Z

    On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
    > daveg <daveg@sonic.net> writes:
    > > Also, this is very intermittant, we have seen it only in recent months
    > > on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
    > > feels like a few times a month. Possibly some new application behaviour
    > > is provoking it, but I have no guesses as to what.
    > 
    > BTW ... what were the last versions you were running on which you had
    > *not* seen the problem?  (Just wondering about the possibility that we
    > back-patched some "fix" that broke things.  It would be good to have
    > a version range before trawling the commit logs.)
    
    The first version we saw it on was 8.4.7.
    
    -dg
     
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  22. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-08T01:02:04Z

    daveg <daveg@sonic.net> writes:
    > On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
    >> BTW ... what were the last versions you were running on which you had
    >> *not* seen the problem?  (Just wondering about the possibility that we
    >> back-patched some "fix" that broke things.  It would be good to have
    >> a version range before trawling the commit logs.)
    
    > The first version we saw it on was 8.4.7.
    
    Yeah, you said that.  I was wondering what you'd last run before 8.4.7.
    
    			regards, tom lane
    
    
  23. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Robert Haas <robertmhaas@gmail.com> — 2011-09-08T01:22:40Z

    On Wed, Sep 7, 2011 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> I thought about an error exit from client authentication, and that's a
    >> somewhat appealing explanation, but I can't quite see why we wouldn't
    >> clean up there the same as anywhere else.  The whole mechanism feels a
    >> bit rickety to me - we don't actually release locks; we just abort the
    >> transaction and *assume* that will cause locks to get released.
    >
    > Well, transaction abort will call LockReleaseAll, which is carefully
    > coded to clean up the proclock lists regardless of what is in the
    > locallocks table, so I'm not sure why you find that any more rickety
    > than anything else.
    
    Well, it's very clear that you CAN orphan locks if a backend holding a
    session lock ever does CHECK_FOR_INTERRUPTS() outside of a
    transaction.  Try the attached patch.
    
    rhaas=# vacuum full foo;
    FATAL:  terminating connection due to administrator command
    FATAL:  terminating connection due to administrator command
    The connection to the server was lost. Attempting reset: Succeeded.
    rhaas=# vacuum full foo;
    ERROR:  lock AccessExclusiveLock on object 16384/1431013/0 is already held
    
    Now, I don't see any evidence of a live bug here (and on further
    thought it can't be Dave's bug because he is orphaning
    AccessShareLocks, not AccessExclusiveLocks), but I find this pretty
    convincing as a demonstration of ricketiness.  It is certainly not
    obvious on its face that a misplaced CHECK_FOR_INTERRUPTS() can result
    in a backend exiting without cleaning up its locks, and I'd argue its
    a bad idea to leave it that way even if there's no user-visible
    problem there today.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
  24. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-08T06:35:29Z

    On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
    > daveg <daveg@sonic.net> writes:
    > > On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
    > >> BTW ... what were the last versions you were running on which you had
    > >> *not* seen the problem?  (Just wondering about the possibility that we
    > >> back-patched some "fix" that broke things.  It would be good to have
    > >> a version range before trawling the commit logs.)
    > 
    > > The first version we saw it on was 8.4.7.
    > 
    > Yeah, you said that.  I was wondering what you'd last run before 8.4.7.
    
    Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
    since shortly after it was released. Prior to that we have had all the major
    and most of the minor releases since 7.1.
     
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.
    
    
  25. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-08T15:31:51Z

    daveg <daveg@sonic.net> writes:
    > On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
    >> daveg <daveg@sonic.net> writes:
    >>> The first version we saw it on was 8.4.7.
    
    >> Yeah, you said that.  I was wondering what you'd last run before 8.4.7.
    
    > Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
    > since shortly after it was released. Prior to that we have had all the major
    > and most of the minor releases since 7.1.
    
    Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and
    I can't find anything that looks like it could possibly be related.
    So at this point I'm inclined to think that the bug is older than
    that, but your usage patterns changed so that you started to tickle it.
    
    Can you think of any changes in your usage that might date to around
    that time, and would somehow be connected to backend startup/shutdown?
    
    			regards, tom lane