Thread

  1. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    daveg <daveg@sonic.net> — 2011-09-07T09:16:34Z

    On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
    > On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
    > > So far I've got:
    > >
    > >  - affects system tables
    > >  - happens very soon after process startup
    > >  - in 8.4.7 and 9.0.4
    > >  - not likely to be hardware or OS related
    > >  - happens in clusters for period of a few second to many minutes
    > >
    > > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    > > hard to get downtime to pick up new builds. Any other ideas on getting to
    > > the bottom of this?
    > 
    > I've been thinking this one over, and doing a little testing. I'm
    > still stumped, but I have a few thoughts.  What that error message is
    > really saying is that the LOCALLOCK bookkeeping doesn't match the
    > PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
    ... 
    > My second thought is that perhaps a process is occasionally managing
    > to exit without fully cleaning up the associated PROCLOCK entry.  At
    > first glance, it appears that this would explain the observed
    > symptoms.  A new backend gets the PGPROC belonging to the guy who
    > didn't clean up after himself, hits the error, and disconnects,
    > sticking himself right back on to the head of the SHM_QUEUE where the
    > next connection will inherit the same PGPROC and hit the same problem.
    >  But it's not clear to me what could cause the system to get into this
    > state in the first place, or how it would eventually right itself.
    > 
    > It might be worth kludging up your system to add a test to
    > InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    > either NULL or empty, along the lines of the attached patch (which
    > assumes that assertions are enabled; otherwise, put in an elog() of
    > some sort).  Actually, I wonder if we shouldn't move all the
    > SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    > doing it over again every time someone calls InitProcess().  Besides
    > being a waste of cycles, it's probably less robust this way.   If
    > there somehow are leftovers in one of those queues, the next
    > successful call to LockReleaseAll() ought to clean up the mess, but of
    > course there's no chance of that working if we've nuked the queue
    > pointers.
    
    I did this in the elog flavor as we don't build production images with asserts.
    It has been running on all hosts for a few days. Today it hit the extra
    checks in initproc.
    
    00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
    00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
    00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 
    00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
    00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
    00:02:32.783  21535  LOG:  terminating any other active server processes
    00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process
    
    The patch that produced this is attached. If you can think of anything I
    can add to this to help I'd be happy to do so. Also, can I clean this up
    and continue somehow? Maybe clear the queue instead having to have a restart?
    Or is there a way to just pause this proc here, maybe mark it not to be used
    and exit, or just to sleep forever so I can debug later?
    
    Thanks
    
    -dg
    
    -- 
    David Gould       daveg@sonic.net      510 536 1443    510 282 0869
    If simplicity worked, the world would be overrun with insects.