Thread

  1. Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

    Robert Haas <robertmhaas@gmail.com> — 2011-08-23T16:15:23Z

    On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg@sonic.net> wrote:
    > So far I've got:
    >
    >  - affects system tables
    >  - happens very soon after process startup
    >  - in 8.4.7 and 9.0.4
    >  - not likely to be hardware or OS related
    >  - happens in clusters for period of a few second to many minutes
    >
    > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    > hard to get downtime to pick up new builds. Any other ideas on getting to
    > the bottom of this?
    
    I've been thinking this one over, and doing a little testing. I'm
    still stumped, but I have a few thoughts.  What that error message is
    really saying is that the LOCALLOCK bookkeeping doesn't match the
    PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
    
    My first thought was that there might be some situation where
    LockAcquireExtended() gets an interrupt between the time it does the
    LOCALLOCK lookup and the time it acquires the partition lock.  If the
    interrupt handler were to acquire (but not releases) a lock in the
    meantime, then we'd get confused.  However, I can't see how that's
    possible.  I inserted some debugging code to fail an assertion if
    CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if
    ImmediateInterruptOK is set on entering the function, and the system
    still passes regression tests.
    
    My second thought is that perhaps a process is occasionally managing
    to exit without fully cleaning up the associated PROCLOCK entry.  At
    first glance, it appears that this would explain the observed
    symptoms.  A new backend gets the PGPROC belonging to the guy who
    didn't clean up after himself, hits the error, and disconnects,
    sticking himself right back on to the head of the SHM_QUEUE where the
    next connection will inherit the same PGPROC and hit the same problem.
     But it's not clear to me what could cause the system to get into this
    state in the first place, or how it would eventually right itself.
    
    It might be worth kludging up your system to add a test to
    InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    either NULL or empty, along the lines of the attached patch (which
    assumes that assertions are enabled; otherwise, put in an elog() of
    some sort).  Actually, I wonder if we shouldn't move all the
    SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    doing it over again every time someone calls InitProcess().  Besides
    being a waste of cycles, it's probably less robust this way.   If
    there somehow are leftovers in one of those queues, the next
    successful call to LockReleaseAll() ought to clean up the mess, but of
    course there's no chance of that working if we've nuked the queue
    pointers.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company