Thread

  1. Re: [HACKERS] Backends waiting, spinlocks, shared mem patches

    Tom Lane <tgl@sss.pgh.pa.us> — 1999-05-31T14:56:48Z

    Wayne Piekarski <wayne@senet.com.au> writes:
    > Sorry this has taken me so long to get back to you.
    
    Thanks for reporting back, Wayne.
    
    > One thing we did notice is that when we tried to open more than say 50
    > backends, we would get the following:
    > InitPostgres
    > IpcSemaphoreCreate: semget failed (No space left on device) key=5432017,
    > num=16, permission=600
    > proc_exit(3) [#0]         
    > Shortly after, we would get:
    > FATAL: s_lock(18001065) at spin.c:125, stuck spinlock. Aborting.
    
    Yes, 6.4.* does not cope gracefully at all with running out of kernel
    semaphores.  This is "fixed" in 6.5 by the brute-force approach of
    grabbing all the semaphores we could want at postmaster startup, rather
    than trying to allocate them on-the-fly during backend startup.  Either
    way, you want your kernel to be able to provide one semaphore per
    potential backend.
    
    > We tried the same massive number of connections test with 6.5 and it
    > refuses to accept the connection after a while, which is good. I'm reading
    > through archives about MaxBackendId now, so I'm going to play with that.
    
    In 6.5 you just need to set the postmaster's -N switch.
    
    > We have also been doing some testing with the latest 6.5 from the other
    > day, to check that certain problems we've bumped into have been fixed. We
    > can't run it live, but we'll try to run our testing programs on it as a
    > best approximation to help flush out any bugs that might be left.
    
    OK, please let us know ASAP if you spot problems... we are shooting for
    formal 6.5 release one week from today...
    
    			regards, tom lane
    
    
  2. Re: [HACKERS] Backends waiting, spinlocks, shared mem patches

    Wayne Piekarski <wayne@senet.com.au> — 1999-06-03T05:41:06Z

    Hi,
    
    > Yes, 6.4.* does not cope gracefully at all with running out of kernel
    > semaphores.  This is "fixed" in 6.5 by the brute-force approach of
    > grabbing all the semaphores we could want at postmaster startup, rather
    > than trying to allocate them on-the-fly during backend startup.  Either
    > way, you want your kernel to be able to provide one semaphore per
    > potential backend.
    
    Right now, every so often we have a problem where all of a sudden the
    backends will just start piling up, we exceed 50-60 backends, and then the
    thing fails. The wierd part is that some times it happens during times of
    the day which are very quiet and I wouldn't expect there to be that many
    tasks being done. I'm thinking something is getting jammed up in Postgres
    and then this occurs [more about this later] We get the spinlock fail
    message and then we just restart, so it does "recover" in a way, although
    it would be better if it didn't die. At least I understand what is
    happening here ..... 
    
    > > We have also been doing some testing with the latest 6.5 from the other
    > > day, to check that certain problems we've bumped into have been fixed. We
    > > can't run it live, but we'll try to run our testing programs on it as a
    > > best approximation to help flush out any bugs that might be left.
    > 
    > OK, please let us know ASAP if you spot problems... we are shooting for
    > formal 6.5 release one week from today...
    
    Ok, well the past two days or so, we've still had the backends waiting
    problem like before, even though we installed the 6.4.2 shared memory
    patches. (ie, lots of backends waiting for nothing to happen - some kind
    of lock is getting left around by a backend) It has been running better
    than it was before, but we still get one problem or two per day, which
    isn't very good. This time, when we kill all the waiting backends, new
    backends will still jam anyways, so we kill and restart the whole thing.
    The problem appears to have changed from what it was before, where we
    could selectively kill off backends and eventually it would start working
    again.
    
    Unfortunately, this is not the kind of thing I can reproduce with a
    testing program, and so I can't try it against 6.5 - but it still exists
    in 6.4.2 so unless someones made more changes related to this area, there
    might be a chance it is still in 6.5 - although the locking code has been
    changed a lot maybe not?
    
    Is there anything I can do, like enable some extra debugging code,
    #define, (I've tried turning on a few of the locking defines but they
    waiting for, so I or someone else can have a look and see if the problem
    can be spotted? I can get it to happen one or twice per day, but I can
    only test against 6.4.2 and it can't adversely affect the performance. 
    
    One thing I thought is this problem could still be related to the
    spinlock/semget problem. ie, too many backends start up, something fails
    and dies off, but leaves a semaphore laying around, and so from then
    onwards, all the backends are waiting for this semaphore to go when it is
    still hanging around, causing problems ... The postmaster code fails to
    detect the stuck spinlock and so it looks like a different problem? Hope
    that made sense?
    
    thanks,
    Wayne
    
    ------------------------------------------------------------------------------
    Wayne Piekarski                               Tel:     (08) 8221 5221
    Research & Development Manager                Fax:     (08) 8221 5220
    SE Network Access Pty Ltd                     Mob:     0407 395 889
    222 Grote Street                              Email:   wayne@senet.com.au
    Adelaide SA 5000                              WWW:     http://www.senet.com.au