Thread

  1. Strong feeling of something ugly lurking deeply within 7.0 ;-)

    Christof Petig <christof.petig@wtal.de> — 2000-10-02T21:27:41Z

    The severity of this bug heavily depends on your lack of buggy programs.
    
    Short description:
    Long standing open transactions combined with high traffic updates and
    some regular vacuums eventually corrupt memory.
    
    Long description:
    Due to a design flaw within our ecpg Programs (I don't recommend
    designing for autocommit off!) some transactions stayed open for several
    days. A process data collection system generates a lot of status change
    updates (3MB a day) to about 110 rows in a table at the same time.
    After 1024 updates I vacuum the high traffic table which should shrink
    to 16kB. First I noticed that vacuum did not free old tuples. This put
    me on the track of the real cause.
    
    Since three weeks (more buggy long standing transactions) I have seen
    one major crash of the program system per week. For months I have seen
    some strange NOTICES which went away after another vacuum. And this
    morning I found a 'possible memory corruption, killing other backends'
    message.
    
    The situation got better and better during the 7.0 development cycle (I
    started with a pre-beta version this January and reported some
    concurrent vacuum oddities that time). And it got worse the more
    interactive programs we added.
    But up to now I didn't see the special addon which causes the pain: Long
    standing transactions.
    
    It's not very bad. This seems to happen on rare conditions. Until this
    week I thought of it as a minor oddity - a temporary nuissance.
    
    And: It is current stable CVS tree! running on a 233MHz Pentium2, Linux
    2.2.14(?)
    
    Sample Code:
        update bn_actual set meter=meter+1 where machine= ?; // repeat every
    second
    combined with
        begin transaction; // hold
        select something;
    and
        vacuum analyze; // once a day
    and
        vacuum bn_actual; // every 1024 updates
    
    and some others.
    
    PS: Of course I'm currently fixing the long transactions problem. I'll
    tell you once the system runs 4 weeks again without any strange
    occurence.
    PPS: Yes, I'm following the hackers list.
    P3S: No, I don't believe in a hardware bug.
    
    
    
    
    
    
  2. Re: Strong feeling of something ugly lurking deeply within 7.0 ;-)

    Tom Lane <tgl@sss.pgh.pa.us> — 2000-10-03T04:51:07Z

    I think the cause here is probably a known problem.  The vacuums in
    parallel with the long-running transactions would result in periodic
    sinval message queue overflows, with resultant flushes of syscache
    entries in all active backends.  We know that there are places where
    syscache entry pointers are used longer than is safe --- ie, it's
    possible for an entry to get flushed while some routine still has
    a pointer to it.  Finding all these places, or better redesigning the
    syscache mechanism to eliminate the issue completely, has been on the
    todo list for awhile.
    
    In the short term I'd recommend that you avoid vacuuming system tables
    while there are other open transactions; that should reduce the
    incidence of overflows to a livable level.
    
    			regards, tom lane