Thread

  1. could not truncate directory "pg_serial": apparent wraparound

    Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-06-08T00:28:36Z

    We had a report of the subject message during testing a while back
    and attempted to address the issue.  It can result in a LOG level
    message and the accumulation of files in the pg_serial SLRU
    subdirectory.  We haven't seen a recurrence, until I hit it during
    testing of the just-posted patch for SSI DDL.  I re-read the code
    and believe that the attached is the correct fix.
     
    -Kevin
    
    
  2. Re: could not truncate directory "pg_serial": apparent wraparound

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-08T19:30:31Z

    On 08.06.2011 03:28, Kevin Grittner wrote:
    > We had a report of the subject message during testing a while back
    > and attempted to address the issue.  It can result in a LOG level
    > message and the accumulation of files in the pg_serial SLRU
    > subdirectory.  We haven't seen a recurrence, until I hit it during
    > testing of the just-posted patch for SSI DDL.  I re-read the code
    > and believe that the attached is the correct fix.
    
    Doesn't this patch bring back the issue mentioned in the comment: the 
    slru page might not get used again until we wrap-around?
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  3. Re: could not truncate directory "pg_serial": apparent wraparound

    Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-06-08T19:40:40Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
    > On 08.06.2011 03:28, Kevin Grittner wrote:
    >> We had a report of the subject message during testing a while
    >> back and attempted to address the issue.  It can result in a LOG
    >< level message and the accumulation of files in the pg_serial SLRU
    >> subdirectory.  We haven't seen a recurrence, until I hit it
    >> during testing of the just-posted patch for SSI DDL.  I re-read
    >> the code and believe that the attached is the correct fix.
    > 
    > Doesn't this patch bring back the issue mentioned in the comment:
    > the slru page might not get used again until we wrap-around?
     
    In the previous attempt I thought it was looking at the allocated
    files to assess that it was approaching wraparound.  In looking at
    the SLRU code last night, it seemed that it was really using the
    "last zeroed" page as the comparison point, and wanted the
    truncation point to precede that.  Given that last night didn't seem
    to be my sharpest hour, it would be wise to confirm this.  This code
    is pretty hard to test, since it only comes into play on overflow of
    the predicate locking shared memory structures, so desk-checking is
    important.
     
    So, to directly address your question, if we don't overflow again
    and go to the SLRU summary data, one file could sit in the pg_serial
    directory indefinitely; but we should avoid the LOG message and the
    accumulation of large numbers of such files -- if I'm reading the
    SLRU code correctly....
     
    -Kevin
    
    
  4. Re: could not truncate directory "pg_serial": apparent wraparound

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-09T08:56:41Z

    While testing this, I noticed another serious bug in the OldSerXidSLRU 
    handling: we never set the dirty-flag on any page. I believe the reason 
    we haven't bumped into this in testing before is that when a new page is 
    initialized, it's marked as dirty, so everything goes smoothly when we 
    modify recently-zeroed pages. But if a page falls out of the cache, and 
    is later read back in and modified, the modifications are lost.
    
    The comments in SLRU could be more explicit about this. It was 
    coincidental that I started to wonder where the pages are marked as 
    dirty, I somehow thought the SLRU functions do that for you.
    
    Fortunately the fix is very simple, we just need to set the page_dirty 
    flag whenever we modify an slru page. But clearly this slru stuff needs 
    more testing. It's pretty hard to write good repeatable test cases for 
    these things, though.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  5. Re: could not truncate directory "pg_serial": apparent wraparound

    Alvaro Herrera <alvherre@commandprompt.com> — 2011-06-09T14:40:24Z

    Excerpts from Heikki Linnakangas's message of jue jun 09 04:56:41 -0400 2011:
    
    > Fortunately the fix is very simple, we just need to set the page_dirty 
    > flag whenever we modify an slru page. But clearly this slru stuff needs 
    > more testing. It's pretty hard to write good repeatable test cases for 
    > these things, though.
    
    Maybe reduce the number of SLRU buffers used?
    
    -- 
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
    
    
  6. Re: could not truncate directory "pg_serial": apparent wraparound

    Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-06-09T15:09:23Z

    Alvaro Herrera <alvherre@commandprompt.com> wrote:
    > Excerpts from Heikki Linnakangas's message of jue jun 09 04:56:41
    -0400 2011:
    > 
    >> Fortunately the fix is very simple, we just need to set the
    >> page_dirty flag whenever we modify an slru page. But clearly this
    >> slru stuff needs more testing. It's pretty hard to write good
    >> repeatable test cases for these things, though.
    > 
    > Maybe reduce the number of SLRU buffers used?
     
    That's a thought.  I'll see about getting a build with
    TEST_OLDSERXID defined and just one or two SLRU buffers for this,
    and see if anything pops out of that.
     
    .... after I finish reviewing Dan's patch from last night.
     
    Thanks,
     
    -Kevin
    
    
  7. Re: could not truncate directory "pg_serial": apparent wraparound

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-09T17:42:05Z

    On 08.06.2011 22:40, Kevin Grittner wrote:
    > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  wrote:
    >> On 08.06.2011 03:28, Kevin Grittner wrote:
    >>> We had a report of the subject message during testing a while
    >>> back and attempted to address the issue.  It can result in a LOG
    >> <  level message and the accumulation of files in the pg_serial SLRU
    >>> subdirectory.  We haven't seen a recurrence, until I hit it
    >>> during testing of the just-posted patch for SSI DDL.  I re-read
    >>> the code and believe that the attached is the correct fix.
    >>
    >> Doesn't this patch bring back the issue mentioned in the comment:
    >> the slru page might not get used again until we wrap-around?
    >
    > In the previous attempt I thought it was looking at the allocated
    > files to assess that it was approaching wraparound.  In looking at
    > the SLRU code last night, it seemed that it was really using the
    > "last zeroed" page as the comparison point, and wanted the
    > truncation point to precede that.
    
    I've committed your patch for now. It does in fact bring back the 
    original problem the clever but broken code was trying to address: if a 
    pg_serial is not needed for a very long time, the last segment that we 
    leave behind will eventually appear to be new again, and won't be 
    cleaned up until a lot more XIDs pass.
    
    > So, to directly address your question, if we don't overflow again
    > and go to the SLRU summary data, one file could sit in the pg_serial
    > directory indefinitely; but we should avoid the LOG message and the
    > accumulation of large numbers of such files -- if I'm reading the
    > SLRU code correctly....
    
    Yeah, as far as I can see it's harmless except for the small waste of 
    disk space. We probably want to revisit this later, maybe still for 9.1 
    or maybe not, but for now I just put in a comment about it.
    
    I also fixed the broken warning logic. Please double-check that too when 
    you get a chance.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  8. Re: could not truncate directory "pg_serial": apparent wraparound

    Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-06-09T18:15:57Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
     
    > I've committed your patch for now.
     
    Thanks!
     
    I don't see it yet on the public git repo, nor on the -commiters
    list.  I'll keep an eye out for it.
     
    > as far as I can see it's harmless except for the small waste of 
    > disk space. We probably want to revisit this later, maybe still
    > for 9.1 or maybe not, but for now I just put in a comment about
    > it.
     
    I'm inclined to think it's not worth the trouble to revisit it.  At
    most it would be one segment.  In fact, in line with your desire to
    flush the SLRU pages so they're useful for debugging, we could call
    this a feature -- it provides a quick way to check the date and time
    of the last need for SLRU summarization.  I can see that being a
    useful piece of information for forensics.
     
    -Kevin
    
    
  9. Re: could not truncate directory "pg_serial": apparent wraparound

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-09T18:40:25Z

    On 09.06.2011 21:15, Kevin Grittner wrote:
    > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> I've committed your patch for now.
    >
    > Thanks!
    >
    > I don't see it yet on the public git repo, nor on the -commiters
    > list.  I'll keep an eye out for it.
    
    Oh, rats! Forgot to push.. Will do so now.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  10. Re: could not truncate directory "pg_serial": apparent wraparound

    Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-06-09T19:32:29Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
     
    > I also fixed the broken warning logic. Please double-check that
    > too when you get a chance.
     
    As usual, I like your code better than mine.  I don't think my code
    was broken in the sense that it would generate different results
    than yours, but it was too clever by half, and depended on knowledge
    that a TransactionId is 32 bits.  Yours is more tolerant of changes
    to implementation details, and much easier to read and understand.
     
    While I'm *pretty sure* my code for this worked, I am *positive*
    that yours does.  :-)
     
    Thanks again.
     
    -Kevin