Thread

  1. Patch to improve reliability of postgresql on linux nfs

    George Barnett <gbarnett@atlassian.com> — 2011-09-09T00:04:58Z

    Hi Hackers,
    
  2. Re: Patch to improve reliability of postgresql on linux nfs

    Josh Berkus <josh@agliodbs.com> — 2011-09-09T00:16:59Z

    George,
    
    I'm quoting you here because in the version of your email which got
    posted to the list your whole explanation got put below the patch text,
    making it hard to find the justification for the patch.  Follows:
    
    > I run a number of postgresql installations on NFS and on the whole I find this to be very reliable.  I have however run into a few issues when there is concurrent writes from multiple processes.
    > 
    > I see errors such as the following:
    > 
    > 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: user=postgres database=postgres
    > 2011-07-31 22:13:35 EST    ERROR:  could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
    > 2011-07-31 22:13:35 EST    HINT:  Check free disk space.
    > 2011-07-31 22:13:35 EST    CONTEXT:  writing block 1 of relation global/2671
    > 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: host=[local]
    > 
    > I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a little more drastic.
    > 
    > After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs when one process was writing to a data file and it received a SIGINT from another process, eg:
    > (These logs are from another similar run)
    > 
    > [pid  1804] <... fsync resumed> )       = 0
    > [pid 10198] kill(1804, SIGINT <unfinished ...>
    > [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
    > [pid  1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...>
    > [pid 10198] <... kill resumed> )        = 0
    > [pid  1804] <... write resumed> )       = 4096
    > [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
    > [pid  1804] rt_sigreturn(0x2)           = 4096
    > [pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
    > [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
    > [pid  1802] <... select resumed> )      = 1 (in [5], left {0, 999000})
    > [pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
    > [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
    > [pid  1802] read(5,  <unfinished ...>
    > [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
    > Process 1804 detached
    > 
    > After finding this, I came up with the following test case which easily replicated our issue:
    > 
    > #!/bin/bash
    > 
    > name=$1
    > number=1
    > while true; do 
    >   /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';"
    >   /usr/bin/createdb -E UNICODE -O $name$number $name$number
    >   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
    >     exit
    >   fi
    >   let number=$number+1
    > done
    > 
    > When I run a single copy of this script, I have no issues, however when I start up a few more copies to simultaneously hit the DB, it crashes quiet quickly - usually within 20 or 30 seconds.
    > 
    > After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the issue with the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, however I was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I spoke to an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similar on the file handle.  I had a look in the libc docs and found this:
    > 
    > http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
    > 
    > ----
    > The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not necessarily a character string and a null character is output like any other character.
    > 
    > The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written.
    > ----
    > 
    > After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the JVM) for their usage of write().  I confirmed they write in a while loop and set about patching the postgres source.
    > 
    > I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems.  I have also checked that make check still passes. 
    > 
    > As my C is terrible, I would welcome any comments on the implementation of this patch.
    > 
    > Best regards,
    > 
    > George
    
    
    -- 
    Josh Berkus
    PostgreSQL Experts Inc.
    http://pgexperts.com
    
    
  3. Re: Patch to improve reliability of postgresql on linux nfs

    Thom Brown <thom@linux.com> — 2011-09-09T08:38:58Z

    On 9 September 2011 01:04, George Barnett <gbarnett@atlassian.com> wrote:
    > After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the issue with the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, however I was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I spoke to an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similar on the file handle.
    
    Have you run the test with varying wal_sync_method values?  On Linux
    the default is fdatasync because historically Linux hasn't supported
    O_DSYNC (a wal_sync_method value of open_datasync).  But I believe as
    of Kernel 2.6.33 it does support it.  Have you tried modifying this
    parameter in your tests?  Are you even using Linux? (you haven't
    specified)
    
    -- 
    Thom Brown
    Twitter: @darkixion
    IRC (freenode): dark_ixion
    Registered Linux user: #516935
    
    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  4. Re: Patch to improve reliability of postgresql on linux nfs

    Peter Eisentraut <peter_e@gmx.net> — 2011-09-09T09:43:07Z

    On fre, 2011-09-09 at 10:04 +1000, George Barnett wrote:
    > After looking through the code I found that when postgres calls
    > write() it doesn't retry.  In order to address the issue with the
    > PANIC in the WAL writer I set the sync method to o_sync which solved
    > the issue in that part of the code, however I was still seeing
    > failures in other areas of the code (such as the FileWrite function).
    > Following this, I spoke to an NFS guru who pointed out that writes
    > under linux are not guaranteed to complete unless you open up O_SYNC
    > or similar on the file handle.
    
    I've had this problem many years ago.  I recall that changing the mount
    options for NFS also fixed it.  Could you post what mount options you
    are using.
    
    (We eventually moved away from NFS at that time, so I didn't pursue it
    further, but my analysis back then matched yours.)
    
    
    
    
  5. Re: Patch to improve reliability of postgresql on linux nfs

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-09T14:27:22Z

    George Barnett <gbarnett@atlassian.com> writes:
    > [ patch to retry writes on NFS ]
    
    I'm having a hard time getting excited about this idea, because IMO
    NFS is insufficiently reliable to run a database on, and no patch like
    this can fix that.  However, some concrete points:
    
    1. If writes need to be retried, why not reads?  (No, that's not an
    invitation to expand the scope of the patch; it's a question about NFS
    implementation.)
    
    2. What is the rationale for supposing that a retry a nanosecond later
    will help?  If it will help, why didn't the kernel just do that?
    
    3. What about EINTR?  If you believe that this is necessary, then the
    kernel logically has to return EINTR when it would like you to retry but
    it hasn't been able to write any bytes yet.  If you get a zero return
    you have to assume that means out-of-disk-space.
    
    4. As coded, the patch behaves incorrectly if you get a zero return on a
    retry.  If we were going to do this, I think we'd need to absorb the
    errno-munging currently done by callers into the writeAll function.
    
    On the whole I think you'd be better off lobbying your NFS implementors
    to provide something closer to the behavior of every other filesystem on
    the planet.  Or checking to see if you need to adjust your NFS
    configuration, as the other responders mentioned.
    
    			regards, tom lane
    
    
  6. Re: Patch to improve reliability of postgresql on linux nfs

    Bernd Helmle <mailings@oopsware.de> — 2011-09-09T15:30:02Z

    
    --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
    
    > On the whole I think you'd be better off lobbying your NFS implementors
    > to provide something closer to the behavior of every other filesystem on
    > the planet.  Or checking to see if you need to adjust your NFS
    > configuration, as the other responders mentioned.
    
    You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
    otherwise you are out of luck. Oracle and DB2 guys recommend those settings and 
    without them any millisecond of network glitch could disturb things 
    unreasonably.
    
    -- 
    Thanks
    
    	Bernd
    
    
  7. Re: Patch to improve reliability of postgresql on linux nfs

    Noah Misch <noah@leadboat.com> — 2011-09-09T17:44:05Z

    On Fri, Sep 09, 2011 at 10:27:22AM -0400, Tom Lane wrote:
    > George Barnett <gbarnett@atlassian.com> writes:
    > > [ patch to retry writes on NFS ]
    > 
    > I'm having a hard time getting excited about this idea, because IMO
    > NFS is insufficiently reliable to run a database on, and no patch like
    > this can fix that.  However, some concrete points:
    > 
    > 1. If writes need to be retried, why not reads?  (No, that's not an
    > invitation to expand the scope of the patch; it's a question about NFS
    > implementation.)
    
    To support all POSIX:2008-conforming read() implementations, we must indeed
    retry reads.  I suppose the OP never encountered this in practice, though.
    
    > 2. What is the rationale for supposing that a retry a nanosecond later
    > will help?  If it will help, why didn't the kernel just do that?
    
    POSIX:2008 unconditionally permits[1] partial writes of requests exceeding 512
    bytes (_POSIX_PIPE_BUF).  We shouldn't complain when a kernel provides a
    conforming write(), even if it appears that the kernel achieved little by
    using some freedom afforded it.
    
    > 3. What about EINTR?  If you believe that this is necessary, then the
    > kernel logically has to return EINTR when it would like you to retry but
    > it hasn't been able to write any bytes yet.  If you get a zero return
    > you have to assume that means out-of-disk-space.
    
    Assuming conforming SA_RESTART for write()/read(), this will not happen.  The
    call will restart and resume blocking until it writes something.
    
    nm
    
    [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
    
    
  8. Re: Patch to improve reliability of postgresql on linux nfs

    Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-09-09T18:38:08Z

    Noah Misch <noah@leadboat.com> wrote:
     
    > We shouldn't complain when a kernel provides a conforming write(),
    > even if it appears that the kernel achieved little by using some
    > freedom afforded it.
     
    I remember we had some compiler warnings in the logging area because
    we were making the assumption that no implementation would ever use
    that freedom.  I suggested we replace the simple, unchecked calls to
    write with a function which did what we expected through an API
    conforming loop:
     
    http://archives.postgresql.org/pgsql-hackers/2011-02/msg01719.php
     
    The response was that we could ignore the documented API because we
    had "never seen nor heard of it being true for writes to disk
    files".  I'm still uncomfortable with that.  Where I have seen
    people code to implementation details rather than the documented
    API, it has often not turned out well in the long run.
     
    I'd still be willing to put together a patch for that if people buy
    into it.
     
    -Kevin
    
    
  9. Re: Patch to improve reliability of postgresql on linux nfs

    Bruce Momjian <bruce@momjian.us> — 2011-09-09T23:06:49Z

    Tom Lane wrote:
    > George Barnett <gbarnett@atlassian.com> writes:
    > > [ patch to retry writes on NFS ]
    > 
    > I'm having a hard time getting excited about this idea, because IMO
    > NFS is insufficiently reliable to run a database on, and no patch like
    > this can fix that.  However, some concrete points:
    > 
    > 1. If writes need to be retried, why not reads?  (No, that's not an
    > invitation to expand the scope of the patch; it's a question about NFS
    > implementation.)
    > 
    > 2. What is the rationale for supposing that a retry a nanosecond later
    > will help?  If it will help, why didn't the kernel just do that?
    > 
    > 3. What about EINTR?  If you believe that this is necessary, then the
    > kernel logically has to return EINTR when it would like you to retry but
    > it hasn't been able to write any bytes yet.  If you get a zero return
    > you have to assume that means out-of-disk-space.
    > 
    > 4. As coded, the patch behaves incorrectly if you get a zero return on a
    > retry.  If we were going to do this, I think we'd need to absorb the
    > errno-munging currently done by callers into the writeAll function.
    > 
    > On the whole I think you'd be better off lobbying your NFS implementors
    > to provide something closer to the behavior of every other filesystem on
    > the planet.  Or checking to see if you need to adjust your NFS
    > configuration, as the other responders mentioned.
    
    Can our NFS documentation be improved (section 17.2.1)?
    
    	http://www.postgresql.org/docs/9.1/static/creating-cluster.html
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  10. Re: Patch to improve reliability of postgresql on linux nfs

    George Barnett <gbarnett@atlassian.com> — 2011-09-12T04:30:17Z

    On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:
    
    > --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > 
    >> On the whole I think you'd be better off lobbying your NFS implementors
    >> to provide something closer to the behavior of every other filesystem on
    >> the planet.  Or checking to see if you need to adjust your NFS
    >> configuration, as the other responders mentioned.
    > 
    > You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and DB2 guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably.
    
    Hi,
    
    My mount options include hard and intr.
    
    George
    
  11. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-12T05:59:32Z

    On Sep12, 2011, at 06:30 , George Barnett wrote:
    > On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:
    > 
    >> --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> 
    >>> On the whole I think you'd be better off lobbying your NFS implementors
    >>> to provide something closer to the behavior of every other filesystem on
    >>> the planet.  Or checking to see if you need to adjust your NFS
    >>> configuration, as the other responders mentioned.
    >> 
    >> You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and DB2 guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably.
    > 
    > My mount options include hard and intr.
    
    If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
    
    Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
    
    best regards,
    Florian Pflug
    
    
    
  12. Re: Patch to improve reliability of postgresql on linux nfs

    George Barnett <gbarnett@atlassian.com> — 2011-09-12T06:46:53Z

    On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
    
    > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
    > 
    > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
    
    
    Hi Florian,
    
    You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
    
    It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.
    
    Thanks in advance,
    
    George
    
  13. Re: Patch to improve reliability of postgresql on linux nfs

    Peter Eisentraut <peter_e@gmx.net> — 2011-09-12T12:54:39Z

    On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
    > On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
    > 
    > > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
    > > 
    > > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
    > 
    > 
    > Hi Florian,
    > 
    > You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
    > 
    > It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.
    
    We have signal handling configured so that system calls are not
    interrupted.  So there is ordinarily no reason to do anything more
    graceful.  The problem is that NFS is in this case not observing that
    setting.  It's debatable whether it's worth supporting that; just saying
    that the code is correct as it stands.
    
    
    
    
  14. Re: Patch to improve reliability of postgresql on linux nfs

    Kenneth Marshall <ktm@rice.edu> — 2011-09-12T12:54:52Z

    On Mon, Sep 12, 2011 at 04:46:53PM +1000, George Barnett wrote:
    > On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
    > 
    > > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
    > > 
    > > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
    > 
    > 
    > Hi Florian,
    > 
    > You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
    > 
    > It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.
    > 
    > Thanks in advance,
    > 
    > George
    
    Hi George,
    
    Many, many, many other software packages expect I/O usage to be the same on
    an NFS volume and a local disk volume, including Oracle. Coding every application,
    or more likely mis-coding, to handle this gives every application another chance
    to get it wrong. If the OS does this, when it gets it right, all of the apps get
    it right. I think you should be surprised when other software actually deals with
    broken I/O semantics gracefully rather than concerned when one of a pantheon of
    programs does not. My two cents.
    
    Regards,
    Ken
    
    
  15. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-12T13:27:00Z

    On Sep12, 2011, at 14:54 , Peter Eisentraut wrote:
    > On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
    >> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
    >>> Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
    >> 
    >> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.
    > 
    > We have signal handling configured so that system calls are not
    > interrupted.  So there is ordinarily no reason to do anything more
    > graceful.  The problem is that NFS is in this case not observing that
    > setting.  It's debatable whether it's worth supporting that; just saying
    > that the code is correct as it stands.
    
    SA_RESTART doesn't protect against partial reads/writes due to signal delivery,
    it only removes the need to check for EINTR. In other words, it retries until
    at least one byte has been written, not until all bytes have been written.
    
    The GNU LibC documentation has this to say on the subject
    
      "There is one situation where resumption never happens no matter which
       choice you make: when a data-transfer function such as read or write is
       interrupted by a signal after transferring part of the data. In this case,
       the function returns the number of bytes already transferred, indicating
       partial success."[1]
    
    While it's true that reads and writes are by tradition non-interruptible, I
    personally wouldn't bet that they'll stay that way forever. It all depends on
    whether the timeouts involved in the communication with a disk are short enough
    to mask the difference - once they get too long (or even infinite like in the
    case of "hard" NFS mounts) you pay for non-interruptible primitives with
    un-killable stuck processes. Since the current trend is to move storage further
    away from processing, and to put non-deterministic networks like ethernet between
    the two, I expect situations in which read/write primitives are interruptible
    to increase, not decrease.
    
    And BTW, the GNU LibC documentations doesn't seem to mention anything about
    local reads and writes being non-interruptible. In fact, it even says
    
      "A signal can arrive and be handled while an I/O primitive such as open or read
       is waiting for an I/O device. If the signal handler returns, the system faces
       the question: what should happen next?"[1]
    
    Had the GNU people faith in local read and writes being non-interruptible, they'd
    probably have said "network device" or "remove device", not "I/O device".
    
    best regards,
    Florian Pflug
    
    [1] http://www.gnu.org/s/hello/manual/libc/Interrupted-Primitives.html#Interrupted-Primitives
    
  16. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-12T13:39:02Z

    On Sep12, 2011, at 14:54 , ktm@rice.edu wrote:
    > Many, many, many other software packages expect I/O usage to be the same on
    > an NFS volume and a local disk volume, including Oracle. Coding every application,
    > or more likely mis-coding, to handle this gives every application another chance
    > to get it wrong. If the OS does this, when it gets it right, all of the apps get
    > it right. I think you should be surprised when other software actually deals with
    > broken I/O semantics gracefully rather than concerned when one of a pantheon of
    > programs does not. My two cents.
    
    I don't buy that. People seem to be perfectly able to code correct networking
    applications (correct from a read/write API POV at least), yet those applications
    need to deal with partial reads and writes too.
    
    Really, it's not *that* hard to put a retry loop around "read" and "write".
    
    Also, non-interruptible IO primitives are by no means "right". At best, they're
    a compromise between complexity and functionality for I/O devices with rather
    short (and bounded) communication timeouts - because in that case, processes are
    only blocked un-interruptibly for a short while.
    
    best regards,
    Florian Pflug
    
    
    
  17. Re: Patch to improve reliability of postgresql on linux nfs

    Robert Haas <robertmhaas@gmail.com> — 2011-09-12T14:37:35Z

    On Mon, Sep 12, 2011 at 9:39 AM, Florian Pflug <fgp@phlo.org> wrote:
    > Really, it's not *that* hard to put a retry loop around "read" and "write".
    
    +1.  I don't see what we're gaining by digging in our heels on this
    one.  Retry loops around read() and write() are pretty routine, and I
    wouldn't like to bet this is the only case where not having them could
    cause an unnecessary failure.
    
    Now, that having been said, I *really* think we could use some better
    documentation on which mount options we believe to be safe, and not
    just for NFS.  Right now we have practically nothing.
    
    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  18. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-13T11:07:42Z

    [CC'ing to the list again - I assume you omitted pgsql-hackers from the
    recipient list by accident]
    
    On Sep13, 2011, at 03:00 , George Barnett wrote:
    > On 12/09/2011, at 11:39 PM, Florian Pflug wrote:
    >> Also, non-interruptible IO primitives are by no means "right". At best, they're
    >> a compromise between complexity and functionality for I/O devices with rather
    >> short (and bounded) communication timeouts - because in that case, processes are
    >> only blocked un-interruptibly for a short while.
    > 
    > Just to expand on that - I'm now in the situation where I can run my nfs mounts
    > 'nointr' and postgres will work, but that means if I lose a storage unit I have
    > a number of stuck processes, effectively meaning I need to reboot all my frontend
    > servers before I can fail over to backup nfs stores.
    > 
    > However, if I run the mounts with intr, then if a storage unit fails, I can fail
    > over to a backup node (taking a minor loss of data hit I'm willing to accept) but
    > postgres breaks under a moderate insert load.
    > 
    > With the patch I supplied though, I'm able to have most of my cake and eat it.
    > 
    > I'd be very interested in moving this forward - is there something I can change
    > in the patch to make it more acceptable for a merge?
    
    Here are a few comments
    
    Tom already remarked that if we do that for write()s, we ought to do it for read()s
    also which I agree with. All other primitives like lseek, close, ... should be taken
    care of by SA_RESTART, but I'd be a good idea to verify that.
    
    Also, I don't think that POSIX mandates that errno be reset to 0 if a function returns
    successfully, making that "returnCode == 0 && errno == 0" check pretty dubious. I'm not
    sure of this was what Tom was getting at with his remark about the ENOSPC handling being
    wrong in the retry case.
    
    And I also think that if we do this, we might as well handle EINTR correctly, even if
    our use of SA_RESTART should prevent us from ever seeing that. The rules surrounding
    EINTR and SA_RESTART for read/write are quite subtle...
    
    If we retry, shouldn't be do CHECK_FOR_INTERRUPTS? Otherwise, processes waiting for
    a vanished NFS server would be killable only with SIGKILL, not SIGTERM or SIGINT.
    But I'm not sure if it's safe to put that into a generic function like pg_write_nointr.
    
    Finally, WriteAll() seems like a poor name for that function. How about pg_write_nointr()?
    
    Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
    (but obviously without the ENOSPC handling)
    
    int pg_write_nointr(int fd, const void *bytes, Size amount)
    {
      int written = 0;
    
      while (amount > 0)
      {
        int ret;
    
        ret = write(fd, bytes, amount);
        if ((ret < 0) && (errno == EINTR))
        {
          /* interrupted by signal before first byte was written. Retry */
          /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
          CHECK_FOR_INTERRUPTS();
          continue;
        }
        else if (ret < 0)
        {
          /* error occurred. Abort */
          return -1;
        }
        else if (ret == 0)
        {
          /* out of disk space. Abort */
          return written;
        }
    
        /* made progress */
    
        /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
        CHECK_FOR_INTERRUPTS();
    
        written += ret;
        amount -= ret;
        bytes = (const char *) bytes + ret;
      }
    }
    
    best regards,
    Florian Pflug
    
    
    
    
  19. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-13T11:30:34Z

    On Sep13, 2011, at 13:07 , Florian Pflug wrote:
    > Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
    > (but obviously without the ENOSPC handling)
    > 
    > <wrong pg_write_nointr implementation snipped>
    
    Sorry for the self-reply. I realized only after hitting send that I
    got the ENOSPC handling wrong again - we probably ought to check for
    ENOSPC as well as ret == 0. Also, it seems preferable to return the
    number of bytes actually written instead of -1 if we hit an error during
    retry.
    
    With this version, any return value other than <amount> signals an
    error, the number of actually written bytes is reported even in the
    case of an error (to the best of pg_write_nointr's knowledge), and
    errno always indicates the kind of error.
    
    int pg_write_nointr(int fd, const void *bytes, Size amount)
    {
     int written = 0;
    
     while (amount > 0)
     {
       int ret;
    
       ret = write(fd, bytes, amount);
    
       if ((ret < 0) && (errno == EINTR))
       {
         /* interrupted by signal before first byte was written. Retry */
    
         /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
         CHECK_FOR_INTERRUPTS();
    
         continue;
       }
       else if (ret < 1)
       {
         /* error occurred. Abort */
    
         if (ret == 0)
           /* out of disk space */
           errno = ENOSPC;
    
         if (written == 0)
           return -1;
         else
           return written;
       }
    
       /* made progress */
       written += ret;
       amount -= ret;
       bytes = (const char *) bytes + ret;
       
       /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
       CHECK_FOR_INTERRUPTS();
     }
    }
    
    best regards,
    Florian Pflug
    
    
    
  20. Re: Patch to improve reliability of postgresql on linux nfs

    Kenneth Marshall <ktm@rice.edu> — 2011-09-13T12:58:22Z

    On Tue, Sep 13, 2011 at 01:30:34PM +0200, Florian Pflug wrote:
    > On Sep13, 2011, at 13:07 , Florian Pflug wrote:
    > > Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
    > > (but obviously without the ENOSPC handling)
    > > 
    > > <wrong pg_write_nointr implementation snipped>
    > 
    > Sorry for the self-reply. I realized only after hitting send that I
    > got the ENOSPC handling wrong again - we probably ought to check for
    > ENOSPC as well as ret == 0. Also, it seems preferable to return the
    > number of bytes actually written instead of -1 if we hit an error during
    > retry.
    > 
    > With this version, any return value other than <amount> signals an
    > error, the number of actually written bytes is reported even in the
    > case of an error (to the best of pg_write_nointr's knowledge), and
    > errno always indicates the kind of error.
    > 
    > int pg_write_nointr(int fd, const void *bytes, Size amount)
    > {
    >  int written = 0;
    > 
    >  while (amount > 0)
    >  {
    >    int ret;
    > 
    >    ret = write(fd, bytes, amount);
    > 
    >    if ((ret < 0) && (errno == EINTR))
    >    {
    >      /* interrupted by signal before first byte was written. Retry */
    > 
    >      /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
    >      CHECK_FOR_INTERRUPTS();
    > 
    >      continue;
    >    }
    >    else if (ret < 1)
    >    {
    >      /* error occurred. Abort */
    > 
    >      if (ret == 0)
    >        /* out of disk space */
    >        errno = ENOSPC;
    > 
    >      if (written == 0)
    >        return -1;
    >      else
    >        return written;
    >    }
    > 
    >    /* made progress */
    >    written += ret;
    >    amount -= ret;
    >    bytes = (const char *) bytes + ret;
    >    
    >    /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
    >    CHECK_FOR_INTERRUPTS();
    >  }
    > }
    > 
    > best regards,
    > Florian Pflug
    > 
    
    It will be interesting to see if there are any performance ramifications to
    this new write function.
    
    Regards,
    Ken
    
    
  21. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-13T13:02:57Z

    On Sep13, 2011, at 14:58 , ktm@rice.edu wrote:
    > It will be interesting to see if there are any performance ramifications to
    > this new write function.
    
    What would those be? For non-interruptible reads and writes, the overhead
    comes down to an additional function call (if we don't make pg_write_nointr
    inlined) and a few conditional jumps (which branch prediction should be
    able to take care of). These are bound to disappear in the noise compared
    to the cost of the actual syscall.
    
    best regards,
    Florian Pflug
    
    
    
  22. Re: Patch to improve reliability of postgresql on linux nfs

    Aidan Van Dyk <aidan@highrise.ca> — 2011-09-13T13:05:38Z

    On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote:
    
    >
    > Sorry for the self-reply. I realized only after hitting send that I
    > got the ENOSPC handling wrong again - we probably ought to check for
    > ENOSPC as well as ret == 0. Also, it seems preferable to return the
    > number of bytes actually written instead of -1 if we hit an error during
    > retry.
    >
    > With this version, any return value other than <amount> signals an
    > error, the number of actually written bytes is reported even in the
    > case of an error (to the best of pg_write_nointr's knowledge), and
    > errno always indicates the kind of error.
    
    Personally, I'ld think that's ripe for bugs.   If the contract is that
    ret != amount is the "error" case, then don't return -1 for an error
    *sometimes*.
    
    If you sometimes return -1 for an error, even though ret != amount is
    the *real* test, I'm going to guess there will be lots of chance for
    code to do:
      if (pg_write_no_intr(...) < 0)
       ...
    
    which will only catch some of the errors, and happily continue with the rest...
    
    a.
    
    -- 
    Aidan Van Dyk                                             Create like a god,
    aidan@highrise.ca                                       command like a king,
    http://www.highrise.ca/                                   work like a slave.
    
    
  23. Re: Patch to improve reliability of postgresql on linux nfs

    Kenneth Marshall <ktm@rice.edu> — 2011-09-13T13:22:42Z

    On Tue, Sep 13, 2011 at 03:02:57PM +0200, Florian Pflug wrote:
    > On Sep13, 2011, at 14:58 , ktm@rice.edu wrote:
    > > It will be interesting to see if there are any performance ramifications to
    > > this new write function.
    > 
    > What would those be? For non-interruptible reads and writes, the overhead
    > comes down to an additional function call (if we don't make pg_write_nointr
    > inlined) and a few conditional jumps (which branch prediction should be
    > able to take care of). These are bound to disappear in the noise compared
    > to the cost of the actual syscall.
    > 
    > best regards,
    > Florian Pflug
    > 
    That would be my expectation too. It is just always nice to benchmark changes,
    just in case. I have had similar simple changes blow out a cache and have a
    much greater impact on performance than might be expected from inspection. :)
    
    Regards,
    Ken
    
    
  24. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-13T14:14:23Z

    On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
    > On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote:
    >> Sorry for the self-reply. I realized only after hitting send that I
    >> got the ENOSPC handling wrong again - we probably ought to check for
    >> ENOSPC as well as ret == 0. Also, it seems preferable to return the
    >> number of bytes actually written instead of -1 if we hit an error during
    >> retry.
    >> 
    >> With this version, any return value other than <amount> signals an
    >> error, the number of actually written bytes is reported even in the
    >> case of an error (to the best of pg_write_nointr's knowledge), and
    >> errno always indicates the kind of error.
    > 
    > Personally, I'ld think that's ripe for bugs.   If the contract is that
    > ret != amount is the "error" case, then don't return -1 for an error
    > *sometimes*.
    
    Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
    will return the number of bytes written, which may be less than the requested
    number if there's not enough free space, or -1 in case of an error like
    an invalid fd being passed.
    
    > If you sometimes return -1 for an error, even though ret != amount is
    > the *real* test, I'm going to guess there will be lots of chance for
    > code to do:
    >  if (pg_write_no_intr(...) < 0)
    >   ...
    > 
    > which will only catch some of the errors, and happily continue with the rest...
    
    Yeah, but that's equally wrong for plain write(), so I'm not sure I share
    your concern there. Also, I'm not sure how to improve that. We could always
    return -1 in case of an error, and "amount" in case of success, but that makes
    it impossible to determine how many bytes where actually written (and also feel
    awkward). Or we could return 0 instead of -1 if there was an error and zero
    bytes where written. But that feels awkward also...
    
    One additional possibility would be to make the signature
    
      boolean pg_write_nointr(int fd, const void *bytes, int len, int *written)
    
    and simply return true on success and false on error. Callers who're interested
    in the number of bytes actually written (in the case of an error) would need to
    pass some non-NULL pointer for <written>, while all others would just pass NULL.
    
    Thoughts?
    
    best regards,
    Florian Pflug
    
    
    
    
  25. Re: Patch to improve reliability of postgresql on linux nfs

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-13T14:25:15Z

    Florian Pflug <fgp@phlo.org> writes:
    > On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
    >> Personally, I'ld think that's ripe for bugs.   If the contract is that
    >> ret != amount is the "error" case, then don't return -1 for an error
    >> *sometimes*.
    
    > Hm, but isn't that how write() works also?
    
    Yeah.  It's not possible to maintain the same error-reporting contract
    that bare write() has got, unless you're willing to forget about actual
    errors reported by a non-first write attempt.  Which might not be
    totally unreasonable, because presumably something similar is going on
    under the hood within write() itself.  Most of the errors one might
    think are worth reporting would have had to occur on the first write
    attempt anyway.
    
    But if you do want to report such errors, I think you have to push the
    error reporting logic into the subroutine, which seems a bit messy since
    there's quite a variety of error message phrasings out there, all of
    which require information that write() itself does not have.  Also, we
    do *not* want e.g. gettext() to be invoked unless an error actually
    occurs and has to be reported.
    
    			regards, tom lane
    
    
  26. Re: Patch to improve reliability of postgresql on linux nfs

    Aidan Van Dyk <aidan@highrise.ca> — 2011-09-13T14:32:01Z

    On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote:
    
    >> Personally, I'ld think that's ripe for bugs.   If the contract is that
    >> ret != amount is the "error" case, then don't return -1 for an error
    >> *sometimes*.
    >
    > Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
    > will return the number of bytes written, which may be less than the requested
    > number if there's not enough free space, or -1 in case of an error like
    > an invalid fd being passed.
    
    Looking through the code, it appears as if all the write calls I've
    seen are checking ret != amount, so it's probably not as big a deal
    for PG as I fear...
    
    But the subtle change in semantics (from system write ret != amount
    not necessarily a real error, hence no errno set) of pg_write ret !=
    amount only happening after a "real error" (errno should be set) is
    one that could yet lead to confusion.
    
    a.
    
    
    -- 
    Aidan Van Dyk                                             Create like a god,
    aidan@highrise.ca                                       command like a king,
    http://www.highrise.ca/                                   work like a slave.
    
    
  27. Re: Patch to improve reliability of postgresql on linux nfs

    Florian G. Pflug <fgp@phlo.org> — 2011-09-13T15:02:26Z

    On Sep13, 2011, at 16:25 , Tom Lane wrote:
    > Florian Pflug <fgp@phlo.org> writes:
    >> On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
    >>> Personally, I'ld think that's ripe for bugs.   If the contract is that
    >>> ret != amount is the "error" case, then don't return -1 for an error
    >>> *sometimes*.
    > 
    >> Hm, but isn't that how write() works also?
    > 
    > Yeah.  It's not possible to maintain the same error-reporting contract
    > that bare write() has got, unless you're willing to forget about actual
    > errors reported by a non-first write attempt.
    
    Hm, yeah, but we're only replacing the exclusive or in "either sets errno
    *or* returns >= 0 and < amount" by a non-exclusive one. Which, in practice,
    doesn't make much difference for callers. They can (and should) continue to
    check whether they correct amount of bytes has been written, and they may
    still use errno to distinguish different kinds of errors. They should just
    do so upon any error condition, not upon us returning -1.
    
    The important thing, I believe, is that we don't withhold any information
    from callers, which we don't. If write() sets errno, it must return -1,
    so we'll abort and hence leave the errno in place to be inspected by the
    caller. And we faithfully track the actual number of bytes written.
    
    Or am I missing something?
    
    > But if you do want to report such errors, I think you have to push the
    > error reporting logic into the subroutine, which seems a bit messy since
    > there's quite a variety of error message phrasings out there, all of
    > which require information that write() itself does not have.  Also, we
    > do *not* want e.g. gettext() to be invoked unless an error actually
    > occurs and has to be reported.
    
    Yeah, I had the same idea (moving the error reporting into the subroutine)
    when I first looked at the OP's patch, but then figured it'd just complicate
    the API for no good reason.
    
    best regards,
    Florian Pflug
    
    
    
    
  28. Re: Patch to improve reliability of postgresql on linux nfs

    Bruce Momjian <bruce@momjian.us> — 2012-08-16T00:45:40Z

    On Tue, Sep 13, 2011 at 10:32:01AM -0400, Aidan Van Dyk wrote:
    > On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote:
    > 
    > >> Personally, I'ld think that's ripe for bugs.   If the contract is that
    > >> ret != amount is the "error" case, then don't return -1 for an error
    > >> *sometimes*.
    > >
    > > Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
    > > will return the number of bytes written, which may be less than the requested
    > > number if there's not enough free space, or -1 in case of an error like
    > > an invalid fd being passed.
    > 
    > Looking through the code, it appears as if all the write calls I've
    > seen are checking ret != amount, so it's probably not as big a deal
    > for PG as I fear...
    > 
    > But the subtle change in semantics (from system write ret != amount
    > not necessarily a real error, hence no errno set) of pg_write ret !=
    > amount only happening after a "real error" (errno should be set) is
    > one that could yet lead to confusion.
    
    I assume there is no TODO here.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +