Thread

Patch to improve reliability of postgresql on linux nfs

George Barnett <gbarnett@atlassian.com> — 2011-09-09T00:04:58Z
```
Hi Hackers,
```

Re: Patch to improve reliability of postgresql on linux nfs

Josh Berkus <josh@agliodbs.com> — 2011-09-09T00:16:59Z

George,

I'm quoting you here because in the version of your email which got
posted to the list your whole explanation got put below the patch text,
making it hard to find the justification for the patch.  Follows:

> I run a number of postgresql installations on NFS and on the whole I find this to be very reliable.  I have however run into a few issues when there is concurrent writes from multiple processes.
> 
> I see errors such as the following:
> 
> 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: user=postgres database=postgres
> 2011-07-31 22:13:35 EST    ERROR:  could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
> 2011-07-31 22:13:35 EST    HINT:  Check free disk space.
> 2011-07-31 22:13:35 EST    CONTEXT:  writing block 1 of relation global/2671
> 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: host=[local]
> 
> I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a little more drastic.
> 
> After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs when one process was writing to a data file and it received a SIGINT from another process, eg:
> (These logs are from another similar run)
> 
> [pid  1804] <... fsync resumed> )       = 0
> [pid 10198] kill(1804, SIGINT <unfinished ...>
> [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
> [pid  1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...>
> [pid 10198] <... kill resumed> )        = 0
> [pid  1804] <... write resumed> )       = 4096
> [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
> [pid  1804] rt_sigreturn(0x2)           = 4096
> [pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
> [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
> [pid  1802] <... select resumed> )      = 1 (in [5], left {0, 999000})
> [pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
> [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
> [pid  1802] read(5,  <unfinished ...>
> [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
> Process 1804 detached
> 
> After finding this, I came up with the following test case which easily replicated our issue:
> 
> #!/bin/bash
> 
> name=$1
> number=1
> while true; do 
>   /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';"
>   /usr/bin/createdb -E UNICODE -O $name$number $name$number
>   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
>     exit
>   fi
>   let number=$number+1
> done
> 
> When I run a single copy of this script, I have no issues, however when I start up a few more copies to simultaneously hit the DB, it crashes quiet quickly - usually within 20 or 30 seconds.
> 
> After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the issue with the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, however I was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I spoke to an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similar on the file handle.  I had a look in the libc docs and found this:
> 
> http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
> 
> ----
> The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not necessarily a character string and a null character is output like any other character.
> 
> The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written.
> ----
> 
> After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the JVM) for their usage of write().  I confirmed they write in a while loop and set about patching the postgres source.
> 
> I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems.  I have also checked that make check still passes. 
> 
> As my C is terrible, I would welcome any comments on the implementation of this patch.
> 
> Best regards,
> 
> George


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Patch to improve reliability of postgresql on linux nfs

Thom Brown <thom@linux.com> — 2011-09-09T08:38:58Z

On 9 September 2011 01:04, George Barnett <gbarnett@atlassian.com> wrote:
> After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the issue with the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, however I was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I spoke to an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similar on the file handle.

Have you run the test with varying wal_sync_method values?  On Linux
the default is fdatasync because historically Linux hasn't supported
O_DSYNC (a wal_sync_method value of open_datasync).  But I believe as
of Kernel 2.6.33 it does support it.  Have you tried modifying this
parameter in your tests?  Are you even using Linux? (you haven't
specified)

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch to improve reliability of postgresql on linux nfs

Peter Eisentraut <peter_e@gmx.net> — 2011-09-09T09:43:07Z

On fre, 2011-09-09 at 10:04 +1000, George Barnett wrote:
> After looking through the code I found that when postgres calls
> write() it doesn't retry.  In order to address the issue with the
> PANIC in the WAL writer I set the sync method to o_sync which solved
> the issue in that part of the code, however I was still seeing
> failures in other areas of the code (such as the FileWrite function).
> Following this, I spoke to an NFS guru who pointed out that writes
> under linux are not guaranteed to complete unless you open up O_SYNC
> or similar on the file handle.

I've had this problem many years ago.  I recall that changing the mount
options for NFS also fixed it.  Could you post what mount options you
are using.

(We eventually moved away from NFS at that time, so I didn't pursue it
further, but my analysis back then matched yours.)

Re: Patch to improve reliability of postgresql on linux nfs

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-09T14:27:22Z

George Barnett <gbarnett@atlassian.com> writes:
> [ patch to retry writes on NFS ]

I'm having a hard time getting excited about this idea, because IMO
NFS is insufficiently reliable to run a database on, and no patch like
this can fix that.  However, some concrete points:

1. If writes need to be retried, why not reads?  (No, that's not an
invitation to expand the scope of the patch; it's a question about NFS
implementation.)

2. What is the rationale for supposing that a retry a nanosecond later
will help?  If it will help, why didn't the kernel just do that?

3. What about EINTR?  If you believe that this is necessary, then the
kernel logically has to return EINTR when it would like you to retry but
it hasn't been able to write any bytes yet.  If you get a zero return
you have to assume that means out-of-disk-space.

4. As coded, the patch behaves incorrectly if you get a zero return on a
retry.  If we were going to do this, I think we'd need to absorb the
errno-munging currently done by callers into the writeAll function.

On the whole I think you'd be better off lobbying your NFS implementors
to provide something closer to the behavior of every other filesystem on
the planet.  Or checking to see if you need to adjust your NFS
configuration, as the other responders mentioned.

			regards, tom lane

Re: Patch to improve reliability of postgresql on linux nfs

Bernd Helmle <mailings@oopsware.de> — 2011-09-09T15:30:02Z

--On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> On the whole I think you'd be better off lobbying your NFS implementors
> to provide something closer to the behavior of every other filesystem on
> the planet.  Or checking to see if you need to adjust your NFS
> configuration, as the other responders mentioned.

You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
otherwise you are out of luck. Oracle and DB2 guys recommend those settings and 
without them any millisecond of network glitch could disturb things 
unreasonably.

-- 
Thanks

	Bernd

Re: Patch to improve reliability of postgresql on linux nfs

Noah Misch <noah@leadboat.com> — 2011-09-09T17:44:05Z

On Fri, Sep 09, 2011 at 10:27:22AM -0400, Tom Lane wrote:
> George Barnett <gbarnett@atlassian.com> writes:
> > [ patch to retry writes on NFS ]
> 
> I'm having a hard time getting excited about this idea, because IMO
> NFS is insufficiently reliable to run a database on, and no patch like
> this can fix that.  However, some concrete points:
> 
> 1. If writes need to be retried, why not reads?  (No, that's not an
> invitation to expand the scope of the patch; it's a question about NFS
> implementation.)

To support all POSIX:2008-conforming read() implementations, we must indeed
retry reads.  I suppose the OP never encountered this in practice, though.

> 2. What is the rationale for supposing that a retry a nanosecond later
> will help?  If it will help, why didn't the kernel just do that?

POSIX:2008 unconditionally permits[1] partial writes of requests exceeding 512
bytes (_POSIX_PIPE_BUF).  We shouldn't complain when a kernel provides a
conforming write(), even if it appears that the kernel achieved little by
using some freedom afforded it.

> 3. What about EINTR?  If you believe that this is necessary, then the
> kernel logically has to return EINTR when it would like you to retry but
> it hasn't been able to write any bytes yet.  If you get a zero return
> you have to assume that means out-of-disk-space.

Assuming conforming SA_RESTART for write()/read(), this will not happen.  The
call will restart and resume blocking until it writes something.

nm

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

Re: Patch to improve reliability of postgresql on linux nfs

Kevin Grittner <kevin.grittner@wicourts.gov> — 2011-09-09T18:38:08Z

Noah Misch <noah@leadboat.com> wrote:

> We shouldn't complain when a kernel provides a conforming write(),
> even if it appears that the kernel achieved little by using some
> freedom afforded it.

I remember we had some compiler warnings in the logging area because
we were making the assumption that no implementation would ever use
that freedom.  I suggested we replace the simple, unchecked calls to
write with a function which did what we expected through an API
conforming loop:

http://archives.postgresql.org/pgsql-hackers/2011-02/msg01719.php

The response was that we could ignore the documented API because we
had "never seen nor heard of it being true for writes to disk
files".  I'm still uncomfortable with that.  Where I have seen
people code to implementation details rather than the documented
API, it has often not turned out well in the long run.

I'd still be willing to put together a patch for that if people buy
into it.

-Kevin

Re: Patch to improve reliability of postgresql on linux nfs

Bruce Momjian <bruce@momjian.us> — 2011-09-09T23:06:49Z

Tom Lane wrote:
> George Barnett <gbarnett@atlassian.com> writes:
> > [ patch to retry writes on NFS ]
> 
> I'm having a hard time getting excited about this idea, because IMO
> NFS is insufficiently reliable to run a database on, and no patch like
> this can fix that.  However, some concrete points:
> 
> 1. If writes need to be retried, why not reads?  (No, that's not an
> invitation to expand the scope of the patch; it's a question about NFS
> implementation.)
> 
> 2. What is the rationale for supposing that a retry a nanosecond later
> will help?  If it will help, why didn't the kernel just do that?
> 
> 3. What about EINTR?  If you believe that this is necessary, then the
> kernel logically has to return EINTR when it would like you to retry but
> it hasn't been able to write any bytes yet.  If you get a zero return
> you have to assume that means out-of-disk-space.
> 
> 4. As coded, the patch behaves incorrectly if you get a zero return on a
> retry.  If we were going to do this, I think we'd need to absorb the
> errno-munging currently done by callers into the writeAll function.
> 
> On the whole I think you'd be better off lobbying your NFS implementors
> to provide something closer to the behavior of every other filesystem on
> the planet.  Or checking to see if you need to adjust your NFS
> configuration, as the other responders mentioned.

Can our NFS documentation be improved (section 17.2.1)?

	http://www.postgresql.org/docs/9.1/static/creating-cluster.html

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Patch to improve reliability of postgresql on linux nfs

George Barnett <gbarnett@atlassian.com> — 2011-09-12T04:30:17Z

On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:

> --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
>> On the whole I think you'd be better off lobbying your NFS implementors
>> to provide something closer to the behavior of every other filesystem on
>> the planet.  Or checking to see if you need to adjust your NFS
>> configuration, as the other responders mentioned.
> 
> You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and DB2 guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably.

Hi,

My mount options include hard and intr.

George

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-12T05:59:32Z

On Sep12, 2011, at 06:30 , George Barnett wrote:
> On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:
> 
>> --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 
>>> On the whole I think you'd be better off lobbying your NFS implementors
>>> to provide something closer to the behavior of every other filesystem on
>>> the planet.  Or checking to see if you need to adjust your NFS
>>> configuration, as the other responders mentioned.
>> 
>> You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and DB2 guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably.
> 
> My mount options include hard and intr.

If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.

Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

George Barnett <gbarnett@atlassian.com> — 2011-09-12T06:46:53Z

On 12/09/2011, at 3:59 PM, Florian Pflug wrote:

> If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
> 
> Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.

Hi Florian,

You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.

It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.

Thanks in advance,

George

Re: Patch to improve reliability of postgresql on linux nfs

Peter Eisentraut <peter_e@gmx.net> — 2011-09-12T12:54:39Z

On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
> 
> > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
> > 
> > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
> 
> 
> Hi Florian,
> 
> You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
> 
> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.

We have signal handling configured so that system calls are not
interrupted.  So there is ordinarily no reason to do anything more
graceful.  The problem is that NFS is in this case not observing that
setting.  It's debatable whether it's worth supporting that; just saying
that the code is correct as it stands.

Re: Patch to improve reliability of postgresql on linux nfs

Kenneth Marshall <ktm@rice.edu> — 2011-09-12T12:54:52Z

On Mon, Sep 12, 2011 at 04:46:53PM +1000, George Barnett wrote:
> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
> 
> > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
> > 
> > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
> 
> 
> Hi Florian,
> 
> You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
> 
> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.
> 
> Thanks in advance,
> 
> George

Hi George,

Many, many, many other software packages expect I/O usage to be the same on
an NFS volume and a local disk volume, including Oracle. Coding every application,
or more likely mis-coding, to handle this gives every application another chance
to get it wrong. If the OS does this, when it gets it right, all of the apps get
it right. I think you should be surprised when other software actually deals with
broken I/O semantics gracefully rather than concerned when one of a pantheon of
programs does not. My two cents.

Regards,
Ken

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-12T13:27:00Z

On Sep12, 2011, at 14:54 , Peter Eisentraut wrote:
> On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
>> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
>>> Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read() or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible. If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent us from ever seeing that.
>> 
>> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will consider a patch to resolve that.
> 
> We have signal handling configured so that system calls are not
> interrupted.  So there is ordinarily no reason to do anything more
> graceful.  The problem is that NFS is in this case not observing that
> setting.  It's debatable whether it's worth supporting that; just saying
> that the code is correct as it stands.

SA_RESTART doesn't protect against partial reads/writes due to signal delivery,
it only removes the need to check for EINTR. In other words, it retries until
at least one byte has been written, not until all bytes have been written.

The GNU LibC documentation has this to say on the subject

  "There is one situation where resumption never happens no matter which
   choice you make: when a data-transfer function such as read or write is
   interrupted by a signal after transferring part of the data. In this case,
   the function returns the number of bytes already transferred, indicating
   partial success."[1]

While it's true that reads and writes are by tradition non-interruptible, I
personally wouldn't bet that they'll stay that way forever. It all depends on
whether the timeouts involved in the communication with a disk are short enough
to mask the difference - once they get too long (or even infinite like in the
case of "hard" NFS mounts) you pay for non-interruptible primitives with
un-killable stuck processes. Since the current trend is to move storage further
away from processing, and to put non-deterministic networks like ethernet between
the two, I expect situations in which read/write primitives are interruptible
to increase, not decrease.

And BTW, the GNU LibC documentations doesn't seem to mention anything about
local reads and writes being non-interruptible. In fact, it even says

  "A signal can arrive and be handled while an I/O primitive such as open or read
   is waiting for an I/O device. If the signal handler returns, the system faces
   the question: what should happen next?"[1]

Had the GNU people faith in local read and writes being non-interruptible, they'd
probably have said "network device" or "remove device", not "I/O device".

best regards,
Florian Pflug

[1] http://www.gnu.org/s/hello/manual/libc/Interrupted-Primitives.html#Interrupted-Primitives

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-12T13:39:02Z

On Sep12, 2011, at 14:54 , ktm@rice.edu wrote:
> Many, many, many other software packages expect I/O usage to be the same on
> an NFS volume and a local disk volume, including Oracle. Coding every application,
> or more likely mis-coding, to handle this gives every application another chance
> to get it wrong. If the OS does this, when it gets it right, all of the apps get
> it right. I think you should be surprised when other software actually deals with
> broken I/O semantics gracefully rather than concerned when one of a pantheon of
> programs does not. My two cents.

I don't buy that. People seem to be perfectly able to code correct networking
applications (correct from a read/write API POV at least), yet those applications
need to deal with partial reads and writes too.

Really, it's not *that* hard to put a retry loop around "read" and "write".

Also, non-interruptible IO primitives are by no means "right". At best, they're
a compromise between complexity and functionality for I/O devices with rather
short (and bounded) communication timeouts - because in that case, processes are
only blocked un-interruptibly for a short while.

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

Robert Haas <robertmhaas@gmail.com> — 2011-09-12T14:37:35Z

On Mon, Sep 12, 2011 at 9:39 AM, Florian Pflug <fgp@phlo.org> wrote:
> Really, it's not *that* hard to put a retry loop around "read" and "write".

+1.  I don't see what we're gaining by digging in our heels on this
one.  Retry loops around read() and write() are pretty routine, and I
wouldn't like to bet this is the only case where not having them could
cause an unnecessary failure.

Now, that having been said, I *really* think we could use some better
documentation on which mount options we believe to be safe, and not
just for NFS.  Right now we have practically nothing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-13T11:07:42Z

[CC'ing to the list again - I assume you omitted pgsql-hackers from the
recipient list by accident]

On Sep13, 2011, at 03:00 , George Barnett wrote:
> On 12/09/2011, at 11:39 PM, Florian Pflug wrote:
>> Also, non-interruptible IO primitives are by no means "right". At best, they're
>> a compromise between complexity and functionality for I/O devices with rather
>> short (and bounded) communication timeouts - because in that case, processes are
>> only blocked un-interruptibly for a short while.
> 
> Just to expand on that - I'm now in the situation where I can run my nfs mounts
> 'nointr' and postgres will work, but that means if I lose a storage unit I have
> a number of stuck processes, effectively meaning I need to reboot all my frontend
> servers before I can fail over to backup nfs stores.
> 
> However, if I run the mounts with intr, then if a storage unit fails, I can fail
> over to a backup node (taking a minor loss of data hit I'm willing to accept) but
> postgres breaks under a moderate insert load.
> 
> With the patch I supplied though, I'm able to have most of my cake and eat it.
> 
> I'd be very interested in moving this forward - is there something I can change
> in the patch to make it more acceptable for a merge?

Here are a few comments

Tom already remarked that if we do that for write()s, we ought to do it for read()s
also which I agree with. All other primitives like lseek, close, ... should be taken
care of by SA_RESTART, but I'd be a good idea to verify that.

Also, I don't think that POSIX mandates that errno be reset to 0 if a function returns
successfully, making that "returnCode == 0 && errno == 0" check pretty dubious. I'm not
sure of this was what Tom was getting at with his remark about the ENOSPC handling being
wrong in the retry case.

And I also think that if we do this, we might as well handle EINTR correctly, even if
our use of SA_RESTART should prevent us from ever seeing that. The rules surrounding
EINTR and SA_RESTART for read/write are quite subtle...

If we retry, shouldn't be do CHECK_FOR_INTERRUPTS? Otherwise, processes waiting for
a vanished NFS server would be killable only with SIGKILL, not SIGTERM or SIGINT.
But I'm not sure if it's safe to put that into a generic function like pg_write_nointr.

Finally, WriteAll() seems like a poor name for that function. How about pg_write_nointr()?

Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
(but obviously without the ENOSPC handling)

int pg_write_nointr(int fd, const void *bytes, Size amount)
{
  int written = 0;

  while (amount > 0)
  {
    int ret;

    ret = write(fd, bytes, amount);
    if ((ret < 0) && (errno == EINTR))
    {
      /* interrupted by signal before first byte was written. Retry */
      /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
      CHECK_FOR_INTERRUPTS();
      continue;
    }
    else if (ret < 0)
    {
      /* error occurred. Abort */
      return -1;
    }
    else if (ret == 0)
    {
      /* out of disk space. Abort */
      return written;
    }

    /* made progress */

    /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
    CHECK_FOR_INTERRUPTS();

    written += ret;
    amount -= ret;
    bytes = (const char *) bytes + ret;
  }
}

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-13T11:30:34Z

On Sep13, 2011, at 13:07 , Florian Pflug wrote:
> Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
> (but obviously without the ENOSPC handling)
> 
> <wrong pg_write_nointr implementation snipped>

Sorry for the self-reply. I realized only after hitting send that I
got the ENOSPC handling wrong again - we probably ought to check for
ENOSPC as well as ret == 0. Also, it seems preferable to return the
number of bytes actually written instead of -1 if we hit an error during
retry.

With this version, any return value other than <amount> signals an
error, the number of actually written bytes is reported even in the
case of an error (to the best of pg_write_nointr's knowledge), and
errno always indicates the kind of error.

int pg_write_nointr(int fd, const void *bytes, Size amount)
{
 int written = 0;

 while (amount > 0)
 {
   int ret;

   ret = write(fd, bytes, amount);

   if ((ret < 0) && (errno == EINTR))
   {
     /* interrupted by signal before first byte was written. Retry */

     /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
     CHECK_FOR_INTERRUPTS();

     continue;
   }
   else if (ret < 1)
   {
     /* error occurred. Abort */

     if (ret == 0)
       /* out of disk space */
       errno = ENOSPC;

     if (written == 0)
       return -1;
     else
       return written;
   }

   /* made progress */
   written += ret;
   amount -= ret;
   bytes = (const char *) bytes + ret;

   /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
   CHECK_FOR_INTERRUPTS();
 }
}

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

Kenneth Marshall <ktm@rice.edu> — 2011-09-13T12:58:22Z

On Tue, Sep 13, 2011 at 01:30:34PM +0200, Florian Pflug wrote:
> On Sep13, 2011, at 13:07 , Florian Pflug wrote:
> > Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
> > (but obviously without the ENOSPC handling)
> > 
> > <wrong pg_write_nointr implementation snipped>
> 
> Sorry for the self-reply. I realized only after hitting send that I
> got the ENOSPC handling wrong again - we probably ought to check for
> ENOSPC as well as ret == 0. Also, it seems preferable to return the
> number of bytes actually written instead of -1 if we hit an error during
> retry.
> 
> With this version, any return value other than <amount> signals an
> error, the number of actually written bytes is reported even in the
> case of an error (to the best of pg_write_nointr's knowledge), and
> errno always indicates the kind of error.
> 
> int pg_write_nointr(int fd, const void *bytes, Size amount)
> {
>  int written = 0;
> 
>  while (amount > 0)
>  {
>    int ret;
> 
>    ret = write(fd, bytes, amount);
> 
>    if ((ret < 0) && (errno == EINTR))
>    {
>      /* interrupted by signal before first byte was written. Retry */
> 
>      /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
>      CHECK_FOR_INTERRUPTS();
> 
>      continue;
>    }
>    else if (ret < 1)
>    {
>      /* error occurred. Abort */
> 
>      if (ret == 0)
>        /* out of disk space */
>        errno = ENOSPC;
> 
>      if (written == 0)
>        return -1;
>      else
>        return written;
>    }
> 
>    /* made progress */
>    written += ret;
>    amount -= ret;
>    bytes = (const char *) bytes + ret;
>    
>    /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
>    CHECK_FOR_INTERRUPTS();
>  }
> }
> 
> best regards,
> Florian Pflug
> 

It will be interesting to see if there are any performance ramifications to
this new write function.

Regards,
Ken

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-13T13:02:57Z

On Sep13, 2011, at 14:58 , ktm@rice.edu wrote:
> It will be interesting to see if there are any performance ramifications to
> this new write function.

What would those be? For non-interruptible reads and writes, the overhead
comes down to an additional function call (if we don't make pg_write_nointr
inlined) and a few conditional jumps (which branch prediction should be
able to take care of). These are bound to disappear in the noise compared
to the cost of the actual syscall.

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

Aidan Van Dyk <aidan@highrise.ca> — 2011-09-13T13:05:38Z

On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote:

>
> Sorry for the self-reply. I realized only after hitting send that I
> got the ENOSPC handling wrong again - we probably ought to check for
> ENOSPC as well as ret == 0. Also, it seems preferable to return the
> number of bytes actually written instead of -1 if we hit an error during
> retry.
>
> With this version, any return value other than <amount> signals an
> error, the number of actually written bytes is reported even in the
> case of an error (to the best of pg_write_nointr's knowledge), and
> errno always indicates the kind of error.

Personally, I'ld think that's ripe for bugs.   If the contract is that
ret != amount is the "error" case, then don't return -1 for an error
*sometimes*.

If you sometimes return -1 for an error, even though ret != amount is
the *real* test, I'm going to guess there will be lots of chance for
code to do:
  if (pg_write_no_intr(...) < 0)
   ...

which will only catch some of the errors, and happily continue with the rest...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Patch to improve reliability of postgresql on linux nfs

Kenneth Marshall <ktm@rice.edu> — 2011-09-13T13:22:42Z

On Tue, Sep 13, 2011 at 03:02:57PM +0200, Florian Pflug wrote:
> On Sep13, 2011, at 14:58 , ktm@rice.edu wrote:
> > It will be interesting to see if there are any performance ramifications to
> > this new write function.
> 
> What would those be? For non-interruptible reads and writes, the overhead
> comes down to an additional function call (if we don't make pg_write_nointr
> inlined) and a few conditional jumps (which branch prediction should be
> able to take care of). These are bound to disappear in the noise compared
> to the cost of the actual syscall.
> 
> best regards,
> Florian Pflug
> 
That would be my expectation too. It is just always nice to benchmark changes,
just in case. I have had similar simple changes blow out a cache and have a
much greater impact on performance than might be expected from inspection. :)

Regards,
Ken

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-13T14:14:23Z

On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
> On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote:
>> Sorry for the self-reply. I realized only after hitting send that I
>> got the ENOSPC handling wrong again - we probably ought to check for
>> ENOSPC as well as ret == 0. Also, it seems preferable to return the
>> number of bytes actually written instead of -1 if we hit an error during
>> retry.
>> 
>> With this version, any return value other than <amount> signals an
>> error, the number of actually written bytes is reported even in the
>> case of an error (to the best of pg_write_nointr's knowledge), and
>> errno always indicates the kind of error.
> 
> Personally, I'ld think that's ripe for bugs.   If the contract is that
> ret != amount is the "error" case, then don't return -1 for an error
> *sometimes*.

Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
will return the number of bytes written, which may be less than the requested
number if there's not enough free space, or -1 in case of an error like
an invalid fd being passed.

> If you sometimes return -1 for an error, even though ret != amount is
> the *real* test, I'm going to guess there will be lots of chance for
> code to do:
>  if (pg_write_no_intr(...) < 0)
>   ...
> 
> which will only catch some of the errors, and happily continue with the rest...

Yeah, but that's equally wrong for plain write(), so I'm not sure I share
your concern there. Also, I'm not sure how to improve that. We could always
return -1 in case of an error, and "amount" in case of success, but that makes
it impossible to determine how many bytes where actually written (and also feel
awkward). Or we could return 0 instead of -1 if there was an error and zero
bytes where written. But that feels awkward also...

One additional possibility would be to make the signature

  boolean pg_write_nointr(int fd, const void *bytes, int len, int *written)

and simply return true on success and false on error. Callers who're interested
in the number of bytes actually written (in the case of an error) would need to
pass some non-NULL pointer for <written>, while all others would just pass NULL.

Thoughts?

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-13T14:25:15Z

Florian Pflug <fgp@phlo.org> writes:
> On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
>> Personally, I'ld think that's ripe for bugs.   If the contract is that
>> ret != amount is the "error" case, then don't return -1 for an error
>> *sometimes*.

> Hm, but isn't that how write() works also?

Yeah.  It's not possible to maintain the same error-reporting contract
that bare write() has got, unless you're willing to forget about actual
errors reported by a non-first write attempt.  Which might not be
totally unreasonable, because presumably something similar is going on
under the hood within write() itself.  Most of the errors one might
think are worth reporting would have had to occur on the first write
attempt anyway.

But if you do want to report such errors, I think you have to push the
error reporting logic into the subroutine, which seems a bit messy since
there's quite a variety of error message phrasings out there, all of
which require information that write() itself does not have.  Also, we
do *not* want e.g. gettext() to be invoked unless an error actually
occurs and has to be reported.

			regards, tom lane

Re: Patch to improve reliability of postgresql on linux nfs

Aidan Van Dyk <aidan@highrise.ca> — 2011-09-13T14:32:01Z

On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote:

>> Personally, I'ld think that's ripe for bugs.   If the contract is that
>> ret != amount is the "error" case, then don't return -1 for an error
>> *sometimes*.
>
> Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
> will return the number of bytes written, which may be less than the requested
> number if there's not enough free space, or -1 in case of an error like
> an invalid fd being passed.

Looking through the code, it appears as if all the write calls I've
seen are checking ret != amount, so it's probably not as big a deal
for PG as I fear...

But the subtle change in semantics (from system write ret != amount
not necessarily a real error, hence no errno set) of pg_write ret !=
amount only happening after a "real error" (errno should be set) is
one that could yet lead to confusion.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Patch to improve reliability of postgresql on linux nfs

Florian G. Pflug <fgp@phlo.org> — 2011-09-13T15:02:26Z

On Sep13, 2011, at 16:25 , Tom Lane wrote:
> Florian Pflug <fgp@phlo.org> writes:
>> On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
>>> Personally, I'ld think that's ripe for bugs.   If the contract is that
>>> ret != amount is the "error" case, then don't return -1 for an error
>>> *sometimes*.
> 
>> Hm, but isn't that how write() works also?
> 
> Yeah.  It's not possible to maintain the same error-reporting contract
> that bare write() has got, unless you're willing to forget about actual
> errors reported by a non-first write attempt.

Hm, yeah, but we're only replacing the exclusive or in "either sets errno
*or* returns >= 0 and < amount" by a non-exclusive one. Which, in practice,
doesn't make much difference for callers. They can (and should) continue to
check whether they correct amount of bytes has been written, and they may
still use errno to distinguish different kinds of errors. They should just
do so upon any error condition, not upon us returning -1.

The important thing, I believe, is that we don't withhold any information
from callers, which we don't. If write() sets errno, it must return -1,
so we'll abort and hence leave the errno in place to be inspected by the
caller. And we faithfully track the actual number of bytes written.

Or am I missing something?

> But if you do want to report such errors, I think you have to push the
> error reporting logic into the subroutine, which seems a bit messy since
> there's quite a variety of error message phrasings out there, all of
> which require information that write() itself does not have.  Also, we
> do *not* want e.g. gettext() to be invoked unless an error actually
> occurs and has to be reported.

Yeah, I had the same idea (moving the error reporting into the subroutine)
when I first looked at the OP's patch, but then figured it'd just complicate
the API for no good reason.

best regards,
Florian Pflug

Re: Patch to improve reliability of postgresql on linux nfs

Bruce Momjian <bruce@momjian.us> — 2012-08-16T00:45:40Z

On Tue, Sep 13, 2011 at 10:32:01AM -0400, Aidan Van Dyk wrote:
> On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote:
> 
> >> Personally, I'ld think that's ripe for bugs.   If the contract is that
> >> ret != amount is the "error" case, then don't return -1 for an error
> >> *sometimes*.
> >
> > Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
> > will return the number of bytes written, which may be less than the requested
> > number if there's not enough free space, or -1 in case of an error like
> > an invalid fd being passed.
> 
> Looking through the code, it appears as if all the write calls I've
> seen are checking ret != amount, so it's probably not as big a deal
> for PG as I fear...
> 
> But the subtle change in semantics (from system write ret != amount
> not necessarily a real error, hence no errno set) of pg_write ret !=
> amount only happening after a "real error" (errno should be set) is
> one that could yet lead to confusion.

I assume there is no TODO here.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +