You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Benjamin Mahler <be...@gmail.com> on 2015/11/06 03:27:51 UTC

Re: [jira] [Comment Edited] (MESOS-2079) IO.Write test is flaky on OS X 10.10.

Just want to surface this up to the dev@ thread to raise some awareness.
Recently with the SIGPIPE bug from libev [1], we've revisited whether it
makes sense to continue down the path of leaving SIGPIPE unblocked and
trying to handle it case by case.

We originally wanted users of libprocess to decide on their own whether
they want to ignore SIGPIPE. However, we'd like to reconsider:

(a) The amount of code that is needed to work around SIGPIPE is
substantial, especially because on OS X SIGPIPE appears to not be delivered
synchronously [2]. Also, it is not possible to create pipes that don't
surface SIGPIPE (unlike sockets), so in order to safely write to a pipe we
need to wrap write() calls with signal suppression blocks (which we don't
do in general!). You can get a sense of the code from [3] and [4].

(b) SIGPIPE seems to be more of a legacy mechanism to shut down a set of
piped programs and the general recommendation seems to be to not bother
with it and ignore it. Programs can handle EPIPE as they would with other
signals.

Would love to hear if there are any concerns. I will be glad to shepherd
James' changes here.

[1] https://issues.apache.org/jira/browse/MESOS-2768
[2] https://issues.apache.org/jira/browse/MESOS-2079
[3] https://reviews.apache.org/r/39940/diff/1#index_header
[4]
https://github.com/apache/mesos/blob/0.25.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/posix/signals.hpp#L101

On Wed, Nov 4, 2015 at 9:20 AM, James Peach (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989947#comment-14989947
> ]
>
> James Peach edited comment on MESOS-2079 at 11/4/15 5:19 PM:
> -------------------------------------------------------------
>
> These patches global ignore {{SIGPIPE}} during libprocess initialization,
> document {{SIGPIPE}} behavior a bit more, and remove various signal
> manipulations that were formerly necessary for disabling {{SIGPIPE}}
> delivery.
>
> https://reviews.apache.org/r/39938/
> https://reviews.apache.org/r/39940/
> https://reviews.apache.org/r/39941/
>
>
>
> was (Author: jamespeach):
> https://reviews.apache.org/r/39938/
> https://reviews.apache.org/r/39940/
> https://reviews.apache.org/r/39941/
>
>
> > IO.Write test is flaky on OS X 10.10.
> > -------------------------------------
> >
> >                 Key: MESOS-2079
> >                 URL: https://issues.apache.org/jira/browse/MESOS-2079
> >             Project: Mesos
> >          Issue Type: Task
> >          Components: libprocess, technical debt, test
> >         Environment: OS X 10.10
> > {noformat}
> > $ clang++ --version
> > Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
> > Target: x86_64-apple-darwin14.0.0
> > Thread model: posix
> > {noformat}
> >            Reporter: Benjamin Mahler
> >            Assignee: James Peach
> >              Labels: flaky
> >
> > [~benjaminhindman]: If I recall correctly, this is related to
> MESOS-1658. Unfortunately, we don't have a stacktrace for SIGPIPE currently:
> > {noformat}
> > [ RUN      ] IO.Write
> > make[5]: *** [check-local] Broken pipe: 13
> > {noformat}
> > Running in gdb, seems to always occur here:
> > {code}
> > Program received signal SIGPIPE, Broken pipe.
> > [Switching to process 56827 thread 0x60b]
> > 0x00007fff9a011132 in __psynch_cvwait ()
> > (gdb) where
> > #0  0x00007fff9a011132 in __psynch_cvwait ()
> > #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
> > #2  0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at
> gate.hpp:82
> > #3  0x0000000100600888 in process::schedule (arg=0x0) at
> src/process.cpp:1373
> > #4  0x00007fff903e72fc in _pthread_body ()
> > #5  0x00007fff903e7279 in _pthread_start ()
> > #6  0x00007fff903e54b1 in thread_start ()
> > {code}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: [jira] [Comment Edited] (MESOS-2079) IO.Write test is flaky on OS X 10.10.

Posted by Benjamin Mahler <be...@gmail.com>.

Yeah I had chatted with Alexander in person to clarify what the actual
semantics of SIGPIPE are. We should be good to go here, sorry for the delay
I will get back to these patches.

On Fri, Nov 20, 2015 at 8:52 AM, James Peach <jo...@gmail.com> wrote:

>
> > On Nov 11, 2015, at 12:44 AM, Alexander Rojas <al...@mesosphere.io>
> wrote:
> >
> > What I meant is that we may not care about SIGPIPE (which tell us a pipe
> was broken) because we will be notified when we try to write into it anyway
> (on the writing side) and we will get an EOF on the reading side.
> >
> > The only thing I could see us caring about SIGPIPE is if we want to know
> as soon as the pipe breaks that the event happened.
>
> So it sounds like there is no objection to this change? Can we land these
> changes now?
>
> >> On 06 Nov 2015, at 19:10, Benjamin Mahler <be...@gmail.com>
> wrote:
> >>
> >> To answer your questions:
> >>
> >> We use pipes when we need to communicate across the process boundary
> after
> >> a fork. Look for Subprocess::IO::Pipe for examples. There is plenty of
> code
> >> using pipes.
> >>
> >> Sockets aren't an issue as one can avoid SIGPIPE across OS X
> (SO_NOSIGPIPE)
> >> and Linux (MSG_NOSIGNAL).
> >>
> >> I'm a bit confused by your comment about the timing of SIGPIPE, which
> seems
> >> to suggest that the raising of SIGPIPE is not tied to the bad write
> call.
> >> Why do you think this?
> >>
> >> On Fri, Nov 6, 2015 at 4:37 AM, Alexander Rojas <
> alexander@mesosphere.io>
> >> wrote:
> >>
> >>> I have multiple questions here
> >>>
> >>> 1. Why do we use pipes at all? or is SIGPIPE raised also when writing
> into
> >>> sockets? which leads me to:
> >>> 2. Do we use it only in test cases or is there something actively using
> >>> pipes?
> >>>
> >>> SIGPIPE itself is a weird signal, since a failed call to `write`
> returns
> >>> -1 and sets `errno` to `EPIPE` so there are two ways to deal with
> errors
> >>> when the reading process is not longer reading, one is handling the
> return
> >>> value+errno (which usually means ignoring the SIGPIPE) and the second
> is
> >>> ignoring the return value and handling SIGPIPE. The difference is that
> >>> SIGPIPE is raised as soon as the OS realizes the pipe is broken while
> the
> >>> error on the write happens when you actually try to write on the pipe.
> >>>
> >>> All in all, I prefer to ignore the signal and deal with the return
> value
> >>> of `write`.
> >>>
> >>>> On 06 Nov 2015, at 03:27, Benjamin Mahler <be...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Just want to surface this up to the dev@ thread to raise some
> awareness.
> >>>> Recently with the SIGPIPE bug from libev [1], we've revisited whether
> it
> >>>> makes sense to continue down the path of leaving SIGPIPE unblocked and
> >>>> trying to handle it case by case.
> >>>>
> >>>> We originally wanted users of libprocess to decide on their own
> whether
> >>>> they want to ignore SIGPIPE. However, we'd like to reconsider:
> >>>>
> >>>> (a) The amount of code that is needed to work around SIGPIPE is
> >>>> substantial, especially because on OS X SIGPIPE appears to not be
> >>> delivered
> >>>> synchronously [2]. Also, it is not possible to create pipes that don't
> >>>> surface SIGPIPE (unlike sockets), so in order to safely write to a
> pipe
> >>> we
> >>>> need to wrap write() calls with signal suppression blocks (which we
> don't
> >>>> do in general!). You can get a sense of the code from [3] and [4].
> >>>>
> >>>> (b) SIGPIPE seems to be more of a legacy mechanism to shut down a set
> of
> >>>> piped programs and the general recommendation seems to be to not
> bother
> >>>> with it and ignore it. Programs can handle EPIPE as they would with
> other
> >>>> signals.
> >>>>
> >>>> Would love to hear if there are any concerns. I will be glad to
> shepherd
> >>>> James' changes here.
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/MESOS-2768
> >>>> [2] https://issues.apache.org/jira/browse/MESOS-2079
> >>>> [3] https://reviews.apache.org/r/39940/diff/1#index_header
> >>>> [4]
> >>>>
> >>>
> https://github.com/apache/mesos/blob/0.25.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/posix/signals.hpp#L101
> >>>>
> >>>> On Wed, Nov 4, 2015 at 9:20 AM, James Peach (JIRA) <ji...@apache.org>
> >>> wrote:
> >>>>
> >>>>>
> >>>>>  [
> >>>>>
> >>>
> https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989947#comment-14989947
> >>>>> ]
> >>>>>
> >>>>> James Peach edited comment on MESOS-2079 at 11/4/15 5:19 PM:
> >>>>> -------------------------------------------------------------
> >>>>>
> >>>>> These patches global ignore {{SIGPIPE}} during libprocess
> >>> initialization,
> >>>>> document {{SIGPIPE}} behavior a bit more, and remove various signal
> >>>>> manipulations that were formerly necessary for disabling {{SIGPIPE}}
> >>>>> delivery.
> >>>>>
> >>>>> https://reviews.apache.org/r/39938/
> >>>>> https://reviews.apache.org/r/39940/
> >>>>> https://reviews.apache.org/r/39941/
> >>>>>
> >>>>>
> >>>>>
> >>>>> was (Author: jamespeach):
> >>>>> https://reviews.apache.org/r/39938/
> >>>>> https://reviews.apache.org/r/39940/
> >>>>> https://reviews.apache.org/r/39941/
> >>>>>
> >>>>>
> >>>>>> IO.Write test is flaky on OS X 10.10.
> >>>>>> -------------------------------------
> >>>>>>
> >>>>>>              Key: MESOS-2079
> >>>>>>              URL: https://issues.apache.org/jira/browse/MESOS-2079
> >>>>>>          Project: Mesos
> >>>>>>       Issue Type: Task
> >>>>>>       Components: libprocess, technical debt, test
> >>>>>>      Environment: OS X 10.10
> >>>>>> {noformat}
> >>>>>> $ clang++ --version
> >>>>>> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
> >>>>>> Target: x86_64-apple-darwin14.0.0
> >>>>>> Thread model: posix
> >>>>>> {noformat}
> >>>>>>         Reporter: Benjamin Mahler
> >>>>>>         Assignee: James Peach
> >>>>>>           Labels: flaky
> >>>>>>
> >>>>>> [~benjaminhindman]: If I recall correctly, this is related to
> >>>>> MESOS-1658. Unfortunately, we don't have a stacktrace for SIGPIPE
> >>> currently:
> >>>>>> {noformat}
> >>>>>> [ RUN      ] IO.Write
> >>>>>> make[5]: *** [check-local] Broken pipe: 13
> >>>>>> {noformat}
> >>>>>> Running in gdb, seems to always occur here:
> >>>>>> {code}
> >>>>>> Program received signal SIGPIPE, Broken pipe.
> >>>>>> [Switching to process 56827 thread 0x60b]
> >>>>>> 0x00007fff9a011132 in __psynch_cvwait ()
> >>>>>> (gdb) where
> >>>>>> #0  0x00007fff9a011132 in __psynch_cvwait ()
> >>>>>> #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
> >>>>>> #2  0x000000010062f27c in Gate::arrive (this=0x101908a10,
> old=14780) at
> >>>>> gate.hpp:82
> >>>>>> #3  0x0000000100600888 in process::schedule (arg=0x0) at
> >>>>> src/process.cpp:1373
> >>>>>> #4  0x00007fff903e72fc in _pthread_body ()
> >>>>>> #5  0x00007fff903e7279 in _pthread_start ()
> >>>>>> #6  0x00007fff903e54b1 in thread_start ()
> >>>>>> {code}
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> This message was sent by Atlassian JIRA
> >>>>> (v6.3.4#6332)
> >>>>>
> >>>
> >>>
> >
>
>

Re: [jira] [Comment Edited] (MESOS-2079) IO.Write test is flaky on OS X 10.10.

Posted by James Peach <jo...@gmail.com>.

> On Nov 11, 2015, at 12:44 AM, Alexander Rojas <al...@mesosphere.io> wrote:
> 
> What I meant is that we may not care about SIGPIPE (which tell us a pipe was broken) because we will be notified when we try to write into it anyway (on the writing side) and we will get an EOF on the reading side.
> 
> The only thing I could see us caring about SIGPIPE is if we want to know as soon as the pipe breaks that the event happened.

So it sounds like there is no objection to this change? Can we land these changes now?

>> On 06 Nov 2015, at 19:10, Benjamin Mahler <be...@gmail.com> wrote:
>> 
>> To answer your questions:
>> 
>> We use pipes when we need to communicate across the process boundary after
>> a fork. Look for Subprocess::IO::Pipe for examples. There is plenty of code
>> using pipes.
>> 
>> Sockets aren't an issue as one can avoid SIGPIPE across OS X (SO_NOSIGPIPE)
>> and Linux (MSG_NOSIGNAL).
>> 
>> I'm a bit confused by your comment about the timing of SIGPIPE, which seems
>> to suggest that the raising of SIGPIPE is not tied to the bad write call.
>> Why do you think this?
>> 
>> On Fri, Nov 6, 2015 at 4:37 AM, Alexander Rojas <al...@mesosphere.io>
>> wrote:
>> 
>>> I have multiple questions here
>>> 
>>> 1. Why do we use pipes at all? or is SIGPIPE raised also when writing into
>>> sockets? which leads me to:
>>> 2. Do we use it only in test cases or is there something actively using
>>> pipes?
>>> 
>>> SIGPIPE itself is a weird signal, since a failed call to `write` returns
>>> -1 and sets `errno` to `EPIPE` so there are two ways to deal with errors
>>> when the reading process is not longer reading, one is handling the return
>>> value+errno (which usually means ignoring the SIGPIPE) and the second is
>>> ignoring the return value and handling SIGPIPE. The difference is that
>>> SIGPIPE is raised as soon as the OS realizes the pipe is broken while the
>>> error on the write happens when you actually try to write on the pipe.
>>> 
>>> All in all, I prefer to ignore the signal and deal with the return value
>>> of `write`.
>>> 
>>>> On 06 Nov 2015, at 03:27, Benjamin Mahler <be...@gmail.com>
>>> wrote:
>>>> 
>>>> Just want to surface this up to the dev@ thread to raise some awareness.
>>>> Recently with the SIGPIPE bug from libev [1], we've revisited whether it
>>>> makes sense to continue down the path of leaving SIGPIPE unblocked and
>>>> trying to handle it case by case.
>>>> 
>>>> We originally wanted users of libprocess to decide on their own whether
>>>> they want to ignore SIGPIPE. However, we'd like to reconsider:
>>>> 
>>>> (a) The amount of code that is needed to work around SIGPIPE is
>>>> substantial, especially because on OS X SIGPIPE appears to not be
>>> delivered
>>>> synchronously [2]. Also, it is not possible to create pipes that don't
>>>> surface SIGPIPE (unlike sockets), so in order to safely write to a pipe
>>> we
>>>> need to wrap write() calls with signal suppression blocks (which we don't
>>>> do in general!). You can get a sense of the code from [3] and [4].
>>>> 
>>>> (b) SIGPIPE seems to be more of a legacy mechanism to shut down a set of
>>>> piped programs and the general recommendation seems to be to not bother
>>>> with it and ignore it. Programs can handle EPIPE as they would with other
>>>> signals.
>>>> 
>>>> Would love to hear if there are any concerns. I will be glad to shepherd
>>>> James' changes here.
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/MESOS-2768
>>>> [2] https://issues.apache.org/jira/browse/MESOS-2079
>>>> [3] https://reviews.apache.org/r/39940/diff/1#index_header
>>>> [4]
>>>> 
>>> https://github.com/apache/mesos/blob/0.25.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/posix/signals.hpp#L101
>>>> 
>>>> On Wed, Nov 4, 2015 at 9:20 AM, James Peach (JIRA) <ji...@apache.org>
>>> wrote:
>>>> 
>>>>> 
>>>>>  [
>>>>> 
>>> https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989947#comment-14989947
>>>>> ]
>>>>> 
>>>>> James Peach edited comment on MESOS-2079 at 11/4/15 5:19 PM:
>>>>> -------------------------------------------------------------
>>>>> 
>>>>> These patches global ignore {{SIGPIPE}} during libprocess
>>> initialization,
>>>>> document {{SIGPIPE}} behavior a bit more, and remove various signal
>>>>> manipulations that were formerly necessary for disabling {{SIGPIPE}}
>>>>> delivery.
>>>>> 
>>>>> https://reviews.apache.org/r/39938/
>>>>> https://reviews.apache.org/r/39940/
>>>>> https://reviews.apache.org/r/39941/
>>>>> 
>>>>> 
>>>>> 
>>>>> was (Author: jamespeach):
>>>>> https://reviews.apache.org/r/39938/
>>>>> https://reviews.apache.org/r/39940/
>>>>> https://reviews.apache.org/r/39941/
>>>>> 
>>>>> 
>>>>>> IO.Write test is flaky on OS X 10.10.
>>>>>> -------------------------------------
>>>>>> 
>>>>>>              Key: MESOS-2079
>>>>>>              URL: https://issues.apache.org/jira/browse/MESOS-2079
>>>>>>          Project: Mesos
>>>>>>       Issue Type: Task
>>>>>>       Components: libprocess, technical debt, test
>>>>>>      Environment: OS X 10.10
>>>>>> {noformat}
>>>>>> $ clang++ --version
>>>>>> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
>>>>>> Target: x86_64-apple-darwin14.0.0
>>>>>> Thread model: posix
>>>>>> {noformat}
>>>>>>         Reporter: Benjamin Mahler
>>>>>>         Assignee: James Peach
>>>>>>           Labels: flaky
>>>>>> 
>>>>>> [~benjaminhindman]: If I recall correctly, this is related to
>>>>> MESOS-1658. Unfortunately, we don't have a stacktrace for SIGPIPE
>>> currently:
>>>>>> {noformat}
>>>>>> [ RUN      ] IO.Write
>>>>>> make[5]: *** [check-local] Broken pipe: 13
>>>>>> {noformat}
>>>>>> Running in gdb, seems to always occur here:
>>>>>> {code}
>>>>>> Program received signal SIGPIPE, Broken pipe.
>>>>>> [Switching to process 56827 thread 0x60b]
>>>>>> 0x00007fff9a011132 in __psynch_cvwait ()
>>>>>> (gdb) where
>>>>>> #0  0x00007fff9a011132 in __psynch_cvwait ()
>>>>>> #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
>>>>>> #2  0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at
>>>>> gate.hpp:82
>>>>>> #3  0x0000000100600888 in process::schedule (arg=0x0) at
>>>>> src/process.cpp:1373
>>>>>> #4  0x00007fff903e72fc in _pthread_body ()
>>>>>> #5  0x00007fff903e7279 in _pthread_start ()
>>>>>> #6  0x00007fff903e54b1 in thread_start ()
>>>>>> {code}
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> This message was sent by Atlassian JIRA
>>>>> (v6.3.4#6332)
>>>>> 
>>> 
>>> 
>

Re: [jira] [Comment Edited] (MESOS-2079) IO.Write test is flaky on OS X 10.10.

Posted by Alexander Rojas <al...@mesosphere.io>.

What I meant is that we may not care about SIGPIPE (which tell us a pipe was broken) because we will be notified when we try to write into it anyway (on the writing side) and we will get an EOF on the reading side.

The only thing I could see us caring about SIGPIPE is if we want to know as soon as the pipe breaks that the event happened.
> On 06 Nov 2015, at 19:10, Benjamin Mahler <be...@gmail.com> wrote:
> 
> To answer your questions:
> 
> We use pipes when we need to communicate across the process boundary after
> a fork. Look for Subprocess::IO::Pipe for examples. There is plenty of code
> using pipes.
> 
> Sockets aren't an issue as one can avoid SIGPIPE across OS X (SO_NOSIGPIPE)
> and Linux (MSG_NOSIGNAL).
> 
> I'm a bit confused by your comment about the timing of SIGPIPE, which seems
> to suggest that the raising of SIGPIPE is not tied to the bad write call.
> Why do you think this?
> 
> On Fri, Nov 6, 2015 at 4:37 AM, Alexander Rojas <al...@mesosphere.io>
> wrote:
> 
>> I have multiple questions here
>> 
>> 1. Why do we use pipes at all? or is SIGPIPE raised also when writing into
>> sockets? which leads me to:
>> 2. Do we use it only in test cases or is there something actively using
>> pipes?
>> 
>> SIGPIPE itself is a weird signal, since a failed call to `write` returns
>> -1 and sets `errno` to `EPIPE` so there are two ways to deal with errors
>> when the reading process is not longer reading, one is handling the return
>> value+errno (which usually means ignoring the SIGPIPE) and the second is
>> ignoring the return value and handling SIGPIPE. The difference is that
>> SIGPIPE is raised as soon as the OS realizes the pipe is broken while the
>> error on the write happens when you actually try to write on the pipe.
>> 
>> All in all, I prefer to ignore the signal and deal with the return value
>> of `write`.
>> 
>>> On 06 Nov 2015, at 03:27, Benjamin Mahler <be...@gmail.com>
>> wrote:
>>> 
>>> Just want to surface this up to the dev@ thread to raise some awareness.
>>> Recently with the SIGPIPE bug from libev [1], we've revisited whether it
>>> makes sense to continue down the path of leaving SIGPIPE unblocked and
>>> trying to handle it case by case.
>>> 
>>> We originally wanted users of libprocess to decide on their own whether
>>> they want to ignore SIGPIPE. However, we'd like to reconsider:
>>> 
>>> (a) The amount of code that is needed to work around SIGPIPE is
>>> substantial, especially because on OS X SIGPIPE appears to not be
>> delivered
>>> synchronously [2]. Also, it is not possible to create pipes that don't
>>> surface SIGPIPE (unlike sockets), so in order to safely write to a pipe
>> we
>>> need to wrap write() calls with signal suppression blocks (which we don't
>>> do in general!). You can get a sense of the code from [3] and [4].
>>> 
>>> (b) SIGPIPE seems to be more of a legacy mechanism to shut down a set of
>>> piped programs and the general recommendation seems to be to not bother
>>> with it and ignore it. Programs can handle EPIPE as they would with other
>>> signals.
>>> 
>>> Would love to hear if there are any concerns. I will be glad to shepherd
>>> James' changes here.
>>> 
>>> [1] https://issues.apache.org/jira/browse/MESOS-2768
>>> [2] https://issues.apache.org/jira/browse/MESOS-2079
>>> [3] https://reviews.apache.org/r/39940/diff/1#index_header
>>> [4]
>>> 
>> https://github.com/apache/mesos/blob/0.25.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/posix/signals.hpp#L101
>>> 
>>> On Wed, Nov 4, 2015 at 9:20 AM, James Peach (JIRA) <ji...@apache.org>
>> wrote:
>>> 
>>>> 
>>>>   [
>>>> 
>> https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989947#comment-14989947
>>>> ]
>>>> 
>>>> James Peach edited comment on MESOS-2079 at 11/4/15 5:19 PM:
>>>> -------------------------------------------------------------
>>>> 
>>>> These patches global ignore {{SIGPIPE}} during libprocess
>> initialization,
>>>> document {{SIGPIPE}} behavior a bit more, and remove various signal
>>>> manipulations that were formerly necessary for disabling {{SIGPIPE}}
>>>> delivery.
>>>> 
>>>> https://reviews.apache.org/r/39938/
>>>> https://reviews.apache.org/r/39940/
>>>> https://reviews.apache.org/r/39941/
>>>> 
>>>> 
>>>> 
>>>> was (Author: jamespeach):
>>>> https://reviews.apache.org/r/39938/
>>>> https://reviews.apache.org/r/39940/
>>>> https://reviews.apache.org/r/39941/
>>>> 
>>>> 
>>>>> IO.Write test is flaky on OS X 10.10.
>>>>> -------------------------------------
>>>>> 
>>>>>               Key: MESOS-2079
>>>>>               URL: https://issues.apache.org/jira/browse/MESOS-2079
>>>>>           Project: Mesos
>>>>>        Issue Type: Task
>>>>>        Components: libprocess, technical debt, test
>>>>>       Environment: OS X 10.10
>>>>> {noformat}
>>>>> $ clang++ --version
>>>>> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
>>>>> Target: x86_64-apple-darwin14.0.0
>>>>> Thread model: posix
>>>>> {noformat}
>>>>>          Reporter: Benjamin Mahler
>>>>>          Assignee: James Peach
>>>>>            Labels: flaky
>>>>> 
>>>>> [~benjaminhindman]: If I recall correctly, this is related to
>>>> MESOS-1658. Unfortunately, we don't have a stacktrace for SIGPIPE
>> currently:
>>>>> {noformat}
>>>>> [ RUN      ] IO.Write
>>>>> make[5]: *** [check-local] Broken pipe: 13
>>>>> {noformat}
>>>>> Running in gdb, seems to always occur here:
>>>>> {code}
>>>>> Program received signal SIGPIPE, Broken pipe.
>>>>> [Switching to process 56827 thread 0x60b]
>>>>> 0x00007fff9a011132 in __psynch_cvwait ()
>>>>> (gdb) where
>>>>> #0  0x00007fff9a011132 in __psynch_cvwait ()
>>>>> #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
>>>>> #2  0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at
>>>> gate.hpp:82
>>>>> #3  0x0000000100600888 in process::schedule (arg=0x0) at
>>>> src/process.cpp:1373
>>>>> #4  0x00007fff903e72fc in _pthread_body ()
>>>>> #5  0x00007fff903e7279 in _pthread_start ()
>>>>> #6  0x00007fff903e54b1 in thread_start ()
>>>>> {code}
>>>> 
>>>> 
>>>> 
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v6.3.4#6332)
>>>> 
>> 
>>

Re: [jira] [Comment Edited] (MESOS-2079) IO.Write test is flaky on OS X 10.10.

Posted by Benjamin Mahler <be...@gmail.com>.

To answer your questions:

We use pipes when we need to communicate across the process boundary after
a fork. Look for Subprocess::IO::Pipe for examples. There is plenty of code
using pipes.

Sockets aren't an issue as one can avoid SIGPIPE across OS X (SO_NOSIGPIPE)
and Linux (MSG_NOSIGNAL).

I'm a bit confused by your comment about the timing of SIGPIPE, which seems
to suggest that the raising of SIGPIPE is not tied to the bad write call.
Why do you think this?

On Fri, Nov 6, 2015 at 4:37 AM, Alexander Rojas <al...@mesosphere.io>
wrote:

> I have multiple questions here
>
> 1. Why do we use pipes at all? or is SIGPIPE raised also when writing into
> sockets? which leads me to:
> 2. Do we use it only in test cases or is there something actively using
> pipes?
>
> SIGPIPE itself is a weird signal, since a failed call to `write` returns
> -1 and sets `errno` to `EPIPE` so there are two ways to deal with errors
> when the reading process is not longer reading, one is handling the return
> value+errno (which usually means ignoring the SIGPIPE) and the second is
> ignoring the return value and handling SIGPIPE. The difference is that
> SIGPIPE is raised as soon as the OS realizes the pipe is broken while the
> error on the write happens when you actually try to write on the pipe.
>
> All in all, I prefer to ignore the signal and deal with the return value
> of `write`.
>
> > On 06 Nov 2015, at 03:27, Benjamin Mahler <be...@gmail.com>
> wrote:
> >
> > Just want to surface this up to the dev@ thread to raise some awareness.
> > Recently with the SIGPIPE bug from libev [1], we've revisited whether it
> > makes sense to continue down the path of leaving SIGPIPE unblocked and
> > trying to handle it case by case.
> >
> > We originally wanted users of libprocess to decide on their own whether
> > they want to ignore SIGPIPE. However, we'd like to reconsider:
> >
> > (a) The amount of code that is needed to work around SIGPIPE is
> > substantial, especially because on OS X SIGPIPE appears to not be
> delivered
> > synchronously [2]. Also, it is not possible to create pipes that don't
> > surface SIGPIPE (unlike sockets), so in order to safely write to a pipe
> we
> > need to wrap write() calls with signal suppression blocks (which we don't
> > do in general!). You can get a sense of the code from [3] and [4].
> >
> > (b) SIGPIPE seems to be more of a legacy mechanism to shut down a set of
> > piped programs and the general recommendation seems to be to not bother
> > with it and ignore it. Programs can handle EPIPE as they would with other
> > signals.
> >
> > Would love to hear if there are any concerns. I will be glad to shepherd
> > James' changes here.
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-2768
> > [2] https://issues.apache.org/jira/browse/MESOS-2079
> > [3] https://reviews.apache.org/r/39940/diff/1#index_header
> > [4]
> >
> https://github.com/apache/mesos/blob/0.25.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/posix/signals.hpp#L101
> >
> > On Wed, Nov 4, 2015 at 9:20 AM, James Peach (JIRA) <ji...@apache.org>
> wrote:
> >
> >>
> >>    [
> >>
> https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989947#comment-14989947
> >> ]
> >>
> >> James Peach edited comment on MESOS-2079 at 11/4/15 5:19 PM:
> >> -------------------------------------------------------------
> >>
> >> These patches global ignore {{SIGPIPE}} during libprocess
> initialization,
> >> document {{SIGPIPE}} behavior a bit more, and remove various signal
> >> manipulations that were formerly necessary for disabling {{SIGPIPE}}
> >> delivery.
> >>
> >> https://reviews.apache.org/r/39938/
> >> https://reviews.apache.org/r/39940/
> >> https://reviews.apache.org/r/39941/
> >>
> >>
> >>
> >> was (Author: jamespeach):
> >> https://reviews.apache.org/r/39938/
> >> https://reviews.apache.org/r/39940/
> >> https://reviews.apache.org/r/39941/
> >>
> >>
> >>> IO.Write test is flaky on OS X 10.10.
> >>> -------------------------------------
> >>>
> >>>                Key: MESOS-2079
> >>>                URL: https://issues.apache.org/jira/browse/MESOS-2079
> >>>            Project: Mesos
> >>>         Issue Type: Task
> >>>         Components: libprocess, technical debt, test
> >>>        Environment: OS X 10.10
> >>> {noformat}
> >>> $ clang++ --version
> >>> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
> >>> Target: x86_64-apple-darwin14.0.0
> >>> Thread model: posix
> >>> {noformat}
> >>>           Reporter: Benjamin Mahler
> >>>           Assignee: James Peach
> >>>             Labels: flaky
> >>>
> >>> [~benjaminhindman]: If I recall correctly, this is related to
> >> MESOS-1658. Unfortunately, we don't have a stacktrace for SIGPIPE
> currently:
> >>> {noformat}
> >>> [ RUN      ] IO.Write
> >>> make[5]: *** [check-local] Broken pipe: 13
> >>> {noformat}
> >>> Running in gdb, seems to always occur here:
> >>> {code}
> >>> Program received signal SIGPIPE, Broken pipe.
> >>> [Switching to process 56827 thread 0x60b]
> >>> 0x00007fff9a011132 in __psynch_cvwait ()
> >>> (gdb) where
> >>> #0  0x00007fff9a011132 in __psynch_cvwait ()
> >>> #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
> >>> #2  0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at
> >> gate.hpp:82
> >>> #3  0x0000000100600888 in process::schedule (arg=0x0) at
> >> src/process.cpp:1373
> >>> #4  0x00007fff903e72fc in _pthread_body ()
> >>> #5  0x00007fff903e7279 in _pthread_start ()
> >>> #6  0x00007fff903e54b1 in thread_start ()
> >>> {code}
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.4#6332)
> >>
>
>

Re: [jira] [Comment Edited] (MESOS-2079) IO.Write test is flaky on OS X 10.10.

Posted by Alexander Rojas <al...@mesosphere.io>.

I have multiple questions here

1. Why do we use pipes at all? or is SIGPIPE raised also when writing into sockets? which leads me to:
2. Do we use it only in test cases or is there something actively using pipes?

SIGPIPE itself is a weird signal, since a failed call to `write` returns -1 and sets `errno` to `EPIPE` so there are two ways to deal with errors when the reading process is not longer reading, one is handling the return value+errno (which usually means ignoring the SIGPIPE) and the second is ignoring the return value and handling SIGPIPE. The difference is that SIGPIPE is raised as soon as the OS realizes the pipe is broken while the error on the write happens when you actually try to write on the pipe.

All in all, I prefer to ignore the signal and deal with the return value of `write`.

> On 06 Nov 2015, at 03:27, Benjamin Mahler <be...@gmail.com> wrote:
> 
> Just want to surface this up to the dev@ thread to raise some awareness.
> Recently with the SIGPIPE bug from libev [1], we've revisited whether it
> makes sense to continue down the path of leaving SIGPIPE unblocked and
> trying to handle it case by case.
> 
> We originally wanted users of libprocess to decide on their own whether
> they want to ignore SIGPIPE. However, we'd like to reconsider:
> 
> (a) The amount of code that is needed to work around SIGPIPE is
> substantial, especially because on OS X SIGPIPE appears to not be delivered
> synchronously [2]. Also, it is not possible to create pipes that don't
> surface SIGPIPE (unlike sockets), so in order to safely write to a pipe we
> need to wrap write() calls with signal suppression blocks (which we don't
> do in general!). You can get a sense of the code from [3] and [4].
> 
> (b) SIGPIPE seems to be more of a legacy mechanism to shut down a set of
> piped programs and the general recommendation seems to be to not bother
> with it and ignore it. Programs can handle EPIPE as they would with other
> signals.
> 
> Would love to hear if there are any concerns. I will be glad to shepherd
> James' changes here.
> 
> [1] https://issues.apache.org/jira/browse/MESOS-2768
> [2] https://issues.apache.org/jira/browse/MESOS-2079
> [3] https://reviews.apache.org/r/39940/diff/1#index_header
> [4]
> https://github.com/apache/mesos/blob/0.25.0/3rdparty/libprocess/3rdparty/stout/include/stout/os/posix/signals.hpp#L101
> 
> On Wed, Nov 4, 2015 at 9:20 AM, James Peach (JIRA) <ji...@apache.org> wrote:
> 
>> 
>>    [
>> https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989947#comment-14989947
>> ]
>> 
>> James Peach edited comment on MESOS-2079 at 11/4/15 5:19 PM:
>> -------------------------------------------------------------
>> 
>> These patches global ignore {{SIGPIPE}} during libprocess initialization,
>> document {{SIGPIPE}} behavior a bit more, and remove various signal
>> manipulations that were formerly necessary for disabling {{SIGPIPE}}
>> delivery.
>> 
>> https://reviews.apache.org/r/39938/
>> https://reviews.apache.org/r/39940/
>> https://reviews.apache.org/r/39941/
>> 
>> 
>> 
>> was (Author: jamespeach):
>> https://reviews.apache.org/r/39938/
>> https://reviews.apache.org/r/39940/
>> https://reviews.apache.org/r/39941/
>> 
>> 
>>> IO.Write test is flaky on OS X 10.10.
>>> -------------------------------------
>>> 
>>>                Key: MESOS-2079
>>>                URL: https://issues.apache.org/jira/browse/MESOS-2079
>>>            Project: Mesos
>>>         Issue Type: Task
>>>         Components: libprocess, technical debt, test
>>>        Environment: OS X 10.10
>>> {noformat}
>>> $ clang++ --version
>>> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
>>> Target: x86_64-apple-darwin14.0.0
>>> Thread model: posix
>>> {noformat}
>>>           Reporter: Benjamin Mahler
>>>           Assignee: James Peach
>>>             Labels: flaky
>>> 
>>> [~benjaminhindman]: If I recall correctly, this is related to
>> MESOS-1658. Unfortunately, we don't have a stacktrace for SIGPIPE currently:
>>> {noformat}
>>> [ RUN      ] IO.Write
>>> make[5]: *** [check-local] Broken pipe: 13
>>> {noformat}
>>> Running in gdb, seems to always occur here:
>>> {code}
>>> Program received signal SIGPIPE, Broken pipe.
>>> [Switching to process 56827 thread 0x60b]
>>> 0x00007fff9a011132 in __psynch_cvwait ()
>>> (gdb) where
>>> #0  0x00007fff9a011132 in __psynch_cvwait ()
>>> #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
>>> #2  0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at
>> gate.hpp:82
>>> #3  0x0000000100600888 in process::schedule (arg=0x0) at
>> src/process.cpp:1373
>>> #4  0x00007fff903e72fc in _pthread_body ()
>>> #5  0x00007fff903e7279 in _pthread_start ()
>>> #6  0x00007fff903e54b1 in thread_start ()
>>> {code}
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>