You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2015/10/12 02:08:05 UTC

[jira] [Updated] (MESOS-2768) SIGPIPE in process::run_in_event_loop()

     [ https://issues.apache.org/jira/browse/MESOS-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler updated MESOS-2768:
-----------------------------------
    Environment: CentOS 5
    Component/s: libprocess

Ok, so having ruled out double closes as the culprit, I spent some time digging into libev with [~chzhcn] late last week and found the bug. [~jieyu] helped me validate this by injecting sleeps into libev to be able to trigger the bug deterministically from the tests. Thanks guys!

Note that this issue manifests on older versions of Linux when the eventfd headers are not available, which is true for CentOS 5. Sent an email to the libev mailing list which confirmed it here: http://lists.schmorp.de/pipermail/libev/2015q4/thread.html

Seems there are couple of options:

*(1)* Wait for the release which includes the fix, may take some time.

*(2)* Update our patch file to include the fix. This can be done quickly as a stop-gap but will not apply to those that use an unbundled libev.

*(3)* Update libprocess to ignore SIGPIPE temporarily when using ev_async_send. This seems undesirable due to it being a hot path and it introduces yet another block that temporarily ignores SIGPIPE.

*(4)* Update libprocess to ignore SIGPIPE process-wide and document this so that users of libprocess understand that EPIPE must be handled. In retrospect this seems like the right long-term decision, since we've had to inject several SIGPIPE ignoring blocks and OS X still has quirks. Not to mention that SIGPIPE is unnecessary and is meant primarily for shell filter like programs.

The (untested) fix is these two diffs:
http://cvs.schmorp.de/libev/ev.c?r1=1.477&r2=1.478
http://cvs.schmorp.de/libev/ev_epoll.c?r1=1.68&r2=1.69

> SIGPIPE in process::run_in_event_loop()
> ---------------------------------------
>
>                 Key: MESOS-2768
>                 URL: https://issues.apache.org/jira/browse/MESOS-2768
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>    Affects Versions: 0.23.0
>         Environment: CentOS 5
>            Reporter: Yan Xu
>            Priority: Critical
>
> Observed in production.
> {noformat:title=slave log}
> I0526 12:17:48.027257 51633 slave.cpp:4077] Received a new estimation of the oversubscribable resources 
> W0526 12:17:48.027257 51636 logging.cpp:91] RAW: Received signal SIGPIPE; escalating to SIGABRT
> *** Aborted at 1432642668 (unix time) try "date -d @1432642668" if you are using GNU date ***
> PC: @     0x7fa58c23eb6d raise
> *** SIGABRT (@0xc9a5) received by PID 51621 (TID 0x7fa58224c940) from PID 51621; stack trace: ***
>     @     0x7fa58c23eca0 (unknown)
>     @     0x7fa58c23eb6d raise
>     @     0x7fa58cc19ba7 mesos::internal::logging::handler()
>     @     0x7fa58c23eca0 (unknown)
>     @     0x7fa58c23da2b __libc_write
>     @     0x7fa58cb57b6f evpipe_write.part.5
>     @     0x7fa58d245070 process::run_in_event_loop<>()
>     @     0x7fa58d2441ba process::EventLoop::delay()
>     @     0x7fa58d1c3c9c process::clock::scheduleTick()
>     @     0x7fa58d1c65b1 process::Clock::timer()
>     @     0x7fa58d23915a process::delay<>()
>     @     0x7fa58d23a740 process::ReaperProcess::wait()
>     @     0x7fa58d21261a process::ProcessManager::resume()
>     @     0x7fa58d2128dc process::schedule()
>     @     0x7fa58c23683d start_thread
>     @     0x7fa58ba28fcd clone
> {noformat}
> {noformat:title=gdb}
> (gdb) bt
> #0  0x00007fa58c23eb6d in raise () from /lib64/libpthread.so.0
> #1  0x00007fa58cc19ba7 in mesos::internal::logging::handler (signal=Unhandled dwarf expression opcode 0xf3
> ) at logging/logging.cpp:92
> #2  <signal handler called>
> #3  0x00007fa58c23da2b in write () from /lib64/libpthread.so.0
> #4  0x00007fa58cb57b6f in evpipe_write (loop=0x7fa58e1e79c0, flag=Unhandled dwarf expression opcode 0xfa
> ) at ev.c:2172
> #5  0x00007fa58d245070 in process::run_in_event_loop<Nothing>(const std::function<process::Future<Nothing>()> &) (f=Unhandled dwarf expression opcode 0xf3
> ) at src/libev.hpp:80
> #6  0x00007fa58d2441ba in process::EventLoop::delay(const Duration &, const std::function<void()> &) (duration=Unhandled dwarf expression opcode 0xf3
> ) at src/libev.cpp:106
> #7  0x00007fa58d1c3c9c in process::clock::scheduleTick (timers=Unhandled dwarf expression opcode 0xf3
> ) at src/clock.cpp:119
> #8  0x00007fa58d1c65b1 in process::Clock::timer(const Duration &, const std::function<void()> &) (duration=Unhandled dwarf expression opcode 0xf3
> ) at src/clock.cpp:254
> #9  0x00007fa58d23915a in process::delay<process::ReaperProcess> (duration=..., pid=Unhandled dwarf expression opcode 0xf3
> ) at ./include/process/delay.hpp:25
> #10 0x00007fa58d23a740 in process::ReaperProcess::wait (this=0x2056920) at src/reap.cpp:93
> #11 0x00007fa58d21261a in process::ProcessManager::resume (this=0x1db8d20, process=0x2056958) at src/process.cpp:2172
> #12 0x00007fa58d2128dc in process::schedule (arg=Unhandled dwarf expression opcode 0xf3
> ) at src/process.cpp:602
> #13 0x00007fa58c23683d in start_thread () from /lib64/libpthread.so.0
> #14 0x00007fa58ba28fcd in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)