You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Till Toenshoff (JIRA)" <ji...@apache.org> on 2017/03/30 09:41:41 UTC

[jira] [Comment Edited] (MESOS-5748) Potential segfault in `link` and `send` when linking to a remote process

    [ https://issues.apache.org/jira/browse/MESOS-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948759#comment-15948759 ] 

Till Toenshoff edited comment on MESOS-5748 at 3/30/17 9:41 AM:
----------------------------------------------------------------

The problem does not seem fixed for me - or maybe it got reintroduced lately. 

I am hitting this on macOS after around 100 - 150 repetitions (did 3 runs).
{noformat}
$ ./3rdparty/libprocess/libprocess-tests --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 --gtest_break_on_failure
{noformat}


{noformat}
Repeating all tests (iteration 119) . . .

Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak
[ RUN      ] ProcessRemoteLinkTest.RemoteLinkLeak
(libev) select: Invalid argument
*** Aborted at 1490865958 (unix time) try "date -d @1490865958" if you are using GNU date ***
PC: @     0x7fffb7621d42 __pthread_kill
*** SIGABRT (@0x7fffb7621d42) received by PID 59260 (TID 0x700009538000) stack trace: ***
    @     0x7fffb7702b3a _sigtramp
    @     0x7faf310fc080 (unknown)
    @     0x7fffb7587420 abort
    @        0x109a6b51d ev_syserr
    @        0x109a6be16 select_poll
    @        0x109a67635 ev_run
    @        0x109a21f2b ev_loop()
    @        0x109a21e96 process::EventLoop::run()
    @        0x1099448bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEEEEEPvS5_
    @     0x7fffb770c9af _pthread_body
    @     0x7fffb770c8fb _pthread_start
    @     0x7fffb770c101 thread_start
Abort trap: 6
{noformat}

As the stacktrace shows, I was testing this with a libev build.


was (Author: tillt):
The problem does not seem fixed for me - or maybe it got reintroduced lately. 

I am hitting this on macOS after around 100 - 150 repetitions (did 3 runs).
{noformat}
$ ./3rdparty/libprocess/libprocess-tests --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 --gtest_break_on_failure
{noformat}


{noformat}
Repeating all tests (iteration 119) . . .

Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak
[ RUN      ] ProcessRemoteLinkTest.RemoteLinkLeak
(libev) select: Invalid argument
*** Aborted at 1490865958 (unix time) try "date -d @1490865958" if you are using GNU date ***
PC: @     0x7fffb7621d42 __pthread_kill
*** SIGABRT (@0x7fffb7621d42) received by PID 59260 (TID 0x700009538000) stack trace: ***
    @     0x7fffb7702b3a _sigtramp
    @     0x7faf310fc080 (unknown)
    @     0x7fffb7587420 abort
    @        0x109a6b51d ev_syserr
    @        0x109a6be16 select_poll
    @        0x109a67635 ev_run
    @        0x109a21f2b ev_loop()
    @        0x109a21e96 process::EventLoop::run()
    @        0x1099448bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEEEEEPvS5_
    @     0x7fffb770c9af _pthread_body
    @     0x7fffb770c8fb _pthread_start
    @     0x7fffb770c101 thread_start
make[6]: *** [check-local] Abort trap: 6
make[5]: *** [check-am] Error 2
make[4]: *** [check-recursive] Error 1
make[3]: *** [check] Error 2
make[2]: *** [check-recursive] Error 1
make[1]: *** [check] Error 2
make: *** [check-recursive] Error 1
{noformat}

As the stacktrace shows, I was testing this with a libev build.

> Potential segfault in `link` and `send` when linking to a remote process
> ------------------------------------------------------------------------
>
>                 Key: MESOS-5748
>                 URL: https://issues.apache.org/jira/browse/MESOS-5748
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>    Affects Versions: 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
>            Reporter: Joseph Wu
>            Assignee: Joseph Wu
>              Labels: libprocess, mesosphere
>             Fix For: 0.27.4, 0.28.3, 1.0.0
>
>
> There is a race in the SocketManager, between a remote {{link}} and disconnection of the underlying socket.
> We potentially segfault here: https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512
> {{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} object.  However, the code above this line actually has ownership of the pointer:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499
> If the socket dies during the link, the {{ignore_recv_data}} may delete the Socket underneath {{link}}:
> https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411
> ----
> The same race exists for {{send}}.
> This race was discovered while running a new test in repetition:
> https://reviews.apache.org/r/49175/
> On OSX, I hit the race consistently every 500-800 repetitions:
> {code}
> 3rdparty/libprocess/libprocess-tests --gtest_filter="ProcessRemoteLinkTest.RemoteLink"  --gtest_break_on_failure --gtest_repeat=1000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)