You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org> on 2012/05/07 23:12:51 UTC
[jira] [Commented] (MESOS-190) Slave seg fault when executor exited

    [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270008#comment-13270008 ] 

jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and John Sirois.


Summary
-------

Fix for: https://issues.apache.org/jira/browse/MESOS-190

Also prevents slave from infinitely re-trying status updates to a dead framework.


This addresses bug MESOS-190.
    https://issues.apache.org/jira/browse/MESOS-190


Diffs
-----

  src/slave/slave.cpp 09a8396 

Diff: https://reviews.apache.org/r/5057/diff


Testing
-------

Checked with long lived framework.

$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050


Thanks,

Vinod


                
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira