You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org> on 2012/05/07 23:12:51 UTC
[jira] [Commented] (MESOS-190) Slave seg fault when executor exited
[ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270008#comment-13270008 ]
jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/
-----------------------------------------------------------
Review request for mesos, Benjamin Hindman and John Sirois.
Summary
-------
Fix for: https://issues.apache.org/jira/browse/MESOS-190
Also prevents slave from infinitely re-trying status updates to a dead framework.
This addresses bug MESOS-190.
https://issues.apache.org/jira/browse/MESOS-190
Diffs
-----
src/slave/slave.cpp 09a8396
Diff: https://reviews.apache.org/r/5057/diff
Testing
-------
Checked with long lived framework.
$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050
Thanks,
Vinod
> Slave seg fault when executor exited
> ------------------------------------
>
> Key: MESOS-190
> URL: https://issues.apache.org/jira/browse/MESOS-190
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Hindman
> Assignee: Vinod Kone
> Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave. I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0 mesos::internal::slave::Slave::executorExited (this=0x1305820,
> frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1 0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
> this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2 operator()<process::ProcessBase*> (this=<optimized out>)
> at /usr/include/c++/4.6/tr1/functional:1207
> #3 std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
> __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4 0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
> from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5 0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6 0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7 0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8 0x00007f0cf31ec783 in process::schedule(void*) ()
> from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9 0x00007f0cf26e5e9a in start_thread ()
> from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax 0x7f0cf3647cf0 139693599784176
> rbx 0x0 0
> rcx 0x7f0ce8000038 139693408649272
> rdx 0x3031406576616c73 3472627592201333875
> rsi 0x2 2
> rdi 0x7f0cf0613ac0 139693549238976
> rbp 0x7f0ce80034c8 0x7f0ce80034c8
> rsp 0x7f0cf0613c00 0x7f0cf0613c00
> r8 0x7f0ce80009b0 139693408651696
> r9 0x1 1
> r10 0x6 6
> r11 0x1 1
> r12 0x7f0ce8001ca0 139693408656544
> r13 0x7f0ce80056c0 139693408671424
> r14 0x7f0ce8006cc0 139693408677056
> r15 0x1305820 19945504
> rip 0x7f0cf30fecd5 0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags 0x10206 [ PF IF RF ]
> cs 0xe033 57395
> ss 0xe02b 57387
> ds 0x0 0
> es 0x0 0
> fs 0x0 0
> gs 0x0 0
> disassemble:
> 0x00007f0cf30fecb9 <+505>: mov %rax,0x20(%rsp)
> 0x00007f0cf30fecbe <+510>: xor %ebx,%ebx
> 0x00007f0cf30fecc0 <+512>: cmp 0x20(%rsp),%r12
> 0x00007f0cf30fecc5 <+517>: je 0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
> 0x00007f0cf30fecc7 <+519>: test %r12,%r12
> 0x00007f0cf30fecca <+522>: je 0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
> 0x00007f0cf30fecd0 <+528>: mov 0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>: mov 0x70(%rdx),%edi
> 0x00007f0cf30fecd8 <+536>: mov %rdx,0x8(%rsp)
> 0x00007f0cf30fecdd <+541>: callq 0x7f0cf3062220
> <_Z...@plt>
> 0x00007f0cf30fece2 <+546>: test %al,%al
> 0x00007f0cf30fece4 <+548>: mov 0x8(%rsp),%rdx
> 0x00007f0cf30fece9 <+553>: je 0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
> 0x00007f0cf30fecef <+559>: test %rbp,%rbp
> 0x00007f0cf30fecf2 <+562>: je 0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira