You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Hindman (JIRA)" <ji...@apache.org> on 2012/05/07 19:00:51 UTC

[jira] [Created] (MESOS-190) Slave seg fault when executor exited

Benjamin Hindman created MESOS-190:
--------------------------------------

             Summary: Slave seg fault when executor exited
                 Key: MESOS-190
                 URL: https://issues.apache.org/jira/browse/MESOS-190
             Project: Mesos
          Issue Type: Bug
            Reporter: Benjamin Hindman
            Assignee: Vinod Kone
            Priority: Blocker


When I restart/kill early or otherwise interrupt my framework from the
client, I often segfault the slave.  I'm not sure if there is a bug in
my executor, but it seems Mesos should be more resilient than this.

Mesos subversion -r 1331158

I know optimized builds can be tricky to debug, but in this case it
does look like it was trying to dereference the invalid Task* address
(note that task matches %rdx, and the crashed assembly code is trying
to dereference %rdx).

Any suggestions?

(gdb) bt
#0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
   frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
#1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
   this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
#2  operator()<process::ProcessBase*> (this=<optimized out>)
   at /usr/include/c++/4.6/tr1/functional:1207
#3  std::tr1::_Function_handler<void (process::ProcessBase*),
std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
std::tr1::shared_ptr<std::tr1::function<void
(mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
std::tr1::shared_ptr<std::tr1::function<void
(mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
const&, process::ProcessBase*) (__functor=...,
   __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
#4  0x00007f0cf32014a3 in std::tr1::function<void
(process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
  from /home/ubuntu/cr/lib/libmesos-0.9.0.so
#5  0x00007f0cf31f617f in
process::ProcessBase::visit(process::DispatchEvent const&) () from
/home/ubuntu/cr/lib/libmesos-0.9.0.so
#6  0x00007f0cf31f885c in
process::DispatchEvent::visit(process::EventVisitor*) const () from
/home/ubuntu/cr/lib/libmesos-0.9.0.so
#7  0x00007f0cf31f38cf in
process::ProcessManager::resume(process::ProcessBase*) () from
/home/ubuntu/cr/lib/libmesos-0.9.0.so
#8  0x00007f0cf31ec783 in process::schedule(void*) ()
  from /home/ubuntu/cr/lib/libmesos-0.9.0.so
#9  0x00007f0cf26e5e9a in start_thread ()
  from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) print task
$1 = (mesos::internal::Task *) 0x3031406576616c73
(gdb) info register
rax            0x7f0cf3647cf0   139693599784176
rbx            0x0      0
rcx            0x7f0ce8000038   139693408649272
rdx            0x3031406576616c73       3472627592201333875
rsi            0x2      2
rdi            0x7f0cf0613ac0   139693549238976
rbp            0x7f0ce80034c8   0x7f0ce80034c8
rsp            0x7f0cf0613c00   0x7f0cf0613c00
r8             0x7f0ce80009b0   139693408651696
r9             0x1      1
r10            0x6      6
r11            0x1      1
r12            0x7f0ce8001ca0   139693408656544
r13            0x7f0ce80056c0   139693408671424
r14            0x7f0ce8006cc0   139693408677056
r15            0x1305820        19945504
rip            0x7f0cf30fecd5   0x7f0cf30fecd5
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+533>
eflags         0x10206  [ PF IF RF ]
cs             0xe033   57395
ss             0xe02b   57387
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

disassemble:

 0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
  0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
  0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
  0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+622>
  0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
  0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+1981>
  0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
=> 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
  0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
  0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
<_Z...@plt>
  0x00007f0cf30fece2 <+546>:   test   %al,%al
  0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
  0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+1376>
  0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
  0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
<mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (MESOS-190) Slave seg fault when executor exited

Posted by "Benjamin Hindman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Hindman closed MESOS-190.
----------------------------------

    
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-190) Slave seg fault when executor exited

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270008#comment-13270008 ] 

jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and John Sirois.


Summary
-------

Fix for: https://issues.apache.org/jira/browse/MESOS-190

Also prevents slave from infinitely re-trying status updates to a dead framework.


This addresses bug MESOS-190.
    https://issues.apache.org/jira/browse/MESOS-190


Diffs
-----

  src/slave/slave.cpp 09a8396 

Diff: https://reviews.apache.org/r/5057/diff


Testing
-------

Checked with long lived framework.

$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050


Thanks,

Vinod


                
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-190) Slave seg fault when executor exited

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270617#comment-13270617 ] 

jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/
-----------------------------------------------------------

(Updated 2012-05-08 17:09:42.129768)


Review request for mesos, Benjamin Hindman and John Sirois.


Changes
-------

john's comments. added test case.


Summary
-------

Fix for: https://issues.apache.org/jira/browse/MESOS-190

Also prevents slave from infinitely re-trying status updates to a dead framework.


This addresses bug MESOS-190.
    https://issues.apache.org/jira/browse/MESOS-190


Diffs (updated)
-----

  src/slave/slave.cpp 09a8396 
  src/tests/fault_tolerance_tests.cpp 6772daf 

Diff: https://reviews.apache.org/r/5057/diff


Testing
-------

Checked with long lived framework.

$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050


Thanks,

Vinod


                
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-190) Slave seg fault when executor exited

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270614#comment-13270614 ] 

jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------



bq.  On 2012-05-07 21:50:01, John Sirois wrote:
bq.  > src/slave/slave.cpp, line 1487
bq.  > <https://reviews.apache.org/r/5057/diff/2/?file=107599#file107599line1487>
bq.  >
bq.  >     Is there a test that could be tweaked to ensure this is happening?  Presumably it wasn't before via executorExited?

added a test.


bq.  On 2012-05-07 21:50:01, John Sirois wrote:
bq.  > src/slave/slave.cpp, line 1483
bq.  > <https://reviews.apache.org/r/5057/diff/2/?file=107599#file107599line1483>
bq.  >
bq.  >     Does this new api call still transition live tasks to LOST/FAILED?

This is a bit nuanced. When a framework is shutdown, the slave sends a shutdown to the executor. One of the 2 things might happen.

1) EXECUTOR_SHUTDOWN_TIMEOUT_SECONDS elapses before the isolation module informs about the lost executor.  A TASK_LOST  will be sent by 
   the slave to the master. But the master drops it to the floor because the framework is dead.

2) Isolation module informs about lost executor before EXECUTOR_SHUTDOWN_TIMEOUT_SECONDS. Slave doesn't send a TASK_LOST.

In either case, the master never sends the TASK_LOST to the dead framework, which is the right thing to do.


This might be different when we have slave recovery implemented, but the logic there for handling status updates is very different. In other words, this fix will 
probably go away when we merge slave recovery stuff.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/#review7657
-----------------------------------------------------------


On 2012-05-07 21:11:34, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/5057/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-07 21:11:34)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Fix for: https://issues.apache.org/jira/browse/MESOS-190
bq.  
bq.  Also prevents slave from infinitely re-trying status updates to a dead framework.
bq.  
bq.  
bq.  This addresses bug MESOS-190.
bq.      https://issues.apache.org/jira/browse/MESOS-190
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/slave/slave.cpp 09a8396 
bq.  
bq.  Diff: https://reviews.apache.org/r/5057/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Checked with long lived framework.
bq.  
bq.  $ ./bin/mesos-master.sh
bq.  $ ./bin/mesos-slave.sh --master=localhost:5050
bq.  $./src/long-lived-framework localhost:5050
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-190) Slave seg fault when executor exited

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270868#comment-13270868 ] 

jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/#review7701
-----------------------------------------------------------

Ship it!


Thanks Vinod.

- Benjamin


On 2012-05-08 17:09:42, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/5057/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-08 17:09:42)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Fix for: https://issues.apache.org/jira/browse/MESOS-190
bq.  
bq.  Also prevents slave from infinitely re-trying status updates to a dead framework.
bq.  
bq.  
bq.  This addresses bug MESOS-190.
bq.      https://issues.apache.org/jira/browse/MESOS-190
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/slave/slave.cpp 09a8396 
bq.    src/tests/fault_tolerance_tests.cpp 6772daf 
bq.  
bq.  Diff: https://reviews.apache.org/r/5057/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Checked with long lived framework.
bq.  
bq.  $ ./bin/mesos-master.sh
bq.  $ ./bin/mesos-slave.sh --master=localhost:5050
bq.  $./src/long-lived-framework localhost:5050
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MESOS-190) Slave seg fault when executor exited

Posted by "Benjamin Hindman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Hindman resolved MESOS-190.
------------------------------------

    Resolution: Fixed
    
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-190) Slave seg fault when executor exited

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270046#comment-13270046 ] 

jiraposter@reviews.apache.org commented on MESOS-190:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/#review7657
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/5057/#comment16872>

    Does this new api call still transition live tasks to LOST/FAILED?



src/slave/slave.cpp
<https://reviews.apache.org/r/5057/#comment16873>

    Is there a test that could be tweaked to ensure this is happening?  Presumably it wasn't before via executorExited?


- John


On 2012-05-07 21:11:34, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/5057/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-07 21:11:34)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Fix for: https://issues.apache.org/jira/browse/MESOS-190
bq.  
bq.  Also prevents slave from infinitely re-trying status updates to a dead framework.
bq.  
bq.  
bq.  This addresses bug MESOS-190.
bq.      https://issues.apache.org/jira/browse/MESOS-190
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/slave/slave.cpp 09a8396 
bq.  
bq.  Diff: https://reviews.apache.org/r/5057/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Checked with long lived framework.
bq.  
bq.  $ ./bin/mesos-master.sh
bq.  $ ./bin/mesos-slave.sh --master=localhost:5050
bq.  $./src/long-lived-framework localhost:5050
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Slave seg fault when executor exited
> ------------------------------------
>
>                 Key: MESOS-190
>                 URL: https://issues.apache.org/jira/browse/MESOS-190
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>            Assignee: Vinod Kone
>            Priority: Blocker
>
> When I restart/kill early or otherwise interrupt my framework from the
> client, I often segfault the slave.  I'm not sure if there is a bug in
> my executor, but it seems Mesos should be more resilient than this.
> Mesos subversion -r 1331158
> I know optimized builds can be tricky to debug, but in this case it
> does look like it was trying to dereference the invalid Task* address
> (note that task matches %rdx, and the crashed assembly code is trying
> to dereference %rdx).
> Any suggestions?
> (gdb) bt
> #0  mesos::internal::slave::Slave::executorExited (this=0x1305820,
>    frameworkId=..., executorId=..., status=0) at slave/slave.cpp:1400
> #1  0x00007f0cf310526d in __call<process::ProcessBase*&, 0, 1> (__args=...,
>    this=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1153
> #2  operator()<process::ProcessBase*> (this=<optimized out>)
>    at /usr/include/c++/4.6/tr1/functional:1207
> #3  std::tr1::_Function_handler<void (process::ProcessBase*),
> std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >))(process::ProcessBase*,
> std::tr1::shared_ptr<std::tr1::function<void
> (mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data
> const&, process::ProcessBase*) (__functor=...,
>    __args#0=<optimized out>) at /usr/include/c++/4.6/tr1/functional:1684
> #4  0x00007f0cf32014a3 in std::tr1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #5  0x00007f0cf31f617f in
> process::ProcessBase::visit(process::DispatchEvent const&) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #6  0x00007f0cf31f885c in
> process::DispatchEvent::visit(process::EventVisitor*) const () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #7  0x00007f0cf31f38cf in
> process::ProcessManager::resume(process::ProcessBase*) () from
> /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #8  0x00007f0cf31ec783 in process::schedule(void*) ()
>   from /home/ubuntu/cr/lib/libmesos-0.9.0.so
> #9  0x00007f0cf26e5e9a in start_thread ()
>   from /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f0cf24134bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
> (gdb) print task
> $1 = (mesos::internal::Task *) 0x3031406576616c73
> (gdb) info register
> rax            0x7f0cf3647cf0   139693599784176
> rbx            0x0      0
> rcx            0x7f0ce8000038   139693408649272
> rdx            0x3031406576616c73       3472627592201333875
> rsi            0x2      2
> rdi            0x7f0cf0613ac0   139693549238976
> rbp            0x7f0ce80034c8   0x7f0ce80034c8
> rsp            0x7f0cf0613c00   0x7f0cf0613c00
> r8             0x7f0ce80009b0   139693408651696
> r9             0x1      1
> r10            0x6      6
> r11            0x1      1
> r12            0x7f0ce8001ca0   139693408656544
> r13            0x7f0ce80056c0   139693408671424
> r14            0x7f0ce8006cc0   139693408677056
> r15            0x1305820        19945504
> rip            0x7f0cf30fecd5   0x7f0cf30fecd5
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+533>
> eflags         0x10206  [ PF IF RF ]
> cs             0xe033   57395
> ss             0xe02b   57387
> ds             0x0      0
> es             0x0      0
> fs             0x0      0
> gs             0x0      0
> disassemble:
>  0x00007f0cf30fecb9 <+505>:    mov    %rax,0x20(%rsp)
>   0x00007f0cf30fecbe <+510>:   xor    %ebx,%ebx
>   0x00007f0cf30fecc0 <+512>:   cmp    0x20(%rsp),%r12
>   0x00007f0cf30fecc5 <+517>:   je     0x7f0cf30fed2e
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+622>
>   0x00007f0cf30fecc7 <+519>:   test   %r12,%r12
>   0x00007f0cf30fecca <+522>:   je     0x7f0cf30ff27d
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1981>
>   0x00007f0cf30fecd0 <+528>:   mov    0x28(%r12),%rdx
> => 0x00007f0cf30fecd5 <+533>:   mov    0x70(%rdx),%edi
>   0x00007f0cf30fecd8 <+536>:   mov    %rdx,0x8(%rsp)
>   0x00007f0cf30fecdd <+541>:   callq  0x7f0cf3062220
> <_Z...@plt>
>   0x00007f0cf30fece2 <+546>:   test   %al,%al
>   0x00007f0cf30fece4 <+548>:   mov    0x8(%rsp),%rdx
>   0x00007f0cf30fece9 <+553>:   je     0x7f0cf30ff020
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1376>
>   0x00007f0cf30fecef <+559>:   test   %rbp,%rbp
>   0x00007f0cf30fecf2 <+562>:   je     0x7f0cf30ff244
> <mesos::internal::slave::Slave::executorExited(mesos::FrameworkID
> const&, mesos::ExecutorID const&, int)+1---Type <return> to continue,
> or q <re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira