You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/06/04 22:41:03 UTC

[jira] [Updated] (MESOS-873) Crash in os::killtree on Mavericks

     [ https://issues.apache.org/jira/browse/MESOS-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler updated MESOS-873:
----------------------------------

    Summary: Crash in os::killtree on Mavericks  (was: Crash in os::killtree on Mavericks )

> Crash in os::killtree on Mavericks
> ----------------------------------
>
>                 Key: MESOS-873
>                 URL: https://issues.apache.org/jira/browse/MESOS-873
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>         Environment: Mac OS X Mavericks
>            Reporter: Niklas Quarfot Nielsen
>            Assignee: Benjamin Hindman
>             Fix For: 0.19.0
>
>
> This is a crash we experienced on a Mavericks installation. We haven't been able to reproduce it on other machines since, but managed to capture core files from the crashes.
> Here is the stack trace from the crashing thread:
>   thread #2: tid = 0x0001, 0x0000000106816de5 mesos-executor`os::process(int) + 4133, stop reason = signal SIGSTOP
>     frame #0: 0x0000000106816de5 mesos-executor`os::process(int) + 4133
>     frame #1: 0x000000010681734c mesos-executor`os::processes() + 316
>     frame #2: 0x0000000106817752 mesos-executor`os::killtree(int, int, bool, bool) + 66
>     frame #3: 0x0000000106819748 mesos-executor`mesos::internal::CommandExecutorProcess::shutdown(mesos::ExecutorDriver*) + 200
>     frame #4: 0x000000010798be70
>     frame #5: 0x000000010798be60
>     frame #6: 0x0000000106b21c20 libmesos-0.16.0.dylib`process::Event::~Event() + 32
>     frame #7: 0x90c307894810c083
> The stop condition is wrong (all threads in the core file is reported as stopped). 
> Here is a snippet of disassemble of the failing frame:
>    0x106817306:  je     0x106817460               ; os::processes() + 592
>    0x10681730c:  movq   16(%rsp), %rax
>    0x106817311:  movq   296(%rsp), %rbx
>    0x106817319:  leaq   16(%rax), %r14
>    0x10681731d:  leaq   128(%rsp), %rax
>    0x106817325:  addq   $8, %r14
>    0x106817329:  movq   %rax, 24(%rsp)
>    0x10681732e:  leaq   384(%rsp), %rbp
>    0x106817336:  cmpq   %rbx, %r14
>    0x106817339:  je     0x106817530               ; os::processes() + 800
>    0x10681733f:  movl   32(%rbx), %esi
>    0x106817342:  movq   24(%rsp), %rdi
>    0x106817347:  callq  0x10681d5a0               ; symbol stub for: os::process(int)
> -> 0x10681734c:  movl   128(%rsp), %esi
>    0x106817353:  testl  %esi, %esi
>    0x106817355:  jne    0x1068173e0               ; os::processes() + 464
>    0x10681735b:  movq   136(%rsp), %rsi
>    0x106817363:  movq   %rbp, %rdi
>    0x106817366:  callq  0x10681d58e               ; symbol stub for: os::Process::Process(os::Process const&)
>    0x10681736b:  movl   $112, %edi
>    0x106817370:  callq  0x10681d9e4               ; symbol stub for: operator new(unsigned long)
> We got to (while investigation the crash live in lldb) that using sysctl to get argument count probably was the reason for the crash, but still with no ways to validate this.
> We can dig further into the core dump, if you know any suspected reasons for the failure / where to look further.
> Also, since we haven't been able to reproduce the crash. If we don't hear of any others with the same problem, we can probably mark this as won't fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)