You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Niklas Quarfot Nielsen (JIRA)" <ji...@apache.org> on 2013/12/07 00:55:35 UTC
[jira] [Created] (MESOS-873) Crash in os::killtree on Mavericks

Niklas Quarfot Nielsen created MESOS-873:
--------------------------------------------

             Summary: Crash in os::killtree on Mavericks 
                 Key: MESOS-873
                 URL: https://issues.apache.org/jira/browse/MESOS-873
             Project: Mesos
          Issue Type: Bug
          Components: libprocess
         Environment: Mac OS X Mavericks
            Reporter: Niklas Quarfot Nielsen


This is a crash we experienced on a Mavericks installation. We haven't been able to reproduce it on other machines since, but managed to capture core files from the crashes.

Here is the stack trace from the crashing thread:

  thread #2: tid = 0x0001, 0x0000000106816de5 mesos-executor`os::process(int) + 4133, stop reason = signal SIGSTOP
    frame #0: 0x0000000106816de5 mesos-executor`os::process(int) + 4133
    frame #1: 0x000000010681734c mesos-executor`os::processes() + 316
    frame #2: 0x0000000106817752 mesos-executor`os::killtree(int, int, bool, bool) + 66
    frame #3: 0x0000000106819748 mesos-executor`mesos::internal::CommandExecutorProcess::shutdown(mesos::ExecutorDriver*) + 200
    frame #4: 0x000000010798be70
    frame #5: 0x000000010798be60
    frame #6: 0x0000000106b21c20 libmesos-0.16.0.dylib`process::Event::~Event() + 32
    frame #7: 0x90c307894810c083

The stop condition is wrong (all threads in the core file is reported as stopped). 

Here is a snippet of disassemble of the failing frame:
   0x106817306:  je     0x106817460               ; os::processes() + 592
   0x10681730c:  movq   16(%rsp), %rax
   0x106817311:  movq   296(%rsp), %rbx
   0x106817319:  leaq   16(%rax), %r14
   0x10681731d:  leaq   128(%rsp), %rax
   0x106817325:  addq   $8, %r14
   0x106817329:  movq   %rax, 24(%rsp)
   0x10681732e:  leaq   384(%rsp), %rbp
   0x106817336:  cmpq   %rbx, %r14
   0x106817339:  je     0x106817530               ; os::processes() + 800
   0x10681733f:  movl   32(%rbx), %esi
   0x106817342:  movq   24(%rsp), %rdi
   0x106817347:  callq  0x10681d5a0               ; symbol stub for: os::process(int)
-> 0x10681734c:  movl   128(%rsp), %esi
   0x106817353:  testl  %esi, %esi
   0x106817355:  jne    0x1068173e0               ; os::processes() + 464
   0x10681735b:  movq   136(%rsp), %rsi
   0x106817363:  movq   %rbp, %rdi
   0x106817366:  callq  0x10681d58e               ; symbol stub for: os::Process::Process(os::Process const&)
   0x10681736b:  movl   $112, %edi
   0x106817370:  callq  0x10681d9e4               ; symbol stub for: operator new(unsigned long)

We got to (while investigation the crash live in lldb) that using sysctl to get argument count probably was the reason for the crash, but still with no ways to validate this.

We can dig further into the core dump, if you know any suspected reasons for the failure / where to look further.
Also, since we haven't been able to reproduce the crash. If we don't hear of any others with the same problem, we can probably mark this as won't fix.



--
This message was sent by Atlassian JIRA
(v6.1#6144)