You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2015/11/17 07:48:11 UTC
[jira] [Commented] (MESOS-3719) Core dump on /teardown

    [ https://issues.apache.org/jira/browse/MESOS-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008152#comment-15008152 ] 

Neil Conway commented on MESOS-3719:
------------------------------------

[~kensipe] Hi Ken -- any progress getting the steps to reproduce this crash?

I notice that MESOS-3744 lists the assertion failure as:

{{sorter.cpp:213] Check failed: total.resources.contains(slaveId)}}

> Core dump on /teardown
> ----------------------
>
>                 Key: MESOS-3719
>                 URL: https://issues.apache.org/jira/browse/MESOS-3719
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.24.1
>            Reporter: Ken Sipe
>
> invoked a `/master/teardown` for 2 frameworks.  sample invocation (on the master node using mesos-dns) is:  
> {code}curl -d "frameworkId=20151013-143739-1510211594-5050-1515-0002" -X POST http://master.mesos:5050/master/teardown{code}
> logs at the master:
> {code}
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.677880  1525 http.cpp:321] HTTP POST for /master/teardown from 10.0.4.90:53789 with User-Agent='curl/7.42.1'
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.679015  1525 master.cpp:5112] Removing framework 20151013-143739-1510211594-5050-1515-0002 (hdfs) at scheduler-a5388720-fcbf-4fd0-b01e-75712b12c99d@10.0.4.90:53903
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.679280  1525 master.cpp:5576] Updating the latest state of task task.journalnode.journalnode.NodeExecutor.1444747955695 of framework 20151013-143739-1510211594-5050-1515-0002 to TASK_KILLED
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.679385  1525 master.cpp:5644] Removing task task.journalnode.journalnode.NodeExecutor.1444747955695 with resources cpus(*):0.25; mem(*):691.2 of framework 20151013-143739-1510211594-5050-1515-0002 on
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.679404  1521 hierarchical.hpp:814] Recovered cpus(*):0.25; mem(*):691.2 (total: ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; disk(*):32541, a
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.679536  1525 master.cpp:5673] Removing executor 'executor.journalnode.NodeExecutor.1444747955695' with resources cpus(*):0.1; mem(*):345.6 of framework 20151013-143739-1510211594-5050-1515-0002 on sl
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.679641  1521 hierarchical.hpp:814] Recovered cpus(*):0.1; mem(*):345.6 (total: ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; disk(*):32541, al
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: F1013 15:36:24.679719  1521 sorter.cpp:213] Check failed: total.resources.contains(slaveId)
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: *** Check failure stack trace: ***
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6d86c9fd  google::LogMessage::Fail()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6d86e89d  google::LogMessage::SendToLog()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6d86c5ec  google::LogMessage::Flush()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6d86f1be  google::LogMessageFatal::~LogMessageFatal()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6e3f2910  mesos::internal::master::allocator::DRFSorter::remove()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6e2cc0bc  mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6e977551  process::ProcessManager::resume()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6e97784f  process::internal::schedule()
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: I1013 15:36:24.684733  1520 http.cpp:321] HTTP GET for /master/state-summary from 10.0.4.90:53790 with User-Agent='Python-urllib/3.4'
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6d30ebc3  (unknown)
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6cb1266c  (unknown)
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal mesos-master[1515]: @     0x7fba6c8552ed  (unknown)
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal systemd[1]: dcos-mesos-master.service: Main process exited, code=killed, status=6/ABRT
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal systemd[1]: dcos-mesos-master.service: Unit entered failed state.
> Oct 13 15:36:24 ip-10-0-4-90.us-west-2.compute.internal systemd[1]: dcos-mesos-master.service: Failed with result 'signal'.
> Oct 13 15:36:40 ip-10-0-4-90.us-west-2.compute.internal systemd[1]: dcos-mesos-master.service: Service hold-off time over, scheduling restart.
> Oct 13 15:36:40 ip-10-0-4-90.us-west-2.compute.internal systemd[1]: Starting Mesos Master...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)