You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Kai Wang <ka...@caltech.edu> on 2013/07/03 17:58:19 UTC
TASK_LOST in running mesos-0.12.0 on real cluster
Hi all:
Does anyone run mesos-0.12.0 successfully on a real cluster?
In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes: 192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've uploaded the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/, according to the code in Mesos_Scheduler.cpp.
However, I encountered task lost continuously due to process exit. On analyzing the code in process_based_isolation_module.cpp as well as the logs, I think the executor process was forked successfully, and exited soon, without the expected"Got registration for executor 'executorId' of framework" log generated by slave.cpp.
In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in launcher.cpp was called and executed, so that the resource from HDFS was grabbed according to the Mesos config file above.
================= mapred-site.xml ===================
<property>
<name>mapred.mesos.executor</name>
# <value>hdfs://hdfs.name.node:port/hadoop.zip</value>
<value>hdfs://master:54310/hadoop.tar.gz</value>
</property>
==================== Mesos_Scheduler.java ======================
459 if (master.equals("local")) {
try {
commandInfo = CommandInfo.newBuilder()
.setEnvironment(envBuilder)
.setValue(new File("bin/mesos-executor").getCanonicalPath())
.build();
} catch (IOException e) {
LOG.fatal("Failed to find Mesos executor ", e);
System.exit(1);
}
} else {
String uri = conf.get("mapred.mesos.executor");
commandInfo = CommandInfo.newBuilder()
.setEnvironment(envBuilder)
.setValue("cd hadoop && ./bin/mesos-executor")
.addUris(CommandInfo.URI.newBuilder().setValue(uri)).build();
}
===================== slave2 log =======================================================
I0703 19:44:58.547559 4648 slave.cpp:169] Slave started on 1)@192.168.1.3:50842
I0703 19:44:58.547617 4648 slave.cpp:170] Slave resources: cpus=24; mem=63362; ports=[31000-32000]; disk=882530
I0703 19:44:58.550875 4648 slave.cpp:378] New master detected at master@192.168.1.1:5050
I0703 19:44:58.552201 4632 slave.cpp:361] Successfully attached file '/home/mesos/log/mesos-slave.INFO'
I0703 19:44:58.553238 4649 slave.cpp:398] Registered with master; given slave ID 201307031945-16885952-5050-13298-1
I0703 19:45:58.550350 4629 slave.cpp:1233] Current disk usage 0.36%. Max allowed age: 6.97days
I0703 19:46:58.552299 4650 slave.cpp:1233] Current disk usage 0.36%. Max allowed age: 6.97days
I0703 19:47:58.553762 4638 slave.cpp:1233] Current disk usage 0.36%. Max allowed age: 6.97days
I0703 19:48:16.771438 4647 slave.cpp:475] Got assigned task Task_Tracker_0 for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:16.774071 4647 paths.hpp:235] Created executor directory '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.774608 4647 process_based_isolation_module.cpp:108] Launching executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7 with resources cpus=1; mem=1280' for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:16.775753 4635 slave.cpp:361] Successfully attached file '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.776356 4647 process_based_isolation_module.cpp:153] Forked executor at 5262
I0703 19:48:18.789098 4634 process_based_isolation_module.cpp:344] Telling slave of lost executor executor_Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000
I0703 19:48:18.789445 4650 slave.cpp:1053] Executor 'executor_Task_Tracker_0' of framework 201307031945-16885952-5050-13298-0000 has terminated with signal Aborted
I0703 19:48:18.792886 4650 slave.cpp:830] Status update: task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
I0703 19:48:18.793547 4646 gc.cpp:97] Scheduling /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7 for removal
I0703 19:48:18.797713 4636 slave.cpp:727] Got acknowledgement of status update for task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000
I0703 19:48:18.789301 4634 process_utils.hpp:64] Stopping … 5262
============== process_based_isolation_module.cpp ==========================
153 LOG(INFO) << "Forked executor at " << pid;
// Tell the slave this executor has started.
159 dispatch(slave, &Slave::executorStarted, frameworkId, executorId, pid);
338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status)
344 LOG(INFO) << "Telling slave of lost executor " << executorId
<< " of framework " << frameworkId;
347 dispatch(slave,
&Slave::executorTerminated,
frameworkId,
executorId,
status,
false,
"Executor exited");
====================== master log =====================================
I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave 201307031945-16885952-5050-13298-1
I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave 201307031945-16885952-5050-13298-1 (slave2)
I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave 201307031945-16885952-5050-13298-1
I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer 201307031945-16885952-5050-13298-5 on slave 201307031945-16885952-5050-13298-0 (slave1) for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495] Framework 201307031945-16885952-5050-13298-0000 filtered slave 201307031945-16885952-5050-13298-1 for 5.00secs
I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave 201307031945-16885952-5050-13298-0
I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495] Framework 201307031945-16885952-5050-13298-0000 filtered slave 201307031945-16885952-5050-13298-0 for 5.00secs
I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597] Performed allocation for 2 slaves in 102.71us
I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597] Performed allocation for 2 slaves in 127.42us
I0703 19:48:24.218245 13321 master.cpp:1022] Status update from slave(1)@192.168.1.3:50842: task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
Thanks a lot for your help!
Kai
Re: TASK_LOST in running mesos-0.12.0 on real cluster
Posted by Vinod Kone <vi...@gmail.com>.
Hi Kai,
Are there any logs in the executor work directory? e.g., /tmp/mesos/slaves/
201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-
13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-
7eb7-4530-8315-153fef22cee7.
Also are you capturing stderr of the slave?
On Wed, Jul 3, 2013 at 8:58 AM, Kai Wang <ka...@caltech.edu> wrote:
> Hi all:
>
> Does anyone run mesos-0.12.0 successfully on a real cluster?
>
> In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes:
> 192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've
> uploaded the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/,
> according to the code in Mesos_Scheduler.cpp.
>
> However, I encountered task lost continuously due to process exit. On
> analyzing the code in process_based_isolation_module.cpp as well as the
> logs, I think the executor process was forked successfully, and exited
> soon, without the expected"Got registration for executor 'executorId' of
> framework" log generated by slave.cpp.
> In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in
> launcher.cpp was called and executed, so that the resource from HDFS was
> grabbed according to the Mesos config file above.
>
> ================= mapred-site.xml ===================
> <property>
> <name>mapred.mesos.executor</name>
> # <value>hdfs://hdfs.name.node:port/hadoop.zip</value>
> <value>hdfs://master:54310/hadoop.tar.gz</value>
> </property>
>
> ==================== Mesos_Scheduler.java ======================
> 459 if (master.equals("local")) {
> try {
> commandInfo = CommandInfo.newBuilder()
> .setEnvironment(envBuilder)
> .setValue(new File("bin/mesos-executor").getCanonicalPath())
> .build();
> } catch (IOException e) {
> LOG.fatal("Failed to find Mesos executor ", e);
> System.exit(1);
> }
> } else {
> String uri = conf.get("mapred.mesos.executor");
> commandInfo = CommandInfo.newBuilder()
> .setEnvironment(envBuilder)
> .setValue("cd hadoop && ./bin/mesos-executor")
> .addUris(CommandInfo.URI.newBuilder().setValue(uri)).build();
> }
>
> ===================== slave2 log
> =======================================================
> I0703 19:44:58.547559 4648 slave.cpp:169] Slave started on 1)@
> 192.168.1.3:50842
> I0703 19:44:58.547617 4648 slave.cpp:170] Slave resources: cpus=24;
> mem=63362; ports=[31000-32000]; disk=882530
> I0703 19:44:58.550875 4648 slave.cpp:378] New master detected at
> master@192.168.1.1:5050
> I0703 19:44:58.552201 4632 slave.cpp:361] Successfully attached file
> '/home/mesos/log/mesos-slave.INFO'
> I0703 19:44:58.553238 4649 slave.cpp:398] Registered with master; given
> slave ID 201307031945-16885952-5050-13298-1
> I0703 19:45:58.550350 4629 slave.cpp:1233] Current disk usage 0.36%. Max
> allowed age: 6.97days
> I0703 19:46:58.552299 4650 slave.cpp:1233] Current disk usage 0.36%. Max
> allowed age: 6.97days
> I0703 19:47:58.553762 4638 slave.cpp:1233] Current disk usage 0.36%. Max
> allowed age: 6.97days
> I0703 19:48:16.771438 4647 slave.cpp:475] Got assigned task
> Task_Tracker_0 for framework 201307031945-16885952-5050-13298-0000
> I0703 19:48:16.774071 4647 paths.hpp:235] Created executor directory
> '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
> I0703 19:48:16.774608 4647 process_based_isolation_module.cpp:108]
> Launching executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in
> /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
> with resources cpus=1; mem=1280' for framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:16.775753 4635 slave.cpp:361] Successfully attached file
> '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
> I0703 19:48:16.776356 4647 process_based_isolation_module.cpp:153] Forked
> executor at 5262
> I0703 19:48:18.789098 4634 process_based_isolation_module.cpp:344]
> Telling slave of lost executor executor_Task_Tracker_0 of framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:18.789445 4650 slave.cpp:1053] Executor
> 'executor_Task_Tracker_0' of framework
> 201307031945-16885952-5050-13298-0000 has terminated with signal Aborted
> I0703 19:48:18.792886 4650 slave.cpp:830] Status update: task
> Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in
> state TASK_LOST
> I0703 19:48:18.793547 4646 gc.cpp:97] Scheduling
> /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
> for removal
> I0703 19:48:18.797713 4636 slave.cpp:727] Got acknowledgement of status
> update for task Task_Tracker_0 of framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:18.789301 4634 process_utils.hpp:64] Stopping … 5262
>
> ============== process_based_isolation_module.cpp
> ==========================
> 153 LOG(INFO) << "Forked executor at " << pid;
> // Tell the slave this executor has started.
> 159 dispatch(slave, &Slave::executorStarted, frameworkId, executorId,
> pid);
>
>
> 338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status)
> 344 LOG(INFO) << "Telling slave of lost executor " << executorId
> << " of framework " << frameworkId;
>
> 347 dispatch(slave,
> &Slave::executorTerminated,
> frameworkId,
> executorId,
> status,
> false,
> "Executor exited");
>
> ====================== master log =====================================
> I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources
> cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave
> 201307031945-16885952-5050-13298-1
> I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0
> of framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8;
> mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave
> 201307031945-16885952-5050-13298-1 (slave2)
> I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources
> cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave
> 201307031945-16885952-5050-13298-1
> I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer
> 201307031945-16885952-5050-13298-5 on slave
> 201307031945-16885952-5050-13298-0 (slave1) for framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495]
> Framework 201307031945-16885952-5050-13298-0000 filtered slave
> 201307031945-16885952-5050-13298-1 for 5.00secs
> I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources
> cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave
> 201307031945-16885952-5050-13298-0
> I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495]
> Framework 201307031945-16885952-5050-13298-0000 filtered slave
> 201307031945-16885952-5050-13298-0 for 5.00secs
> I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597]
> Performed allocation for 2 slaves in 102.71us
> I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597]
> Performed allocation for 2 slaves in 127.42us
> I0703 19:48:24.218245 13321 master.cpp:1022] Status update from slave(1)@
> 192.168.1.3:50842: task Task_Tracker_0 of framework
> 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
>
> Thanks a lot for your help!
> Kai
>
>