You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Kai Wang <ka...@caltech.edu> on 2013/07/03 17:58:19 UTC

TASK_LOST in running mesos-0.12.0 on real cluster

Hi all:

Does anyone run mesos-0.12.0 successfully on a real cluster?

In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes: 192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've uploaded the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/, according to the code in Mesos_Scheduler.cpp.

However, I encountered task lost continuously due to process exit. On analyzing the code in process_based_isolation_module.cpp as well as the logs, I think the executor process was forked successfully, and exited soon, without the expected"Got registration for executor 'executorId' of framework" log generated by slave.cpp.
In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in launcher.cpp was called and executed, so that the resource from HDFS was grabbed according to the Mesos config file above.
 
================= mapred-site.xml ===================
<property>
    <name>mapred.mesos.executor</name>
#    <value>hdfs://hdfs.name.node:port/hadoop.zip</value>
     <value>hdfs://master:54310/hadoop.tar.gz</value>
 </property>

==================== Mesos_Scheduler.java ======================
459      if (master.equals("local")) {
        try {
          commandInfo = CommandInfo.newBuilder()
              .setEnvironment(envBuilder)
              .setValue(new File("bin/mesos-executor").getCanonicalPath())
              .build();
        } catch (IOException e) {
          LOG.fatal("Failed to find Mesos executor ", e);
          System.exit(1);
        }
      } else {
        String uri = conf.get("mapred.mesos.executor");
        commandInfo = CommandInfo.newBuilder()
            .setEnvironment(envBuilder)
            .setValue("cd hadoop && ./bin/mesos-executor")
            .addUris(CommandInfo.URI.newBuilder().setValue(uri)).build();
      }

===================== slave2 log =======================================================
I0703 19:44:58.547559  4648 slave.cpp:169] Slave started on 1)@192.168.1.3:50842
I0703 19:44:58.547617  4648 slave.cpp:170] Slave resources: cpus=24; mem=63362; ports=[31000-32000]; disk=882530
I0703 19:44:58.550875  4648 slave.cpp:378] New master detected at master@192.168.1.1:5050
I0703 19:44:58.552201  4632 slave.cpp:361] Successfully attached file '/home/mesos/log/mesos-slave.INFO'
I0703 19:44:58.553238  4649 slave.cpp:398] Registered with master; given slave ID 201307031945-16885952-5050-13298-1
I0703 19:45:58.550350  4629 slave.cpp:1233] Current disk usage 0.36%. Max allowed age: 6.97days
I0703 19:46:58.552299  4650 slave.cpp:1233] Current disk usage 0.36%. Max allowed age: 6.97days
I0703 19:47:58.553762  4638 slave.cpp:1233] Current disk usage 0.36%. Max allowed age: 6.97days
I0703 19:48:16.771438  4647 slave.cpp:475] Got assigned task Task_Tracker_0 for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:16.774071  4647 paths.hpp:235] Created executor directory '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.774608  4647 process_based_isolation_module.cpp:108] Launching executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7 with resources cpus=1; mem=1280' for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:16.775753  4635 slave.cpp:361] Successfully attached file '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.776356  4647 process_based_isolation_module.cpp:153] Forked executor at 5262
I0703 19:48:18.789098  4634 process_based_isolation_module.cpp:344] Telling slave of lost executor executor_Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000
I0703 19:48:18.789445  4650 slave.cpp:1053] Executor 'executor_Task_Tracker_0' of framework 201307031945-16885952-5050-13298-0000 has terminated with signal Aborted
I0703 19:48:18.792886  4650 slave.cpp:830] Status update: task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
I0703 19:48:18.793547  4646 gc.cpp:97] Scheduling /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7 for removal
I0703 19:48:18.797713  4636 slave.cpp:727] Got acknowledgement of status update for task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000
I0703 19:48:18.789301  4634 process_utils.hpp:64] Stopping … 5262

============== process_based_isolation_module.cpp ==========================
153   LOG(INFO) << "Forked executor at " << pid;
    // Tell the slave this executor has started.
159    dispatch(slave, &Slave::executorStarted, frameworkId, executorId, pid);


338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status)
344        LOG(INFO) << "Telling slave of lost executor " << executorId
                  << " of framework " << frameworkId;

347     dispatch(slave,
                 &Slave::executorTerminated,
                 frameworkId,
                 executorId,
                 status,
                 false,
                 "Executor exited");
                                                            
====================== master log =====================================
I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave 201307031945-16885952-5050-13298-1
I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave 201307031945-16885952-5050-13298-1 (slave2)
I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave 201307031945-16885952-5050-13298-1
I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer 201307031945-16885952-5050-13298-5 on slave 201307031945-16885952-5050-13298-0 (slave1) for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495] Framework 201307031945-16885952-5050-13298-0000 filtered slave 201307031945-16885952-5050-13298-1 for 5.00secs
I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave 201307031945-16885952-5050-13298-0
I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495] Framework 201307031945-16885952-5050-13298-0000 filtered slave 201307031945-16885952-5050-13298-0 for 5.00secs
I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597] Performed allocation for 2 slaves in 102.71us
I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597] Performed allocation for 2 slaves in 127.42us
I0703 19:48:24.218245 13321 master.cpp:1022] Status update from slave(1)@192.168.1.3:50842: task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST

Thanks a lot for your help!
Kai


Re: TASK_LOST in running mesos-0.12.0 on real cluster

Posted by Vinod Kone <vi...@gmail.com>.
Hi Kai,

Are there any logs in the executor work directory? e.g., /tmp/mesos/slaves/
201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-
13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-
7eb7-4530-8315-153fef22cee7.

Also are you capturing stderr of the slave?



On Wed, Jul 3, 2013 at 8:58 AM, Kai Wang <ka...@caltech.edu> wrote:

> Hi all:
>
> Does anyone run mesos-0.12.0 successfully on a real cluster?
>
> In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes:
> 192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've
> uploaded the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/,
> according to the code in Mesos_Scheduler.cpp.
>
> However, I encountered task lost continuously due to process exit. On
> analyzing the code in process_based_isolation_module.cpp as well as the
> logs, I think the executor process was forked successfully, and exited
> soon, without the expected"Got registration for executor 'executorId' of
> framework" log generated by slave.cpp.
> In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in
> launcher.cpp was called and executed, so that the resource from HDFS was
> grabbed according to the Mesos config file above.
>
> ================= mapred-site.xml ===================
> <property>
>     <name>mapred.mesos.executor</name>
> #    <value>hdfs://hdfs.name.node:port/hadoop.zip</value>
>      <value>hdfs://master:54310/hadoop.tar.gz</value>
>  </property>
>
> ==================== Mesos_Scheduler.java ======================
> 459      if (master.equals("local")) {
>         try {
>           commandInfo = CommandInfo.newBuilder()
>               .setEnvironment(envBuilder)
>               .setValue(new File("bin/mesos-executor").getCanonicalPath())
>               .build();
>         } catch (IOException e) {
>           LOG.fatal("Failed to find Mesos executor ", e);
>           System.exit(1);
>         }
>       } else {
>         String uri = conf.get("mapred.mesos.executor");
>         commandInfo = CommandInfo.newBuilder()
>             .setEnvironment(envBuilder)
>             .setValue("cd hadoop && ./bin/mesos-executor")
>             .addUris(CommandInfo.URI.newBuilder().setValue(uri)).build();
>       }
>
> ===================== slave2 log
> =======================================================
> I0703 19:44:58.547559  4648 slave.cpp:169] Slave started on 1)@
> 192.168.1.3:50842
> I0703 19:44:58.547617  4648 slave.cpp:170] Slave resources: cpus=24;
> mem=63362; ports=[31000-32000]; disk=882530
> I0703 19:44:58.550875  4648 slave.cpp:378] New master detected at
> master@192.168.1.1:5050
> I0703 19:44:58.552201  4632 slave.cpp:361] Successfully attached file
> '/home/mesos/log/mesos-slave.INFO'
> I0703 19:44:58.553238  4649 slave.cpp:398] Registered with master; given
> slave ID 201307031945-16885952-5050-13298-1
> I0703 19:45:58.550350  4629 slave.cpp:1233] Current disk usage 0.36%. Max
> allowed age: 6.97days
> I0703 19:46:58.552299  4650 slave.cpp:1233] Current disk usage 0.36%. Max
> allowed age: 6.97days
> I0703 19:47:58.553762  4638 slave.cpp:1233] Current disk usage 0.36%. Max
> allowed age: 6.97days
> I0703 19:48:16.771438  4647 slave.cpp:475] Got assigned task
> Task_Tracker_0 for framework 201307031945-16885952-5050-13298-0000
> I0703 19:48:16.774071  4647 paths.hpp:235] Created executor directory
> '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
> I0703 19:48:16.774608  4647 process_based_isolation_module.cpp:108]
> Launching executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in
> /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
> with resources cpus=1; mem=1280' for framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:16.775753  4635 slave.cpp:361] Successfully attached file
> '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
> I0703 19:48:16.776356  4647 process_based_isolation_module.cpp:153] Forked
> executor at 5262
> I0703 19:48:18.789098  4634 process_based_isolation_module.cpp:344]
> Telling slave of lost executor executor_Task_Tracker_0 of framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:18.789445  4650 slave.cpp:1053] Executor
> 'executor_Task_Tracker_0' of framework
> 201307031945-16885952-5050-13298-0000 has terminated with signal Aborted
> I0703 19:48:18.792886  4650 slave.cpp:830] Status update: task
> Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in
> state TASK_LOST
> I0703 19:48:18.793547  4646 gc.cpp:97] Scheduling
> /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
> for removal
> I0703 19:48:18.797713  4636 slave.cpp:727] Got acknowledgement of status
> update for task Task_Tracker_0 of framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:18.789301  4634 process_utils.hpp:64] Stopping … 5262
>
> ============== process_based_isolation_module.cpp
> ==========================
> 153   LOG(INFO) << "Forked executor at " << pid;
>     // Tell the slave this executor has started.
> 159    dispatch(slave, &Slave::executorStarted, frameworkId, executorId,
> pid);
>
>
> 338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status)
> 344        LOG(INFO) << "Telling slave of lost executor " << executorId
>                   << " of framework " << frameworkId;
>
> 347     dispatch(slave,
>                  &Slave::executorTerminated,
>                  frameworkId,
>                  executorId,
>                  status,
>                  false,
>                  "Executor exited");
>
> ====================== master log =====================================
> I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources
> cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave
> 201307031945-16885952-5050-13298-1
> I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0
> of framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8;
> mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave
> 201307031945-16885952-5050-13298-1 (slave2)
> I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources
> cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave
> 201307031945-16885952-5050-13298-1
> I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer
> 201307031945-16885952-5050-13298-5 on slave
> 201307031945-16885952-5050-13298-0 (slave1) for framework
> 201307031945-16885952-5050-13298-0000
> I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495]
> Framework 201307031945-16885952-5050-13298-0000 filtered slave
> 201307031945-16885952-5050-13298-1 for 5.00secs
> I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources
> cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave
> 201307031945-16885952-5050-13298-0
> I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495]
> Framework 201307031945-16885952-5050-13298-0000 filtered slave
> 201307031945-16885952-5050-13298-0 for 5.00secs
> I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597]
> Performed allocation for 2 slaves in 102.71us
> I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597]
> Performed allocation for 2 slaves in 127.42us
> I0703 19:48:24.218245 13321 master.cpp:1022] Status update from slave(1)@
> 192.168.1.3:50842: task Task_Tracker_0 of framework
> 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
>
> Thanks a lot for your help!
> Kai
>
>