You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by "Jessica J (JIRA)" <ji...@apache.org> on 2012/06/08 17:20:23 UTC

[jira] [Created] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Jessica J created MESOS-206:
-------------------------------

             Summary: Long-running jobs on Hadoop framework do not run to completion
                 Key: MESOS-206
                 URL: https://issues.apache.org/jira/browse/MESOS-206
             Project: Mesos
          Issue Type: Bug
          Components: framework
            Reporter: Jessica J
            Priority: Blocker


When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:

12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots

I've examined the master's log and noticed this:

I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317

The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."

I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403153#comment-13403153 ] 

Jessica J commented on MESOS-206:
---------------------------------

Yeah, there are a number of these errors for multiple tasks that receive resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from slave(1)@[slave-ip]:59707: task [task #] of framework 201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task #]

These status updates come from multiple slave nodes, as well, so it's not just a single node failing.

The first exception I see is in the JobTracker's logs is a FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from attempt_201206280805_0002_r_000014_1: Error initializing attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce slots" (my first indication that something is wrong) a full 5 minutes before this exception occurs, so I'm not sure how things correlate.

                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Benjamin Hindman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411663#comment-13411663 ] 

Benjamin Hindman commented on MESOS-206:
----------------------------------------

Did you ever get a chance to wrap all of the Hadoop scheduler callbacks with try/catch to see if the problem went away?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398440#comment-13398440 ] 

Jessica J commented on MESOS-206:
---------------------------------

I am seeing a number of exceptions from the jobtracker. They're all for the same machine--it appears to be trying to connect repeatedly even after it says it's excluding the node. This is what I'm seeing:

12/06/21 10:01:23 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as [datanode-ip]:50010
12/06/21 10:01:23 INFO hdfs.DFSClient: Abandoning block blk_-9105592800944997506_190983
12/06/21 10:01:23 INFO hdfs.DFSClient: Excluding datanode [datanode-ip]:50010

12/06/21 10:01:23 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused
12/06/21 10:01:23 INFO hdfs.DFSClient: Abandoning block blk_8455625917706798102_190988
12/06/21 10:01:23 INFO hdfs.DFSClient: Excluding datanode [datanode-ip]:50010
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403153#comment-13403153 ] 

Jessica J edited comment on MESOS-206 at 6/28/12 3:19 PM:
----------------------------------------------------------

Yeah, there are a number of these errors for multiple tasks that receive resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from slave(1)@[slave-ip]:59707: task [task #] of framework 201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task #]

These status updates come from multiple slave nodes, as well, so it's not just a single node failing.

The first exception I see is in the JobTracker's logs is a FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from attempt_201206280805_0002_r_000014_1: Error initializing attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce slots" (my first indication that something is wrong) a full 5 minutes before this exception occurs, so I'm not sure how things correlate.

The only other exception I can find is in a DataNode's log:

2012-06-28 08:38:05,839 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration([ip-address]:50010, storageID=DS-739474830-[ip-address]-50010-1335530999790, infoPort=50075, ipcPort=50020):Got exception while serving blk_4193084752334304973_9283 to /[another datanode ip-address]:
java.io.IOException: Block blk_4193084752334304973_9283 is not valid.
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1029)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:992)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1002)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:94)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) 
        at java.lang.Thread.run(Thread.java:679)


                
      was (Author: esohpromatem):
    Yeah, there are a number of these errors for multiple tasks that receive resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from slave(1)@[slave-ip]:59707: task [task #] of framework 201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task #]

These status updates come from multiple slave nodes, as well, so it's not just a single node failing.

The first exception I see is in the JobTracker's logs is a FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from attempt_201206280805_0002_r_000014_1: Error initializing attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce slots" (my first indication that something is wrong) a full 5 minutes before this exception occurs, so I'm not sure how things correlate.

                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398750#comment-13398750 ] 

Jessica J commented on MESOS-206:
---------------------------------

Well, the jobs are still failing. I noticed this additional set of errors (occurring for multiple tasks) coming from the JobTracker:

12/06/21 14:32:51 INFO mapred.FrameworkScheduler: Task 316 is TASK_FINISHED
Exception in thread "Thread-3216" java.lang.RuntimeException: Received status update for unknown task value: "316"

        at org.apache.hadoop.mapred.FrameworkScheduler.statusUpdate(FrameworkScheduler.java:493)

Could this be related?


                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403173#comment-13403173 ] 

Jessica J edited comment on MESOS-206 at 6/28/12 3:42 PM:
----------------------------------------------------------

It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.
                
      was (Author: esohpromatem):
    It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.

I assume the jobs progress from 8:17 to 9:47, where the first failed status update occurs.
                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403173#comment-13403173 ] 

Jessica J edited comment on MESOS-206 at 6/28/12 6:36 PM:
----------------------------------------------------------

It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log; there are 114 of these, which are replicated in the JobTracker log at 9:48:17.

9:48:17 mesos-master log says, "Deactivating framework 201206280753-36284608-5050-25784-0001 as requested by scheduler(1)"

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.

from 8:10 to 9:48, the mesos-slave logs contain multiple repetitions of this warning: W0628 08:11:47.255110 23714 slave.cpp:1027] Status update error: couldn't lookup executor for framework 201206280753-36284608-5050-25784-0001
                
      was (Author: esohpromatem):
    It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:17 mesos-master log says, "Deactivating framework 201206280753-36284608-5050-25784-0001 as requested by scheduler(1)"

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.
                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Benjamin Hindman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291919#comment-13291919 ] 

Benjamin Hindman commented on MESOS-206:
----------------------------------------

My best guess is that an exception is being thrown from somewhere within the Hadoop jobtracker which is bubbling through Mesos which is deciding to deactivate the framework. Is there anything in the Hadoop logs that looks like a "description" of an exception?

The immediate fix will be to wrap all Mesos scheduler callbacks with a try/catch so that exceptions don't bubble.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403173#comment-13403173 ] 

Jessica J commented on MESOS-206:
---------------------------------

It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.

I assume the jobs progress from 8:17 to 9:47, where the first failed status update occurs.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423259#comment-13423259 ] 

Jessica J commented on MESOS-206:
---------------------------------

This has to be the weirdest bug I've ever encountered. I finally found an answer to the pegged CPU, and it has nothing to do with Mesos and everything to do with Java and last month's leap second. http://answers.mapr.com/questions/2751/high-cpu-utilization-on-cluster-while-idle. I have yet to see if my jobs run to completion, but at least CPU usage is back to normal.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418324#comment-13418324 ] 

Jessica J commented on MESOS-206:
---------------------------------

After more digging, I've discovered the JobTracker is seriously overusing resources. Top output indicates it's using almost 10 GB of virtual memory (and the number steadily grows larger as the JobTracker continues to run) and almost 800% CPU:

   PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28868 jessicaj  20   0 9069m 1.3g  15m S 737.6  8.2 257:12.30 java

NameNode and RunJar are also using fairly large amounts of virtual memory at 8 GB and 6 GB respectively.

My guess is a serious memory leak...
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398631#comment-13398631 ] 

Jessica J commented on MESOS-206:
---------------------------------

Yes, I spent some time troubleshooting, and I believe it was an HDFS issue (although I am still waiting for the jobs to complete after what I believe was the fix I needed, so I can't say for sure yet). I wonder if there's any way Mesos can be more helpful when the underlying framework fails...?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423976#comment-13423976 ] 

Jessica J commented on MESOS-206:
---------------------------------

OK, so now that the leap second issue is out of the way (which makes the excessive resource usage irrelevant), my Hadoop jobs are still failing to complete. (Essentially, everything except my last four comments still apply.) Basically, when I first set up a cluster, I can run through my large job a couple times successfully. (I'm doing benchmarks, so I'm running each job multiple times.) On my third try, however, the job fails and the above-mentioned issues come into play (namely, the framework disconnecting and task #s no longer being recognized.) Is Mesos not completely releasing resources (e.g., file handles) that it should be?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Vinod Kone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423991#comment-13423991 ] 

Vinod Kone commented on MESOS-206:
----------------------------------

Can you try with the latest mesos. We have had some fixes in recently that dealt with improper allocation of resources.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403173#comment-13403173 ] 

Jessica J edited comment on MESOS-206 at 6/28/12 6:16 PM:
----------------------------------------------------------

It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:17 mesos-master log says, "Deactivating framework 201206280753-36284608-5050-25784-0001 as requested by scheduler(1)"

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.
                
      was (Author: esohpromatem):
    It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both reduced 0. Since no further progress is being made, I kill the framework.
                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411535#comment-13411535 ] 

Jessica J commented on MESOS-206:
---------------------------------

I can say with high certainty that this is a Mesos issue and not Hadoop. I don't know whether the problem is specifically in the Hadoop patched code or if it's in the core Mesos functionality, but identical jobs run perfectly when Mesos is not involved (i.e., using a non-Mesos patched Hadoop 0.20.205.0).

I am near completion on a Mesos-related project, but this issue is preventing me from finishing the job. Any input is greatly appreciated.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418324#comment-13418324 ] 

Jessica J edited comment on MESOS-206 at 7/19/12 2:30 PM:
----------------------------------------------------------

After more digging, I've discovered the JobTracker is seriously overusing resources. Top output indicates it's using almost 10 GB of virtual memory (and the number steadily grows larger as the JobTracker continues to run) and almost 800% CPU:

   PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28868 jessicaj  20   0 9069m 1.3g  15m S 737.6  8.2 257:12.30 java

NameNode and RunJar are also using fairly large amounts of virtual memory at 8 GB and 6 GB respectively.

My guess is a serious memory leak...

Edited to add: The JobTracker memory usage continues to grow significantly even after I kill any jobs I was attempting to run.
                
      was (Author: esohpromatem):
    After more digging, I've discovered the JobTracker is seriously overusing resources. Top output indicates it's using almost 10 GB of virtual memory (and the number steadily grows larger as the JobTracker continues to run) and almost 800% CPU:

   PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28868 jessicaj  20   0 9069m 1.3g  15m S 737.6  8.2 257:12.30 java

NameNode and RunJar are also using fairly large amounts of virtual memory at 8 GB and 6 GB respectively.

My guess is a serious memory leak...
                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413003#comment-13413003 ] 

Jessica J commented on MESOS-206:
---------------------------------

Ben, thanks for your response. Yes, I tried wrapping all the callbacks with try/catch blocks, but the issue is still there. I put LOG statements in each of the catches and am not seeing them in any of my logs.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398622#comment-13398622 ] 

Matei Zaharia commented on MESOS-206:
-------------------------------------

So that might actually be an HDFS issue, or an issue with machines becoming overloaded. Can you look at the HDFS datanode logs on that node? Maybe the datanode crashed, or maybe the machine was overloaded for some reason and very slow.

The reason the JobTracker uses HDFS, by the way, is that Hadoop clients submit jobs by first writing the JAR and config file to HDFS.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Matei Zaharia (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399834#comment-13399834 ] 

Matei Zaharia commented on MESOS-206:
-------------------------------------

Yeah, that's much more likely, but unfortunately I'm not sure what's causing it. Are there any messages earlier in the log about a task called 316? Maybe we launch it and then somehow drop it from the history before it finishes. Also, is this the very first error that happens?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427393#comment-13427393 ] 

Jessica J commented on MESOS-206:
---------------------------------

Some new messages in the JobTracker log I'm seeing with the latest code:

12/08/02 10:05:20 INFO mapred.TaskInProgress: Error from attempt_201208020749_0150_r_000047_0: java.lang.IllegalArgumentException: Null user
    at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:762)
    at org.apache.hadoop.mapred.Child.main(Child.java:241)

..........................................................................................................................................................................................................................
(4 minutes later)

12/08/02 10:09:47 WARN mapred.FrameworkScheduler: SchedulerDriver returned irregular status: DRIVER_ABORTED

All other log activity/error messages remain the same as previously commented. The "irregular status" message comes from resourceOffers(SchedulerDriver d, List<Offer> offers) in FrameworkScheduler.java (line 245) after attempting to launch tasks. If I've followed the code path correctly, I've traced this call to line 172 (JNIExecutor::launchTask) in org_apache_mesos_MesosExecutorDriver.cpp, where it appears a Java exception is causing the driver to abort. The exception that occurs closest to the time of DRIVER_ABORTED is the first status update for an unknown task:

Exception in thread "Thread-150424" java.lang.RuntimeException: Received status update for unknown task value: "56357"

Any ideas how all this is inter-related? And what's causing it?
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jessica J updated MESOS-206:
----------------------------

    Description: 
When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:

12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots

I've examined the master's log and noticed this:

I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317

The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."

I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.

UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

  was:
When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:

12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots

I've examined the master's log and noticed this:

I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317

The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."

I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.

    
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427345#comment-13427345 ] 

Jessica J commented on MESOS-206:
---------------------------------

Thanks for the response, Vinod. I updated to the latest code base this morning; the job set still will not complete. I got down to about 30 jobs left out of ~200; 12 of them were fully mapped and one was partially reduced when the framework stopped scheduling maps/reduces.
                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

Posted by "Jessica J (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413742#comment-13413742 ] 

Jessica J commented on MESOS-206:
---------------------------------

I updated to the most recent code base this morning to see if the issue was fixed in any of the more recent code changes. The problem is still the same, but because I was starting from scratch with a smaller log file, I noticed another error that I had missed before. Think it's related?

12/07/13 09:24:58 INFO mapred.TaskInProgress: Error from attempt_201207130920_0002_m_000025_0: java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:278)
Caused by: java.io.IOException: Task process exit with nonzero status of 143.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:265)

                
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira