You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Jessica J (JIRA)" <ji...@apache.org> on 2012/06/28 17:20:44 UTC

[jira] [Comment Edited] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion

    [ https://issues.apache.org/jira/browse/MESOS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403153#comment-13403153 ] 

Jessica J edited comment on MESOS-206 at 6/28/12 3:19 PM:
----------------------------------------------------------

Yeah, there are a number of these errors for multiple tasks that receive resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from slave(1)@[slave-ip]:59707: task [task #] of framework 201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task #]

These status updates come from multiple slave nodes, as well, so it's not just a single node failing.

The first exception I see is in the JobTracker's logs is a FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from attempt_201206280805_0002_r_000014_1: Error initializing attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce slots" (my first indication that something is wrong) a full 5 minutes before this exception occurs, so I'm not sure how things correlate.

The only other exception I can find is in a DataNode's log:

2012-06-28 08:38:05,839 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration([ip-address]:50010, storageID=DS-739474830-[ip-address]-50010-1335530999790, infoPort=50075, ipcPort=50020):Got exception while serving blk_4193084752334304973_9283 to /[another datanode ip-address]:
java.io.IOException: Block blk_4193084752334304973_9283 is not valid.
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1029)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:992)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1002)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:94)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) 
        at java.lang.Thread.run(Thread.java:679)


                
      was (Author: esohpromatem):
    Yeah, there are a number of these errors for multiple tasks that receive resources, start, and finally arrive at the TASK_FINISHED state. The JobTracker shows the error I pasted above for each "unknown" task; the master log says,

I0628 09:48:01.400383 25789 master.cpp:956] Status update from slave(1)@[slave-ip]:59707: task [task #] of framework 201206280753-36284608-5050-25784-0001 is now in state TASK_FINISHED
W0628 09:48:01.400524 25789 master.cpp:988] Status update from slave(1)@[slave-ip]:59707 ([slave hostname]): error, couldn't lookup task [task #]

These status updates come from multiple slave nodes, as well, so it's not just a single node failing.

The first exception I see is in the JobTracker's logs is a FileNotFoundException:

12/06/28 08:17:30 INFO mapred.TaskInProgress: Error from attempt_201206280805_0002_r_000014_1: Error initializing attempt_201206280805_0002_r_000014_1:
java.io.FileNotFoundException: File does not exist: hdfs://namenode:54310/scratch/hadoop/mapred/staging/jessicaj/.staging/job_201206280805_0002/job.jar
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1164)
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1145)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:272)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:372)
        at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:362)
        at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:202)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1201)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1176)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1091)
        at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2372)
        at java.lang.Thread.run(Thread.java:679)

However, the job tracker starts scheduling tasks "with 0 map slots and 0 reduce slots" (my first indication that something is wrong) a full 5 minutes before this exception occurs, so I'm not sure how things correlate.

                  
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>
>                 Key: MESOS-206
>                 URL: https://issues.apache.org/jira/browse/MESOS-206
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
>
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework continues for a while, but eventually, although it appears to still be running, it stops making progress on the jobs. The jobtracker keeps running, but each line of output indicates no map or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with 0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000 as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework running simultaneously.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira