You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Kai Ju Liu <ka...@tellapart.com> on 2011/08/02 06:47:04 UTC

Re: MapReduce jobs hanging or failing near completion

Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral
disks, the problem has actually persisted. Now, however, there is no
evidence of errors on any of the mappers. The job tracker lists one less map
completed than the map total, while the job details show all mappers as
having completed. The jobs "hang" in this state as before.

Is there something in particular I should be looking for on my local disks?
Hadoop fsck shows all clear, but I'll have to wait until morning to take
individual nodes offline to check their disks. Any further details you might
have would be very helpful. Thanks!

Kai Ju

On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <ac...@hortonworks.com> wrote:

> Is this reproducible? If so, I'd urge you to check your local disks...
>
> Arun
>
> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
>
> Hi Marcos. The issue appears to be the following. A reduce task is unable
> to fetch results from a map task on HDFS. The map task is re-run, but the
> map task is now unable to retrieve information that it needs to run. Here is
> the error from the second map task:
>
> java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
> 	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
> 	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
> 	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
> 	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>
> I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks!
>
> Kai Ju
>
> P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them.
>
> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>
>>
>>
>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>>
>>  Over the past week or two, I've run into an issue where MapReduce jobs
>>> hang or fail near completion. The percent completion of both map and
>>> reduce tasks is often reported as 100%, but the actual number of
>>> completed tasks is less than the total number. It appears that either
>>> tasks backtrack and need to be restarted or the last few reduce tasks
>>> hang interminably on the copy step.
>>>
>>> In certain cases, the jobs actually complete. In other cases, I can't
>>> wait long enough and have to kill the job manually.
>>>
>>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
>>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>>> distribution. Has anyone experienced similar behavior in their clusters,
>>> and if so, had any luck resolving it? Thanks!
>>>
>>>  Can you post here your NN and DN logs files?
>> Regards
>>
>>  Kai Ju
>>>
>>
>> --
>> Marcos Luís Ortíz Valmaseda
>>  Software Engineer (UCI)
>>  Linux User # 418229
>>  http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com/>
>>  http://twitter.com/**marcosluis2186 <http://twitter.com/marcosluis2186>
>>
>>
>

Re: MapReduce jobs hanging or failing near completion

Posted by Kai Ju Liu <ka...@tellapart.com>.
Hi Arun. The job was stuck for 6+ hours before I killed it. Looking back at
the logs, it looks like there were 12 failed map tasks. Two attempts were
indeed made for each task on different nodes, with the failed attempt
showing "Too many fetch-failures" in the Error column.

One thing possibly of note is that the first attempt always fails on the
order of 5 to 10 minutes and often shows "Failed to connect to <ip>, add to
deadNodes and continue" messages. The second attempt always succeeds in less
than 1 minute. The map task as a whole still shows failure, though.

Kai Ju

On Wed, Aug 3, 2011 at 11:56 AM, Arun C Murthy <ac...@hortonworks.com> wrote:

> How long was your job stuck?
>
> The JT should have re-run the map on a different node. Do you see 'fetch
> failures' messages in the JT logs?
>
> The upcoming hadoop-0.20.204 release (now under discussion/vote) has better
> logging to help diagnose this in the JT logs.
>
> Arun
>
> On Aug 3, 2011, at 10:30 AM, Kai Ju Liu wrote:
>
> Hi Arun. A funny thing happened this morning: one of my jobs got stuck with
> the "fetch failures" messages that you mentioned. There was one pending map
> task remaining and one failed map task that had that error, and the reducers
> were stuck at just under 33.3% completion.
>
> Is there a solution or diagnosis for this situation? I don't know if it's
> related to the other issue I've been having, but it would be great to
> resolve this one for now. Thanks!
>
> Kai Ju
>
> On Tue, Aug 2, 2011 at 10:18 AM, Kai Ju Liu <ka...@tellapart.com> wrote:
>
>> All of the reducers are complete, both on the job tracker page and the job
>> details page. I used to get "fetch failure" messages when HDFS was mounted
>> on EBS volumes, but I haven't seen any since I migrated to physical disks.
>>
>> I'm currently using the fair scheduler, but it doesn't look like I've
>> specified any allocations. Perhaps I'll dig into this further with the
>> Cloudera team to see if there is indeed a problem with the job tracker or
>> scheduler. Otherwise, I'll give 0.20.203 + capacity scheduler a shot.
>>
>> Thanks again for the pointers.
>>
>> Kai Ju
>>
>>
>> On Mon, Aug 1, 2011 at 10:08 PM, Arun C Murthy <ac...@hortonworks.com>wrote:
>>
>>> On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote:
>>>
>>> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral
>>> disks, the problem has actually persisted. Now, however, there is no
>>> evidence of errors on any of the mappers. The job tracker lists one less map
>>> completed than the map total, while the job details show all mappers as
>>> having completed. The jobs "hang" in this state as before.
>>>
>>>
>>> Are any of your job's reducers completing? Do you see 'fetch failures'
>>> messages either in JT logs or reducers' (tasks) logs?
>>>
>>> If not it's clear that the JobTracker/Scheduler (which Scheduler are you
>>> using btw?) are 'losing' tasks which is a serious bug. You say that you are
>>> running CDH - unfortunately I have no idea what patchsets you run with it. I
>>> can't, at the top of my head, remember the JT/CapacityScheduler losing a
>>> task - but I maintained Yahoo clusters which ran hadoop-0.20.203.
>>>
>>> Here is something worth trying:
>>> $ cat JOBTRACKER.log | grep Assigning | grep
>>> _<clustertimestamp>_<jobid>_m_*
>>>
>>> The JOBTRACKER.log is the JT's log file on the JT host and if your jobid
>>> is job_12345342432_0001, then <clustertimestamp> == 12345342432 and
>>> <jobid> == 0001.
>>>
>>> Good luck.
>>>
>>> Arun
>>>
>>>
>>> Is there something in particular I should be looking for on my local
>>> disks? Hadoop fsck shows all clear, but I'll have to wait until morning to
>>> take individual nodes offline to check their disks. Any further details you
>>> might have would be very helpful. Thanks!
>>>
>>> Kai Ju
>>>
>>> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <ac...@hortonworks.com>wrote:
>>>
>>>> Is this reproducible? If so, I'd urge you to check your local disks...
>>>>
>>>> Arun
>>>>
>>>> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
>>>>
>>>> Hi Marcos. The issue appears to be the following. A reduce task is
>>>> unable to fetch results from a map task on HDFS. The map task is re-run, but
>>>> the map task is now unable to retrieve information that it needs to run.
>>>> Here is the error from the second map task:
>>>>
>>>> java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
>>>> 	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
>>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>>>> 	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
>>>> 	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
>>>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
>>>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>>> 	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
>>>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
>>>> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>>>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>>
>>>> I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kai Ju
>>>>
>>>> P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them.
>>>>
>>>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>>>>
>>>>>
>>>>>
>>>>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>>>>>
>>>>>  Over the past week or two, I've run into an issue where MapReduce jobs
>>>>>> hang or fail near completion. The percent completion of both map and
>>>>>> reduce tasks is often reported as 100%, but the actual number of
>>>>>> completed tasks is less than the total number. It appears that either
>>>>>> tasks backtrack and need to be restarted or the last few reduce tasks
>>>>>> hang interminably on the copy step.
>>>>>>
>>>>>> In certain cases, the jobs actually complete. In other cases, I can't
>>>>>> wait long enough and have to kill the job manually.
>>>>>>
>>>>>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with
>>>>>> 4
>>>>>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>>>>>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>>>>>> distribution. Has anyone experienced similar behavior in their
>>>>>> clusters,
>>>>>> and if so, had any luck resolving it? Thanks!
>>>>>>
>>>>>>  Can you post here your NN and DN logs files?
>>>>> Regards
>>>>>
>>>>>  Kai Ju
>>>>>>
>>>>>
>>>>> --
>>>>> Marcos Luís Ortíz Valmaseda
>>>>>  Software Engineer (UCI)
>>>>>  Linux User # 418229
>>>>>  http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com/>
>>>>>  http://twitter.com/**marcosluis2186<http://twitter.com/marcosluis2186>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Re: MapReduce jobs hanging or failing near completion

Posted by Arun C Murthy <ac...@hortonworks.com>.
How long was your job stuck?

The JT should have re-run the map on a different node. Do you see 'fetch failures' messages in the JT logs?

The upcoming hadoop-0.20.204 release (now under discussion/vote) has better logging to help diagnose this in the JT logs.

Arun

On Aug 3, 2011, at 10:30 AM, Kai Ju Liu wrote:

> Hi Arun. A funny thing happened this morning: one of my jobs got stuck with the "fetch failures" messages that you mentioned. There was one pending map task remaining and one failed map task that had that error, and the reducers were stuck at just under 33.3% completion.
> 
> Is there a solution or diagnosis for this situation? I don't know if it's related to the other issue I've been having, but it would be great to resolve this one for now. Thanks!
> 
> Kai Ju
> 
> On Tue, Aug 2, 2011 at 10:18 AM, Kai Ju Liu <ka...@tellapart.com> wrote:
> All of the reducers are complete, both on the job tracker page and the job details page. I used to get "fetch failure" messages when HDFS was mounted on EBS volumes, but I haven't seen any since I migrated to physical disks.
> 
> I'm currently using the fair scheduler, but it doesn't look like I've specified any allocations. Perhaps I'll dig into this further with the Cloudera team to see if there is indeed a problem with the job tracker or scheduler. Otherwise, I'll give 0.20.203 + capacity scheduler a shot.
> 
> Thanks again for the pointers.
> 
> Kai Ju
> 
> 
> On Mon, Aug 1, 2011 at 10:08 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote:
> 
>> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral disks, the problem has actually persisted. Now, however, there is no evidence of errors on any of the mappers. The job tracker lists one less map completed than the map total, while the job details show all mappers as having completed. The jobs "hang" in this state as before.
> 
> Are any of your job's reducers completing? Do you see 'fetch failures' messages either in JT logs or reducers' (tasks) logs?
> 
> If not it's clear that the JobTracker/Scheduler (which Scheduler are you using btw?) are 'losing' tasks which is a serious bug. You say that you are running CDH - unfortunately I have no idea what patchsets you run with it. I can't, at the top of my head, remember the JT/CapacityScheduler losing a task - but I maintained Yahoo clusters which ran hadoop-0.20.203.
> 
> Here is something worth trying: 
> $ cat JOBTRACKER.log | grep Assigning | grep _<clustertimestamp>_<jobid>_m_*
> 
> The JOBTRACKER.log is the JT's log file on the JT host and if your jobid is job_12345342432_0001, then <clustertimestamp> == 12345342432 and <jobid> == 0001.
> 
> Good luck.
> 
> Arun
> 
>> 
>> Is there something in particular I should be looking for on my local disks? Hadoop fsck shows all clear, but I'll have to wait until morning to take individual nodes offline to check their disks. Any further details you might have would be very helpful. Thanks!
>> 
>> Kai Ju
>> 
>> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Is this reproducible? If so, I'd urge you to check your local disks...
>> 
>> Arun
>> 
>> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
>> 
>>> Hi Marcos. The issue appears to be the following. A reduce task is unable to fetch results from a map task on HDFS. The map task is re-run, but the map task is now unable to retrieve information that it needs to run. Here is the error from the second map task:
>>> java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
>>> 	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>>> 	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
>>> 	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
>>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
>>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>> 	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
>>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
>>> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>> 
>>> I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks!
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Kai Ju
>>> 
>>> P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them.
>>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>>> 
>>> 
>>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>>> 
>>> Over the past week or two, I've run into an issue where MapReduce jobs
>>> hang or fail near completion. The percent completion of both map and
>>> reduce tasks is often reported as 100%, but the actual number of
>>> completed tasks is less than the total number. It appears that either
>>> tasks backtrack and need to be restarted or the last few reduce tasks
>>> hang interminably on the copy step.
>>> 
>>> In certain cases, the jobs actually complete. In other cases, I can't
>>> wait long enough and have to kill the job manually.
>>> 
>>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
>>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>>> distribution. Has anyone experienced similar behavior in their clusters,
>>> and if so, had any luck resolving it? Thanks!
>>> 
>>> Can you post here your NN and DN logs files?
>>> Regards
>>> 
>>> Kai Ju
>>> 
>>> -- 
>>> Marcos Luís Ortíz Valmaseda
>>>  Software Engineer (UCI)
>>>  Linux User # 418229
>>>  http://marcosluis2186.posterous.com
>>>  http://twitter.com/marcosluis2186
>>> 
>> 
>> 
> 
> 
> 


Re: MapReduce jobs hanging or failing near completion

Posted by Kai Ju Liu <ka...@tellapart.com>.
Hi Arun. A funny thing happened this morning: one of my jobs got stuck with
the "fetch failures" messages that you mentioned. There was one pending map
task remaining and one failed map task that had that error, and the reducers
were stuck at just under 33.3% completion.

Is there a solution or diagnosis for this situation? I don't know if it's
related to the other issue I've been having, but it would be great to
resolve this one for now. Thanks!

Kai Ju

On Tue, Aug 2, 2011 at 10:18 AM, Kai Ju Liu <ka...@tellapart.com> wrote:

> All of the reducers are complete, both on the job tracker page and the job
> details page. I used to get "fetch failure" messages when HDFS was mounted
> on EBS volumes, but I haven't seen any since I migrated to physical disks.
>
> I'm currently using the fair scheduler, but it doesn't look like I've
> specified any allocations. Perhaps I'll dig into this further with the
> Cloudera team to see if there is indeed a problem with the job tracker or
> scheduler. Otherwise, I'll give 0.20.203 + capacity scheduler a shot.
>
> Thanks again for the pointers.
>
> Kai Ju
>
>
> On Mon, Aug 1, 2011 at 10:08 PM, Arun C Murthy <ac...@hortonworks.com>wrote:
>
>> On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote:
>>
>> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral
>> disks, the problem has actually persisted. Now, however, there is no
>> evidence of errors on any of the mappers. The job tracker lists one less map
>> completed than the map total, while the job details show all mappers as
>> having completed. The jobs "hang" in this state as before.
>>
>>
>> Are any of your job's reducers completing? Do you see 'fetch failures'
>> messages either in JT logs or reducers' (tasks) logs?
>>
>> If not it's clear that the JobTracker/Scheduler (which Scheduler are you
>> using btw?) are 'losing' tasks which is a serious bug. You say that you are
>> running CDH - unfortunately I have no idea what patchsets you run with it. I
>> can't, at the top of my head, remember the JT/CapacityScheduler losing a
>> task - but I maintained Yahoo clusters which ran hadoop-0.20.203.
>>
>> Here is something worth trying:
>> $ cat JOBTRACKER.log | grep Assigning | grep
>> _<clustertimestamp>_<jobid>_m_*
>>
>> The JOBTRACKER.log is the JT's log file on the JT host and if your jobid
>> is job_12345342432_0001, then <clustertimestamp> == 12345342432 and
>> <jobid> == 0001.
>>
>> Good luck.
>>
>> Arun
>>
>>
>> Is there something in particular I should be looking for on my local
>> disks? Hadoop fsck shows all clear, but I'll have to wait until morning to
>> take individual nodes offline to check their disks. Any further details you
>> might have would be very helpful. Thanks!
>>
>> Kai Ju
>>
>> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <ac...@hortonworks.com>wrote:
>>
>>> Is this reproducible? If so, I'd urge you to check your local disks...
>>>
>>> Arun
>>>
>>> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
>>>
>>> Hi Marcos. The issue appears to be the following. A reduce task is unable
>>> to fetch results from a map task on HDFS. The map task is re-run, but the
>>> map task is now unable to retrieve information that it needs to run. Here is
>>> the error from the second map task:
>>>
>>> java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
>>> 	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
>>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>>> 	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
>>> 	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
>>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
>>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>> 	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
>>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
>>> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>> 	at java.security.AccessController.doPrivileged(Native Method)
>>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>
>>> I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks!
>>>
>>>
>>>
>>> Kai Ju
>>>
>>> P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them.
>>>
>>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>>>
>>>>
>>>>
>>>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>>>>
>>>>  Over the past week or two, I've run into an issue where MapReduce jobs
>>>>> hang or fail near completion. The percent completion of both map and
>>>>> reduce tasks is often reported as 100%, but the actual number of
>>>>> completed tasks is less than the total number. It appears that either
>>>>> tasks backtrack and need to be restarted or the last few reduce tasks
>>>>> hang interminably on the copy step.
>>>>>
>>>>> In certain cases, the jobs actually complete. In other cases, I can't
>>>>> wait long enough and have to kill the job manually.
>>>>>
>>>>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with
>>>>> 4
>>>>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>>>>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>>>>> distribution. Has anyone experienced similar behavior in their
>>>>> clusters,
>>>>> and if so, had any luck resolving it? Thanks!
>>>>>
>>>>>  Can you post here your NN and DN logs files?
>>>> Regards
>>>>
>>>>  Kai Ju
>>>>>
>>>>
>>>> --
>>>> Marcos Luís Ortíz Valmaseda
>>>>  Software Engineer (UCI)
>>>>  Linux User # 418229
>>>>  http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com/>
>>>>  http://twitter.com/**marcosluis2186<http://twitter.com/marcosluis2186>
>>>>
>>>>
>>>
>>
>>
>

Re: MapReduce jobs hanging or failing near completion

Posted by Kai Ju Liu <ka...@tellapart.com>.
All of the reducers are complete, both on the job tracker page and the job
details page. I used to get "fetch failure" messages when HDFS was mounted
on EBS volumes, but I haven't seen any since I migrated to physical disks.

I'm currently using the fair scheduler, but it doesn't look like I've
specified any allocations. Perhaps I'll dig into this further with the
Cloudera team to see if there is indeed a problem with the job tracker or
scheduler. Otherwise, I'll give 0.20.203 + capacity scheduler a shot.

Thanks again for the pointers.

Kai Ju

On Mon, Aug 1, 2011 at 10:08 PM, Arun C Murthy <ac...@hortonworks.com> wrote:

> On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote:
>
> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral
> disks, the problem has actually persisted. Now, however, there is no
> evidence of errors on any of the mappers. The job tracker lists one less map
> completed than the map total, while the job details show all mappers as
> having completed. The jobs "hang" in this state as before.
>
>
> Are any of your job's reducers completing? Do you see 'fetch failures'
> messages either in JT logs or reducers' (tasks) logs?
>
> If not it's clear that the JobTracker/Scheduler (which Scheduler are you
> using btw?) are 'losing' tasks which is a serious bug. You say that you are
> running CDH - unfortunately I have no idea what patchsets you run with it. I
> can't, at the top of my head, remember the JT/CapacityScheduler losing a
> task - but I maintained Yahoo clusters which ran hadoop-0.20.203.
>
> Here is something worth trying:
> $ cat JOBTRACKER.log | grep Assigning | grep
> _<clustertimestamp>_<jobid>_m_*
>
> The JOBTRACKER.log is the JT's log file on the JT host and if your jobid is
> job_12345342432_0001, then <clustertimestamp> == 12345342432 and <jobid>
> == 0001.
>
> Good luck.
>
> Arun
>
>
> Is there something in particular I should be looking for on my local disks?
> Hadoop fsck shows all clear, but I'll have to wait until morning to take
> individual nodes offline to check their disks. Any further details you might
> have would be very helpful. Thanks!
>
> Kai Ju
>
> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <ac...@hortonworks.com>wrote:
>
>> Is this reproducible? If so, I'd urge you to check your local disks...
>>
>> Arun
>>
>> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
>>
>> Hi Marcos. The issue appears to be the following. A reduce task is unable
>> to fetch results from a map task on HDFS. The map task is re-run, but the
>> map task is now unable to retrieve information that it needs to run. Here is
>> the error from the second map task:
>>
>> java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
>> 	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>> 	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
>> 	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>> 	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
>> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>
>> I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks!
>>
>>
>> Kai Ju
>>
>> P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them.
>>
>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>>
>>>
>>>
>>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>>>
>>>  Over the past week or two, I've run into an issue where MapReduce jobs
>>>> hang or fail near completion. The percent completion of both map and
>>>> reduce tasks is often reported as 100%, but the actual number of
>>>> completed tasks is less than the total number. It appears that either
>>>> tasks backtrack and need to be restarted or the last few reduce tasks
>>>> hang interminably on the copy step.
>>>>
>>>> In certain cases, the jobs actually complete. In other cases, I can't
>>>> wait long enough and have to kill the job manually.
>>>>
>>>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
>>>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>>>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>>>> distribution. Has anyone experienced similar behavior in their clusters,
>>>> and if so, had any luck resolving it? Thanks!
>>>>
>>>>  Can you post here your NN and DN logs files?
>>> Regards
>>>
>>>  Kai Ju
>>>>
>>>
>>> --
>>> Marcos Luís Ortíz Valmaseda
>>>  Software Engineer (UCI)
>>>  Linux User # 418229
>>>  http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com/>
>>>  http://twitter.com/**marcosluis2186 <http://twitter.com/marcosluis2186>
>>>
>>>
>>
>
>

Re: MapReduce jobs hanging or failing near completion

Posted by Arun C Murthy <ac...@hortonworks.com>.
On Aug 1, 2011, at 9:47 PM, Kai Ju Liu wrote:

> Hi Arun. Since migrating HDFS off EBS-mounted volumes and onto ephemeral disks, the problem has actually persisted. Now, however, there is no evidence of errors on any of the mappers. The job tracker lists one less map completed than the map total, while the job details show all mappers as having completed. The jobs "hang" in this state as before.

Are any of your job's reducers completing? Do you see 'fetch failures' messages either in JT logs or reducers' (tasks) logs?

If not it's clear that the JobTracker/Scheduler (which Scheduler are you using btw?) are 'losing' tasks which is a serious bug. You say that you are running CDH - unfortunately I have no idea what patchsets you run with it. I can't, at the top of my head, remember the JT/CapacityScheduler losing a task - but I maintained Yahoo clusters which ran hadoop-0.20.203.

Here is something worth trying: 
$ cat JOBTRACKER.log | grep Assigning | grep _<clustertimestamp>_<jobid>_m_*

The JOBTRACKER.log is the JT's log file on the JT host and if your jobid is job_12345342432_0001, then <clustertimestamp> == 12345342432 and <jobid> == 0001.

Good luck.

Arun

> 
> Is there something in particular I should be looking for on my local disks? Hadoop fsck shows all clear, but I'll have to wait until morning to take individual nodes offline to check their disks. Any further details you might have would be very helpful. Thanks!
> 
> Kai Ju
> 
> On Tue, Jul 19, 2011 at 1:50 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Is this reproducible? If so, I'd urge you to check your local disks...
> 
> Arun
> 
> On Jul 19, 2011, at 12:41 PM, Kai Ju Liu wrote:
> 
>> Hi Marcos. The issue appears to be the following. A reduce task is unable to fetch results from a map task on HDFS. The map task is re-run, but the map task is now unable to retrieve information that it needs to run. Here is the error from the second map task:
>> java.io.FileNotFoundException: /mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
>> 	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
>> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>> 	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
>> 	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
>> 	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>> 	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
>> 	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
>> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> 
>> I have been having general difficulties with HDFS on EBS, which pointed me in this direction. Does this sound like a possible hypothesis to you? Thanks!
>> 
>> 
>> 
>> Kai Ju
>> 
>> P.S. I am migrating off of HDFS on EBS, so I will post back with further results as soon as I have them.
>> On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>> 
>> 
>> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>> 
>> Over the past week or two, I've run into an issue where MapReduce jobs
>> hang or fail near completion. The percent completion of both map and
>> reduce tasks is often reported as 100%, but the actual number of
>> completed tasks is less than the total number. It appears that either
>> tasks backtrack and need to be restarted or the last few reduce tasks
>> hang interminably on the copy step.
>> 
>> In certain cases, the jobs actually complete. In other cases, I can't
>> wait long enough and have to kill the job manually.
>> 
>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>> distribution. Has anyone experienced similar behavior in their clusters,
>> and if so, had any luck resolving it? Thanks!
>> 
>> Can you post here your NN and DN logs files?
>> Regards
>> 
>> Kai Ju
>> 
>> -- 
>> Marcos Luís Ortíz Valmaseda
>>  Software Engineer (UCI)
>>  Linux User # 418229
>>  http://marcosluis2186.posterous.com
>>  http://twitter.com/marcosluis2186
>> 
> 
>