You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Bryan Duxbury <br...@rapleaf.com> on 2009/02/21 00:58:46 UTC

Super-long reduce task timeouts in hadoop-0.19.0

(Repost from the dev list)

I noticed some really odd behavior today while reviewing the job  
history of some of our jobs. Our Ganglia graphs showed really long  
periods of inactivity across the entire cluster, which should  
definitely not be the case - we have a really long string of jobs in  
our workflow that should execute one after another. I figured out  
which jobs were running during those periods of inactivity, and  
discovered that almost all of them had 4-5 failed reduce tasks, with  
the reason for failure being something like:

Task attempt_200902061117_3382_r_000038_0 failed to report status for  
1282 seconds. Killing!

The actual timeout reported varies from 700-5000 seconds. Virtually  
all of our longer-running jobs were affected by this problem. The  
period of inactivity on the cluster seems to correspond to the amount  
of time the job waited for these reduce tasks to fail.

I checked out the tasktracker log for the machines with timed-out  
reduce tasks looking for something that might explain the problem,  
but the only thing I came up with that actually referenced the failed  
task was this log message, which was repeated many times:

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200902061117_3388/ 
attempt_200902061117_3388_r_000066_0/output/file.out in any of the  
configured local directories

I'm not sure what this means; can anyone shed some light on this  
message?

Further confusing the issue, on the affected machines, I looked in  
logs/userlogs/<task id>, and to my surprise, the directory and log  
files existed, and the syslog file seemed to contain logs of a  
perfectly good reduce task!

Overall, this seems like a pretty critical bug. It's consuming up to  
50% of the runtime of our jobs in some instances, killing our  
throughput. At the very least, it seems like the reduce task timeout  
period should be MUCH shorter than the current 10-20 minutes.

-Bryan

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by Rasit OZDAS <ra...@gmail.com>.

I agree with the timeout period, Bryan,
Reporter has a progress() method to tell the namenode that it's still
working, no need to kill the job.


2009/2/21 Bryan Duxbury <br...@rapleaf.com>

> We didn't customize this value, to my knowledge, so I'd suspect it's the
> default.
> -Bryan
>
>
> On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote:
>
>  How often do your reduce tasks report status?
>>
>> On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>>
>>  (Repost from the dev list)
>>>
>>>
>>> I noticed some really odd behavior today while reviewing the job history
>>> of
>>> some of our jobs. Our Ganglia graphs showed really long periods of
>>> inactivity across the entire cluster, which should definitely not be the
>>> case - we have a really long string of jobs in our workflow that should
>>> execute one after another. I figured out which jobs were running during
>>> those periods of inactivity, and discovered that almost all of them had
>>> 4-5
>>> failed reduce tasks, with the reason for failure being something like:
>>>
>>> Task attempt_200902061117_3382_r_000038_0 failed to report status for
>>> 1282
>>> seconds. Killing!
>>>
>>> The actual timeout reported varies from 700-5000 seconds. Virtually all
>>> of
>>> our longer-running jobs were affected by this problem. The period of
>>> inactivity on the cluster seems to correspond to the amount of time the
>>> job
>>> waited for these reduce tasks to fail.
>>>
>>> I checked out the tasktracker log for the machines with timed-out reduce
>>> tasks looking for something that might explain the problem, but the only
>>> thing I came up with that actually referenced the failed task was this
>>> log
>>> message, which was repeated many times:
>>>
>>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out
>>> in any of the configured local directories
>>>
>>> I'm not sure what this means; can anyone shed some light on this message?
>>>
>>> Further confusing the issue, on the affected machines, I looked in
>>> logs/userlogs/<task id>, and to my surprise, the directory and log files
>>> existed, and the syslog file seemed to contain logs of a perfectly good
>>> reduce task!
>>>
>>> Overall, this seems like a pretty critical bug. It's consuming up to 50%
>>> of
>>> the runtime of our jobs in some instances, killing our throughput. At the
>>> very least, it seems like the reduce task timeout period should be MUCH
>>> shorter than the current 10-20 minutes.
>>>
>>> -Bryan
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>> 111 West Evelyn Ave. Ste. 202
>> Sunnyvale, CA 94086
>> www.deepdyve.com
>> 408-773-0110 ext. 738
>> 858-414-0013 (m)
>> 408-773-0220 (fax)
>>
>
>


-- 
M. Raşit ÖZDAŞ

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by Bryan Duxbury <br...@rapleaf.com>.

We didn't customize this value, to my knowledge, so I'd suspect it's  
the default.
-Bryan

On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote:

> How often do your reduce tasks report status?
>
> On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com>  
> wrote:
>
>> (Repost from the dev list)
>>
>>
>> I noticed some really odd behavior today while reviewing the job  
>> history of
>> some of our jobs. Our Ganglia graphs showed really long periods of
>> inactivity across the entire cluster, which should definitely not  
>> be the
>> case - we have a really long string of jobs in our workflow that  
>> should
>> execute one after another. I figured out which jobs were running  
>> during
>> those periods of inactivity, and discovered that almost all of  
>> them had 4-5
>> failed reduce tasks, with the reason for failure being something  
>> like:
>>
>> Task attempt_200902061117_3382_r_000038_0 failed to report status  
>> for 1282
>> seconds. Killing!
>>
>> The actual timeout reported varies from 700-5000 seconds.  
>> Virtually all of
>> our longer-running jobs were affected by this problem. The period of
>> inactivity on the cluster seems to correspond to the amount of  
>> time the job
>> waited for these reduce tasks to fail.
>>
>> I checked out the tasktracker log for the machines with timed-out  
>> reduce
>> tasks looking for something that might explain the problem, but  
>> the only
>> thing I came up with that actually referenced the failed task was  
>> this log
>> message, which was repeated many times:
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/ 
>> attempt_200902061117_3388_r_000066_0/output/file.out
>> in any of the configured local directories
>>
>> I'm not sure what this means; can anyone shed some light on this  
>> message?
>>
>> Further confusing the issue, on the affected machines, I looked in
>> logs/userlogs/<task id>, and to my surprise, the directory and log  
>> files
>> existed, and the syslog file seemed to contain logs of a perfectly  
>> good
>> reduce task!
>>
>> Overall, this seems like a pretty critical bug. It's consuming up  
>> to 50% of
>> the runtime of our jobs in some instances, killing our throughput.  
>> At the
>> very least, it seems like the reduce task timeout period should be  
>> MUCH
>> shorter than the current 10-20 minutes.
>>
>> -Bryan
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 408-773-0110 ext. 738
> 858-414-0013 (m)
> 408-773-0220 (fax)

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by Ted Dunning <te...@gmail.com>.

How often do your reduce tasks report status?

On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com> wrote:

> (Repost from the dev list)
>
>
> I noticed some really odd behavior today while reviewing the job history of
> some of our jobs. Our Ganglia graphs showed really long periods of
> inactivity across the entire cluster, which should definitely not be the
> case - we have a really long string of jobs in our workflow that should
> execute one after another. I figured out which jobs were running during
> those periods of inactivity, and discovered that almost all of them had 4-5
> failed reduce tasks, with the reason for failure being something like:
>
> Task attempt_200902061117_3382_r_000038_0 failed to report status for 1282
> seconds. Killing!
>
> The actual timeout reported varies from 700-5000 seconds. Virtually all of
> our longer-running jobs were affected by this problem. The period of
> inactivity on the cluster seems to correspond to the amount of time the job
> waited for these reduce tasks to fail.
>
> I checked out the tasktracker log for the machines with timed-out reduce
> tasks looking for something that might explain the problem, but the only
> thing I came up with that actually referenced the failed task was this log
> message, which was repeated many times:
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out
> in any of the configured local directories
>
> I'm not sure what this means; can anyone shed some light on this message?
>
> Further confusing the issue, on the affected machines, I looked in
> logs/userlogs/<task id>, and to my surprise, the directory and log files
> existed, and the syslog file seemed to contain logs of a perfectly good
> reduce task!
>
> Overall, this seems like a pretty critical bug. It's consuming up to 50% of
> the runtime of our jobs in some instances, killing our throughput. At the
> very least, it seems like the reduce task timeout period should be MUCH
> shorter than the current 10-20 minutes.
>
> -Bryan
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by John Lee <j....@gmail.com>.

Sorry, I should add the TaskTracker log messages I'm seeing relating
to such a hung task

2009-03-26 08:16:32,152 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_200903111131_0766_r_000007_0
2009-03-26 08:16:32,986 INFO org.apache.hadoop.mapred.TaskTracker:
Trying to launch : attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,146 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,361 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903111131_0766_r_000007_0: Task
attempt_200903111131_0766_r_000007_0 failed to report status for 2127
seconds. Killing!
2009-03-26 08:51:59,374 INFO org.apache.hadoop.mapred.TaskTracker:
About to purge task: attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,375 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200903111131_0766_r_000007_0 done; removing files.
2009-03-26 08:51:59,464 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child task finshed: attempt_200903111131_0766_r_000007_0.
Ignored.

And the relevant JobTracker logs.

2009-03-26 08:16:32,150 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'attempt_200903111131_0766_r_000007_0' to tip
task_200903111131_0766_r_000007, for tracker
'tracker_dev-01.33across.com:localhost/
27.0.0.1:58102'
2009-03-26 08:52:04,365 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200903111131_0766_r_000007_0: Task
attempt_200903111131_0766_r_000007_0 failed to report status for 2127
seconds. Killi
g!
2009-03-26 08:52:04,367 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'attempt_200903111131_0766_r_000007_0' from
'tracker_dev-01.33across.com:localhost/127.0.0.1:58102'


On Thu, Mar 26, 2009 at 11:51 AM, John Lee <j....@gmail.com> wrote:
> I have this same issue re: lots of failed reduce tasks.
>
> From the WebUI, it looks like the jobs are failing in the shuffle
> phase. The shuffle phase for the failed attempts took about a third
> the time of the successful attempts.
>
> I have also noted that in 0.19.0, my reduces often get started but
> then remained in the "unassigned" state for a long time before timing
> out. No evidence of these tasks in the local taskTracker dir.
>
> The latter problem sounds like HADOOP-5407, but is the former problem
> (reduces timing out) just a secondary symptom of HADOOP-5407? My
> TaskTrackers aren't hanging, though (other reduce tasks in the same
> job run to completion).
>
> - John
>
> On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <dd...@yahoo-inc.com> wrote:
>> Bryan, the message
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
>> configured local directories
>>
>> is spurious. That had been reported in https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. Meanwhile, please apply the patch on your repository if you can.
>> Regarding the tasks timing out, do you know whether the reduce tasks were in the shuffle phase or the reducer phase? That you can deduce by looking at the task web UI for the failed tasks, or, the task logs.
>> Also, from your reduce method, do you ensure that progress reports are sent every so often? By default, progress reports are sent for every record-group that the reducer method is invoked with, and, for every record that the reducer emits. If the timeout is not happening in the shuffle, then the problematic part is the reduce method where the timeout could be happening because a lot of time is spent in the processing of a particular record-group, or, the write of the output record to the hdfs is taking a long time.
>>
>>
>> On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:
>>
>> (Repost from the dev list)
>>
>> I noticed some really odd behavior today while reviewing the job
>> history of some of our jobs. Our Ganglia graphs showed really long
>> periods of inactivity across the entire cluster, which should
>> definitely not be the case - we have a really long string of jobs in
>> our workflow that should execute one after another. I figured out
>> which jobs were running during those periods of inactivity, and
>> discovered that almost all of them had 4-5 failed reduce tasks, with
>> the reason for failure being something like:
>>
>> Task attempt_200902061117_3382_r_000038_0 failed to report status for
>> 1282 seconds. Killing!
>>
>> The actual timeout reported varies from 700-5000 seconds. Virtually
>> all of our longer-running jobs were affected by this problem. The
>> period of inactivity on the cluster seems to correspond to the amount
>> of time the job waited for these reduce tasks to fail.
>>
>> I checked out the tasktracker log for the machines with timed-out
>> reduce tasks looking for something that might explain the problem,
>> but the only thing I came up with that actually referenced the failed
>> task was this log message, which was repeated many times:
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
>> configured local directories
>>
>> I'm not sure what this means; can anyone shed some light on this
>> message?
>>
>> Further confusing the issue, on the affected machines, I looked in
>> logs/userlogs/<task id>, and to my surprise, the directory and log
>> files existed, and the syslog file seemed to contain logs of a
>> perfectly good reduce task!
>>
>> Overall, this seems like a pretty critical bug. It's consuming up to
>> 50% of the runtime of our jobs in some instances, killing our
>> throughput. At the very least, it seems like the reduce task timeout
>> period should be MUCH shorter than the current 10-20 minutes.
>>
>> -Bryan
>>
>>
>

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by schubert zhang <zs...@gmail.com>.

Can you try to use branch-0.19. I am using it. and it seems fine.
0.19.1 have also following issues:

https://issues.apache.org/jira/browse/HADOOP-5269

https://issues.apache.org/jira/browse/HADOOP-5235

https://issues.apache.org/jira/browse/HADOOP-5367

On Fri, Mar 27, 2009 at 12:09 AM, John Lee <j....@gmail.com> wrote:

> Woops. Sorry about the extra noise.
>

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by John Lee <j....@gmail.com>.

Woops. Sorry about the extra noise.

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by John Lee <j....@gmail.com>.

Sorry, I should add the TaskTracker log messages I'm seeing relating
to such a hung task

2009-03-26 08:16:32,152 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_200903111131_0766_r_000007_0
2009-03-26 08:16:32,986 INFO org.apache.hadoop.mapred.TaskTracker:
Trying to launch : attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,146 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,361 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903111131_0766_r_000007_0: Task
attempt_200903111131_0766_r_000007_0 failed to report status for 2127
seconds. Killing!
2009-03-26 08:51:59,374 INFO org.apache.hadoop.mapred.TaskTracker:
About to purge task: attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,375 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200903111131_0766_r_000007_0 done; removing files.
2009-03-26 08:51:59,464 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child task finshed: attempt_200903111131_0766_r_000007_0.
Ignored.

And the relevant JobTracker logs.

2009-03-26 08:16:32,150 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'attempt_200903111131_0766_r_000007_0' to tip
task_200903111131_0766_r_000007, for tracker
'tracker_dev-01.33across.com:localhost/
27.0.0.1:58102'
2009-03-26 08:52:04,365 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200903111131_0766_r_000007_0: Task
attempt_200903111131_0766_r_000007_0 failed to report status for 2127
seconds. Killi
g!
2009-03-26 08:52:04,367 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'attempt_200903111131_0766_r_000007_0' from
'tracker_dev-01.33across.com:localhost/127.0.0.1:58102'


On Thu, Mar 26, 2009 at 11:51 AM, John Lee <j....@gmail.com> wrote:
> I have this same issue re: lots of failed reduce tasks.
>
> From the WebUI, it looks like the jobs are failing in the shuffle
> phase. The shuffle phase for the failed attempts took about a third
> the time of the successful attempts.
>
> I have also noted that in 0.19.0, my reduces often get started but
> then remained in the "unassigned" state for a long time before timing
> out. No evidence of these tasks in the local taskTracker dir.
>
> The latter problem sounds like HADOOP-5407, but is the former problem
> (reduces timing out) just a secondary symptom of HADOOP-5407? My
> TaskTrackers aren't hanging, though (other reduce tasks in the same
> job run to completion).
>
> - John
>
> On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <dd...@yahoo-inc.com> wrote:
>> Bryan, the message
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
>> configured local directories
>>
>> is spurious. That had been reported in https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. Meanwhile, please apply the patch on your repository if you can.
>> Regarding the tasks timing out, do you know whether the reduce tasks were in the shuffle phase or the reducer phase? That you can deduce by looking at the task web UI for the failed tasks, or, the task logs.
>> Also, from your reduce method, do you ensure that progress reports are sent every so often? By default, progress reports are sent for every record-group that the reducer method is invoked with, and, for every record that the reducer emits. If the timeout is not happening in the shuffle, then the problematic part is the reduce method where the timeout could be happening because a lot of time is spent in the processing of a particular record-group, or, the write of the output record to the hdfs is taking a long time.
>>
>>
>> On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:
>>
>> (Repost from the dev list)
>>
>> I noticed some really odd behavior today while reviewing the job
>> history of some of our jobs. Our Ganglia graphs showed really long
>> periods of inactivity across the entire cluster, which should
>> definitely not be the case - we have a really long string of jobs in
>> our workflow that should execute one after another. I figured out
>> which jobs were running during those periods of inactivity, and
>> discovered that almost all of them had 4-5 failed reduce tasks, with
>> the reason for failure being something like:
>>
>> Task attempt_200902061117_3382_r_000038_0 failed to report status for
>> 1282 seconds. Killing!
>>
>> The actual timeout reported varies from 700-5000 seconds. Virtually
>> all of our longer-running jobs were affected by this problem. The
>> period of inactivity on the cluster seems to correspond to the amount
>> of time the job waited for these reduce tasks to fail.
>>
>> I checked out the tasktracker log for the machines with timed-out
>> reduce tasks looking for something that might explain the problem,
>> but the only thing I came up with that actually referenced the failed
>> task was this log message, which was repeated many times:
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
>> configured local directories
>>
>> I'm not sure what this means; can anyone shed some light on this
>> message?
>>
>> Further confusing the issue, on the affected machines, I looked in
>> logs/userlogs/<task id>, and to my surprise, the directory and log
>> files existed, and the syslog file seemed to contain logs of a
>> perfectly good reduce task!
>>
>> Overall, this seems like a pretty critical bug. It's consuming up to
>> 50% of the runtime of our jobs in some instances, killing our
>> throughput. At the very least, it seems like the reduce task timeout
>> period should be MUCH shorter than the current 10-20 minutes.
>>
>> -Bryan
>>
>>
>

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by John Lee <j....@gmail.com>.

I have this same issue re: lots of failed reduce tasks.

>From the WebUI, it looks like the jobs are failing in the shuffle
phase. The shuffle phase for the failed attempts took about a third
the time of the successful attempts.

I have also noted that in 0.19.0, my reduces often get started but
then remained in the "unassigned" state for a long time before timing
out. No evidence of these tasks in the local taskTracker dir.

The latter problem sounds like HADOOP-5407, but is the former problem
(reduces timing out) just a secondary symptom of HADOOP-5407? My
TaskTrackers aren't hanging, though (other reduce tasks in the same
job run to completion).

- John

On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <dd...@yahoo-inc.com> wrote:
> Bryan, the message
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/
> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
> configured local directories
>
> is spurious. That had been reported in https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. Meanwhile, please apply the patch on your repository if you can.
> Regarding the tasks timing out, do you know whether the reduce tasks were in the shuffle phase or the reducer phase? That you can deduce by looking at the task web UI for the failed tasks, or, the task logs.
> Also, from your reduce method, do you ensure that progress reports are sent every so often? By default, progress reports are sent for every record-group that the reducer method is invoked with, and, for every record that the reducer emits. If the timeout is not happening in the shuffle, then the problematic part is the reduce method where the timeout could be happening because a lot of time is spent in the processing of a particular record-group, or, the write of the output record to the hdfs is taking a long time.
>
>
> On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:
>
> (Repost from the dev list)
>
> I noticed some really odd behavior today while reviewing the job
> history of some of our jobs. Our Ganglia graphs showed really long
> periods of inactivity across the entire cluster, which should
> definitely not be the case - we have a really long string of jobs in
> our workflow that should execute one after another. I figured out
> which jobs were running during those periods of inactivity, and
> discovered that almost all of them had 4-5 failed reduce tasks, with
> the reason for failure being something like:
>
> Task attempt_200902061117_3382_r_000038_0 failed to report status for
> 1282 seconds. Killing!
>
> The actual timeout reported varies from 700-5000 seconds. Virtually
> all of our longer-running jobs were affected by this problem. The
> period of inactivity on the cluster seems to correspond to the amount
> of time the job waited for these reduce tasks to fail.
>
> I checked out the tasktracker log for the machines with timed-out
> reduce tasks looking for something that might explain the problem,
> but the only thing I came up with that actually referenced the failed
> task was this log message, which was repeated many times:
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/
> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
> configured local directories
>
> I'm not sure what this means; can anyone shed some light on this
> message?
>
> Further confusing the issue, on the affected machines, I looked in
> logs/userlogs/<task id>, and to my surprise, the directory and log
> files existed, and the syslog file seemed to contain logs of a
> perfectly good reduce task!
>
> Overall, this seems like a pretty critical bug. It's consuming up to
> 50% of the runtime of our jobs in some instances, killing our
> throughput. At the very least, it seems like the reduce task timeout
> period should be MUCH shorter than the current 10-20 minutes.
>
> -Bryan
>
>

Re: Super-long reduce task timeouts in hadoop-0.19.0

Posted by Devaraj Das <dd...@yahoo-inc.com>.

Bryan, the message

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200902061117_3388/
attempt_200902061117_3388_r_000066_0/output/file.out in any of the
configured local directories

is spurious. That had been reported in https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. Meanwhile, please apply the patch on your repository if you can.
Regarding the tasks timing out, do you know whether the reduce tasks were in the shuffle phase or the reducer phase? That you can deduce by looking at the task web UI for the failed tasks, or, the task logs.
Also, from your reduce method, do you ensure that progress reports are sent every so often? By default, progress reports are sent for every record-group that the reducer method is invoked with, and, for every record that the reducer emits. If the timeout is not happening in the shuffle, then the problematic part is the reduce method where the timeout could be happening because a lot of time is spent in the processing of a particular record-group, or, the write of the output record to the hdfs is taking a long time.

On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:

(Repost from the dev list)

I noticed some really odd behavior today while reviewing the job
history of some of our jobs. Our Ganglia graphs showed really long
periods of inactivity across the entire cluster, which should
definitely not be the case - we have a really long string of jobs in
our workflow that should execute one after another. I figured out
which jobs were running during those periods of inactivity, and
discovered that almost all of them had 4-5 failed reduce tasks, with
the reason for failure being something like:

Task attempt_200902061117_3382_r_000038_0 failed to report status for
1282 seconds. Killing!

The actual timeout reported varies from 700-5000 seconds. Virtually
all of our longer-running jobs were affected by this problem. The
period of inactivity on the cluster seems to correspond to the amount
of time the job waited for these reduce tasks to fail.

I checked out the tasktracker log for the machines with timed-out
reduce tasks looking for something that might explain the problem,
but the only thing I came up with that actually referenced the failed
task was this log message, which was repeated many times:

I'm not sure what this means; can anyone shed some light on this
message?

Further confusing the issue, on the affected machines, I looked in
logs/userlogs/<task id>, and to my surprise, the directory and log
files existed, and the syslog file seemed to contain logs of a
perfectly good reduce task!

Overall, this seems like a pretty critical bug. It's consuming up to
50% of the runtime of our jobs in some instances, killing our
throughput. At the very least, it seems like the reduce task timeout
period should be MUCH shorter than the current 10-20 minutes.

-Bryan