You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Billy Pearson <sa...@pearsonwholesale.com> on 2009/03/26 03:23:29 UTC

reduce task failing after 24 hours waiting

I am seeing on one of my long running jobs about 50-60 hours that after 24 
hours all
active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them again.

Billy

Re: reduce task failing after 24 hours waiting

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

mapred.jobtracker.retirejob.interval
is not in the default config

should this not be in the config?

Billy



"Amar Kamat" <am...@yahoo-inc.com> wrote in 
message news:49CAFF11.8070400@yahoo-inc.com...
> Amar Kamat wrote:
>> Amareshwari Sriramadasu wrote:
>>> Set mapred.jobtracker.retirejob.interval
>> This is used to retire completed jobs.
>>> and mapred.userlog.retain.hours to higher value.
>> This is used to discard user logs.
> As Amareshwari pointed out, this might be the cause. Can you increase this 
> value and try?
> Amar
>>> By default, their values are 24 hours. These might be the reason for 
>>> failure, though I'm not sure.
>>>
>>> Thanks
>>> Amareshwari
>>>
>>> Billy Pearson wrote:
>>>> I am seeing on one of my long running jobs about 50-60 hours that after 
>>>> 24 hours all
>>>> active reduce task fail with the error messages
>>>>
>>>> java.io.IOException: Task process exit with nonzero status of 255.
>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>>>
>>>> Is there something in the config that I can change to stop this?
>>>>
>>>> Every time with in 1 min of 24 hours they all fail at the same time.
>>>> waist a lot of resource downloading the map outputs and merging them 
>>>> again.
>> What is the state of the reducer (copy or sort)? Check 
>> jobtracker/task-tracker logs to see what is the state of these reducers 
>> and whether it issued a kill signal. Either jobtracker/tasktracker is 
>> issuing a kill signal or the reducers are committing suicide. Were there 
>> any failures on the reducer side while pulling the map output? Also what 
>> is the nature of the job? How fast the maps finish?
>> Amar
>>>>
>>>> Billy
>>>>
>>>>
>>>
>>
>
>

Re: reduce task failing after 24 hours waiting

Posted by Amar Kamat <am...@yahoo-inc.com>.

Amar Kamat wrote:
> Amareshwari Sriramadasu wrote:
>> Set mapred.jobtracker.retirejob.interval 
> This is used to retire completed jobs.
>> and mapred.userlog.retain.hours to higher value. 
> This is used to discard user logs.
As Amareshwari pointed out, this might be the cause. Can you increase 
this value and try?
Amar
>> By default, their values are 24 hours. These might be the reason for 
>> failure, though I'm not sure.
>>
>> Thanks
>> Amareshwari
>>
>> Billy Pearson wrote:
>>> I am seeing on one of my long running jobs about 50-60 hours that 
>>> after 24 hours all
>>> active reduce task fail with the error messages
>>>
>>> java.io.IOException: Task process exit with nonzero status of 255.
>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>>
>>> Is there something in the config that I can change to stop this?
>>>
>>> Every time with in 1 min of 24 hours they all fail at the same time.
>>> waist a lot of resource downloading the map outputs and merging them 
>>> again.
> What is the state of the reducer (copy or sort)? Check 
> jobtracker/task-tracker logs to see what is the state of these 
> reducers and whether it issued a kill signal. Either 
> jobtracker/tasktracker is issuing a kill signal or the reducers are 
> committing suicide. Were there any failures on the reducer side while 
> pulling the map output? Also what is the nature of the job? How fast 
> the maps finish?
> Amar
>>>
>>> Billy
>>>
>>>
>>
>

Re: reduce task failing after 24 hours waiting

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

There is many maps finshing from 4 mins to 15 mins less time closer to the 
end of the jobs so no timeout there. The state of the reduce task is Shuffle 
there grabing the map task as they finsh. the current job took 50:43:37 each 
of the reduce task failed twice in that time once at 24 hours in and second 
at 48 hours in. I will test on the next run in a few days the settings 
mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to 72 
hours and see if that solves the problem. So not a bad gess thought seams 
odd within 5 min's of 24 hours both times on all the task at the same time.


looks like from the tasktracker logs I get the WARN below 
org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1 
Child Error


grep the tasktracker log for one of the reduce that failed I do not have 
debug turned on so all I got is the info logs

2009-03-25 18:37:45,473 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551 
at 0.87 MB/s) >
2009-03-25 18:37:48,476 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551 
at 0.87 MB/s) >
2009-03-25 18:37:49,194 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out 
in any of the configured local directories
2009-03-25 18:37:49,480 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out 
in any of the configured local directories
2009-03-25 18:37:51,481 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551 
at 0.87 MB/s) >
2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_000001_1 Child Error
2009-03-25 18:37:54,497 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out 
in any of the configured local directories
2009-03-25 18:37:57,400 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_000001_1 done; removing files.
2009-03-25 18:42:25,191 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_200903212204_0005_r_000001_1 task's 
state:FAILED_UNCLEAN
2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_200903212204_0005_r_000001_1
2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 1 and trying to launch 
attempt_200903212204_0005_r_000001_1
2009-03-25 18:42:30,134 INFO org.apache.hadoop.mapred.TaskTracker: JVM with 
ID: jvm_200903212204_0005_r_437314552 given task: 
attempt_200903212204_0005_r_000001_1
2009-03-25 18:42:30,196 INFO org.apache.hadoop.mapred.TaskTracker: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out 
in any of the configured local directories
2009-03-25 18:42:32,530 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_000001_1 0.0%
2009-03-25 18:42:32,555 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_200903212204_0005_r_000001_1 0.0% cleanup
2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: Task 
attempt_200903212204_0005_r_000001_1 is done.
2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: reported 
output size for attempt_200903212204_0005_r_000001_1  was 0
2009-03-25 18:42:32,568 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_200903212204_0005_r_000001_1 done; removing files.


grep the jobtracker for the same task

2009-03-25 18:37:54,500 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
from attempt_200903212204_0005_r_000001_1: java.io.IOException: Task process 
exit with nonzero status of 255.
2009-03-25 18:42:25,186 INFO org.apache.hadoop.mapred.JobTracker: Adding 
task (cleanup)'attempt_200903212204_0005_r_000001_1' to tip 
task_200903212204_0005_r_000001, for tracker 
'tracker_server-1:localhost.localdomain/127.0.0.1:38816'
2009-03-25 18:42:32,589 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_200903212204_0005_r_000001_1' from 
'tracker_server-1:localhost.localdomain/127.0.0.1:38816'






"Amar Kamat" <am...@yahoo-inc.com> wrote in 
message news:49CAFD8E.8010700@yahoo-inc.com...
> Amareshwari Sriramadasu wrote:
>> Set mapred.jobtracker.retirejob.interval
> This is used to retire completed jobs.
>> and mapred.userlog.retain.hours to higher value.
> This is used to discard user logs.
>> By default, their values are 24 hours. These might be the reason for 
>> failure, though I'm not sure.
>>
>> Thanks
>> Amareshwari
>>
>> Billy Pearson wrote:
>>> I am seeing on one of my long running jobs about 50-60 hours that after 
>>> 24 hours all
>>> active reduce task fail with the error messages
>>>
>>> java.io.IOException: Task process exit with nonzero status of 255.
>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>>
>>> Is there something in the config that I can change to stop this?
>>>
>>> Every time with in 1 min of 24 hours they all fail at the same time.
>>> waist a lot of resource downloading the map outputs and merging them 
>>> again.
> What is the state of the reducer (copy or sort)? Check 
> jobtracker/task-tracker logs to see what is the state of these reducers 
> and whether it issued a kill signal. Either jobtracker/tasktracker is 
> issuing a kill signal or the reducers are committing suicide. Were there 
> any failures on the reducer side while pulling the map output? Also what 
> is the nature of the job? How fast the maps finish?
> Amar
>>>
>>> Billy
>>>
>>>
>>
>
>

Re: reduce task failing after 24 hours waiting

Posted by Amar Kamat <am...@yahoo-inc.com>.

Amareshwari Sriramadasu wrote:
> Set mapred.jobtracker.retirejob.interval 
This is used to retire completed jobs.
> and mapred.userlog.retain.hours to higher value. 
This is used to discard user logs.
> By default, their values are 24 hours. These might be the reason for 
> failure, though I'm not sure.
>
> Thanks
> Amareshwari
>
> Billy Pearson wrote:
>> I am seeing on one of my long running jobs about 50-60 hours that 
>> after 24 hours all
>> active reduce task fail with the error messages
>>
>> java.io.IOException: Task process exit with nonzero status of 255.
>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>
>> Is there something in the config that I can change to stop this?
>>
>> Every time with in 1 min of 24 hours they all fail at the same time.
>> waist a lot of resource downloading the map outputs and merging them 
>> again.
What is the state of the reducer (copy or sort)? Check 
jobtracker/task-tracker logs to see what is the state of these reducers 
and whether it issued a kill signal. Either jobtracker/tasktracker is 
issuing a kill signal or the reducers are committing suicide. Were there 
any failures on the reducer side while pulling the map output? Also what 
is the nature of the job? How fast the maps finish?
Amar
>>
>> Billy
>>
>>
>

Re: reduce task failing after 24 hours waiting

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours 
to higher value. By default, their values are 24 hours. These might be 
the reason for failure, though I'm not sure.

Thanks
Amareshwari

Billy Pearson wrote:
> I am seeing on one of my long running jobs about 50-60 hours that 
> after 24 hours all
> active reduce task fail with the error messages
>
> java.io.IOException: Task process exit with nonzero status of 255.
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>
> Is there something in the config that I can change to stop this?
>
> Every time with in 1 min of 24 hours they all fail at the same time.
> waist a lot of resource downloading the map outputs and merging them 
> again.
>
> Billy
>
>