You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Nathan Marz <na...@rapleaf.com> on 2009/02/25 05:28:57 UTC

FAILED_UNCLEAN?

I have a large job operating on over 2 TB of data, with about 50000  
input splits. For some reason (as yet unknown), tasks started failing  
on two of the machines (which got blacklisted). 13 mappers failed in  
total. Of those 13, 8 of the tasks were able to execute on another  
machine without any issues. 5 of the tasks *did not* get re-executed  
on another machine, and their status is marked as "FAILED_UNCLEAN".  
Anyone have any idea what's going on? Why isn't Hadoop running these  
tasks on other machines?

Thanks,
Nathan Marz



Re: FAILED_UNCLEAN?

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
This could be mostly because of HADOOP-5269. By any chance did you see 
TaskTracker web ui? was it holding up some FAILED_UNCLEAN tasks?
Can you attach JobTracker and TaskTracker logs and task logs, if possible?

Thanks
Amareshwari
Nathan Marz wrote:
> This is on Hadoop 0.19.1. The first time I saw it happen, the job was 
> hung. That is, 5 map tasks were "running", but looking at each task 
> there was the FAILED_UNCLEAN task attempt and no other task attempts. 
> I reran it again, the job failed immediately, and some of the tasks 
> had FAILED_UNCLEAN.
>
> There is one job that runs in parallel with this job, but it's of the 
> same priority. The other job had failed when the job I'm describing 
> got hung.
>
>
> On Feb 24, 2009, at 10:46 PM, Amareshwari Sriramadasu wrote:
>
>> Nathan Marz wrote:
>>> I have a large job operating on over 2 TB of data, with about 50000 
>>> input splits. For some reason (as yet unknown), tasks started 
>>> failing on two of the machines (which got blacklisted). 13 mappers 
>>> failed in total. Of those 13, 8 of the tasks were able to execute on 
>>> another machine without any issues. 5 of the tasks *did not* get 
>>> re-executed on another machine, and their status is marked as 
>>> "FAILED_UNCLEAN". Anyone have any idea what's going on? Why isn't 
>>> Hadoop running these tasks on other machines?
>>>
>> Has the job failed/killed or Succeded when you see this situation ? 
>> Once the job completes, the unclean attempts will not get scheduled.
>> If not, are there other jobs of higher priority running at the same 
>> time preventing the cleanups to be launched?
>> What version of Hadoop are you using? latest trunk?
>>
>> Thanks
>> Amareshwari
>>> Thanks,
>>> Nathan Marz
>>>
>>>
>>
>
>


Re: FAILED_UNCLEAN?

Posted by Nathan Marz <na...@rapleaf.com>.
This is on Hadoop 0.19.1. The first time I saw it happen, the job was  
hung. That is, 5 map tasks were "running", but looking at each task  
there was the FAILED_UNCLEAN task attempt and no other task attempts.  
I reran it again, the job failed immediately, and some of the tasks  
had FAILED_UNCLEAN.

There is one job that runs in parallel with this job, but it's of the  
same priority. The other job had failed when the job I'm describing  
got hung.


On Feb 24, 2009, at 10:46 PM, Amareshwari Sriramadasu wrote:

> Nathan Marz wrote:
>> I have a large job operating on over 2 TB of data, with about 50000  
>> input splits. For some reason (as yet unknown), tasks started  
>> failing on two of the machines (which got blacklisted). 13 mappers  
>> failed in total. Of those 13, 8 of the tasks were able to execute  
>> on another machine without any issues. 5 of the tasks *did not* get  
>> re-executed on another machine, and their status is marked as  
>> "FAILED_UNCLEAN". Anyone have any idea what's going on? Why isn't  
>> Hadoop running these tasks on other machines?
>>
> Has the job failed/killed or Succeded when you see this situation ?  
> Once the job completes, the unclean attempts will not get scheduled.
> If not, are there other jobs of higher priority running at the same  
> time preventing the cleanups to be launched?
> What version of Hadoop are you using? latest trunk?
>
> Thanks
> Amareshwari
>> Thanks,
>> Nathan Marz
>>
>>
>


Re: FAILED_UNCLEAN?

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
Nathan Marz wrote:
> I have a large job operating on over 2 TB of data, with about 50000 
> input splits. For some reason (as yet unknown), tasks started failing 
> on two of the machines (which got blacklisted). 13 mappers failed in 
> total. Of those 13, 8 of the tasks were able to execute on another 
> machine without any issues. 5 of the tasks *did not* get re-executed 
> on another machine, and their status is marked as "FAILED_UNCLEAN". 
> Anyone have any idea what's going on? Why isn't Hadoop running these 
> tasks on other machines?
>
Has the job failed/killed or Succeded when you see this situation ? Once 
the job completes, the unclean attempts will not get scheduled.
If not, are there other jobs of higher priority running at the same time 
preventing the cleanups to be launched?
What version of Hadoop are you using? latest trunk?

Thanks
Amareshwari
> Thanks,
> Nathan Marz
>
>