You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Aaron Zimmerman <az...@sproutsocial.com> on 2012/12/26 16:14:36 UTC

INFO org.apache.hadoop.mapred.TaskTracker: attempt_XXXX NaN%

Hi,
  I'm new to hadoop, setting up a new cluster on hadoop 1.0.3 that currently
only has 2 datanode/tasktrackers.  I'll be adding more soon, but I'm worried
about something being configured incorrectly. When I run a moderately
expensive map reduce job (via pig), the job usually fails (though it does
succeed 1/8 times or so).

ERROR 2997: Unable to recreate exception from backed error: Task
attempt_201212171952_0406_m_000020_3 failed to report status for 601
seconds. Killing!

Any time a job runs on the cluster, both task tracker logs output line after
line of 
INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201212171952_0411_m_000000_0 NaN%, with different attempt
identifiers.  

Interspersed with these entries are lines like,
org.apache.hadoop.mapred.TaskTracker: attempt_201212171952_0411_r_000000_0
0.1851852% reduce > copy (5 of 9 at 0.00 MB/s) >

Which makes it look to me like some of the tasks are working, but some of
the tasks just stall out, and perhaps they eventually timeout the entire
job?

So maybe my job is just to labor intensive for the cluster, but the task
tracker log entry seems odd, like something is wrong.  Why would it say
NaN%?  I know that I can extend the timeout allotment, but I'd rather not do
that as a permanent solution.  Is there any other config that I could
update?  Has anyone seen that task tracker line before?  I can't find
anything about it via Google, etc.

Thanks,

Aaron Zimmerman

Re: INFO org.apache.hadoop.mapred.TaskTracker: attempt_XXXX NaN%

Posted by Aaron Zimmerman <az...@sproutsocial.com>.

What information can I get you?

I'm not sending periodic updates, I'm just using pig to generate the map
reduce jobs, so perhaps something I'm doing is causing the pig not to be
able to report?  

The map task itself is querying Hbase, essentially a full table scan, the
table currently has 9 regions, so 9 map tasks.

I don't see that line in the task logs, just in the task tracker as they
are performing the task.

It looks to me like the tasks that are coming back progress NaN eventually
fail and are retried, and the retry succeeds.

http://hastebin.com/xuhugipuve.sm - task fails

http://hastebin.com/vusadavano.vhdl - task succeeds


On 12/26/12 1:21 PM, "Vinod Kumar Vavilapalli" <vi...@hortonworks.com>
wrote:

>
>The NaN is very suspicious, perhaps a bug - will need more information
>
>But irrespective, are you sending periodic updates from your map/reduce
>code? The framework has the 10 minute timeout to avoid hung tasks, so the
>user code can report progress via the Reporter interface and avoid the
>task-failures.
>
>HTH,
>
>+Vinod Kumar Vavilapalli
>Hortonworks Inc.
>http://hortonworks.com/
>
>On Dec 26, 2012, at 7:14 AM, Aaron Zimmerman wrote:
>
>> Hi,
>>  I'm new to hadoop, setting up a new cluster on hadoop 1.0.3 that
>>currently
>> only has 2 datanode/tasktrackers.  I'll be adding more soon, but I'm
>>worried
>> about something being configured incorrectly. When I run a moderately
>> expensive map reduce job (via pig), the job usually fails (though it
>>does
>> succeed 1/8 times or so).
>> 
>> ERROR 2997: Unable to recreate exception from backed error: Task
>> attempt_201212171952_0406_m_000020_3 failed to report status for 601
>> seconds. Killing!
>> 
>> Any time a job runs on the cluster, both task tracker logs output line
>>after
>> line of 
>> INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201212171952_0411_m_000000_0 NaN%, with different attempt
>> identifiers.  
>> 
>> Interspersed with these entries are lines like,
>> org.apache.hadoop.mapred.TaskTracker:
>>attempt_201212171952_0411_r_000000_0
>> 0.1851852% reduce > copy (5 of 9 at 0.00 MB/s) >
>> 
>> Which makes it look to me like some of the tasks are working, but some
>>of
>> the tasks just stall out, and perhaps they eventually timeout the entire
>> job?
>> 
>> So maybe my job is just to labor intensive for the cluster, but the task
>> tracker log entry seems odd, like something is wrong.  Why would it say
>> NaN%?  I know that I can extend the timeout allotment, but I'd rather
>>not do
>> that as a permanent solution.  Is there any other config that I could
>> update?  Has anyone seen that task tracker line before?  I can't find
>> anything about it via Google, etc.
>> 
>> Thanks,
>> 
>> Aaron Zimmerman
>> 
>> 
>

Re: INFO org.apache.hadoop.mapred.TaskTracker: attempt_XXXX NaN%

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

The NaN is very suspicious, perhaps a bug - will need more information

But irrespective, are you sending periodic updates from your map/reduce code? The framework has the 10 minute timeout to avoid hung tasks, so the user code can report progress via the Reporter interface and avoid the task-failures.

HTH,

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Dec 26, 2012, at 7:14 AM, Aaron Zimmerman wrote:

> Hi,
>  I'm new to hadoop, setting up a new cluster on hadoop 1.0.3 that currently
> only has 2 datanode/tasktrackers.  I'll be adding more soon, but I'm worried
> about something being configured incorrectly. When I run a moderately
> expensive map reduce job (via pig), the job usually fails (though it does
> succeed 1/8 times or so).
> 
> ERROR 2997: Unable to recreate exception from backed error: Task
> attempt_201212171952_0406_m_000020_3 failed to report status for 601
> seconds. Killing!
> 
> Any time a job runs on the cluster, both task tracker logs output line after
> line of 
> INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201212171952_0411_m_000000_0 NaN%, with different attempt
> identifiers.  
> 
> Interspersed with these entries are lines like,
> org.apache.hadoop.mapred.TaskTracker: attempt_201212171952_0411_r_000000_0
> 0.1851852% reduce > copy (5 of 9 at 0.00 MB/s) >
> 
> Which makes it look to me like some of the tasks are working, but some of
> the tasks just stall out, and perhaps they eventually timeout the entire
> job?
> 
> So maybe my job is just to labor intensive for the cluster, but the task
> tracker log entry seems odd, like something is wrong.  Why would it say
> NaN%?  I know that I can extend the timeout allotment, but I'd rather not do
> that as a permanent solution.  Is there any other config that I could
> update?  Has anyone seen that task tracker line before?  I can't find
> anything about it via Google, etc.
> 
> Thanks,
> 
> Aaron Zimmerman
> 
>