You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mori Bellamy <mb...@apple.com> on 2008/07/02 00:20:03 UTC

scaling issue, please help

hey all,
i've got a mapreduce task that works on small (~1G) input. when i try  
to run the same task on large (~100G) input, i get the following error  
around when the map tasks are almost done (~98%)

2008-07-01 13:10:59,231 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0 obsolete  
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Got 0 known map output location(s);  
scheduling...
2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0  
slow hosts and 0 dup hosts)
2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Need 1 map output(s)
2008-07-01 13:11:00,231 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0 obsolete  
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-01 13:11:00,231 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Got 0 known map output location(s);  
scheduling...
2008-07-01 13:11:00,231 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0  
slow hosts and 0 dup hosts)
2008-07-01 13:11:05,232 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Need 1 map output(s)
2008-07-01 13:11:05,232 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0 obsolete  
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-01 13:11:05,232 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Got 0 known map output location(s);  
scheduling...
2008-07-01 13:11:05,233 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0  
slow hosts and 0 dup hosts)
2008-07-01 13:11:10,233 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200807011005_0005_r_000000_0 Need 1 map output(s)

I'm running the task on a cluster of 5 workers, one DFS master, and  
one task tracker. i'm chaining mapreduce tasks, so i'm using  
SequenceFileOutput and SequenceFileInput. this error happens before  
the first link in the chain sucessfully reduces.

does anyone have any insight? thanks!

Re: scaling issue, please help

Posted by Amar Kamat <am...@yahoo-inc.com>.

Mori Bellamy wrote:
> i discovered that some of my code was causing out of bounds 
> exceptions. i cleaned up that code and the map tasks seemed to work. 
> that confuses me -- i'm pretty sure hadoop is resilient to a few map 
> tasks failing (5 out of 13k). before this fix, my remaining 2% of 
> tasks were getting killed.
Mori, I am not sure what the confusion is. Hadoop is resilient to few 
task failures but not by default. The parameter that does it is 
mapred.max.map.failures.percent and mapred.max.reduce.failures.percent. 
Every task internally consists of attempts (internally, for the 
framework). Hadoop allows some attempt failures too. If the number of 
attempts that failed of a task exceeds the threshold 
(mapred.map.max.attempts/mapred.reduce.max.attempts : default is 4) then 
the task is considered failed. If the number of map/reduce task failures 
exceeds the threshold 
(mapred.max.map.failures.percent/mapred.max.reduce.failures.percent : 
default is 0) then the job is considered failed.
Amar
>
>
> On Jul 1, 2008, at 10:06 PM, Amar Kamat wrote:
>
>> Mori Bellamy wrote:
>>> hey all,
>>> i've got a mapreduce task that works on small (~1G) input. when i 
>>> try to run the same task on large (~100G) input, i get the following 
>>> error around when the map tasks are almost done (~98%)
>>>
>>> 2008-07-01 13:10:59,231 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0 
>>> obsolete map-outputs from tasktracker and 0 map-outputs from 
>>> previous failures
>>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> task_200807011005_0005_r_000000_0 Got 0 known map output 
>>> location(s); scheduling...
>>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0 
>>> slow hosts and 0 dup hosts)
>>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> task_200807011005_0005_r_000000_0 Need 1 map output(s)
>> ...
>> ...
>> These are not error messages. The reducers are stuck as not all maps 
>> are completed. Mori, could you let us know what is happening to the 
>> other 2% maps. Are they getting executed? Are they still pending 
>> (waiting to run)? Were they killed/failed? Is there any lost tracker?
>>> I'm running the task on a cluster of 5 workers, one DFS master, and 
>>> one task tracker.
>> What do you mean by 5 workers and 1 task tracker?
>>> i'm chaining mapreduce tasks, so i'm using SequenceFileOutput and 
>>> SequenceFileInput. this error happens before the first link in the 
>>> chain sucessfully reduces.
>> Can you elaborate this a bit. Are you chaining MR jobs?
>> Amar
>>>
>>> does anyone have any insight? thanks!
>>
>

Re: scaling issue, please help

Posted by Mori Bellamy <mb...@apple.com>.

i discovered that some of my code was causing out of bounds  
exceptions. i cleaned up that code and the map tasks seemed to work.  
that confuses me -- i'm pretty sure hadoop is resilient to a few map  
tasks failing (5 out of 13k). before this fix, my remaining 2% of  
tasks were getting killed.


On Jul 1, 2008, at 10:06 PM, Amar Kamat wrote:

> Mori Bellamy wrote:
>> hey all,
>> i've got a mapreduce task that works on small (~1G) input. when i  
>> try to run the same task on large (~100G) input, i get the  
>> following error around when the map tasks are almost done (~98%)
>>
>> 2008-07-01 13:10:59,231 INFO org.apache.hadoop.mapred.ReduceTask:  
>> task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0  
>> obsolete map-outputs from tasktracker and 0 map-outputs from  
>> previous failures
>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask:  
>> task_200807011005_0005_r_000000_0 Got 0 known map output  
>> location(s); scheduling...
>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask:  
>> task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0  
>> slow hosts and 0 dup hosts)
>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask:  
>> task_200807011005_0005_r_000000_0 Need 1 map output(s)
> ...
> ...
> These are not error messages. The reducers are stuck as not all maps  
> are completed. Mori, could you let us know what is happening to the  
> other 2% maps. Are they getting executed? Are they still pending  
> (waiting to run)? Were they killed/failed? Is there any lost tracker?
>> I'm running the task on a cluster of 5 workers, one DFS master, and  
>> one task tracker.
> What do you mean by 5 workers and 1 task tracker?
>> i'm chaining mapreduce tasks, so i'm using SequenceFileOutput and  
>> SequenceFileInput. this error happens before the first link in the  
>> chain sucessfully reduces.
> Can you elaborate this a bit. Are you chaining MR jobs?
> Amar
>>
>> does anyone have any insight? thanks!
>

Re: scaling issue, please help

Posted by Amar Kamat <am...@yahoo-inc.com>.

Mori Bellamy wrote:
> hey all,
> i've got a mapreduce task that works on small (~1G) input. when i try 
> to run the same task on large (~100G) input, i get the following error 
> around when the map tasks are almost done (~98%)
>
> 2008-07-01 13:10:59,231 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0 obsolete 
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200807011005_0005_r_000000_0 Got 0 known map output location(s); 
> scheduling...
> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0 
> slow hosts and 0 dup hosts)
> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: 
> task_200807011005_0005_r_000000_0 Need 1 map output(s)
...
...
These are not error messages. The reducers are stuck as not all maps are 
completed. Mori, could you let us know what is happening to the other 2% 
maps. Are they getting executed? Are they still pending (waiting to 
run)? Were they killed/failed? Is there any lost tracker?
> I'm running the task on a cluster of 5 workers, one DFS master, and 
> one task tracker.
What do you mean by 5 workers and 1 task tracker?
> i'm chaining mapreduce tasks, so i'm using SequenceFileOutput and 
> SequenceFileInput. this error happens before the first link in the 
> chain sucessfully reduces.
Can you elaborate this a bit. Are you chaining MR jobs?
Amar
>
> does anyone have any insight? thanks!

Re: scaling issue, please help

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.



On 7/1/08 3:20 PM, "Mori Bellamy" <mb...@apple.com> wrote:
> i've got a mapreduce task that works on small (~1G) input. when i try
> to run the same task on large (~100G) input, i get the following error
> around when the map tasks are almost done (~98%)

[error list deleted]

> I'm running the task on a cluster of 5 workers, one DFS master, and
> one task tracker. i'm chaining mapreduce tasks, so i'm using
> SequenceFileOutput and SequenceFileInput. this error happens before
> the first link in the chain sucessfully reduces.
> 
> does anyone have any insight? thanks!

    Any chance your tasks are running out of memory?  I've seen similar
errors when we had our memory watchdog set too low and the tasks were killed
during the shuffle. Woops. :)