You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Artem Yankov <ar...@gmail.com> on 2011/10/25 19:56:15 UTC

Hadoop cluster on EC2: hangs on big chunks of data

Hey,

I set up a hadoop cluster on EC2 using this documentation:
http://wiki.apache.org/hadoop/AmazonEC2

OS: Linux Fedora 8
Hadoop version is 0.20.203.0
java version "1.7.0_01"
heap size: 1Gb (stats always shows that it uses only 4% of this)
I use mongo-hadoop plugin to get data from mongodb.

Everything seems to work perfect with the small chunks of data: calculations
are fast, I'm getting the results and tasks
seem to be distributed normally among the slaves.

Then I try to load a huge amount of data (22 Millions of records) and
everything hangs. First slave receives a map task and other slaves are not.
In logs I constantly see this:

*INFO org.apache.hadoop.hdfs.StateChange: *BLOCK* NameSystem.processReport:
from x.x.x.x:50010, blocks: 2, processing time: 0 m*
*
*
I tried to use different number of slaves (maximum I ran 25 nodes), but it
doesn't help cause it seems that when first slave receives a job it blocks
everything else. (again, everything works cool with the small chunks of
data).

There are no significant CPU or Memory load on Master.

Any ideas on what can be a reason of this?

Artem.

Re: Hadoop cluster on EC2: hangs on big chunks of data

Posted by Friso van Vollenhoven <fv...@xebia.com>.

What is your input data? Some types of files are not splittable, because of non-splittable compression codecs (like gzip). Could that be in your case?

Friso

On 25 okt. 2011, at 21:43, Artem Yankov wrote:

It looks like input data is not splited correctly. It always generates only one map task and gives it to one of the nodes. I tried to pass parameters like -D mapred.max.split.size but it doesn't seem to have any effect.

So the question would be: how to specify the maximum amount of input records each mapper can receive?

On Tue, Oct 25, 2011 at 10:56 AM, Artem Yankov <ar...@gmail.com>> wrote:
Hey,

I set up a hadoop cluster on EC2 using this documentation: http://wiki.apache.org/hadoop/AmazonEC2

OS: Linux Fedora 8
Hadoop version is 0.20.203.0
java version "1.7.0_01"
heap size: 1Gb (stats always shows that it uses only 4% of this)
I use mongo-hadoop plugin to get data from mongodb.

Everything seems to work perfect with the small chunks of data: calculations are fast, I'm getting the results and tasks
seem to be distributed normally among the slaves.

Then I try to load a huge amount of data (22 Millions of records) and everything hangs. First slave receives a map task and other slaves are not. In logs I constantly see this:

INFO org.apache.hadoop.hdfs.StateChange: *BLOCK* NameSystem.processReport: from x.x.x.x:50010, blocks: 2, processing time: 0 m

I tried to use different number of slaves (maximum I ran 25 nodes), but it doesn't help cause it seems that when first slave receives a job it blocks everything else. (again, everything works cool with the small chunks of data).

There are no significant CPU or Memory load on Master.

Any ideas on what can be a reason of this?

Artem.

Re: Hadoop cluster on EC2: hangs on big chunks of data

Posted by Artem Yankov <ar...@gmail.com>.

It looks like input data is not splited correctly. It always generates only
one map task and gives it to one of the nodes. I tried to pass parameters
like  -D mapred.max.split.size but it doesn't seem to have any effect.

So the question would be: how to specify the maximum amount of input records
each mapper can receive?

On Tue, Oct 25, 2011 at 10:56 AM, Artem Yankov <ar...@gmail.com>wrote:

> Hey,
>
> I set up a hadoop cluster on EC2 using this documentation:
> http://wiki.apache.org/hadoop/AmazonEC2
>
> OS: Linux Fedora 8
> Hadoop version is 0.20.203.0
> java version "1.7.0_01"
> heap size: 1Gb (stats always shows that it uses only 4% of this)
> I use mongo-hadoop plugin to get data from mongodb.
>
> Everything seems to work perfect with the small chunks of data:
> calculations are fast, I'm getting the results and tasks
> seem to be distributed normally among the slaves.
>
> Then I try to load a huge amount of data (22 Millions of records) and
> everything hangs. First slave receives a map task and other slaves are not.
> In logs I constantly see this:
>
> *INFO org.apache.hadoop.hdfs.StateChange: *BLOCK*
> NameSystem.processReport: from x.x.x.x:50010, blocks: 2, processing time: 0
> m*
> *
> *
> I tried to use different number of slaves (maximum I ran 25 nodes), but it
> doesn't help cause it seems that when first slave receives a job it blocks
> everything else. (again, everything works cool with the small chunks of
> data).
>
> There are no significant CPU or Memory load on Master.
>
> Any ideas on what can be a reason of this?
>
> Artem.
>
>