You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by souravm <SO...@infosys.com> on 2008/11/14 01:55:30 UTC

Any suggestion on performance improvement ?

Hi,

I'm testing with a 4 node setup of Hadoop hdfs. 

Each node has configuration of 2GB memory and dual core and around 30-60 GB disk space.

I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.

I'm querying those files using PIG. What I'm seeing that even a simple select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP process in one node takes at least 25 sec.

I've kept the jvm max heap size to 1024m.

Any suggestion on how to improve the performance with different configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Any suggestion on performance improvement ?

Posted by Aaron Kimball <aa...@cloudera.com>.
It's worth pointing out that Hadoop really isn't designed to run at this low
of a scale. Hadoop's performance doesn't really begin to kick in until
you've got 10's of GB's of data.

The question is sort of like asking "how can I make an 18-wheeler run faster
when carrying only a single bag of groceries."

There is a large amount of overhead associated with starting Hadoop; in
particular, starting a bunch of JVMs. The TaskTrackers only poll for new
work every 10 seconds, so every Hadoop job is going to be 10 seconds long
minimum, before the job actually gets slotted into worker nodes. The
remaining 20 seconds of time is most likely eaten up by similar overheads.
These stop being a factor when you actually have a sizeable amount of data
worth reading.

You are correct that adding more nodes won't help. For 60 MB of data, it's
only spawning one task, on one worker node.

You might want to configure Hadoop to run in single-threaded mode on a
single machine, and ditch the cluster entirely. Set 'mapred.job.tracker' to
'local' and 'fs.default.name' to 'file:///some/dir/in/the/local/machine',
and it should run Hadoop entirely within a single JVM.

- Aaron

On Fri, Nov 14, 2008 at 11:12 AM, souravm <SO...@infosys.com> wrote:

> Hi Alex,
>
> I get 30-40 secs of response time for around 60MB of data. The number of
> Map and Reduce task is 1 each. This is because the default HDFS block size
> is 64 MB and Pig assigns 1 Map task for each HDFS block - I believe that is
> optimal.
>
> Now this being the unit of performance even if I increase the number of
> node I don't think the performance would be better.
>
> Regards,
> Sourav
> -----Original Message-----
> From: Alex Loddengaard [mailto:alex@cloudera.com]
> Sent: Friday, November 14, 2008 9:44 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Any suggestion on performance improvement ?
>
> How big is the data that you're loading and filtering?  Your cluster is
> pretty small, so if you have data on the magnitude of tens or hundreds of
> GBs, then the performance you're describing is probably to be expected.
> How many map and reduce tasks are you running on each node?
>
> Alex
>
> On Thu, Nov 13, 2008 at 4:55 PM, souravm <SO...@infosys.com> wrote:
>
> > Hi,
> >
> > I'm testing with a 4 node setup of Hadoop hdfs.
> >
> > Each node has configuration of 2GB memory and dual core and around 30-60
> GB
> > disk space.
> >
> > I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
> >
> > I'm querying those files using PIG. What I'm seeing that even a simple
> > select query (LOAD and FILTER) is taking at least 30-40 sec of time. The
> MAP
> > process in one node takes at least 25 sec.
> >
> > I've kept the jvm max heap size to 1024m.
> >
> > Any suggestion on how to improve the performance with different
> > configuration at Hadoop level (by changing hdfs and MapReduce parameters)
> ?
> >
> > Regards,
> > Sourav
> >
> > **************** CAUTION - Disclaimer *****************
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> > solely
> > for the use of the addressee(s). If you are not the intended recipient,
> > please
> > notify the sender by e-mail and delete the original message. Further, you
> > are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> > person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has
> > taken
> > every reasonable precaution to minimize this risk, but is not liable for
> > any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> > out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves
> > the
> > right to monitor and review the content of all messages sent to or from
> > this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS******** End of Disclaimer ********INFOSYS***
> >
>

RE: Any suggestion on performance improvement ?

Posted by souravm <SO...@infosys.com>.
Hi Alex,

I get 30-40 secs of response time for around 60MB of data. The number of Map and Reduce task is 1 each. This is because the default HDFS block size is 64 MB and Pig assigns 1 Map task for each HDFS block - I believe that is optimal.

Now this being the unit of performance even if I increase the number of node I don't think the performance would be better.

Regards,
Sourav
-----Original Message-----
From: Alex Loddengaard [mailto:alex@cloudera.com] 
Sent: Friday, November 14, 2008 9:44 AM
To: core-user@hadoop.apache.org
Subject: Re: Any suggestion on performance improvement ?

How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm <SO...@infosys.com> wrote:

> Hi,
>
> I'm testing with a 4 node setup of Hadoop hdfs.
>
> Each node has configuration of 2GB memory and dual core and around 30-60 GB
> disk space.
>
> I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
>
> I'm querying those files using PIG. What I'm seeing that even a simple
> select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
> process in one node takes at least 25 sec.
>
> I've kept the jvm max heap size to 1024m.
>
> Any suggestion on how to improve the performance with different
> configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>

Re: Any suggestion on performance improvement ?

Posted by Alex Loddengaard <al...@cloudera.com>.
How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm <SO...@infosys.com> wrote:

> Hi,
>
> I'm testing with a 4 node setup of Hadoop hdfs.
>
> Each node has configuration of 2GB memory and dual core and around 30-60 GB
> disk space.
>
> I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
>
> I'm querying those files using PIG. What I'm seeing that even a simple
> select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
> process in one node takes at least 25 sec.
>
> I've kept the jvm max heap size to 1024m.
>
> Any suggestion on how to improve the performance with different
> configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>