You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Juan P." <go...@gmail.com> on 2011/07/07 22:29:43 UTC

Cluster Tuning

Hi guys!

I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
alike. Single core machines with 600MB of RAM. No chance of upgrading the
hardware.

My cluster is made out of 1 NameNode/JobTracker box and 19
DataNode/TaskTracker boxes.

All my config is default except i've set the following in my mapred-site.xml
in an effort to try and prevent choking my boxes.
  *<property>*
*      <name>mapred.tasktracker.map.tasks.maximum</name>*
*      <value>1</value>*
*  </property>*

I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
hosts to each record and then in the reduce task it accumulates the amount
of bytes received from each host.

Currently it's producing about 65000 keys

The hole job takes forever to complete, specially the reduce part. I've
tried different tuning configs by I can't bring it down under 20mins.

Any ideas?

Thanks for your help!
Pony

Re: Hadoop Production Issue

Posted by Віталій Тимчишин <ti...@gmail.com>.

2011/7/16 jagaran das <ja...@yahoo.co.in>

> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy
> around 11520 small size files (Total Size 12 GB) to the cluster for one
> application.
> Like this we have 20 applications that would run in parallel
>
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel,
>
> We have a total SLA for the pipeline from copy to pig aggregation to copy
> to local and sql load is 15 mins.
>
>
Have you tried to use HARs?
-- 
Best regards,
 Vitalii Tymchyshyn

Hadoop Production Issue

Posted by jagaran das <ja...@yahoo.co.in>.

Hi,

Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
Like this we have 20 applications that would run in parallel

So one set would have 11520 files of total size 12 GB
Like this we would have 15 sets in parallel, 

We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins. 

What we do:

1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
2. Copy to cluster
3. Execute PIG job
4. copy to local
5 Sql loader

Can we perform merge and copy to cluster from a different host other than the Namenode?
We want an out of cluster machine running a java process that would
1. Run periodically
2. Merge Files
3. Copy to Cluster 

Secondly,
If we can append to an existing file in cluster?

Please provide your thoughts as maintaing the SLA is becoming tough. 

Regards,
Jagaran

Re: Cluster Tuning

Posted by Steve Loughran <st...@apache.org>.

On 08/07/2011 16:25, Juan P. wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints

take a look to see if its usually the same machine that's taking too 
long; test your HDDs to see if there are any signs of problems in the 
SMART messages. Then turn on speculation. It could be the problem with a 
slow mapper is caused by disk problems or an overloaded server.

Re: Cluster Tuning

Posted by "Juan P." <go...@gmail.com>.

BTW: Here's the Job Output

https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5N1j_JvusDdDdaTG51OE1FOUptZHg5M1Zxc0FZbHc&hl=en_US

On Mon, Jul 11, 2011 at 1:28 PM, Juan P. <go...@gmail.com> wrote:

> Hi guys! Here's my mapred-site.xml
> I've tweaked a few properties but still it's taking about 8-10mins to
> process 4GB of data. Thought maybe you guys could find something you'd
> comment on.
> Thanks!
> Pony
>
> *<?xml version="1.0"?>*
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
> *
> *
> *<configuration>*
> *  <property>*
> *    <name>mapred.job.tracker</name>*
> *    <value>name-node:54311</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.tasktracker.map.tasks.maximum</name>*
> *    <value>1</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.tasktracker.reduce.tasks.maximum</name>*
> *    <value>1</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.compress.map.output</name>*
> *    <value>true</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.map.output.compression.codec</name>*
> *    <value>org.apache.hadoop.io.compress.GzipCodec</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.child.java.opts</name>*
> *    <value>-Xmx400m</value>*
> *  </property>*
> *  <property>*
> *    <name>map.sort.class</name>*
> *    <value>org.apache.hadoop.util.HeapSort</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.reduce.slowstart.completed.maps</name>*
> *    <value>0.85</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.map.tasks.speculative.execution</name>*
> *    <value>false</value>*
> *  </property>*
> *  <property>*
> *    <name>mapred.reduce.tasks.speculative.execution</name>*
> *    <value>false</value>*
> *  </property>*
> *</configuration>*
>
> On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi <bh...@yahoo.com>wrote:
>
>> Slow start is an important parameter. Definitely impacts job runtime. My
>> experience in the past has been that, setting this parameter to too low or
>> setting to too high can have issues with job latencies. If you are trying to
>> run same job then its easy to set right value but if your cluster is
>> multi-tenancy then getting this to right requires some benchmarking of
>> different workloads concurrently.
>>
>> But you case is interesting, you are running on a single core(How many
>> disks per node?). So setting to higher side of the spectrum as suggested by
>> Joey makes sense.
>>
>>
>> -Bharath
>>
>>
>>
>>
>>
>> ________________________________
>> From: Joey Echeverria <jo...@cloudera.com>
>> To: common-user@hadoop.apache.org
>> Sent: Friday, July 8, 2011 9:14 AM
>> Subject: Re: Cluster Tuning
>>
>> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
>> 1.0 means the maps have to completely finish before the reduce starts
>> copying any data. I often run jobs with this set to .90-.95.
>>
>> -Joey
>>
>> On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <go...@gmail.com> wrote:
>> > Here's another thought. I realized that the reduce operation in my
>> > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
>> > mappers end. Is there a way to configure the cluster to make the reduce
>> wait
>> > for the map operations to complete? Specially considering my hardware
>> > restraints
>> >
>> > Thanks!
>> > Pony
>> >
>> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <go...@gmail.com> wrote:
>> >
>> >> Hey guys,
>> >> Thanks all of you for your help.
>> >>
>> >> Joey,
>> >> I tweaked my MapReduce to serialize/deserialize only escencial values
>> and
>> >> added a combiner and that helped a lot. Previously I had a domain
>> object
>> >> which was being passed between Mapper and Reducer when I only needed a
>> >> single value.
>> >>
>> >> Esteban,
>> >> I think you underestimate the constraints of my cluster. Adding
>> multiple
>> >> jobs per JVM really kills me in terms of memory. Not to mention that by
>> >> having a single core there's not much to gain in terms of paralelism
>> (other
>> >> than perhaps while a process is waiting of an I/O operation). Still I
>> gave
>> >> it a shot, but even though I kept changing the config I always ended
>> with a
>> >> Java heap space error.
>> >>
>> >> Is it me or performance tuning is mostly a per job task? I mean it
>> will, in
>> >> the end, depend on the the data you are processing (structure, size,
>> weather
>> >> it's in one file or many, etc). If my jobs have different sets of data,
>> >> which are in different formats and organized in different  file
>> structures,
>> >> Do you guys recommend moving some of the configuration to Java code?
>> >>
>> >> Thanks!
>> >> Pony
>> >>
>> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:
>> >>
>> >>> Eres el Esteban que conozco?
>> >>>
>> >>>
>> >>>
>> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
>> >>> escribió:
>> >>>
>> >>> > Hi Pony,
>> >>> >
>> >>> > There is a good chance that your boxes are doing some heavy swapping
>> and
>> >>> > that is a killer for Hadoop.  Have you tried
>> >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
>> the
>> >>> > heap on that boxes?
>> >>> >
>> >>> > Cheers,
>> >>> > Esteban.
>> >>> >
>> >>> > --
>> >>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com>
>> wrote:
>> >>> >
>> >>> >> Hi guys!
>> >>> >>
>> >>> >> I'd like some help fine tuning my cluster. I currently have 20
>> boxes
>> >>> >> exactly
>> >>> >> alike. Single core machines with 600MB of RAM. No chance of
>> upgrading
>> >>> the
>> >>> >> hardware.
>> >>> >>
>> >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >>> >> DataNode/TaskTracker boxes.
>> >>> >>
>> >>> >> All my config is default except i've set the following in my
>> >>> >> mapred-site.xml
>> >>> >> in an effort to try and prevent choking my boxes.
>> >>> >> *<property>*
>> >>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>> >>> >> *      <value>1</value>*
>> >>> >> *  </property>*
>> >>> >>
>> >>> >> I'm running a MapReduce job which reads a Proxy Server log file
>> (2GB),
>> >>> maps
>> >>> >> hosts to each record and then in the reduce task it accumulates the
>> >>> amount
>> >>> >> of bytes received from each host.
>> >>> >>
>> >>> >> Currently it's producing about 65000 keys
>> >>> >>
>> >>> >> The hole job takes forever to complete, specially the reduce part.
>> I've
>> >>> >> tried different tuning configs by I can't bring it down under
>> 20mins.
>> >>> >>
>> >>> >> Any ideas?
>> >>> >>
>> >>> >> Thanks for your help!
>> >>> >> Pony
>> >>> >>
>> >>>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>
>

Re: Cluster Tuning

Posted by "Juan P." <go...@gmail.com>.

Allen,
Say I were to bring the property back to the default of -Xmx200m, which
buffers do you think I should adjust? io.sort.mb? io.sort.factor? How would
you adjust them?

Thanks for your help!
Pony

On Mon, Jul 11, 2011 at 4:41 PM, Allen Wittenauer <aw...@apache.org> wrote:

>
> On Jul 11, 2011, at 9:28 AM, Juan P. wrote:
> >
> > *  <property>*
> > *    <name>mapred.child.java.opts</name>*
> > *    <value>-Xmx400m</value>*
> > *  </property>*
>
> "Single core machines with 600MB of RAM."
>
>                 2x400m = 800m just for the heap of the map and reduce
> phases, not counting the other memory that the jvm will need.  io buffer
> sizes aren't adjusted downward either, so you're likely looking at a
> swapping + spills = death scenario.  slowstart set to 1 is going to be
> pretty much required.

Re: Cluster Tuning

Posted by Allen Wittenauer <aw...@apache.org>.

On Jul 11, 2011, at 9:28 AM, Juan P. wrote:
> 
> *  <property>*
> *    <name>mapred.child.java.opts</name>*
> *    <value>-Xmx400m</value>*
> *  </property>*

"Single core machines with 600MB of RAM."

		2x400m = 800m just for the heap of the map and reduce phases, not counting the other memory that the jvm will need.  io buffer sizes aren't adjusted downward either, so you're likely looking at a swapping + spills = death scenario.  slowstart set to 1 is going to be pretty much required.

Re: Cluster Tuning

Posted by "Juan P." <go...@gmail.com>.

Hi guys! Here's my mapred-site.xml
I've tweaked a few properties but still it's taking about 8-10mins to
process 4GB of data. Thought maybe you guys could find something you'd
comment on.
Thanks!
Pony

*<?xml version="1.0"?>*
*<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
*
*
*<configuration>*
*  <property>*
*    <name>mapred.job.tracker</name>*
*    <value>name-node:54311</value>*
*  </property>*
*  <property>*
*    <name>mapred.tasktracker.map.tasks.maximum</name>*
*    <value>1</value>*
*  </property>*
*  <property>*
*    <name>mapred.tasktracker.reduce.tasks.maximum</name>*
*    <value>1</value>*
*  </property>*
*  <property>*
*    <name>mapred.compress.map.output</name>*
*    <value>true</value>*
*  </property>*
*  <property>*
*    <name>mapred.map.output.compression.codec</name>*
*    <value>org.apache.hadoop.io.compress.GzipCodec</value>*
*  </property>*
*  <property>*
*    <name>mapred.child.java.opts</name>*
*    <value>-Xmx400m</value>*
*  </property>*
*  <property>*
*    <name>map.sort.class</name>*
*    <value>org.apache.hadoop.util.HeapSort</value>*
*  </property>*
*  <property>*
*    <name>mapred.reduce.slowstart.completed.maps</name>*
*    <value>0.85</value>*
*  </property>*
*  <property>*
*    <name>mapred.map.tasks.speculative.execution</name>*
*    <value>false</value>*
*  </property>*
*  <property>*
*    <name>mapred.reduce.tasks.speculative.execution</name>*
*    <value>false</value>*
*  </property>*
*</configuration>*

On Fri, Jul 8, 2011 at 4:21 PM, Bharath Mundlapudi <bh...@yahoo.com>wrote:

> Slow start is an important parameter. Definitely impacts job runtime. My
> experience in the past has been that, setting this parameter to too low or
> setting to too high can have issues with job latencies. If you are trying to
> run same job then its easy to set right value but if your cluster is
> multi-tenancy then getting this to right requires some benchmarking of
> different workloads concurrently.
>
> But you case is interesting, you are running on a single core(How many
> disks per node?). So setting to higher side of the spectrum as suggested by
> Joey makes sense.
>
>
> -Bharath
>
>
>
>
>
> ________________________________
> From: Joey Echeverria <jo...@cloudera.com>
> To: common-user@hadoop.apache.org
> Sent: Friday, July 8, 2011 9:14 AM
> Subject: Re: Cluster Tuning
>
> Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
> 1.0 means the maps have to completely finish before the reduce starts
> copying any data. I often run jobs with this set to .90-.95.
>
> -Joey
>
> On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <go...@gmail.com> wrote:
> > Here's another thought. I realized that the reduce operation in my
> > map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> > mappers end. Is there a way to configure the cluster to make the reduce
> wait
> > for the map operations to complete? Specially considering my hardware
> > restraints
> >
> > Thanks!
> > Pony
> >
> > On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <go...@gmail.com> wrote:
> >
> >> Hey guys,
> >> Thanks all of you for your help.
> >>
> >> Joey,
> >> I tweaked my MapReduce to serialize/deserialize only escencial values
> and
> >> added a combiner and that helped a lot. Previously I had a domain object
> >> which was being passed between Mapper and Reducer when I only needed a
> >> single value.
> >>
> >> Esteban,
> >> I think you underestimate the constraints of my cluster. Adding multiple
> >> jobs per JVM really kills me in terms of memory. Not to mention that by
> >> having a single core there's not much to gain in terms of paralelism
> (other
> >> than perhaps while a process is waiting of an I/O operation). Still I
> gave
> >> it a shot, but even though I kept changing the config I always ended
> with a
> >> Java heap space error.
> >>
> >> Is it me or performance tuning is mostly a per job task? I mean it will,
> in
> >> the end, depend on the the data you are processing (structure, size,
> weather
> >> it's in one file or many, etc). If my jobs have different sets of data,
> >> which are in different formats and organized in different  file
> structures,
> >> Do you guys recommend moving some of the configuration to Java code?
> >>
> >> Thanks!
> >> Pony
> >>
> >> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:
> >>
> >>> Eres el Esteban que conozco?
> >>>
> >>>
> >>>
> >>> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
> >>> escribió:
> >>>
> >>> > Hi Pony,
> >>> >
> >>> > There is a good chance that your boxes are doing some heavy swapping
> and
> >>> > that is a killer for Hadoop.  Have you tried
> >>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible
> the
> >>> > heap on that boxes?
> >>> >
> >>> > Cheers,
> >>> > Esteban.
> >>> >
> >>> > --
> >>> > Get Hadoop!  http://www.cloudera.com/downloads/
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com>
> wrote:
> >>> >
> >>> >> Hi guys!
> >>> >>
> >>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
> >>> >> exactly
> >>> >> alike. Single core machines with 600MB of RAM. No chance of
> upgrading
> >>> the
> >>> >> hardware.
> >>> >>
> >>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
> >>> >> DataNode/TaskTracker boxes.
> >>> >>
> >>> >> All my config is default except i've set the following in my
> >>> >> mapred-site.xml
> >>> >> in an effort to try and prevent choking my boxes.
> >>> >> *<property>*
> >>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
> >>> >> *      <value>1</value>*
> >>> >> *  </property>*
> >>> >>
> >>> >> I'm running a MapReduce job which reads a Proxy Server log file
> (2GB),
> >>> maps
> >>> >> hosts to each record and then in the reduce task it accumulates the
> >>> amount
> >>> >> of bytes received from each host.
> >>> >>
> >>> >> Currently it's producing about 65000 keys
> >>> >>
> >>> >> The hole job takes forever to complete, specially the reduce part.
> I've
> >>> >> tried different tuning configs by I can't bring it down under
> 20mins.
> >>> >>
> >>> >> Any ideas?
> >>> >>
> >>> >> Thanks for your help!
> >>> >> Pony
> >>> >>
> >>>
> >>
> >>
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Re: Cluster Tuning

Posted by Bharath Mundlapudi <bh...@yahoo.com>.

Slow start is an important parameter. Definitely impacts job runtime. My experience in the past has been that, setting this parameter to too low or setting to too high can have issues with job latencies. If you are trying to run same job then its easy to set right value but if your cluster is multi-tenancy then getting this to right requires some benchmarking of different workloads concurrently.

But you case is interesting, you are running on a single core(How many disks per node?). So setting to higher side of the spectrum as suggested by Joey makes sense. 


-Bharath





________________________________
From: Joey Echeverria <jo...@cloudera.com>
To: common-user@hadoop.apache.org
Sent: Friday, July 8, 2011 9:14 AM
Subject: Re: Cluster Tuning

Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <go...@gmail.com> wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <go...@gmail.com> wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> *<property>*
>>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>>> >> *      <value>1</value>*
>>> >> *  </property>*
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Cluster Tuning

Posted by Joey Echeverria <jo...@cloudera.com>.

Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <go...@gmail.com> wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <go...@gmail.com> wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> *<property>*
>>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>>> >> *      <value>1</value>*
>>> >> *  </property>*
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Cluster Tuning

Posted by Robert Evans <ev...@yahoo-inc.com>.

I doubt It is going to make that much of a difference, even with the hardware constraints.  All that the reduce is doing during this period of time is downloading the map output data doing a merge sort on it and possibly dumping parts of it to disk. It may take up some RAM and if you are swapping a lot then it might be a speed bump to keep it from running, but only if you are really on the edge of the amount of RAM available to the system.  Looking at how you can reduce the data you transfer and tuning the heap size for the various JVMs can probably have a bigger impact.

--Bobby Evans

On 7/8/11 10:25 AM, "Juan P." <go...@gmail.com> wrote:

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <go...@gmail.com> wrote:

> Hey guys,
> Thanks all of you for your help.
>
> Joey,
> I tweaked my MapReduce to serialize/deserialize only escencial values and
> added a combiner and that helped a lot. Previously I had a domain object
> which was being passed between Mapper and Reducer when I only needed a
> single value.
>
> Esteban,
> I think you underestimate the constraints of my cluster. Adding multiple
> jobs per JVM really kills me in terms of memory. Not to mention that by
> having a single core there's not much to gain in terms of paralelism (other
> than perhaps while a process is waiting of an I/O operation). Still I gave
> it a shot, but even though I kept changing the config I always ended with a
> Java heap space error.
>
> Is it me or performance tuning is mostly a per job task? I mean it will, in
> the end, depend on the the data you are processing (structure, size, weather
> it's in one file or many, etc). If my jobs have different sets of data,
> which are in different formats and organized in different  file structures,
> Do you guys recommend moving some of the configuration to Java code?
>
> Thanks!
> Pony
>
> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:
>
>> Eres el Esteban que conozco?
>>
>>
>>
>> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
>> escribió:
>>
>> > Hi Pony,
>> >
>> > There is a good chance that your boxes are doing some heavy swapping and
>> > that is a killer for Hadoop.  Have you tried
>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>> > heap on that boxes?
>> >
>> > Cheers,
>> > Esteban.
>> >
>> > --
>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >
>> >
>> >
>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:
>> >
>> >> Hi guys!
>> >>
>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> >> exactly
>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>> the
>> >> hardware.
>> >>
>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >> DataNode/TaskTracker boxes.
>> >>
>> >> All my config is default except i've set the following in my
>> >> mapred-site.xml
>> >> in an effort to try and prevent choking my boxes.
>> >> *<property>*
>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>> >> *      <value>1</value>*
>> >> *  </property>*
>> >>
>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>> maps
>> >> hosts to each record and then in the reduce task it accumulates the
>> amount
>> >> of bytes received from each host.
>> >>
>> >> Currently it's producing about 65000 keys
>> >>
>> >> The hole job takes forever to complete, specially the reduce part. I've
>> >> tried different tuning configs by I can't bring it down under 20mins.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks for your help!
>> >> Pony
>> >>
>>
>
>

Re: Cluster Tuning

Posted by "Juan P." <go...@gmail.com>.

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints

Thanks!
Pony

On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <go...@gmail.com> wrote:

> Hey guys,
> Thanks all of you for your help.
>
> Joey,
> I tweaked my MapReduce to serialize/deserialize only escencial values and
> added a combiner and that helped a lot. Previously I had a domain object
> which was being passed between Mapper and Reducer when I only needed a
> single value.
>
> Esteban,
> I think you underestimate the constraints of my cluster. Adding multiple
> jobs per JVM really kills me in terms of memory. Not to mention that by
> having a single core there's not much to gain in terms of paralelism (other
> than perhaps while a process is waiting of an I/O operation). Still I gave
> it a shot, but even though I kept changing the config I always ended with a
> Java heap space error.
>
> Is it me or performance tuning is mostly a per job task? I mean it will, in
> the end, depend on the the data you are processing (structure, size, weather
> it's in one file or many, etc). If my jobs have different sets of data,
> which are in different formats and organized in different  file structures,
> Do you guys recommend moving some of the configuration to Java code?
>
> Thanks!
> Pony
>
> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:
>
>> Eres el Esteban que conozco?
>>
>>
>>
>> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
>> escribió:
>>
>> > Hi Pony,
>> >
>> > There is a good chance that your boxes are doing some heavy swapping and
>> > that is a killer for Hadoop.  Have you tried
>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>> > heap on that boxes?
>> >
>> > Cheers,
>> > Esteban.
>> >
>> > --
>> > Get Hadoop!  http://www.cloudera.com/downloads/
>> >
>> >
>> >
>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:
>> >
>> >> Hi guys!
>> >>
>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> >> exactly
>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>> the
>> >> hardware.
>> >>
>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>> >> DataNode/TaskTracker boxes.
>> >>
>> >> All my config is default except i've set the following in my
>> >> mapred-site.xml
>> >> in an effort to try and prevent choking my boxes.
>> >> *<property>*
>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>> >> *      <value>1</value>*
>> >> *  </property>*
>> >>
>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>> maps
>> >> hosts to each record and then in the reduce task it accumulates the
>> amount
>> >> of bytes received from each host.
>> >>
>> >> Currently it's producing about 65000 keys
>> >>
>> >> The hole job takes forever to complete, specially the reduce part. I've
>> >> tried different tuning configs by I can't bring it down under 20mins.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks for your help!
>> >> Pony
>> >>
>>
>
>

Re: Cluster Tuning

Posted by "Juan P." <go...@gmail.com>.

Hey guys,
Thanks all of you for your help.

Joey,
I tweaked my MapReduce to serialize/deserialize only escencial values and
added a combiner and that helped a lot. Previously I had a domain object
which was being passed between Mapper and Reducer when I only needed a
single value.

Esteban,
I think you underestimate the constraints of my cluster. Adding multiple
jobs per JVM really kills me in terms of memory. Not to mention that by
having a single core there's not much to gain in terms of paralelism (other
than perhaps while a process is waiting of an I/O operation). Still I gave
it a shot, but even though I kept changing the config I always ended with a
Java heap space error.

Is it me or performance tuning is mostly a per job task? I mean it will, in
the end, depend on the the data you are processing (structure, size, weather
it's in one file or many, etc). If my jobs have different sets of data,
which are in different formats and organized in different  file structures,
Do you guys recommend moving some of the configuration to Java code?

Thanks!
Pony

On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ce...@gmail.com> wrote:

> Eres el Esteban que conozco?
>
>
>
> El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com>
> escribió:
>
> > Hi Pony,
> >
> > There is a good chance that your boxes are doing some heavy swapping and
> > that is a killer for Hadoop.  Have you tried
> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
> > heap on that boxes?
> >
> > Cheers,
> > Esteban.
> >
> > --
> > Get Hadoop!  http://www.cloudera.com/downloads/
> >
> >
> >
> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:
> >
> >> Hi guys!
> >>
> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
> >> exactly
> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
> the
> >> hardware.
> >>
> >> My cluster is made out of 1 NameNode/JobTracker box and 19
> >> DataNode/TaskTracker boxes.
> >>
> >> All my config is default except i've set the following in my
> >> mapred-site.xml
> >> in an effort to try and prevent choking my boxes.
> >> *<property>*
> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
> >> *      <value>1</value>*
> >> *  </property>*
> >>
> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
> maps
> >> hosts to each record and then in the reduce task it accumulates the
> amount
> >> of bytes received from each host.
> >>
> >> Currently it's producing about 65000 keys
> >>
> >> The hole job takes forever to complete, specially the reduce part. I've
> >> tried different tuning configs by I can't bring it down under 20mins.
> >>
> >> Any ideas?
> >>
> >> Thanks for your help!
> >> Pony
> >>
>

Re: Cluster Tuning

Posted by Ceriasmex <ce...@gmail.com>.

Eres el Esteban que conozco?



El 07/07/2011, a las 15:53, Esteban Gutierrez <es...@cloudera.com> escribió:

> Hi Pony,
> 
> There is a good chance that your boxes are doing some heavy swapping and
> that is a killer for Hadoop.  Have you tried
> with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
> heap on that boxes?
> 
> Cheers,
> Esteban.
> 
> --
> Get Hadoop!  http://www.cloudera.com/downloads/
> 
> 
> 
> On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:
> 
>> Hi guys!
>> 
>> I'd like some help fine tuning my cluster. I currently have 20 boxes
>> exactly
>> alike. Single core machines with 600MB of RAM. No chance of upgrading the
>> hardware.
>> 
>> My cluster is made out of 1 NameNode/JobTracker box and 19
>> DataNode/TaskTracker boxes.
>> 
>> All my config is default except i've set the following in my
>> mapred-site.xml
>> in an effort to try and prevent choking my boxes.
>> *<property>*
>> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>> *      <value>1</value>*
>> *  </property>*
>> 
>> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
>> hosts to each record and then in the reduce task it accumulates the amount
>> of bytes received from each host.
>> 
>> Currently it's producing about 65000 keys
>> 
>> The hole job takes forever to complete, specially the reduce part. I've
>> tried different tuning configs by I can't bring it down under 20mins.
>> 
>> Any ideas?
>> 
>> Thanks for your help!
>> Pony
>>

Re: Cluster Tuning

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hi Pony,

There is a good chance that your boxes are doing some heavy swapping and
that is a killer for Hadoop.  Have you tried
with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
heap on that boxes?

Cheers,
Esteban.

--
Get Hadoop!  http://www.cloudera.com/downloads/



On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <go...@gmail.com> wrote:

> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes
> exactly
> alike. Single core machines with 600MB of RAM. No chance of upgrading the
> hardware.
>
> My cluster is made out of 1 NameNode/JobTracker box and 19
> DataNode/TaskTracker boxes.
>
> All my config is default except i've set the following in my
> mapred-site.xml
> in an effort to try and prevent choking my boxes.
>  *<property>*
> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
> *      <value>1</value>*
> *  </property>*
>
> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
> hosts to each record and then in the reduce task it accumulates the amount
> of bytes received from each host.
>
> Currently it's producing about 65000 keys
>
> The hole job takes forever to complete, specially the reduce part. I've
> tried different tuning configs by I can't bring it down under 20mins.
>
> Any ideas?
>
> Thanks for your help!
> Pony
>

Re: Cluster Tuning

Posted by Joey Echeverria <jo...@cloudera.com>.

Have you tried using a Combiner?

Here's an example of using one:

http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Example%3A+WordCount+v1.0

-Joey

On Thu, Jul 7, 2011 at 4:29 PM, Juan P. <go...@gmail.com> wrote:
> Hi guys!
>
> I'd like some help fine tuning my cluster. I currently have 20 boxes exactly
> alike. Single core machines with 600MB of RAM. No chance of upgrading the
> hardware.
>
> My cluster is made out of 1 NameNode/JobTracker box and 19
> DataNode/TaskTracker boxes.
>
> All my config is default except i've set the following in my mapred-site.xml
> in an effort to try and prevent choking my boxes.
>  *<property>*
> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
> *      <value>1</value>*
> *  </property>*
>
> I'm running a MapReduce job which reads a Proxy Server log file (2GB), maps
> hosts to each record and then in the reduce task it accumulates the amount
> of bytes received from each host.
>
> Currently it's producing about 65000 keys
>
> The hole job takes forever to complete, specially the reduce part. I've
> tried different tuning configs by I can't bring it down under 20mins.
>
> Any ideas?
>
> Thanks for your help!
> Pony
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434