You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/06/26 23:43:42 UTC

Scaling out/up or a mix

Hi.

We have a deployment of 10 hadoop servers and I now need more mapping
capability (no not just add more mappers per instance) since I have so many
jobs running. Now I am wondering what I should aim on...
Memory, cpu or disk... How long is a rope perhaps you would say ?

A typical server is currently using about 15-20% cpu today on a quad-core
2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.

Some specs below.
> mpstat 2 5
Linux 2.6.24-19-server (mapreduce2)     06/26/2009

11:36:13 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
%idle    intr/s
11:36:15 PM  all   22.82    0.00    3.24    1.37    0.62    2.49    0.00
69.45   8572.50
11:36:17 PM  all   13.56    0.00    1.74    1.99    0.62    2.61    0.00
79.48   8075.50
11:36:19 PM  all   14.32    0.00    2.24    1.12    1.12    2.24    0.00
78.95   9219.00
11:36:21 PM  all   14.71    0.00    0.87    1.62    0.25    1.75    0.00
80.80   8489.50
11:36:23 PM  all   12.69    0.00    0.87    1.24    0.50    0.75    0.00
83.96   5495.00
Average:     all   15.62    0.00    1.79    1.47    0.62    1.97    0.00
78.53   7970.30

What I am thinking is... Is it wiser to go for many of these cheap boxes
with 8GB of RAM or should I for instance focus on machines which can give
more I|O throughput ?

I know that these things are hard but perhaps someone have draw some
conclusions before the pragmatic way.

Kindly

//Marcus


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi.

The crawlers are _very_ threaded but no we use our own threading framework
since it was not available at the time on hadoop-core.

Crawlers normally just wait a lot on clients inducing very little CPU but
consumes some memory due to the parallellism.

//Marcus

On Sat, Jun 27, 2009 at 6:10 PM, jason hadoop <ja...@gmail.com>wrote:

> How about multi-threaded mappers?
> Multi-Threaded mappers are ideal for map tasks that are non locally io
> bound
> with many distinct endpoints.
> You can also control the thread count on a per job basis.
>
> On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > The argument currently against increasing num-mappers is that the
> machines
> > will get into oom and since a lot of the jobs are crawlers I need more
> > ip-numbers so I don't get banned :)
> >
> > Thing is that we currently have solr on the very same machines and
> > data-nodes as well so I can only give the MR nodes about 1G memory since
> I
> > need SOLR to have 4G...
> >
> > Now I see that I should get some obvious and juste critique about the
> > layout
> > of this arch but I'm a little limited in budget and so is then the arch
> :)
> >
> > However is it wise to have the MR tasks on the same nodes as the
> data-nodes
> > or should I split the arch ? I mean the data-nodes perhaps need more
> > disk-IO
> > and the MR more memory and CPU ?
> >
> > Trying to find a sweetspot hardware spec of those two roles.
> >
> > //Marcus
> >
> >
> >
> > On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bbockelm@cse.unl.edu
> > >wrote:
> >
> > > Hey Marcus,
> > >
> > > Are you recording the data rates coming out of HDFS?  Since you have
> such
> > a
> > > low CPU utilizations, I'd look at boxes utterly packed with big hard
> > drives
> > > (also, why are you using RAID1 for Hadoop??).
> > >
> > > You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
> > >  Based on the data rates you see, make the call.
> > >
> > > On the other hand, what's the argument against running 3x more mappers
> > per
> > > box?  It seems that your boxes still have more overhead to use --
> there's
> > no
> > > I/O wait.
> > >
> > > Brian
> > >
> > >
> > > On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
> > >
> > >  Hi.
> > >>
> > >> We have a deployment of 10 hadoop servers and I now need more mapping
> > >> capability (no not just add more mappers per instance) since I have so
> > >> many
> > >> jobs running. Now I am wondering what I should aim on...
> > >> Memory, cpu or disk... How long is a rope perhaps you would say ?
> > >>
> > >> A typical server is currently using about 15-20% cpu today on a
> > quad-core
> > >> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
> > >>
> > >> Some specs below.
> > >>
> > >>> mpstat 2 5
> > >>>
> > >> Linux 2.6.24-19-server (mapreduce2)     06/26/2009
> > >>
> > >> 11:36:13 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft
>  %steal
> > >> %idle    intr/s
> > >> 11:36:15 PM  all   22.82    0.00    3.24    1.37    0.62    2.49
>  0.00
> > >> 69.45   8572.50
> > >> 11:36:17 PM  all   13.56    0.00    1.74    1.99    0.62    2.61
>  0.00
> > >> 79.48   8075.50
> > >> 11:36:19 PM  all   14.32    0.00    2.24    1.12    1.12    2.24
>  0.00
> > >> 78.95   9219.00
> > >> 11:36:21 PM  all   14.71    0.00    0.87    1.62    0.25    1.75
>  0.00
> > >> 80.80   8489.50
> > >> 11:36:23 PM  all   12.69    0.00    0.87    1.24    0.50    0.75
>  0.00
> > >> 83.96   5495.00
> > >> Average:     all   15.62    0.00    1.79    1.47    0.62    1.97
>  0.00
> > >> 78.53   7970.30
> > >>
> > >> What I am thinking is... Is it wiser to go for many of these cheap
> boxes
> > >> with 8GB of RAM or should I for instance focus on machines which can
> > give
> > >> more I|O throughput ?
> > >>
> > >> I know that these things are hard but perhaps someone have draw some
> > >> conclusions before the pragmatic way.
> > >>
> > >> Kindly
> > >>
> > >> //Marcus
> > >>
> > >>
> > >> --
> > >> Marcus Herou CTO and co-founder Tailsweep AB
> > >> +46702561312
> > >> marcus.herou@tailsweep.com
> > >> http://www.tailsweep.com/
> > >>
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by jason hadoop <ja...@gmail.com>.

How about multi-threaded mappers?
Multi-Threaded mappers are ideal for map tasks that are non locally io bound
with many distinct endpoints.
You can also control the thread count on a per job basis.

On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou <ma...@tailsweep.com>wrote:

> The argument currently against increasing num-mappers is that the machines
> will get into oom and since a lot of the jobs are crawlers I need more
> ip-numbers so I don't get banned :)
>
> Thing is that we currently have solr on the very same machines and
> data-nodes as well so I can only give the MR nodes about 1G memory since I
> need SOLR to have 4G...
>
> Now I see that I should get some obvious and juste critique about the
> layout
> of this arch but I'm a little limited in budget and so is then the arch :)
>
> However is it wise to have the MR tasks on the same nodes as the data-nodes
> or should I split the arch ? I mean the data-nodes perhaps need more
> disk-IO
> and the MR more memory and CPU ?
>
> Trying to find a sweetspot hardware spec of those two roles.
>
> //Marcus
>
>
>
> On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bbockelm@cse.unl.edu
> >wrote:
>
> > Hey Marcus,
> >
> > Are you recording the data rates coming out of HDFS?  Since you have such
> a
> > low CPU utilizations, I'd look at boxes utterly packed with big hard
> drives
> > (also, why are you using RAID1 for Hadoop??).
> >
> > You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
> >  Based on the data rates you see, make the call.
> >
> > On the other hand, what's the argument against running 3x more mappers
> per
> > box?  It seems that your boxes still have more overhead to use -- there's
> no
> > I/O wait.
> >
> > Brian
> >
> >
> > On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
> >
> >  Hi.
> >>
> >> We have a deployment of 10 hadoop servers and I now need more mapping
> >> capability (no not just add more mappers per instance) since I have so
> >> many
> >> jobs running. Now I am wondering what I should aim on...
> >> Memory, cpu or disk... How long is a rope perhaps you would say ?
> >>
> >> A typical server is currently using about 15-20% cpu today on a
> quad-core
> >> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
> >>
> >> Some specs below.
> >>
> >>> mpstat 2 5
> >>>
> >> Linux 2.6.24-19-server (mapreduce2)     06/26/2009
> >>
> >> 11:36:13 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
> >> %idle    intr/s
> >> 11:36:15 PM  all   22.82    0.00    3.24    1.37    0.62    2.49    0.00
> >> 69.45   8572.50
> >> 11:36:17 PM  all   13.56    0.00    1.74    1.99    0.62    2.61    0.00
> >> 79.48   8075.50
> >> 11:36:19 PM  all   14.32    0.00    2.24    1.12    1.12    2.24    0.00
> >> 78.95   9219.00
> >> 11:36:21 PM  all   14.71    0.00    0.87    1.62    0.25    1.75    0.00
> >> 80.80   8489.50
> >> 11:36:23 PM  all   12.69    0.00    0.87    1.24    0.50    0.75    0.00
> >> 83.96   5495.00
> >> Average:     all   15.62    0.00    1.79    1.47    0.62    1.97    0.00
> >> 78.53   7970.30
> >>
> >> What I am thinking is... Is it wiser to go for many of these cheap boxes
> >> with 8GB of RAM or should I for instance focus on machines which can
> give
> >> more I|O throughput ?
> >>
> >> I know that these things are hard but perhaps someone have draw some
> >> conclusions before the pragmatic way.
> >>
> >> Kindly
> >>
> >> //Marcus
> >>
> >>
> >> --
> >> Marcus Herou CTO and co-founder Tailsweep AB
> >> +46702561312
> >> marcus.herou@tailsweep.com
> >> http://www.tailsweep.com/
> >>
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: Scaling out/up or a mix

Posted by Marcus Herou <ma...@tailsweep.com>.

The argument currently against increasing num-mappers is that the machines
will get into oom and since a lot of the jobs are crawlers I need more
ip-numbers so I don't get banned :)

Thing is that we currently have solr on the very same machines and
data-nodes as well so I can only give the MR nodes about 1G memory since I
need SOLR to have 4G...

Now I see that I should get some obvious and juste critique about the layout
of this arch but I'm a little limited in budget and so is then the arch :)

However is it wise to have the MR tasks on the same nodes as the data-nodes
or should I split the arch ? I mean the data-nodes perhaps need more disk-IO
and the MR more memory and CPU ?

Trying to find a sweetspot hardware spec of those two roles.

//Marcus



On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Marcus,
>
> Are you recording the data rates coming out of HDFS?  Since you have such a
> low CPU utilizations, I'd look at boxes utterly packed with big hard drives
> (also, why are you using RAID1 for Hadoop??).
>
> You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
>  Based on the data rates you see, make the call.
>
> On the other hand, what's the argument against running 3x more mappers per
> box?  It seems that your boxes still have more overhead to use -- there's no
> I/O wait.
>
> Brian
>
>
> On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
>
>  Hi.
>>
>> We have a deployment of 10 hadoop servers and I now need more mapping
>> capability (no not just add more mappers per instance) since I have so
>> many
>> jobs running. Now I am wondering what I should aim on...
>> Memory, cpu or disk... How long is a rope perhaps you would say ?
>>
>> A typical server is currently using about 15-20% cpu today on a quad-core
>> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
>>
>> Some specs below.
>>
>>> mpstat 2 5
>>>
>> Linux 2.6.24-19-server (mapreduce2)     06/26/2009
>>
>> 11:36:13 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
>> %idle    intr/s
>> 11:36:15 PM  all   22.82    0.00    3.24    1.37    0.62    2.49    0.00
>> 69.45   8572.50
>> 11:36:17 PM  all   13.56    0.00    1.74    1.99    0.62    2.61    0.00
>> 79.48   8075.50
>> 11:36:19 PM  all   14.32    0.00    2.24    1.12    1.12    2.24    0.00
>> 78.95   9219.00
>> 11:36:21 PM  all   14.71    0.00    0.87    1.62    0.25    1.75    0.00
>> 80.80   8489.50
>> 11:36:23 PM  all   12.69    0.00    0.87    1.24    0.50    0.75    0.00
>> 83.96   5495.00
>> Average:     all   15.62    0.00    1.79    1.47    0.62    1.97    0.00
>> 78.53   7970.30
>>
>> What I am thinking is... Is it wiser to go for many of these cheap boxes
>> with 8GB of RAM or should I for instance focus on machines which can give
>> more I|O throughput ?
>>
>> I know that these things are hard but perhaps someone have draw some
>> conclusions before the pragmatic way.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.herou@tailsweep.com
>> http://www.tailsweep.com/
>>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: Scaling out/up or a mix

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Marcus,

Are you recording the data rates coming out of HDFS?  Since you have  
such a low CPU utilizations, I'd look at boxes utterly packed with big  
hard drives (also, why are you using RAID1 for Hadoop??).

You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive  
bays.  Based on the data rates you see, make the call.

On the other hand, what's the argument against running 3x more mappers  
per box?  It seems that your boxes still have more overhead to use --  
there's no I/O wait.

Brian

On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:

> Hi.
>
> We have a deployment of 10 hadoop servers and I now need more mapping
> capability (no not just add more mappers per instance) since I have  
> so many
> jobs running. Now I am wondering what I should aim on...
> Memory, cpu or disk... How long is a rope perhaps you would say ?
>
> A typical server is currently using about 15-20% cpu today on a quad- 
> core
> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
>
> Some specs below.
>> mpstat 2 5
> Linux 2.6.24-19-server (mapreduce2)     06/26/2009
>
> 11:36:13 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft   
> %steal
> %idle    intr/s
> 11:36:15 PM  all   22.82    0.00    3.24    1.37    0.62    2.49     
> 0.00
> 69.45   8572.50
> 11:36:17 PM  all   13.56    0.00    1.74    1.99    0.62    2.61     
> 0.00
> 79.48   8075.50
> 11:36:19 PM  all   14.32    0.00    2.24    1.12    1.12    2.24     
> 0.00
> 78.95   9219.00
> 11:36:21 PM  all   14.71    0.00    0.87    1.62    0.25    1.75     
> 0.00
> 80.80   8489.50
> 11:36:23 PM  all   12.69    0.00    0.87    1.24    0.50    0.75     
> 0.00
> 83.96   5495.00
> Average:     all   15.62    0.00    1.79    1.47    0.62    1.97     
> 0.00
> 78.53   7970.30
>
> What I am thinking is... Is it wiser to go for many of these cheap  
> boxes
> with 8GB of RAM or should I for instance focus on machines which can  
> give
> more I|O throughput ?
>
> I know that these things are hard but perhaps someone have draw some
> conclusions before the pragmatic way.
>
> Kindly
>
> //Marcus
>
>
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/