You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Oleg Ruchovets <or...@gmail.com> on 2010/11/21 14:53:06 UTC

Hadoop/HBase hardware requirement

Hi all,
After testing HBase for few months with very light configurations  (5
machines, 2 TB disk, 8 GB RAM), we are now planing for production.
Our Load -
1) 50GB log files to process per day by Map/Reduce jobs.
2)  Insert 4-5GB to 3 tables in hbase.
3) Run 10-20 scans per day (scanning about 20 regions in a table).
All this should run in parallel.
Our current configuration can't cope with this load and we are having many
stability issues.

This is what we have in mind :
1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
2. Name node - 16 GB, 2TB, Two quad core CPUs.
we plan to have up to 20 name servers (starting with 5).

We already read
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
.

We would appreciate your feedback on our proposed configuration.


Regards Oleg & Lior

Re: Hadoop/HBase hardware requirement

Posted by Oleg Ruchovets <or...@gmail.com>.
On Sun, Nov 21, 2010 at 9:55 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
I'm unclear about the 2TB disk thing, is it 1x2TB or 2x1TB or 4x500GB?
I hope it's the last one, as you want to have as many spindles as possible.

 We have 2X1TB ,

I would prefer 24GB to 16, this is what we run on and it
works like a charm, and gives more room for those memory hungry jobs.

OK , the questions here is 24GB is critical? Will all this stuff be stable
 using 16GB ?


 What kind of stability issues are you having?

It looks like I got all mentioned in hbase list issues regarding region
server crashes.


It was planned to run simultaneously 3 m/r jobs which insert result to 3
HBase tables (every table insertion is ~2GB).
In addition there 10(in a future will be 20)  scans.

Almost all tests failed , so I run only one job , or only one scan at the
same time. In addition I reduces from 3 parallel  reducers to 2 . It brings
me some stability , but with a such way of stability it is impossible to go
to production.


  Thanks Oleg



> J-D
>
> On Sun, Nov 21, 2010 at 5:53 AM, Oleg Ruchovets <or...@gmail.com>
> wrote:
> > Hi all,
> > After testing HBase for few months with very light configurations  (5
> > machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> > Our Load -
> > 1) 50GB log files to process per day by Map/Reduce jobs.
> > 2)  Insert 4-5GB to 3 tables in hbase.
> > 3) Run 10-20 scans per day (scanning about 20 regions in a table).
> > All this should run in parallel.
> > Our current configuration can't cope with this load and we are having
> many
> > stability issues.
> >
> > This is what we have in mind :
> > 1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> > 2. Name node - 16 GB, 2TB, Two quad core CPUs.
> > we plan to have up to 20 name servers (starting with 5).
> >
> > We already read
> >
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> > .
> >
> > We would appreciate your feedback on our proposed configuration.
> >
> >
> > Regards Oleg & Lior
> >
>

Re: Hadoop/HBase hardware requirement

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I'm unclear about the 2TB disk thing, is it 1x2TB or 2x1TB or 4x500GB?
I hope it's the last one, as you want to have as many spindles as
possible. I would prefer 24GB to 16, this is what we run on and it
works like a charm, and gives more room for those memory hungry jobs.

What kind of stability issues are you having?

J-D

On Sun, Nov 21, 2010 at 5:53 AM, Oleg Ruchovets <or...@gmail.com> wrote:
> Hi all,
> After testing HBase for few months with very light configurations  (5
> machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> Our Load -
> 1) 50GB log files to process per day by Map/Reduce jobs.
> 2)  Insert 4-5GB to 3 tables in hbase.
> 3) Run 10-20 scans per day (scanning about 20 regions in a table).
> All this should run in parallel.
> Our current configuration can't cope with this load and we are having many
> stability issues.
>
> This is what we have in mind :
> 1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> 2. Name node - 16 GB, 2TB, Two quad core CPUs.
> we plan to have up to 20 name servers (starting with 5).
>
> We already read
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> .
>
> We would appreciate your feedback on our proposed configuration.
>
>
> Regards Oleg & Lior
>

Re: Hadoop/HBase hardware requirement

Posted by Oleg Ruchovets <or...@gmail.com>.
Yes , I am fully agree with that , but As I understand there is a limitation
: Bulk load works only with one column family.
It is not my case. Can you advice me workaround how to use bulk load
with multi  column families?

Thanks Oleg.





> Are these insertions the output of the MR jobs?
>
> If so, I would strongly recommend the bulk load functionality. It is
> somewhere between 10x and 100x more efficient than direct API usage.
>
>
> > 3) Run 10-20 scans per day (scanning about 20 regions in a table).
> > All this should run in parallel.
> > Our current configuration can't cope with this load and we are having
> many
> > stability issues.
> >
> > This is what we have in mind :
> > 1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> > 2. Name node - 16 GB, 2TB, Two quad core CPUs.
> > we plan to have up to 20 name servers (starting with 5).
> >
> > We already read
> >
> >
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> > .
> >
> > We would appreciate your feedback on our proposed configuration.
> >
> >
> > Regards Oleg & Lior
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Hadoop/HBase hardware requirement

Posted by Todd Lipcon <to...@cloudera.com>.
On Sun, Nov 21, 2010 at 5:53 AM, Oleg Ruchovets <or...@gmail.com>wrote:

> Hi all,
> After testing HBase for few months with very light configurations  (5
> machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> Our Load -
> 1) 50GB log files to process per day by Map/Reduce jobs.
> 2)  Insert 4-5GB to 3 tables in hbase.
>

Are these insertions the output of the MR jobs?

If so, I would strongly recommend the bulk load functionality. It is
somewhere between 10x and 100x more efficient than direct API usage.


> 3) Run 10-20 scans per day (scanning about 20 regions in a table).
> All this should run in parallel.
> Our current configuration can't cope with this load and we are having many
> stability issues.
>
> This is what we have in mind :
> 1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> 2. Name node - 16 GB, 2TB, Two quad core CPUs.
> we plan to have up to 20 name servers (starting with 5).
>
> We already read
>
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
> .
>
> We would appreciate your feedback on our proposed configuration.
>
>
> Regards Oleg & Lior
>



-- 
Todd Lipcon
Software Engineer, Cloudera

RE: Hadoop/HBase hardware requirement

Posted by Michael Segel <mi...@hotmail.com>.
Just to put things in perspective...

16GB isn't a lot of memory for a data node when it comes to 8core Nehelem ?sp? (Intel e5500 CPUs)

I would suggest you take a look at your Ganglia output on your cloud.

>From our experience, we're disk i/o bound. Our DN typically have 8 core, 32GB, and 4x 2TB 7200 RPM SATA Drives.
(Going across 2 racks is also an issue... but that's a different story.)

In terms of general Hadoop (We also run HBase)... 
The more disks (spindles) the better. 
So if you're going to cap your boxes to 16GB, then look at 4 core and 4x2TB drives in each box.
I'd even consider a motherboard that could support 2 sockets and only populate1 socket.

As others have pointed out in different discussions, you will be limited in the number of m/r tasks by memory. 

4 cores is 8 virtual CPUs so you could run 8 m/r tasks assuming 512MB per task and then you have 1GB for datanode and 1GB for Task Tracker.
The you'd have some head room to run 1024MB per task, or more if you end up with larger jobs.


HTH

-Mike

> Date: Mon, 22 Nov 2010 15:50:05 +0200
> Subject: Re: Hadoop/HBase hardware requirement
> From: liors@infolinks.com
> To: user@hbase.apache.org
> 
> I believe our bottleneck is currently with memory.
> Most of the CPU is dedicated to parsing and working with gzip files (we
> don't have heavy computational tasks).
> But on the other hand, more memory and disk mean we can run more M/R and
> scans which needs more CPU....
> 
> Lior
> 
> 
> On Mon, Nov 22, 2010 at 3:22 PM, Lars George <la...@gmail.com> wrote:
> 
> > Hi Lior,
> >
> > Depends on your load, is it IO or CPU bound? Sounds like IO or Disk
> > from the above, right? I would opt for the more machines! This will
> > spread the load better across the cluster. And you can always add more
> > disks in v2 of your setup.
> >
> > Lars
> >
> > On Mon, Nov 22, 2010 at 1:56 PM, Lior Schachter <li...@infolinks.com>
> > wrote:
> > > And another more concrete question:
> > > lets say that on every machine with two quad core CPUs, 4T and 16GB I can
> > > buy 2 machines with one quad, 2T, 16GB.
> > >
> > > Which configuration should I choose ?
> > >
> > > Lior
> > >
> > > On Mon, Nov 22, 2010 at 2:27 PM, Lior Schachter <li...@infolinks.com>
> > wrote:
> > >
> > >> Hi all, Thanks for your input and assistance.
> > >>
> > >>
> > >> From your answers I understand that:
> > >> 1. more is better but our configuration might work.
> > >> 2. there are small tweaks we can do that will improve our configuration
> > >> (like having 4x500GB disks).
> > >> 3. use monitoring (like Ganglia) to find the bottlenecks.
> > >>
> > >> For me, The question here is how to balance between our current budget
> > and
> > >> system stability (and performance).
> > >> I agree that more memory and more disk space will improve our
> > >> responsiveness but on the other hand our system is NOT expected to be
> > >> real-time (but rather a back office analytics with few hours delay).
> > >>
> > >> This is a crucial point since the proposed configurations we found in
> > the
> > >> web don't distinguish between real-time configurations and back-office
> > >> configurations. To build a real-time cluster with 20 nodes will cost
> > around
> > >> 200-300K (in Israel) this is similar to the price of a quite strong
> > Oracle
> > >> cluster... so my boss (the CTO) was partially right when telling me -
> > but
> > >> you said it would be cheap !! very cheap :)
> > >>
> > >> I believe that more money will come when we show the viability of the
> > >> system... I also read that heterogeneous clusters are common.
> > >>
> > >> It will help a lot if you can provide your configurations and system
> > >> characteristics (maybe in a Wiki page).
> > >> It will also help to get more of the "small tweaks" that you found
> > helpful.
> > >>
> > >>
> > >> Lior Schachter
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Nov 22, 2010 at 1:33 PM, Lars George <lars.george@gmail.com
> > >wrote:
> > >>
> > >>> Oleg,
> > >>>
> > >>> Do you have Ganglia or some other graphing tool running against the
> > >>> cluster? It gives you metrics that are crucial here, for example the
> > >>> load on Hadoop and its DataNodes as well as insertion rates etc. on
> > >>> HBase. What is also interesting is the compaction queue to see if the
> > >>> cluster is going slow.
> > >>>
> > >>> Did you try loading from an empty system to a loaded one? Or was it
> > >>> already filled and you are trying to add more? Are you spreading the
> > >>> load across servers or are you using sequential keys that tax only one
> > >>> server at a time?
> > >>>
> > >>> 16GB should work, but is not ideal. The various daemons simply need
> > >>> room to breathe. But that said, I have personally started with 12GB
> > >>> even and it worked.
> > >>>
> > >>> Lars
> > >>>
> > >>> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <oruchovets@gmail.com
> > >
> > >>> wrote:
> > >>> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <
> > ksankar42@gmail.com
> > >>> >wrote:
> > >>> >
> > >>> >> Oleg & Lior,
> > >>> >>
> > >>> >> Couple of questions & couple of suggestions to ponder:
> > >>> >> A)  When you say 20 Name Servers, I assume you are talking about 20
> > >>> Task
> > >>> >> Servers
> > >>> >>
> > >>> >
> > >>> > Yes
> > >>> >
> > >>> >
> > >>> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
> > >>> intensive ?
> > >>> >>
> > >>> >
> > >>> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
> > >>> stores
> > >>> > to hbase
> > >>> >
> > >>> >
> > >>> >> C)  What is your Data growth ?
> > >>> >>
> > >>> >
> > >>> >  currently we have 50GB per day , it could be ~150GB.
> > >>> >
> > >>> >
> > >>> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage
> > ?
> > >>> >>
> > >>> >    Map phase takes 100% CPU consumption since it is a parsing and
> > input
> > >>> > files are  gz.
> > >>> >    Definitely have a memory issues.
> > >>> >
> > >>> >
> > >>> >> Ganglia/Hadoop metrics should tell.
> > >>> >> E)  Also are your jobs long running or short tasks ?
> > >>> >>
> > >>> >    map tasks takes from 5 second to 2 minutes
> > >>> >    reducer (insertion to hbase) takes -- ~3 hours
> > >>> >
> > >>> >
> > >>> >> Suggestions:
> > >>> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
> > >>> enterprise
> > >>> >> class server and also backup to an NFS mount.
> > >>> >> B)  Also have a decent machine as the checkpoint name node. It could
> > be
> > >>> >> similar to the task nodes
> > >>> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
> > >>> similar
> > >>> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> > >>> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
> > >>> would
> > >>> >> also recommend. But it also depends on your primary data,
> > intermediate
> > >>> >> data and final data size. 1 or 2 TB disks are also fine, because
> > they
> > >>> give
> > >>> >> you more strage. I assume you have the default replication of 3
> > >>> >> D)  A 1Gb dedicated network would be good. As there are only ~25
> > >>> machines,
> > >>> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
> > >>> too
> > >>> >> much intermediate data traffic, in the future.
> > >>> >> Cheers
> > >>> >> <k/>
> > >>> >>
> > >>> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
> > >>> wrote:
> > >>> >>
> > >>> >> >Hi all,
> > >>> >> >After testing HBase for few months with very light configurations
> >  (5
> > >>> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> > >>> >> >Our Load -
> > >>> >> >1) 50GB log files to process per day by Map/Reduce jobs.
> > >>> >> >2)  Insert 4-5GB to 3 tables in hbase.
> > >>> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> > >>> >> >All this should run in parallel.
> > >>> >> >Our current configuration can't cope with this load and we are
> > having
> > >>> many
> > >>> >> >stability issues.
> > >>> >> >
> > >>> >> >This is what we have in mind :
> > >>> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> > >>> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> > >>> >> >we plan to have up to 20 name servers (starting with 5).
> > >>> >> >
> > >>> >> >We already read
> > >>> >> >
> > >>> >>
> > >>>
> > http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> > >>> >> >sic-hardware-recommendations/
> > >>> >> >.
> > >>> >> >
> > >>> >> >We would appreciate your feedback on our proposed configuration.
> > >>> >> >
> > >>> >> >
> > >>> >> >Regards Oleg & Lior
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
 		 	   		  

Re: Hadoop/HBase hardware requirement

Posted by Lior Schachter <li...@infolinks.com>.
I believe our bottleneck is currently with memory.
Most of the CPU is dedicated to parsing and working with gzip files (we
don't have heavy computational tasks).
But on the other hand, more memory and disk mean we can run more M/R and
scans which needs more CPU....

Lior


On Mon, Nov 22, 2010 at 3:22 PM, Lars George <la...@gmail.com> wrote:

> Hi Lior,
>
> Depends on your load, is it IO or CPU bound? Sounds like IO or Disk
> from the above, right? I would opt for the more machines! This will
> spread the load better across the cluster. And you can always add more
> disks in v2 of your setup.
>
> Lars
>
> On Mon, Nov 22, 2010 at 1:56 PM, Lior Schachter <li...@infolinks.com>
> wrote:
> > And another more concrete question:
> > lets say that on every machine with two quad core CPUs, 4T and 16GB I can
> > buy 2 machines with one quad, 2T, 16GB.
> >
> > Which configuration should I choose ?
> >
> > Lior
> >
> > On Mon, Nov 22, 2010 at 2:27 PM, Lior Schachter <li...@infolinks.com>
> wrote:
> >
> >> Hi all, Thanks for your input and assistance.
> >>
> >>
> >> From your answers I understand that:
> >> 1. more is better but our configuration might work.
> >> 2. there are small tweaks we can do that will improve our configuration
> >> (like having 4x500GB disks).
> >> 3. use monitoring (like Ganglia) to find the bottlenecks.
> >>
> >> For me, The question here is how to balance between our current budget
> and
> >> system stability (and performance).
> >> I agree that more memory and more disk space will improve our
> >> responsiveness but on the other hand our system is NOT expected to be
> >> real-time (but rather a back office analytics with few hours delay).
> >>
> >> This is a crucial point since the proposed configurations we found in
> the
> >> web don't distinguish between real-time configurations and back-office
> >> configurations. To build a real-time cluster with 20 nodes will cost
> around
> >> 200-300K (in Israel) this is similar to the price of a quite strong
> Oracle
> >> cluster... so my boss (the CTO) was partially right when telling me -
> but
> >> you said it would be cheap !! very cheap :)
> >>
> >> I believe that more money will come when we show the viability of the
> >> system... I also read that heterogeneous clusters are common.
> >>
> >> It will help a lot if you can provide your configurations and system
> >> characteristics (maybe in a Wiki page).
> >> It will also help to get more of the "small tweaks" that you found
> helpful.
> >>
> >>
> >> Lior Schachter
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Nov 22, 2010 at 1:33 PM, Lars George <lars.george@gmail.com
> >wrote:
> >>
> >>> Oleg,
> >>>
> >>> Do you have Ganglia or some other graphing tool running against the
> >>> cluster? It gives you metrics that are crucial here, for example the
> >>> load on Hadoop and its DataNodes as well as insertion rates etc. on
> >>> HBase. What is also interesting is the compaction queue to see if the
> >>> cluster is going slow.
> >>>
> >>> Did you try loading from an empty system to a loaded one? Or was it
> >>> already filled and you are trying to add more? Are you spreading the
> >>> load across servers or are you using sequential keys that tax only one
> >>> server at a time?
> >>>
> >>> 16GB should work, but is not ideal. The various daemons simply need
> >>> room to breathe. But that said, I have personally started with 12GB
> >>> even and it worked.
> >>>
> >>> Lars
> >>>
> >>> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <oruchovets@gmail.com
> >
> >>> wrote:
> >>> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <
> ksankar42@gmail.com
> >>> >wrote:
> >>> >
> >>> >> Oleg & Lior,
> >>> >>
> >>> >> Couple of questions & couple of suggestions to ponder:
> >>> >> A)  When you say 20 Name Servers, I assume you are talking about 20
> >>> Task
> >>> >> Servers
> >>> >>
> >>> >
> >>> > Yes
> >>> >
> >>> >
> >>> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
> >>> intensive ?
> >>> >>
> >>> >
> >>> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
> >>> stores
> >>> > to hbase
> >>> >
> >>> >
> >>> >> C)  What is your Data growth ?
> >>> >>
> >>> >
> >>> >  currently we have 50GB per day , it could be ~150GB.
> >>> >
> >>> >
> >>> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage
> ?
> >>> >>
> >>> >    Map phase takes 100% CPU consumption since it is a parsing and
> input
> >>> > files are  gz.
> >>> >    Definitely have a memory issues.
> >>> >
> >>> >
> >>> >> Ganglia/Hadoop metrics should tell.
> >>> >> E)  Also are your jobs long running or short tasks ?
> >>> >>
> >>> >    map tasks takes from 5 second to 2 minutes
> >>> >    reducer (insertion to hbase) takes -- ~3 hours
> >>> >
> >>> >
> >>> >> Suggestions:
> >>> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
> >>> enterprise
> >>> >> class server and also backup to an NFS mount.
> >>> >> B)  Also have a decent machine as the checkpoint name node. It could
> be
> >>> >> similar to the task nodes
> >>> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
> >>> similar
> >>> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> >>> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
> >>> would
> >>> >> also recommend. But it also depends on your primary data,
> intermediate
> >>> >> data and final data size. 1 or 2 TB disks are also fine, because
> they
> >>> give
> >>> >> you more strage. I assume you have the default replication of 3
> >>> >> D)  A 1Gb dedicated network would be good. As there are only ~25
> >>> machines,
> >>> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
> >>> too
> >>> >> much intermediate data traffic, in the future.
> >>> >> Cheers
> >>> >> <k/>
> >>> >>
> >>> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >> >Hi all,
> >>> >> >After testing HBase for few months with very light configurations
>  (5
> >>> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> >>> >> >Our Load -
> >>> >> >1) 50GB log files to process per day by Map/Reduce jobs.
> >>> >> >2)  Insert 4-5GB to 3 tables in hbase.
> >>> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> >>> >> >All this should run in parallel.
> >>> >> >Our current configuration can't cope with this load and we are
> having
> >>> many
> >>> >> >stability issues.
> >>> >> >
> >>> >> >This is what we have in mind :
> >>> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> >>> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> >>> >> >we plan to have up to 20 name servers (starting with 5).
> >>> >> >
> >>> >> >We already read
> >>> >> >
> >>> >>
> >>>
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> >>> >> >sic-hardware-recommendations/
> >>> >> >.
> >>> >> >
> >>> >> >We would appreciate your feedback on our proposed configuration.
> >>> >> >
> >>> >> >
> >>> >> >Regards Oleg & Lior
> >>> >>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>

Re: Hadoop/HBase hardware requirement

Posted by Lars George <la...@gmail.com>.
Hi Lior,

Depends on your load, is it IO or CPU bound? Sounds like IO or Disk
from the above, right? I would opt for the more machines! This will
spread the load better across the cluster. And you can always add more
disks in v2 of your setup.

Lars

On Mon, Nov 22, 2010 at 1:56 PM, Lior Schachter <li...@infolinks.com> wrote:
> And another more concrete question:
> lets say that on every machine with two quad core CPUs, 4T and 16GB I can
> buy 2 machines with one quad, 2T, 16GB.
>
> Which configuration should I choose ?
>
> Lior
>
> On Mon, Nov 22, 2010 at 2:27 PM, Lior Schachter <li...@infolinks.com> wrote:
>
>> Hi all, Thanks for your input and assistance.
>>
>>
>> From your answers I understand that:
>> 1. more is better but our configuration might work.
>> 2. there are small tweaks we can do that will improve our configuration
>> (like having 4x500GB disks).
>> 3. use monitoring (like Ganglia) to find the bottlenecks.
>>
>> For me, The question here is how to balance between our current budget and
>> system stability (and performance).
>> I agree that more memory and more disk space will improve our
>> responsiveness but on the other hand our system is NOT expected to be
>> real-time (but rather a back office analytics with few hours delay).
>>
>> This is a crucial point since the proposed configurations we found in the
>> web don't distinguish between real-time configurations and back-office
>> configurations. To build a real-time cluster with 20 nodes will cost around
>> 200-300K (in Israel) this is similar to the price of a quite strong Oracle
>> cluster... so my boss (the CTO) was partially right when telling me - but
>> you said it would be cheap !! very cheap :)
>>
>> I believe that more money will come when we show the viability of the
>> system... I also read that heterogeneous clusters are common.
>>
>> It will help a lot if you can provide your configurations and system
>> characteristics (maybe in a Wiki page).
>> It will also help to get more of the "small tweaks" that you found helpful.
>>
>>
>> Lior Schachter
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Nov 22, 2010 at 1:33 PM, Lars George <la...@gmail.com>wrote:
>>
>>> Oleg,
>>>
>>> Do you have Ganglia or some other graphing tool running against the
>>> cluster? It gives you metrics that are crucial here, for example the
>>> load on Hadoop and its DataNodes as well as insertion rates etc. on
>>> HBase. What is also interesting is the compaction queue to see if the
>>> cluster is going slow.
>>>
>>> Did you try loading from an empty system to a loaded one? Or was it
>>> already filled and you are trying to add more? Are you spreading the
>>> load across servers or are you using sequential keys that tax only one
>>> server at a time?
>>>
>>> 16GB should work, but is not ideal. The various daemons simply need
>>> room to breathe. But that said, I have personally started with 12GB
>>> even and it worked.
>>>
>>> Lars
>>>
>>> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <or...@gmail.com>
>>> wrote:
>>> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ksankar42@gmail.com
>>> >wrote:
>>> >
>>> >> Oleg & Lior,
>>> >>
>>> >> Couple of questions & couple of suggestions to ponder:
>>> >> A)  When you say 20 Name Servers, I assume you are talking about 20
>>> Task
>>> >> Servers
>>> >>
>>> >
>>> > Yes
>>> >
>>> >
>>> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
>>> intensive ?
>>> >>
>>> >
>>> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
>>> stores
>>> > to hbase
>>> >
>>> >
>>> >> C)  What is your Data growth ?
>>> >>
>>> >
>>> >  currently we have 50GB per day , it could be ~150GB.
>>> >
>>> >
>>> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
>>> >>
>>> >    Map phase takes 100% CPU consumption since it is a parsing and input
>>> > files are  gz.
>>> >    Definitely have a memory issues.
>>> >
>>> >
>>> >> Ganglia/Hadoop metrics should tell.
>>> >> E)  Also are your jobs long running or short tasks ?
>>> >>
>>> >    map tasks takes from 5 second to 2 minutes
>>> >    reducer (insertion to hbase) takes -- ~3 hours
>>> >
>>> >
>>> >> Suggestions:
>>> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
>>> enterprise
>>> >> class server and also backup to an NFS mount.
>>> >> B)  Also have a decent machine as the checkpoint name node. It could be
>>> >> similar to the task nodes
>>> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
>>> similar
>>> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
>>> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
>>> would
>>> >> also recommend. But it also depends on your primary data, intermediate
>>> >> data and final data size. 1 or 2 TB disks are also fine, because they
>>> give
>>> >> you more strage. I assume you have the default replication of 3
>>> >> D)  A 1Gb dedicated network would be good. As there are only ~25
>>> machines,
>>> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
>>> too
>>> >> much intermediate data traffic, in the future.
>>> >> Cheers
>>> >> <k/>
>>> >>
>>> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
>>> wrote:
>>> >>
>>> >> >Hi all,
>>> >> >After testing HBase for few months with very light configurations  (5
>>> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
>>> >> >Our Load -
>>> >> >1) 50GB log files to process per day by Map/Reduce jobs.
>>> >> >2)  Insert 4-5GB to 3 tables in hbase.
>>> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
>>> >> >All this should run in parallel.
>>> >> >Our current configuration can't cope with this load and we are having
>>> many
>>> >> >stability issues.
>>> >> >
>>> >> >This is what we have in mind :
>>> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
>>> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
>>> >> >we plan to have up to 20 name servers (starting with 5).
>>> >> >
>>> >> >We already read
>>> >> >
>>> >>
>>> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
>>> >> >sic-hardware-recommendations/
>>> >> >.
>>> >> >
>>> >> >We would appreciate your feedback on our proposed configuration.
>>> >> >
>>> >> >
>>> >> >Regards Oleg & Lior
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Hadoop/HBase hardware requirement

Posted by Lior Schachter <li...@infolinks.com>.
And another more concrete question:
lets say that on every machine with two quad core CPUs, 4T and 16GB I can
buy 2 machines with one quad, 2T, 16GB.

Which configuration should I choose ?

Lior

On Mon, Nov 22, 2010 at 2:27 PM, Lior Schachter <li...@infolinks.com> wrote:

> Hi all, Thanks for your input and assistance.
>
>
> From your answers I understand that:
> 1. more is better but our configuration might work.
> 2. there are small tweaks we can do that will improve our configuration
> (like having 4x500GB disks).
> 3. use monitoring (like Ganglia) to find the bottlenecks.
>
> For me, The question here is how to balance between our current budget and
> system stability (and performance).
> I agree that more memory and more disk space will improve our
> responsiveness but on the other hand our system is NOT expected to be
> real-time (but rather a back office analytics with few hours delay).
>
> This is a crucial point since the proposed configurations we found in the
> web don't distinguish between real-time configurations and back-office
> configurations. To build a real-time cluster with 20 nodes will cost around
> 200-300K (in Israel) this is similar to the price of a quite strong Oracle
> cluster... so my boss (the CTO) was partially right when telling me - but
> you said it would be cheap !! very cheap :)
>
> I believe that more money will come when we show the viability of the
> system... I also read that heterogeneous clusters are common.
>
> It will help a lot if you can provide your configurations and system
> characteristics (maybe in a Wiki page).
> It will also help to get more of the "small tweaks" that you found helpful.
>
>
> Lior Schachter
>
>
>
>
>
>
>
>
> On Mon, Nov 22, 2010 at 1:33 PM, Lars George <la...@gmail.com>wrote:
>
>> Oleg,
>>
>> Do you have Ganglia or some other graphing tool running against the
>> cluster? It gives you metrics that are crucial here, for example the
>> load on Hadoop and its DataNodes as well as insertion rates etc. on
>> HBase. What is also interesting is the compaction queue to see if the
>> cluster is going slow.
>>
>> Did you try loading from an empty system to a loaded one? Or was it
>> already filled and you are trying to add more? Are you spreading the
>> load across servers or are you using sequential keys that tax only one
>> server at a time?
>>
>> 16GB should work, but is not ideal. The various daemons simply need
>> room to breathe. But that said, I have personally started with 12GB
>> even and it worked.
>>
>> Lars
>>
>> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <or...@gmail.com>
>> wrote:
>> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ksankar42@gmail.com
>> >wrote:
>> >
>> >> Oleg & Lior,
>> >>
>> >> Couple of questions & couple of suggestions to ponder:
>> >> A)  When you say 20 Name Servers, I assume you are talking about 20
>> Task
>> >> Servers
>> >>
>> >
>> > Yes
>> >
>> >
>> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
>> intensive ?
>> >>
>> >
>> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
>> stores
>> > to hbase
>> >
>> >
>> >> C)  What is your Data growth ?
>> >>
>> >
>> >  currently we have 50GB per day , it could be ~150GB.
>> >
>> >
>> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
>> >>
>> >    Map phase takes 100% CPU consumption since it is a parsing and input
>> > files are  gz.
>> >    Definitely have a memory issues.
>> >
>> >
>> >> Ganglia/Hadoop metrics should tell.
>> >> E)  Also are your jobs long running or short tasks ?
>> >>
>> >    map tasks takes from 5 second to 2 minutes
>> >    reducer (insertion to hbase) takes -- ~3 hours
>> >
>> >
>> >> Suggestions:
>> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
>> enterprise
>> >> class server and also backup to an NFS mount.
>> >> B)  Also have a decent machine as the checkpoint name node. It could be
>> >> similar to the task nodes
>> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
>> similar
>> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
>> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
>> would
>> >> also recommend. But it also depends on your primary data, intermediate
>> >> data and final data size. 1 or 2 TB disks are also fine, because they
>> give
>> >> you more strage. I assume you have the default replication of 3
>> >> D)  A 1Gb dedicated network would be good. As there are only ~25
>> machines,
>> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
>> too
>> >> much intermediate data traffic, in the future.
>> >> Cheers
>> >> <k/>
>> >>
>> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
>> wrote:
>> >>
>> >> >Hi all,
>> >> >After testing HBase for few months with very light configurations  (5
>> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
>> >> >Our Load -
>> >> >1) 50GB log files to process per day by Map/Reduce jobs.
>> >> >2)  Insert 4-5GB to 3 tables in hbase.
>> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
>> >> >All this should run in parallel.
>> >> >Our current configuration can't cope with this load and we are having
>> many
>> >> >stability issues.
>> >> >
>> >> >This is what we have in mind :
>> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
>> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
>> >> >we plan to have up to 20 name servers (starting with 5).
>> >> >
>> >> >We already read
>> >> >
>> >>
>> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
>> >> >sic-hardware-recommendations/
>> >> >.
>> >> >
>> >> >We would appreciate your feedback on our proposed configuration.
>> >> >
>> >> >
>> >> >Regards Oleg & Lior
>> >>
>> >>
>> >>
>> >
>>
>
>

Re: Hadoop/HBase hardware requirement

Posted by Lior Schachter <li...@infolinks.com>.
Hi Lars,
I agree with every sentence you wrote (and that's why we chose HBase).
However, from a managerial point-of-view the question of the initial
investment is very important (specially when considering a new technology).

Lior


p.s. The price is in USD ....

On Mon, Nov 22, 2010 at 2:43 PM, Lars George <la...@gmail.com> wrote:

> Hi Lior,
>
> I can only hope you state this in Schekel! But 20 nodes with Hadoop
> can do quite a lot and you cannot compare a single Oracle box with a
> 20 node Hadoop cluster as they serve slightly different use-cases. You
> need to make a commitment to what you want to achieve with HBase and
> that growth is the most important factor. Scaling Oracle is really
> expensive while HBase/Hadoop is not in comparison and costs are
> linear, while with Oracle more exponential.
>
> Lars
>
> On Mon, Nov 22, 2010 at 1:27 PM, Lior Schachter <li...@infolinks.com>
> wrote:
> > Hi all, Thanks for your input and assistance.
> >
> >
> > From your answers I understand that:
> > 1. more is better but our configuration might work.
> > 2. there are small tweaks we can do that will improve our configuration
> > (like having 4x500GB disks).
> > 3. use monitoring (like Ganglia) to find the bottlenecks.
> >
> > For me, The question here is how to balance between our current budget
> and
> > system stability (and performance).
> > I agree that more memory and more disk space will improve our
> responsiveness
> > but on the other hand our system is NOT expected to be real-time (but
> rather
> > a back office analytics with few hours delay).
> >
> > This is a crucial point since the proposed configurations we found in the
> > web don't distinguish between real-time configurations and back-office
> > configurations. To build a real-time cluster with 20 nodes will cost
> around
> > 200-300K (in Israel) this is similar to the price of a quite strong
> Oracle
> > cluster... so my boss (the CTO) was partially right when telling me - but
> > you said it would be cheap !! very cheap :)
> >
> > I believe that more money will come when we show the viability of the
> > system... I also read that heterogeneous clusters are common.
> >
> > It will help a lot if you can provide your configurations and system
> > characteristics (maybe in a Wiki page).
> > It will also help to get more of the "small tweaks" that you found
> helpful.
> >
> >
> > Lior Schachter
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 22, 2010 at 1:33 PM, Lars George <la...@gmail.com>
> wrote:
> >
> >> Oleg,
> >>
> >> Do you have Ganglia or some other graphing tool running against the
> >> cluster? It gives you metrics that are crucial here, for example the
> >> load on Hadoop and its DataNodes as well as insertion rates etc. on
> >> HBase. What is also interesting is the compaction queue to see if the
> >> cluster is going slow.
> >>
> >> Did you try loading from an empty system to a loaded one? Or was it
> >> already filled and you are trying to add more? Are you spreading the
> >> load across servers or are you using sequential keys that tax only one
> >> server at a time?
> >>
> >> 16GB should work, but is not ideal. The various daemons simply need
> >> room to breathe. But that said, I have personally started with 12GB
> >> even and it worked.
> >>
> >> Lars
> >>
> >> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <or...@gmail.com>
> >> wrote:
> >> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ksankar42@gmail.com
> >> >wrote:
> >> >
> >> >> Oleg & Lior,
> >> >>
> >> >> Couple of questions & couple of suggestions to ponder:
> >> >> A)  When you say 20 Name Servers, I assume you are talking about 20
> Task
> >> >> Servers
> >> >>
> >> >
> >> > Yes
> >> >
> >> >
> >> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
> >> intensive ?
> >> >>
> >> >
> >> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
> >> stores
> >> > to hbase
> >> >
> >> >
> >> >> C)  What is your Data growth ?
> >> >>
> >> >
> >> >  currently we have 50GB per day , it could be ~150GB.
> >> >
> >> >
> >> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage
> ?
> >> >>
> >> >    Map phase takes 100% CPU consumption since it is a parsing and
> input
> >> > files are  gz.
> >> >    Definitely have a memory issues.
> >> >
> >> >
> >> >> Ganglia/Hadoop metrics should tell.
> >> >> E)  Also are your jobs long running or short tasks ?
> >> >>
> >> >    map tasks takes from 5 second to 2 minutes
> >> >    reducer (insertion to hbase) takes -- ~3 hours
> >> >
> >> >
> >> >> Suggestions:
> >> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
> >> enterprise
> >> >> class server and also backup to an NFS mount.
> >> >> B)  Also have a decent machine as the checkpoint name node. It could
> be
> >> >> similar to the task nodes
> >> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
> >> similar
> >> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> >> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
> >> would
> >> >> also recommend. But it also depends on your primary data,
> intermediate
> >> >> data and final data size. 1 or 2 TB disks are also fine, because they
> >> give
> >> >> you more strage. I assume you have the default replication of 3
> >> >> D)  A 1Gb dedicated network would be good. As there are only ~25
> >> machines,
> >> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
> too
> >> >> much intermediate data traffic, in the future.
> >> >> Cheers
> >> >> <k/>
> >> >>
> >> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
> >> wrote:
> >> >>
> >> >> >Hi all,
> >> >> >After testing HBase for few months with very light configurations
>  (5
> >> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> >> >> >Our Load -
> >> >> >1) 50GB log files to process per day by Map/Reduce jobs.
> >> >> >2)  Insert 4-5GB to 3 tables in hbase.
> >> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> >> >> >All this should run in parallel.
> >> >> >Our current configuration can't cope with this load and we are
> having
> >> many
> >> >> >stability issues.
> >> >> >
> >> >> >This is what we have in mind :
> >> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> >> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> >> >> >we plan to have up to 20 name servers (starting with 5).
> >> >> >
> >> >> >We already read
> >> >> >
> >> >>
> >>
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> >> >> >sic-hardware-recommendations/
> >> >> >.
> >> >> >
> >> >> >We would appreciate your feedback on our proposed configuration.
> >> >> >
> >> >> >
> >> >> >Regards Oleg & Lior
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: Hadoop/HBase hardware requirement

Posted by Lars George <la...@gmail.com>.
Hi Lior,

I can only hope you state this in Schekel! But 20 nodes with Hadoop
can do quite a lot and you cannot compare a single Oracle box with a
20 node Hadoop cluster as they serve slightly different use-cases. You
need to make a commitment to what you want to achieve with HBase and
that growth is the most important factor. Scaling Oracle is really
expensive while HBase/Hadoop is not in comparison and costs are
linear, while with Oracle more exponential.

Lars

On Mon, Nov 22, 2010 at 1:27 PM, Lior Schachter <li...@infolinks.com> wrote:
> Hi all, Thanks for your input and assistance.
>
>
> From your answers I understand that:
> 1. more is better but our configuration might work.
> 2. there are small tweaks we can do that will improve our configuration
> (like having 4x500GB disks).
> 3. use monitoring (like Ganglia) to find the bottlenecks.
>
> For me, The question here is how to balance between our current budget and
> system stability (and performance).
> I agree that more memory and more disk space will improve our responsiveness
> but on the other hand our system is NOT expected to be real-time (but rather
> a back office analytics with few hours delay).
>
> This is a crucial point since the proposed configurations we found in the
> web don't distinguish between real-time configurations and back-office
> configurations. To build a real-time cluster with 20 nodes will cost around
> 200-300K (in Israel) this is similar to the price of a quite strong Oracle
> cluster... so my boss (the CTO) was partially right when telling me - but
> you said it would be cheap !! very cheap :)
>
> I believe that more money will come when we show the viability of the
> system... I also read that heterogeneous clusters are common.
>
> It will help a lot if you can provide your configurations and system
> characteristics (maybe in a Wiki page).
> It will also help to get more of the "small tweaks" that you found helpful.
>
>
> Lior Schachter
>
>
>
>
>
>
>
> On Mon, Nov 22, 2010 at 1:33 PM, Lars George <la...@gmail.com> wrote:
>
>> Oleg,
>>
>> Do you have Ganglia or some other graphing tool running against the
>> cluster? It gives you metrics that are crucial here, for example the
>> load on Hadoop and its DataNodes as well as insertion rates etc. on
>> HBase. What is also interesting is the compaction queue to see if the
>> cluster is going slow.
>>
>> Did you try loading from an empty system to a loaded one? Or was it
>> already filled and you are trying to add more? Are you spreading the
>> load across servers or are you using sequential keys that tax only one
>> server at a time?
>>
>> 16GB should work, but is not ideal. The various daemons simply need
>> room to breathe. But that said, I have personally started with 12GB
>> even and it worked.
>>
>> Lars
>>
>> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <or...@gmail.com>
>> wrote:
>> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ksankar42@gmail.com
>> >wrote:
>> >
>> >> Oleg & Lior,
>> >>
>> >> Couple of questions & couple of suggestions to ponder:
>> >> A)  When you say 20 Name Servers, I assume you are talking about 20 Task
>> >> Servers
>> >>
>> >
>> > Yes
>> >
>> >
>> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
>> intensive ?
>> >>
>> >
>> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
>> stores
>> > to hbase
>> >
>> >
>> >> C)  What is your Data growth ?
>> >>
>> >
>> >  currently we have 50GB per day , it could be ~150GB.
>> >
>> >
>> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
>> >>
>> >    Map phase takes 100% CPU consumption since it is a parsing and input
>> > files are  gz.
>> >    Definitely have a memory issues.
>> >
>> >
>> >> Ganglia/Hadoop metrics should tell.
>> >> E)  Also are your jobs long running or short tasks ?
>> >>
>> >    map tasks takes from 5 second to 2 minutes
>> >    reducer (insertion to hbase) takes -- ~3 hours
>> >
>> >
>> >> Suggestions:
>> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
>> enterprise
>> >> class server and also backup to an NFS mount.
>> >> B)  Also have a decent machine as the checkpoint name node. It could be
>> >> similar to the task nodes
>> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
>> similar
>> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
>> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
>> would
>> >> also recommend. But it also depends on your primary data, intermediate
>> >> data and final data size. 1 or 2 TB disks are also fine, because they
>> give
>> >> you more strage. I assume you have the default replication of 3
>> >> D)  A 1Gb dedicated network would be good. As there are only ~25
>> machines,
>> >> you can hang them off of a good Gb switch. Consider 10Gb if there is too
>> >> much intermediate data traffic, in the future.
>> >> Cheers
>> >> <k/>
>> >>
>> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
>> wrote:
>> >>
>> >> >Hi all,
>> >> >After testing HBase for few months with very light configurations  (5
>> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
>> >> >Our Load -
>> >> >1) 50GB log files to process per day by Map/Reduce jobs.
>> >> >2)  Insert 4-5GB to 3 tables in hbase.
>> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
>> >> >All this should run in parallel.
>> >> >Our current configuration can't cope with this load and we are having
>> many
>> >> >stability issues.
>> >> >
>> >> >This is what we have in mind :
>> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
>> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
>> >> >we plan to have up to 20 name servers (starting with 5).
>> >> >
>> >> >We already read
>> >> >
>> >>
>> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
>> >> >sic-hardware-recommendations/
>> >> >.
>> >> >
>> >> >We would appreciate your feedback on our proposed configuration.
>> >> >
>> >> >
>> >> >Regards Oleg & Lior
>> >>
>> >>
>> >>
>> >
>>
>

Re: Hadoop/HBase hardware requirement

Posted by Lior Schachter <li...@infolinks.com>.
Hi all, Thanks for your input and assistance.


>From your answers I understand that:
1. more is better but our configuration might work.
2. there are small tweaks we can do that will improve our configuration
(like having 4x500GB disks).
3. use monitoring (like Ganglia) to find the bottlenecks.

For me, The question here is how to balance between our current budget and
system stability (and performance).
I agree that more memory and more disk space will improve our responsiveness
but on the other hand our system is NOT expected to be real-time (but rather
a back office analytics with few hours delay).

This is a crucial point since the proposed configurations we found in the
web don't distinguish between real-time configurations and back-office
configurations. To build a real-time cluster with 20 nodes will cost around
200-300K (in Israel) this is similar to the price of a quite strong Oracle
cluster... so my boss (the CTO) was partially right when telling me - but
you said it would be cheap !! very cheap :)

I believe that more money will come when we show the viability of the
system... I also read that heterogeneous clusters are common.

It will help a lot if you can provide your configurations and system
characteristics (maybe in a Wiki page).
It will also help to get more of the "small tweaks" that you found helpful.


Lior Schachter







On Mon, Nov 22, 2010 at 1:33 PM, Lars George <la...@gmail.com> wrote:

> Oleg,
>
> Do you have Ganglia or some other graphing tool running against the
> cluster? It gives you metrics that are crucial here, for example the
> load on Hadoop and its DataNodes as well as insertion rates etc. on
> HBase. What is also interesting is the compaction queue to see if the
> cluster is going slow.
>
> Did you try loading from an empty system to a loaded one? Or was it
> already filled and you are trying to add more? Are you spreading the
> load across servers or are you using sequential keys that tax only one
> server at a time?
>
> 16GB should work, but is not ideal. The various daemons simply need
> room to breathe. But that said, I have personally started with 12GB
> even and it worked.
>
> Lars
>
> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <or...@gmail.com>
> wrote:
> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ksankar42@gmail.com
> >wrote:
> >
> >> Oleg & Lior,
> >>
> >> Couple of questions & couple of suggestions to ponder:
> >> A)  When you say 20 Name Servers, I assume you are talking about 20 Task
> >> Servers
> >>
> >
> > Yes
> >
> >
> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
> intensive ?
> >>
> >
> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
> stores
> > to hbase
> >
> >
> >> C)  What is your Data growth ?
> >>
> >
> >  currently we have 50GB per day , it could be ~150GB.
> >
> >
> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
> >>
> >    Map phase takes 100% CPU consumption since it is a parsing and input
> > files are  gz.
> >    Definitely have a memory issues.
> >
> >
> >> Ganglia/Hadoop metrics should tell.
> >> E)  Also are your jobs long running or short tasks ?
> >>
> >    map tasks takes from 5 second to 2 minutes
> >    reducer (insertion to hbase) takes -- ~3 hours
> >
> >
> >> Suggestions:
> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
> enterprise
> >> class server and also backup to an NFS mount.
> >> B)  Also have a decent machine as the checkpoint name node. It could be
> >> similar to the task nodes
> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
> similar
> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
> would
> >> also recommend. But it also depends on your primary data, intermediate
> >> data and final data size. 1 or 2 TB disks are also fine, because they
> give
> >> you more strage. I assume you have the default replication of 3
> >> D)  A 1Gb dedicated network would be good. As there are only ~25
> machines,
> >> you can hang them off of a good Gb switch. Consider 10Gb if there is too
> >> much intermediate data traffic, in the future.
> >> Cheers
> >> <k/>
> >>
> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com>
> wrote:
> >>
> >> >Hi all,
> >> >After testing HBase for few months with very light configurations  (5
> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> >> >Our Load -
> >> >1) 50GB log files to process per day by Map/Reduce jobs.
> >> >2)  Insert 4-5GB to 3 tables in hbase.
> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> >> >All this should run in parallel.
> >> >Our current configuration can't cope with this load and we are having
> many
> >> >stability issues.
> >> >
> >> >This is what we have in mind :
> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> >> >we plan to have up to 20 name servers (starting with 5).
> >> >
> >> >We already read
> >> >
> >>
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> >> >sic-hardware-recommendations/
> >> >.
> >> >
> >> >We would appreciate your feedback on our proposed configuration.
> >> >
> >> >
> >> >Regards Oleg & Lior
> >>
> >>
> >>
> >
>

Re: Hadoop/HBase hardware requirement

Posted by Lars George <la...@gmail.com>.
Oleg,

Do you have Ganglia or some other graphing tool running against the
cluster? It gives you metrics that are crucial here, for example the
load on Hadoop and its DataNodes as well as insertion rates etc. on
HBase. What is also interesting is the compaction queue to see if the
cluster is going slow.

Did you try loading from an empty system to a loaded one? Or was it
already filled and you are trying to add more? Are you spreading the
load across servers or are you using sequential keys that tax only one
server at a time?

16GB should work, but is not ideal. The various daemons simply need
room to breathe. But that said, I have personally started with 12GB
even and it worked.

Lars

On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <or...@gmail.com> wrote:
> On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ks...@gmail.com>wrote:
>
>> Oleg & Lior,
>>
>> Couple of questions & couple of suggestions to ponder:
>> A)  When you say 20 Name Servers, I assume you are talking about 20 Task
>> Servers
>>
>
> Yes
>
>
>> B)  What type are your M/R jobs ? Compute Intensive vs. storage intensive ?
>>
>
> M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10% stores
> to hbase
>
>
>> C)  What is your Data growth ?
>>
>
>  currently we have 50GB per day , it could be ~150GB.
>
>
>> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
>>
>    Map phase takes 100% CPU consumption since it is a parsing and input
> files are  gz.
>    Definitely have a memory issues.
>
>
>> Ganglia/Hadoop metrics should tell.
>> E)  Also are your jobs long running or short tasks ?
>>
>    map tasks takes from 5 second to 2 minutes
>    reducer (insertion to hbase) takes -- ~3 hours
>
>
>> Suggestions:
>> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an enterprise
>> class server and also backup to an NFS mount.
>> B)  Also have a decent machine as the checkpoint name node. It could be
>> similar to the task nodes
>> B)  I assume by Master Machine, you mean Job Tracker. It could be similar
>> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
>> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I would
>> also recommend. But it also depends on your primary data, intermediate
>> data and final data size. 1 or 2 TB disks are also fine, because they give
>> you more strage. I assume you have the default replication of 3
>> D)  A 1Gb dedicated network would be good. As there are only ~25 machines,
>> you can hang them off of a good Gb switch. Consider 10Gb if there is too
>> much intermediate data traffic, in the future.
>> Cheers
>> <k/>
>>
>> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com> wrote:
>>
>> >Hi all,
>> >After testing HBase for few months with very light configurations  (5
>> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
>> >Our Load -
>> >1) 50GB log files to process per day by Map/Reduce jobs.
>> >2)  Insert 4-5GB to 3 tables in hbase.
>> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
>> >All this should run in parallel.
>> >Our current configuration can't cope with this load and we are having many
>> >stability issues.
>> >
>> >This is what we have in mind :
>> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
>> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
>> >we plan to have up to 20 name servers (starting with 5).
>> >
>> >We already read
>> >
>> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
>> >sic-hardware-recommendations/
>> >.
>> >
>> >We would appreciate your feedback on our proposed configuration.
>> >
>> >
>> >Regards Oleg & Lior
>>
>>
>>
>

Re: Hadoop/HBase hardware requirement

Posted by Oleg Ruchovets <or...@gmail.com>.
On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ks...@gmail.com>wrote:

> Oleg & Lior,
>
> Couple of questions & couple of suggestions to ponder:
> A)  When you say 20 Name Servers, I assume you are talking about 20 Task
> Servers
>

Yes


> B)  What type are your M/R jobs ? Compute Intensive vs. storage intensive ?
>

M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10% stores
to hbase


> C)  What is your Data growth ?
>

  currently we have 50GB per day , it could be ~150GB.


> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
>
    Map phase takes 100% CPU consumption since it is a parsing and input
files are  gz.
    Definitely have a memory issues.


> Ganglia/Hadoop metrics should tell.
> E)  Also are your jobs long running or short tasks ?
>
    map tasks takes from 5 second to 2 minutes
    reducer (insertion to hbase) takes -- ~3 hours


> Suggestions:
> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an enterprise
> class server and also backup to an NFS mount.
> B)  Also have a decent machine as the checkpoint name node. It could be
> similar to the task nodes
> B)  I assume by Master Machine, you mean Job Tracker. It could be similar
> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I would
> also recommend. But it also depends on your primary data, intermediate
> data and final data size. 1 or 2 TB disks are also fine, because they give
> you more strage. I assume you have the default replication of 3
> D)  A 1Gb dedicated network would be good. As there are only ~25 machines,
> you can hang them off of a good Gb switch. Consider 10Gb if there is too
> much intermediate data traffic, in the future.
> Cheers
> <k/>
>
> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com> wrote:
>
> >Hi all,
> >After testing HBase for few months with very light configurations  (5
> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> >Our Load -
> >1) 50GB log files to process per day by Map/Reduce jobs.
> >2)  Insert 4-5GB to 3 tables in hbase.
> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> >All this should run in parallel.
> >Our current configuration can't cope with this load and we are having many
> >stability issues.
> >
> >This is what we have in mind :
> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> >we plan to have up to 20 name servers (starting with 5).
> >
> >We already read
> >
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> >sic-hardware-recommendations/
> >.
> >
> >We would appreciate your feedback on our proposed configuration.
> >
> >
> >Regards Oleg & Lior
>
>
>

Re: Hadoop/HBase hardware requirement

Posted by Krishna Sankar <ks...@gmail.com>.
Oleg & Lior,

Couple of questions & couple of suggestions to ponder:
A)  When you say 20 Name Servers, I assume you are talking about 20 Task
Servers
B)  What type are your M/R jobs ? Compute Intensive vs. storage intensive ?
C)  What is your Data growth ?
D)  With the current jobs, are you saturating RAM ? CPU ? Or storage ?
Ganglia/Hadoop metrics should tell.
E)  Also are your jobs long running or short tasks ?
Suggestions:
A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an enterprise
class server and also backup to an NFS mount.
B)  Also have a decent machine as the checkpoint name node. It could be
similar to the task nodes
B)  I assume by Master Machine, you mean Job Tracker. It could be similar
to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I would
also recommend. But it also depends on your primary data, intermediate
data and final data size. 1 or 2 TB disks are also fine, because they give
you more strage. I assume you have the default replication of 3
D)  A 1Gb dedicated network would be good. As there are only ~25 machines,
you can hang them off of a good Gb switch. Consider 10Gb if there is too
much intermediate data traffic, in the future.
Cheers
<k/>

On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <or...@gmail.com> wrote:

>Hi all,
>After testing HBase for few months with very light configurations  (5
>machines, 2 TB disk, 8 GB RAM), we are now planing for production.
>Our Load -
>1) 50GB log files to process per day by Map/Reduce jobs.
>2)  Insert 4-5GB to 3 tables in hbase.
>3) Run 10-20 scans per day (scanning about 20 regions in a table).
>All this should run in parallel.
>Our current configuration can't cope with this load and we are having many
>stability issues.
>
>This is what we have in mind :
>1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
>2. Name node - 16 GB, 2TB, Two quad core CPUs.
>we plan to have up to 20 name servers (starting with 5).
>
>We already read
>http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
>sic-hardware-recommendations/
>.
>
>We would appreciate your feedback on our proposed configuration.
>
>
>Regards Oleg & Lior