You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by James Seigel <ja...@tynt.com> on 2010/04/08 07:50:07 UTC

Distributed Clusters

I am new to this group, and relatively new to hadoop. 

I am looking at building a large cluster.  I was wondering if anyone has any best practices for a cluster in the hundreds of nodes?  As well, has anyone had experience with a cluster spanning multiple data centers.  Is this a bad practice? moderately bad practice?  insane?

Is it better to build the 1000 node cluster in a single data center?  Do you back one of these things up to a second data center or a different 1000 node cluster?

Sorry, I am asking crazy questions...I am just wanting to learn the meta issues and opportunities with making clusters.

Thanks for your ideas!

Cheers
James.



Re: Distributed Clusters

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Apr 7, 2010, at 10:50 PM, James Seigel wrote:

> I am new to this group, and relatively new to hadoop. 

Welcome to the community, James. :)

> I am looking at building a large cluster.  I was wondering if anyone has any best practices for a cluster in the hundreds of nodes?

Take a look at the 'Hadoop 24/7' presentation (on the hadoop wiki preso page) I did for ApacheCon EU last year.  It covers a lot of the "now that I have a grid, what do I do?" situations.

>  As well, has anyone had experience with a cluster spanning multiple data centers.  Is this a bad practice? moderately bad practice?  insane?

Right now, it generally falls into the insane category unless you have REALLY REALLY REALLY low latency and high bandwidth.  The heartbeats between nodes, issues with block placement, etc, make it highly likely to saturate the link and/or split the cluster in multiple pieces.

> Is it better to build the 1000 node cluster in a single data center?  Do you back one of these things up to a second data center or a different 1000 node cluster?

We're currently going with a 'multiple grids in one data center' strategy.  Our 'Source of Truth' data is from another source, meaning we could (theoretically) rebuild the grid from that source if we were to get decimated by dinosaurs.  [That source of truth has a much better backup/dr strategy.]

> Sorry, I am asking crazy questions...I am just wanting to learn the meta issues and opportunities with making clusters.

These are pretty normal questions.  We should probably create a faq or something on the wiki.


RE: Distributed Clusters

Posted by Michael Segel <mi...@hotmail.com>.

> > 
> > Is it better to build the 1000 node cluster in a single data center?  
> 
> yes.
> 
> >Do you back one of these things up to a second data center or a different 1000 node cluster?
> 

If you're building your cluster on the West Coast, yes, you had best concern yourself with Earthquakes, Rolling Blackouts and of course the ever present volcanic activity. ;-) In the Midwest? Not so much. Just some potential revolutionary, right wing conspiracy nut cases in Michigan and Northern Indiana. ;-)  (Ok, we do have tornadoes, floods, and in Chicago plagues of tourists. :-)  So do what Google and Microsoft are doing. Building out data centers at 'undisclosed' locations around Chicago. :-)

Ok... on a more serious note...

I think the question building out two clusters in two data centers only makes sense if you are worried about disaster recovery. Then yes, two clusters in different locations make sense.

Then your two clusters are independent and you have to workout how to keep them in sync.

If you were thinking of having one cloud span 2 data centers? Not really a good idea.

HTH

-Mike



 		 	   		  
_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: Distributed Clusters

Posted by James Seigel <ja...@tynt.com>.
Thanks for the insights into this stuff so far.  I think we are doing somethings right with automating everything and such.  An additional question I have is: I have heard rhetoric about zookeeper being able to help with configurations of hadoop?  I was wondering if anyone is using zookeeper in a way that helps with their deployment of the hadoop cluster?

Cheers
James.


On 2010-04-08, at 4:18 AM, Steve Loughran wrote:

> James Seigel wrote:
>> I am new to this group, and relatively new to hadoop. I am looking at building a large cluster.  I was wondering if anyone has any best practices for a cluster in the hundreds of nodes?  As well, has anyone had experience with a cluster spanning multiple data centers.  Is this a bad practice? moderately bad practice?  insane?
> 
> got some stuff here
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
> 
> though my clusters are of short life span and smaller. At that kind of scale you need to know how to manage datacenters yourself or talk to people who do (I deny all knowledge, though I will note that in HP consulting and EDS we do have people who can handle this)
> 
>> Is it better to build the 1000 node cluster in a single data center?  
> 
> yes.
> 
>> Do you back one of these things up to a second data center or a different 1000 node cluster?
> 
> depends on your concerns and where the building is.
> 
> -If your facility is in the Bay Area then you want a separate datacentre on a different fault line. If it's in Easter WA or OR then you worry more about volcanic activity and spec the roof to take 1-2m of volcanic ash. Power comes off the big dams which again may go down if there's an earthquake, but otherwise pretty reliable.
> 
> -if your worry is about continuous availability, you need different sites with different (multiple) power suppliers and multiple data feeds, and more to worry about in terms of keeping things in sync. Data transfer will cost time and money, and for a big enough cluster -1000 servers can go up to 6-12 PB of storage, which takes time to sync. Even with the CERN LHC experiments data rate of 1 PB/month off the LHC, it would take 6 months to get the data in to your cluster using a good protocol like GridFTP.
> 
> -single site would make sync easier, 10GB ethernet will still take a while but not cost you
> 
>> Sorry, I am asking crazy questions...I am just wanting to learn the meta issues and opportunities with making clusters.
> 
> Start small, automate everything, worry about scaling up the management problems. Hadoop filestore and JT scales well, but you have to get your ops right. That's everything from BIOS upgrades to log file management.

James Seigel
james@tynt.com
http://www.tynt.com
Captain Hammer


Re: Distributed Clusters

Posted by Steve Loughran <st...@apache.org>.
James Seigel wrote:
> I am new to this group, and relatively new to hadoop. 
> 
> I am looking at building a large cluster.  I was wondering if anyone has any best practices for a cluster in the hundreds of nodes?  As well, has anyone had experience with a cluster spanning multiple data centers.  Is this a bad practice? moderately bad practice?  insane?

got some stuff here
http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

though my clusters are of short life span and smaller. At that kind of 
scale you need to know how to manage datacenters yourself or talk to 
people who do (I deny all knowledge, though I will note that in HP 
consulting and EDS we do have people who can handle this)

> 
> Is it better to build the 1000 node cluster in a single data center?  

yes.

>Do you back one of these things up to a second data center or a different 1000 node cluster?

depends on your concerns and where the building is.

-If your facility is in the Bay Area then you want a separate datacentre 
on a different fault line. If it's in Easter WA or OR then you worry 
more about volcanic activity and spec the roof to take 1-2m of volcanic 
ash. Power comes off the big dams which again may go down if there's an 
earthquake, but otherwise pretty reliable.

-if your worry is about continuous availability, you need different 
sites with different (multiple) power suppliers and multiple data feeds, 
and more to worry about in terms of keeping things in sync. Data 
transfer will cost time and money, and for a big enough cluster -1000 
servers can go up to 6-12 PB of storage, which takes time to sync. Even 
with the CERN LHC experiments data rate of 1 PB/month off the LHC, it 
would take 6 months to get the data in to your cluster using a good 
protocol like GridFTP.

-single site would make sync easier, 10GB ethernet will still take a 
while but not cost you

> 
> Sorry, I am asking crazy questions...I am just wanting to learn the meta issues and opportunities with making clusters.

Start small, automate everything, worry about scaling up the management 
problems. Hadoop filestore and JT scales well, but you have to get your 
ops right. That's everything from BIOS upgrades to log file management.

Re: Distributed Clusters

Posted by Ravi Phulari <rp...@yahoo-inc.com>.
Hello James,

I am new to this group, and relatively new to hadoop.
Welcome to the group!!

I am looking at building a large cluster.  I was wondering if anyone has any best practices for a cluster in the hundreds of nodes?  As well, has anyone had experience with a cluster spanning multiple data centers.  Is this a bad practice? moderately bad practice?  insane?

You can find answers to most of the questions here - http://wiki.apache.org/hadoop/
I am not sure if there are clusters spanning in multiple data centers. Even if there are such cluster I am very confident that Hadoop will work on such cluster spanning multiple data center.

Is it better to build the 1000 node cluster in a single data center?  Do you back one of these things up to a second data center or a different 1000 node cluster?

If you are completely new to Hadoop then it's better to start with 100-200 nodes cluster and learn how it works. Obviously later you can scale to 1000 or more nodes.

Regards,
Ravi
--
Hadoop @ Yahoo!