You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by 叶双明 <ye...@gmail.com> on 2008/09/11 11:15:01 UTC

How to manage a large cluster?

Hi, all!

How to manage a large cluster, eg. more than 2000 nodes.
How to config hostname and ip, use DNS?
How to config slaves, all in slaves file?
How to update software in all nodes.

Any practice, articles, suggestion is appreciate!
Thanks.

-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax

Re: How to manage a large cluster?

Posted by Paco NATHAN <ce...@gmail.com>.

Thanks, Steve -

Another flexible approach to handling messages across firewalls,
between jt and worker nodes, etc., would be to place an APMQ message
broker on the jobtracker and another inside our local network.  We're
experimenting with RabbitMQ for that.


On Tue, Sep 16, 2008 at 4:03 AM, Steve Loughran <st...@apache.org> wrote:

>> We use a set of Python scripts to manage a daily, (mostly) automated
>> launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
>> on a local server, so that the Hadoop job can send notification when
>> it completes, and allow the local server to initiate download of
>> results.  Overall, that minimizes the need for having a sysadmin
>> dedicated to the Hadoop jobs -- a small dev team can handle it, while
>> focusing on algorithm development and testing.
>
> 1. We have some components that use google talk to relay messages to local
> boxes behind the firewall. I could imagine hooking up hadoop status events
> to that too.
>
> 2. There's an old paper of mine, "Making Web Services that Work", in which I
> talk about deployment centric development:
> http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html
>
> The idea is that right from the outset, the dev team work on a cluster that
> resembles production, the CI server builds to it automatically, changes get
> pushed out to production semi-automatically (you tag the version you want
> pushed out in SVN, the CI server does the release). The article is focused
> on services exported to third parties, not back end stuff, so it may not all
> apply to hadoop deployments.
>
> -steve

Re: How to manage a large cluster?

Posted by Steve Loughran <st...@apache.org>.

Paco NATHAN wrote:
> We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
> make it simple, pull those from S3 buckets. That provides a flexible
> pattern for managing the frameworks involved, more so than needing to
> re-do an EC2 image whenever you want to add a patch to Hadoop.
> 
> Given that approach, you can add your Hadoop application code
> similarly. Just upload the current stable build out of SVN, Git,
> whatever, to an S3 bucket.

Nice. Your CI tool could upload the latest release tagged as good and 
the machines could pull it down.

The goal of cluster management is to make the addition/removal of an 
extra node an O(1) problem; you edit one entry in one place to increment 
or decrement the number of machines, and that's it.

If you find you have lots of images to keep alive, then your costs go 
up. Keep the # of images you have to 1 and you will stay in control.

> 
> We use a set of Python scripts to manage a daily, (mostly) automated
> launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
> on a local server, so that the Hadoop job can send notification when
> it completes, and allow the local server to initiate download of
> results.  Overall, that minimizes the need for having a sysadmin
> dedicated to the Hadoop jobs -- a small dev team can handle it, while
> focusing on algorithm development and testing.

1. We have some components that use google talk to relay messages to 
local boxes behind the firewall. I could imagine hooking up hadoop 
status events to that too.

2. There's an old paper of mine, "Making Web Services that Work", in 
which I talk about deployment centric development:
http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html

The idea is that right from the outset, the dev team work on a cluster 
that resembles production, the CI server builds to it automatically, 
changes get pushed out to production semi-automatically (you tag the 
version you want pushed out in SVN, the CI server does the release). The 
article is focused on services exported to third parties, not back end 
stuff, so it may not all apply to hadoop deployments.

-steve

Re: How to manage a large cluster?

Posted by Paco NATHAN <ce...@gmail.com>.

We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.

Given that approach, you can add your Hadoop application code
similarly. Just upload the current stable build out of SVN, Git,
whatever, to an S3 bucket.

We use a set of Python scripts to manage a daily, (mostly) automated
launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
on a local server, so that the Hadoop job can send notification when
it completes, and allow the local server to initiate download of
results.  Overall, that minimizes the need for having a sysadmin
dedicated to the Hadoop jobs -- a small dev team can handle it, while
focusing on algorithm development and testing.


>>  Or on EC2 and its competitors, just build a new image whenever you
>>> need to update Hadoop itself.

Re: How to manage a large cluster?

Posted by 叶双明 <ye...@gmail.com>.

Sorry, but I can't open it:
http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

2008/9/13 Steve Loughran <st...@apache.org>

> James Moore wrote:
>
>> On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <aw...@yahoo-inc.com>
>> wrote:
>>
>>> On 9/11/08 2:39 AM, "Alex Loddengaard" <al...@google.com> wrote:
>>>
>>>> I've never dealt with a large cluster, though I'd imagine it is managed
>>>> the
>>>> same way as small clusters:
>>>>
>>>   Maybe. :)
>>>
>>
> Depends how often you like to be paged, doesn't it :)
>
>
>>    Instead, use a real system configuration management package such as
>>> bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the
>>> plug.
>>> :) ]
>>>
>>
> Yes Allen, I owe you beer at the next apachecon we are both at.
> Actually, I think Y! were one of the sponsors at the UK event, so we owe
> you for that too.
>
>
>  Or on EC2 and its competitors, just build a new image whenever you
>> need to update Hadoop itself.
>>
>
>
> 1. It's still good to have as much automation of your image build as you
> can; if you can build new machine images on demand you have have fun/make a
> mess of things. Look at http://instalinux.com to see the web GUI for
> creating linux images on demand that is used inside HP.
>
> 2. When you try and bring up everything from scratch, you have a
> choreography problem. DNS needs to be up early, and your authentication
> system, the management tools, then the other parts of the system. If you
> have a project where hadoop is integrated with the front end site, for
> example, you're app servers have to stay offline until HDFS is live. So it
> does get complex.
>
> 3. The Hadoop nodes are good here in that you aren't required to bring up
> the namenode first; the datanodes will wait; same for the task trackers and
> job tracker. But if you, say, need to point everything at a new hostname for
> the namenode, well, that's a config change that needs to be pushed out,
> somehow.
>
>
>
> I'm adding some stuff on different ways to deploy hadoop here:
>
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
>
> -steve
>



-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax

Re: How to manage a large cluster?

Posted by Steve Loughran <st...@apache.org>.

James Moore wrote:
> On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:
>> On 9/11/08 2:39 AM, "Alex Loddengaard" <al...@google.com> wrote:
>>> I've never dealt with a large cluster, though I'd imagine it is managed the
>>> same way as small clusters:
>>    Maybe. :)

Depends how often you like to be paged, doesn't it :)

> 
>>    Instead, use a real system configuration management package such as
>> bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
>> :) ]

Yes Allen, I owe you beer at the next apachecon we are both at.
Actually, I think Y! were one of the sponsors at the UK event, so we owe 
you for that too.

> Or on EC2 and its competitors, just build a new image whenever you
> need to update Hadoop itself.

1. It's still good to have as much automation of your image build as you 
can; if you can build new machine images on demand you have have 
fun/make a mess of things. Look at http://instalinux.com to see the web 
GUI for creating linux images on demand that is used inside HP.

2. When you try and bring up everything from scratch, you have a 
choreography problem. DNS needs to be up early, and your authentication 
system, the management tools, then the other parts of the system. If you 
have a project where hadoop is integrated with the front end site, for 
example, you're app servers have to stay offline until HDFS is live. So 
it does get complex.

3. The Hadoop nodes are good here in that you aren't required to bring 
up the namenode first; the datanodes will wait; same for the task 
trackers and job tracker. But if you, say, need to point everything at a 
new hostname for the namenode, well, that's a config change that needs 
to be pushed out, somehow.

I'm adding some stuff on different ways to deploy hadoop here:

http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

-steve

Re: How to manage a large cluster?

Posted by 叶双明 <ye...@gmail.com>.

er... is that:

Set up a DNS server, use hostnames instead of raw ips?

Config all node in the slaves file, and put this file on the namenode and
secondary namenode to prevent accidents?

Use a real system configuration management package to sync software in all
nodes of cluster?

Thanks for all, and Alex Loddengaard's creation of wiki are commendable,
anyone should enrich it.

2008/9/12 Alex Loddengaard <al...@google.com>

> My inexperience has been revealed ;).  I've taken your comments, James and
> Allen, and added them to the wiki:
>
> <http://wiki.apache.org/hadoop/LargeClusterTips>
>
> Alex
>
> On Fri, Sep 12, 2008 at 2:01 AM, James Moore <jamesthepiper@gmail.com
> >wrote:
>
> > On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <aw...@yahoo-inc.com>
> > wrote:
> > > On 9/11/08 2:39 AM, "Alex Loddengaard" <al...@google.com> wrote:
> > >> I've never dealt with a large cluster, though I'd imagine it is
> managed
> > the
> > >> same way as small clusters:
> > >
> > >    Maybe. :)
> >
> > Add me to the "maybe :)" column.  In my experience, large rarely turns
> > out the same as small.
> >
> > What usually happens is that the developers build the small thing,
> > keeping in mind that good sysadmins are going to need to do some work
> > turn it into the large thing.  (Never underestimate the value of good
> > system admin people.  Speaking as a developer, good sysadmins will
> > almost always know something about "large" that you haven't thought
> > about.)
> >
> > I think what I was doing on a small cluster (100 machines) would take
> > some modifications to scale up.
> >
> > >> -Use hostnames or ips, whichever is more convenient for you
> > >
> > >    Use hostnames.  Seriously.  Who are you people using raw IPs for
> > things?
> > > :)  Besides, you're going to need it for the eventual support of
> > Kerberos.
> >
> > I suspect lots of people buy arrays by the hour from Amazon, so you're
> > going to have a different batch of IP addresses every
> > $WHATEVER_PERIOD.  Not having to worry about dynamic dns is probably
> > interesting to someone.  (Our plan was to spin up an array of 100 or
> > so servers every N days, work for a few hours, then shut down.)
> >
> > Dynamic DNS sounded like a pain to me only because I'm a really bad
> > system administrator - it may be that it's worth it (or trivial).
> >
> > >    Instead, use a real system configuration management package such as
> > > bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the
> > plug.
> > > :) ]
> >
> > Or on EC2 and its competitors, just build a new image whenever you
> > need to update Hadoop itself.
> >
> > --
> > James Moore | james@restphone.com
> > Ruby and Ruby on Rails consulting
> > blog.restphone.com
> >
>



-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax

Re: How to manage a large cluster?

Posted by Alex Loddengaard <al...@google.com>.

My inexperience has been revealed ;).  I've taken your comments, James and
Allen, and added them to the wiki:

<http://wiki.apache.org/hadoop/LargeClusterTips>

Alex

On Fri, Sep 12, 2008 at 2:01 AM, James Moore <ja...@gmail.com>wrote:

> On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <aw...@yahoo-inc.com>
> wrote:
> > On 9/11/08 2:39 AM, "Alex Loddengaard" <al...@google.com> wrote:
> >> I've never dealt with a large cluster, though I'd imagine it is managed
> the
> >> same way as small clusters:
> >
> >    Maybe. :)
>
> Add me to the "maybe :)" column.  In my experience, large rarely turns
> out the same as small.
>
> What usually happens is that the developers build the small thing,
> keeping in mind that good sysadmins are going to need to do some work
> turn it into the large thing.  (Never underestimate the value of good
> system admin people.  Speaking as a developer, good sysadmins will
> almost always know something about "large" that you haven't thought
> about.)
>
> I think what I was doing on a small cluster (100 machines) would take
> some modifications to scale up.
>
> >> -Use hostnames or ips, whichever is more convenient for you
> >
> >    Use hostnames.  Seriously.  Who are you people using raw IPs for
> things?
> > :)  Besides, you're going to need it for the eventual support of
> Kerberos.
>
> I suspect lots of people buy arrays by the hour from Amazon, so you're
> going to have a different batch of IP addresses every
> $WHATEVER_PERIOD.  Not having to worry about dynamic dns is probably
> interesting to someone.  (Our plan was to spin up an array of 100 or
> so servers every N days, work for a few hours, then shut down.)
>
> Dynamic DNS sounded like a pain to me only because I'm a really bad
> system administrator - it may be that it's worth it (or trivial).
>
> >    Instead, use a real system configuration management package such as
> > bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the
> plug.
> > :) ]
>
> Or on EC2 and its competitors, just build a new image whenever you
> need to update Hadoop itself.
>
> --
> James Moore | james@restphone.com
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>

Re: How to manage a large cluster?

Posted by James Moore <ja...@gmail.com>.

On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:
> On 9/11/08 2:39 AM, "Alex Loddengaard" <al...@google.com> wrote:
>> I've never dealt with a large cluster, though I'd imagine it is managed the
>> same way as small clusters:
>
>    Maybe. :)

Add me to the "maybe :)" column.  In my experience, large rarely turns
out the same as small.

What usually happens is that the developers build the small thing,
keeping in mind that good sysadmins are going to need to do some work
turn it into the large thing.  (Never underestimate the value of good
system admin people.  Speaking as a developer, good sysadmins will
almost always know something about "large" that you haven't thought
about.)

I think what I was doing on a small cluster (100 machines) would take
some modifications to scale up.

>> -Use hostnames or ips, whichever is more convenient for you
>
>    Use hostnames.  Seriously.  Who are you people using raw IPs for things?
> :)  Besides, you're going to need it for the eventual support of Kerberos.

I suspect lots of people buy arrays by the hour from Amazon, so you're
going to have a different batch of IP addresses every
$WHATEVER_PERIOD.  Not having to worry about dynamic dns is probably
interesting to someone.  (Our plan was to spin up an array of 100 or
so servers every N days, work for a few hours, then shut down.)

Dynamic DNS sounded like a pain to me only because I'm a really bad
system administrator - it may be that it's worth it (or trivial).

>    Instead, use a real system configuration management package such as
> bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
> :) ]

Or on EC2 and its competitors, just build a new image whenever you
need to update Hadoop itself.

-- 
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: How to manage a large cluster?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 9/11/08 2:39 AM, "Alex Loddengaard" <al...@google.com> wrote:
> I've never dealt with a large cluster, though I'd imagine it is managed the
> same way as small clusters:

    Maybe. :)

> -Use hostnames or ips, whichever is more convenient for you

    Use hostnames.  Seriously.  Who are you people using raw IPs for things?
:)  Besides, you're going to need it for the eventual support of Kerberos.

> -All the slaves need to go into the slave file

    We only put this file on the namenode and 2ndary namenode to prevent
accidents.

> -You can update software by using bin/hadoop-daemons.sh.  Something like:
> #bin/hadoop-daemons.sh "rsync (mastersrcpath) (localdestpath)"

    We don't use that because it doesn't take inconsideration down nodes
(and you *will* have down nodes!) or deal with nodes that are outside the
grid (such as our gateways/bastion hosts, data loading machines, etc).

    Instead, use a real system configuration management package such as
bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
:) ]

> I created a wiki page that currently contains one tip for managing large
> clusters.  Could others add to this wiki page?
> 
> <http://wiki.apache.org/hadoop/LargeClusterTips>

    Quite a bit of what we do is covered in the latter half of
http://tinyurl.com/5foamm .  This is a presentation I did at ApacheCon EU
this past April that included some of the behind-the-scenes of the large
clusters at Y!.  At some point I'll probably do an updated version that
includes more adminy things (such as why we push four different types of
Hadoop configurations per grid) while others talk about core Hadoop stuff.

Re: How to manage a large cluster?

Posted by Alex Loddengaard <al...@google.com>.

I've never dealt with a large cluster, though I'd imagine it is managed the
same way as small clusters:

-Use hostnames or ips, whichever is more convenient for you
-All the slaves need to go into the slave file
-You can update software by using bin/hadoop-daemons.sh.  Something like:
#bin/hadoop-daemons.sh "rsync (mastersrcpath) (localdestpath)"

I created a wiki page that currently contains one tip for managing large
clusters.  Could others add to this wiki page?

<http://wiki.apache.org/hadoop/LargeClusterTips>

Thanks.  Hope this helps!

Alex

On Thu, Sep 11, 2008 at 5:15 PM, 叶双明 <ye...@gmail.com> wrote:

> Hi, all!
>
> How to manage a large cluster, eg. more than 2000 nodes.
> How to config hostname and ip, use DNS?
> How to config slaves, all in slaves file?
> How to update software in all nodes.
>
> Any practice, articles, suggestion is appreciate!
> Thanks.
>
> --
> Sorry for my english!! 明
> Please help me to correct my english expression and error in syntax
>