You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Bradford Stephens <br...@gmail.com> on 2008/04/29 22:36:49 UTC

Hadoop Cluster Administration Tools?

Greetings,

I'm compiling a list of (free/OSS) tools commonly used to administer Linux
clusters to help my company transition away from Win solutions.

I use Ganglia for monitoring the general stats of the machines (Although I
didn't get the hadoop metrics to work). I also use ntop to check out network
performance (especially with Nutch).

What do you all use to run your Hadoop clusters? I haven't found a good tool
to let me run a command on multiple machines and examine the output, yet.

Cheers,
Bradford

Re: Hadoop Cluster Administration Tools?

Posted by Yingyuan Cheng <yi...@staff.sina.com.cn>.

sshbatch is a tool simplifying ssh login and cluster management.

http://code.google.com/p/sshbatch/

FYI

--
yingyuan


Ted Dunning 写道:
> One of our sysadmins uses a parallel interactive ssh utility.  It opens
> separate windows for output and accepts input from one of the windows.  When
> editing an config file where it would be a pain to build a sed script, this
> is better than pssh or equivalents.  If the starting points are not
> identical this leads to *really* hosed results, of course.  This problem
> afflicts anything of the sort and makes debugging problems really hard
> because the faults are so strange.
>
> My own preference is to edit the files in question locally and then copy
> them to all targets.  I have to understand the problem slightly better for
> that to work, but that seems to me to be a feature.
>
>
> On 4/29/08 2:52 PM, "Khalil Honsali" <k....@gmail.com> wrote:
>
>   
>> I was wondering if any of C3 or Capify offer the capability of doing
>> interactive distributed shell (if that ever makes sense), I am thinking of
>> the example of Yum's update on fedora, say without the default yes option.
>>
>> K. Honsali
>>
>> 2008/4/30 Bryan Duxbury <br...@rapleaf.com>:
>>
>>     
>>> For commands on multiple machines, you can use Capistrano's shell utility.
>>> An added bonus is that you can write all sorts of more complicated processes
>>> using Ruby if you want to.
>>>
>>> www.capify.org
>>>
>>> -Bryan
>>>
>>>
>>> On Apr 29, 2008, at 1:36 PM, Bradford Stephens wrote:
>>>
>>>  Greetings,
>>>       
>>>> I'm compiling a list of (free/OSS) tools commonly used to administer
>>>> Linux
>>>> clusters to help my company transition away from Win solutions.
>>>>
>>>> I use Ganglia for monitoring the general stats of the machines (Although
>>>> I
>>>> didn't get the hadoop metrics to work). I also use ntop to check out
>>>> network
>>>> performance (especially with Nutch).
>>>>
>>>> What do you all use to run your Hadoop clusters? I haven't found a good
>>>> tool
>>>> to let me run a command on multiple machines and examine the output,
>>>> yet.
>>>>
>>>> Cheers,
>>>> Bradford
>>>>
>>>>         
>>>       
>> -
>>     
>
>

Re: Hadoop Cluster Administration Tools?

Posted by Ted Dunning <td...@veoh.com>.

One of our sysadmins uses a parallel interactive ssh utility.  It opens
separate windows for output and accepts input from one of the windows.  When
editing an config file where it would be a pain to build a sed script, this
is better than pssh or equivalents.  If the starting points are not
identical this leads to *really* hosed results, of course.  This problem
afflicts anything of the sort and makes debugging problems really hard
because the faults are so strange.

My own preference is to edit the files in question locally and then copy
them to all targets.  I have to understand the problem slightly better for
that to work, but that seems to me to be a feature.

On 4/29/08 2:52 PM, "Khalil Honsali" <k....@gmail.com> wrote:

> I was wondering if any of C3 or Capify offer the capability of doing
> interactive distributed shell (if that ever makes sense), I am thinking of
> the example of Yum's update on fedora, say without the default yes option.
> 
> K. Honsali
> 
> 2008/4/30 Bryan Duxbury <br...@rapleaf.com>:
> 
>> For commands on multiple machines, you can use Capistrano's shell utility.
>> An added bonus is that you can write all sorts of more complicated processes
>> using Ruby if you want to.
>> 
>> www.capify.org
>> 
>> -Bryan
>> 
>> 
>> On Apr 29, 2008, at 1:36 PM, Bradford Stephens wrote:
>> 
>>  Greetings,
>>> 
>>> I'm compiling a list of (free/OSS) tools commonly used to administer
>>> Linux
>>> clusters to help my company transition away from Win solutions.
>>> 
>>> I use Ganglia for monitoring the general stats of the machines (Although
>>> I
>>> didn't get the hadoop metrics to work). I also use ntop to check out
>>> network
>>> performance (especially with Nutch).
>>> 
>>> What do you all use to run your Hadoop clusters? I haven't found a good
>>> tool
>>> to let me run a command on multiple machines and examine the output,
>>> yet.
>>> 
>>> Cheers,
>>> Bradford
>>> 
>> 
>> 
> 
> 
> -

Re: Hadoop Cluster Administration Tools?

Posted by Steve Loughran <st...@apache.org>.

Khalil Honsali wrote:
> I was wondering if any of C3 or Capify offer the capability of doing
> interactive distributed shell (if that ever makes sense), I am thinking of
> the example of Yum's update on fedora, say without the default yes option.


Yum is trouble.  I've been automating some aspects of creating RPMs and 
installing them, and yum
  -doesnt return meaningful error codes
  -doesnt actually check machine state to see what things are like
The general discussion on the configuration-management list is that you 
can automate rpm install using pure "rpm" command, and check for 
installed artifacts yourself, but that yum is essentially not what you 
want if you want to stay in control of your machine state.


Khalil -how many machines do you have to look after?

-steve


-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: Hadoop Cluster Administration Tools?

Posted by Khalil Honsali <k....@gmail.com>.

I was wondering if any of C3 or Capify offer the capability of doing
interactive distributed shell (if that ever makes sense), I am thinking of
the example of Yum's update on fedora, say without the default yes option.

K. Honsali

2008/4/30 Bryan Duxbury <br...@rapleaf.com>:

> For commands on multiple machines, you can use Capistrano's shell utility.
> An added bonus is that you can write all sorts of more complicated processes
> using Ruby if you want to.
>
> www.capify.org
>
> -Bryan
>
>
> On Apr 29, 2008, at 1:36 PM, Bradford Stephens wrote:
>
>  Greetings,
> >
> > I'm compiling a list of (free/OSS) tools commonly used to administer
> > Linux
> > clusters to help my company transition away from Win solutions.
> >
> > I use Ganglia for monitoring the general stats of the machines (Although
> > I
> > didn't get the hadoop metrics to work). I also use ntop to check out
> > network
> > performance (especially with Nutch).
> >
> > What do you all use to run your Hadoop clusters? I haven't found a good
> > tool
> > to let me run a command on multiple machines and examine the output,
> > yet.
> >
> > Cheers,
> > Bradford
> >
>
>


-

Re: Hadoop Cluster Administration Tools?

Posted by Bryan Duxbury <br...@rapleaf.com>.

For commands on multiple machines, you can use Capistrano's shell  
utility. An added bonus is that you can write all sorts of more  
complicated processes using Ruby if you want to.

www.capify.org

-Bryan

On Apr 29, 2008, at 1:36 PM, Bradford Stephens wrote:

> Greetings,
>
> I'm compiling a list of (free/OSS) tools commonly used to  
> administer Linux
> clusters to help my company transition away from Win solutions.
>
> I use Ganglia for monitoring the general stats of the machines  
> (Although I
> didn't get the hadoop metrics to work). I also use ntop to check  
> out network
> performance (especially with Nutch).
>
> What do you all use to run your Hadoop clusters? I haven't found a  
> good tool
> to let me run a command on multiple machines and examine the  
> output, yet.
>
> Cheers,
> Bradford

Re: Hadoop Cluster Administration Tools?

Posted by Ted Dunning <td...@veoh.com>.

Great idea.

Go for it.  Pick a  place and start writing.  It is a wiki so if you start
it, others will comment on it.


On 5/2/08 5:28 AM, "Khalil Honsali" <k....@gmail.com> wrote:

> useful information indeed, though a bit complicated for my level I must say
> I think it is more than useful to post these online, say maybe in Hadoop's
> wiki or as an article on cluster resource sites..
> How about it? I can volunteer for this if you wish, a central information
> place on the hadoop wiki for pre-install clusters admin?
> - OS image install
> - ssh setup
> - dsh ant tools setup
> - rpm automation
> - this.next( ? )
> 
> 2008/5/2 Steve Loughran <st...@apache.org>:
> 
>> Allen Wittenauer wrote:
>> 
>>> On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com>
>>> wrote:
>>> 
>>>> *Very* cool information. As someone who's leading the transition to
>>>> open-source and cluster-orientation  at a company of about 50 people,
>>>> finding good tools for the IT staff to use is essential. Thanks so
>>>> much for
>>>> the continued feedback.
>>>> 
>>> 
>>>    Hmm.  I should upload my slides.
>>> 
>>> 
>>> 
>> That would be excellent! I was trying not to scare people with things like
>> PXE preboot or the challenge of bringing up a farm of 500+ servers when the
>> building has just suffered a power outage. I will let your slides do that.
>> 
>> The key things people have to remember are
>> -you can't do stuff by hand once you have more than one box; you need to
>> have some story for scaling things up. It could be hand creating some
>> machine image that is cloned, it could be using CM tools. If you find
>> yourself trying to ssh in to boxes to configure them by hand, you are in
>> trouble
>> 
>> -once you have enough racks in your cluster, you can abandon any notion of
>> 100% availability. You have to have be prepared to deal with the failures as
>> an everyday event. The worst failures are not the machines that drop off the
>> net, its the ones that start misbehaving with memory corruption or a network
>> card that starts flooding the network,.
>> 
>> 
> 
> 
> --

Re: Hadoop Cluster Administration Tools?

Posted by Khalil Honsali <k....@gmail.com>.

useful information indeed, though a bit complicated for my level I must say
I think it is more than useful to post these online, say maybe in Hadoop's
wiki or as an article on cluster resource sites..
How about it? I can volunteer for this if you wish, a central information
place on the hadoop wiki for pre-install clusters admin?
- OS image install
- ssh setup
- dsh ant tools setup
- rpm automation
- this.next( ? )

2008/5/2 Steve Loughran <st...@apache.org>:

> Allen Wittenauer wrote:
>
> > On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com>
> > wrote:
> >
> > > *Very* cool information. As someone who's leading the transition to
> > > open-source and cluster-orientation  at a company of about 50 people,
> > > finding good tools for the IT staff to use is essential. Thanks so
> > > much for
> > > the continued feedback.
> > >
> >
> >    Hmm.  I should upload my slides.
> >
> >
> >
> That would be excellent! I was trying not to scare people with things like
> PXE preboot or the challenge of bringing up a farm of 500+ servers when the
> building has just suffered a power outage. I will let your slides do that.
>
> The key things people have to remember are
> -you can't do stuff by hand once you have more than one box; you need to
> have some story for scaling things up. It could be hand creating some
> machine image that is cloned, it could be using CM tools. If you find
> yourself trying to ssh in to boxes to configure them by hand, you are in
> trouble
>
> -once you have enough racks in your cluster, you can abandon any notion of
> 100% availability. You have to have be prepared to deal with the failures as
> an everyday event. The worst failures are not the machines that drop off the
> net, its the ones that start misbehaving with memory corruption or a network
> card that starts flooding the network,.
>
>


--

Re: Hadoop Cluster Administration Tools?

Posted by Steve Loughran <st...@apache.org>.

Allen Wittenauer wrote:
> On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com> wrote:
>> *Very* cool information. As someone who's leading the transition to
>> open-source and cluster-orientation  at a company of about 50 people,
>> finding good tools for the IT staff to use is essential. Thanks so much for
>> the continued feedback.
> 
>     Hmm.  I should upload my slides.
> 
> 

That would be excellent! I was trying not to scare people with things 
like PXE preboot or the challenge of bringing up a farm of 500+ servers 
when the building has just suffered a power outage. I will let your 
slides do that.

The key things people have to remember are
-you can't do stuff by hand once you have more than one box; you need to 
have some story for scaling things up. It could be hand creating some 
machine image that is cloned, it could be using CM tools. If you find 
yourself trying to ssh in to boxes to configure them by hand, you are in 
trouble

-once you have enough racks in your cluster, you can abandon any notion 
of 100% availability. You have to have be prepared to deal with the 
failures as an everyday event. The worst failures are not the machines 
that drop off the net, its the ones that start misbehaving with memory 
corruption or a network card that starts flooding the network,.

Re: Hadoop Cluster Administration Tools?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com> wrote:
> *Very* cool information. As someone who's leading the transition to
> open-source and cluster-orientation  at a company of about 50 people,
> finding good tools for the IT staff to use is essential. Thanks so much for
> the continued feedback.

    Hmm.  I should upload my slides.

Re: Hadoop Cluster Administration Tools?

Posted by Bradford Stephens <br...@gmail.com>.

*Very* cool information. As someone who's leading the transition to
open-source and cluster-orientation  at a company of about 50 people,
finding good tools for the IT staff to use is essential. Thanks so much for
the continued feedback.

On Thu, May 1, 2008 at 6:10 AM, Steve Loughran <st...@apache.org> wrote:

> Khalil Honsali wrote:
>
> > Thanks Mr. Steve, and everyone..
> >
> > I actually have just 16 machines (normal P4 PCs), so in case I need to
> > do
> > things manually it takes half an hour (for example when installing
> > sun-java,
> > I had to type that 'yes' for each .bin install)
> > but for now i'm ok with pssh or just a simple custom script, however,
> > I'm
> > afraid things will get complicated soon enough...
> >
> > You said:
> > "you can automate rpm install using pure "rpm" command, and check for
> > installed artifacts yourself"
> > Could you please explain more, I understand you run the same rpm against
> > all
> > machines provided the cluster is homogeneous.
> >
> >
> 1. you can push out the same RPM files to all machines.
>
> 2. if you use rpmbuild (ant's <rpm> task does this), you can build your
> own RPMs and push them out, possibly with scp, then run ssh to install them.
> http://wiki.smartfrog.org/wiki/display/sf/RPM+Files
>
> 3. A lot of linux distros have adopted Yum
> http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Yum
>
>
> I was discussing Yum support on the Config-Management list last week,
> funnily enough
> http://lopsa.org/pipermail/config-mgmt/2008-April/000662.html
>
> Nobody likes automating it much as
>  -it doesnt provide much state information
>  -it doesnt let you roll back very easily, or fix what you want
>
> Most people in that group -the CM tool authors - prefer to automate RPM
> install/rollback themselves, so they can stay in control.
>
> Having a look at how our build.xml file manages test RPMs -that is from
> the build VMware image to a clean test image, we <scp> and then <ssh> the
> operations
>
>
>    <scp remoteToDir="${rpm.ssh.path}"
>        passphrase="${rpm.ssh.passphrase}"
>        keyfile="${rpm.ssh.keyfile}"
>        trust="${rpm.ssh.trust}"
>        verbose="${rpm.ssh.verbose}">
>      <fileset refid="rpm.upload.fileset"/>
>    </scp>
>
>
>
>  <target name="rpm-remote-install-all" depends="rpm-upload">
>    <rootssh
>        command="cd ${rpm.full.ssh.dir};rpm --upgrade --force
> ${rpm.verbosity} smartfrog-*.rpm"
>        outputProperty="rpm.result.all"/>
>    <validate-rpm-result result="${rpm.result.all}"/>
>  </target>
>
>
> The <rootssh> preset runs a remote root command
>
>    <presetdef name="rpmssh">
>      <sshexec host="${rpm.ssh.server}"
>          username="${rpm.ssh.user}"
>          passphrase="${rpm.ssh.passphrase}"
>          trust="${rpm.ssh.trust}"
>          keyfile="${rpm.ssh.keyfile}"
>          timeout="${ssh.command.timeout}"
>          />
>    </presetdef>
>
>    <presetdef name="rootssh">
>      <rpmssh
>          username="root"
>          timeout="${ssh.rpm.command.timeout}"
>          />
>    </presetdef>
>
> More troublesome is how we check for errors. No simple exit code here,
> instead I have to scan for strings in the response.
>
>    <macrodef name="validate-rpm-result">
>      <attribute name="result"/>
>      <sequential>
>        <echo>
>          @{result}
>        </echo>
>        <fail>
>          <condition>
>            <contains
>                string="@{result}"
>                substring="does not exist"/>
>          </condition>
>          The rpm contains files belonging to an unknown user.
>        </fail>
>      </sequential>
>    </macrodef>
>
> Then, once everything is installed, I do something even scarier - run lots
> of query commands and look for error strings. I do need to automate this
> better; its on my todo list and one of the things I might use as a test
> project would be automating creating custom hadoop EC2 images, something
> like
>
> -bring up the image
> -push out new RPMs and ssh keys, including JVM versions.
> -create the new AMI
> -set the AMI access rights up.
> -delete the old one.
>
> Like I said, on the todo list.
>
>
>
>
> --
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/
>

Re: Hadoop Cluster Administration Tools?

Posted by Steve Loughran <st...@apache.org>.

Khalil Honsali wrote:
> Thanks Mr. Steve, and everyone..
> 
> I actually have just 16 machines (normal P4 PCs), so in case I need to do
> things manually it takes half an hour (for example when installing sun-java,
> I had to type that 'yes' for each .bin install)
> but for now i'm ok with pssh or just a simple custom script, however, I'm
> afraid things will get complicated soon enough...
> 
> You said:
> "you can automate rpm install using pure "rpm" command, and check for
> installed artifacts yourself"
> Could you please explain more, I understand you run the same rpm against all
> machines provided the cluster is homogeneous.
> 

1. you can push out the same RPM files to all machines.

2. if you use rpmbuild (ant's <rpm> task does this), you can build your 
own RPMs and push them out, possibly with scp, then run ssh to install them.
http://wiki.smartfrog.org/wiki/display/sf/RPM+Files

3. A lot of linux distros have adopted Yum
http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Yum


I was discussing Yum support on the Config-Management list last week, 
funnily enough
http://lopsa.org/pipermail/config-mgmt/2008-April/000662.html

Nobody likes automating it much as
  -it doesnt provide much state information
  -it doesnt let you roll back very easily, or fix what you want

Most people in that group -the CM tool authors - prefer to automate RPM 
install/rollback themselves, so they can stay in control.

Having a look at how our build.xml file manages test RPMs -that is from 
the build VMware image to a clean test image, we <scp> and then <ssh> 
the operations


     <scp remoteToDir="${rpm.ssh.path}"
         passphrase="${rpm.ssh.passphrase}"
         keyfile="${rpm.ssh.keyfile}"
         trust="${rpm.ssh.trust}"
         verbose="${rpm.ssh.verbose}">
       <fileset refid="rpm.upload.fileset"/>
     </scp>



   <target name="rpm-remote-install-all" depends="rpm-upload">
     <rootssh
         command="cd ${rpm.full.ssh.dir};rpm --upgrade --force 
${rpm.verbosity} smartfrog-*.rpm"
         outputProperty="rpm.result.all"/>
     <validate-rpm-result result="${rpm.result.all}"/>
   </target>


The <rootssh> preset runs a remote root command

     <presetdef name="rpmssh">
       <sshexec host="${rpm.ssh.server}"
           username="${rpm.ssh.user}"
           passphrase="${rpm.ssh.passphrase}"
           trust="${rpm.ssh.trust}"
           keyfile="${rpm.ssh.keyfile}"
           timeout="${ssh.command.timeout}"
           />
     </presetdef>

     <presetdef name="rootssh">
       <rpmssh
           username="root"
           timeout="${ssh.rpm.command.timeout}"
           />
     </presetdef>

More troublesome is how we check for errors. No simple exit code here, 
instead I have to scan for strings in the response.

     <macrodef name="validate-rpm-result">
       <attribute name="result"/>
       <sequential>
         <echo>
           @{result}
         </echo>
         <fail>
           <condition>
             <contains
                 string="@{result}"
                 substring="does not exist"/>
           </condition>
           The rpm contains files belonging to an unknown user.
         </fail>
       </sequential>
     </macrodef>

Then, once everything is installed, I do something even scarier - run 
lots of query commands and look for error strings. I do need to automate 
this better; its on my todo list and one of the things I might use as a 
test project would be automating creating custom hadoop EC2 images, 
something like

-bring up the image
-push out new RPMs and ssh keys, including JVM versions.
-create the new AMI
-set the AMI access rights up.
-delete the old one.

Like I said, on the todo list.




-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: Hadoop Cluster Administration Tools?

Posted by Khalil Honsali <k....@gmail.com>.

Thanks Mr. Steve, and everyone..

I actually have just 16 machines (normal P4 PCs), so in case I need to do
things manually it takes half an hour (for example when installing sun-java,
I had to type that 'yes' for each .bin install)
but for now i'm ok with pssh or just a simple custom script, however, I'm
afraid things will get complicated soon enough...

You said:
"you can automate rpm install using pure "rpm" command, and check for
installed artifacts yourself"
Could you please explain more, I understand you run the same rpm against all
machines provided the cluster is homogeneous.


K. Honsali

2008/4/30 Steve Loughran <st...@apache.org>:

> Bradford Stephens wrote:
>
> > Greetings,
> >
> > I'm compiling a list of (free/OSS) tools commonly used to administer
> > Linux
> > clusters to help my company transition away from Win solutions.
> >
> > I use Ganglia for monitoring the general stats of the machines (Although
> > I
> > didn't get the hadoop metrics to work). I also use ntop to check out
> > network
> > performance (especially with Nutch).
> >
>
> Once you move to larger farms, you have to move away from running stuff by
> hand to even more automation. You dont really want to work with individual
> machines, just have some central configuration that you adjust and let it
> propagate out. The management tools can detect machines refusing to play and
> hadoop should stop sticking data and work on them.
>
> -LinuxCOE is how we build images; InstaLinux: http://www.instalinux.com/is a public instance of this. It can create .iso kickstart images that pulls
> RPM or deb packages down off local/remote servers
>
> -Configuration Management becomes your next problem. A lot of the CM tools
> let you declare the state of the machines, they then work to keep the
> machines in that state, detect when they are out of it, and push your
> machines back in to the desired state, or, failing that, start paging you.
> The line between CM and monitoring tools gets kind of blurred.
>
> There are a few open source tools that can do this
>
> http://en.wikipedia.org/wiki/Comparison_of_open_source_configuration_management_software
>
> I'd point you at
>  -Smartfrog (personal bias there,  as I work on it)
>  -puppet
>  -bcfg2
>  -LCFG
>  -Quattor
>
> Then I'd go search the LISA archives to see what other people are up to;
> there are some good papers there. Like this one, "On Designing and Deploying
> Internet-Scale Services":
> http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf<http://research.microsoft.com/%7Ejamesrh/TalksAndPapers/JamesRH_Lisa.pdf>
>
>
> -steve
>
> --
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/
>

Re: Hadoop Cluster Administration Tools?

Posted by Steve Loughran <st...@apache.org>.

Bradford Stephens wrote:
> Greetings,
> 
> I'm compiling a list of (free/OSS) tools commonly used to administer Linux
> clusters to help my company transition away from Win solutions.
> 
> I use Ganglia for monitoring the general stats of the machines (Although I
> didn't get the hadoop metrics to work). I also use ntop to check out network
> performance (especially with Nutch).

Once you move to larger farms, you have to move away from running stuff 
by hand to even more automation. You dont really want to work with 
individual machines, just have some central configuration that you 
adjust and let it propagate out. The management tools can detect 
machines refusing to play and hadoop should stop sticking data and work 
on them.

-LinuxCOE is how we build images; InstaLinux: http://www.instalinux.com/ 
is a public instance of this. It can create .iso kickstart images that 
pulls RPM or deb packages down off local/remote servers

-Configuration Management becomes your next problem. A lot of the CM 
tools let you declare the state of the machines, they then work to keep 
the machines in that state, detect when they are out of it, and push 
your machines back in to the desired state, or, failing that, start 
paging you. The line between CM and monitoring tools gets kind of blurred.

There are a few open source tools that can do this
http://en.wikipedia.org/wiki/Comparison_of_open_source_configuration_management_software

I'd point you at
  -Smartfrog (personal bias there,  as I work on it)
  -puppet
  -bcfg2
  -LCFG
  -Quattor

Then I'd go search the LISA archives to see what other people are up to; 
there are some good papers there. Like this one, "On Designing and 
Deploying Internet-Scale Services":
http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf

-steve

-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

RE: Hadoop Cluster Administration Tools?

Posted by Xavier Stevens <Xa...@fox.com>.

We use C3 and it works pretty well.
 

-----Original Message-----
From: Khalil Honsali [mailto:k.honsali@gmail.com] 
Sent: Tuesday, April 29, 2008 2:34 PM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop Cluster Administration Tools?

as for running a distributed command and get the output back, I use
either of (in order of preference)
- pssh, dsh, clusterssh
there is also C3 (Cluster Command & Control) but have'nt used it yet.

2008/4/30 Allen Wittenauer <aw...@yahoo-inc.com>:

> On 4/29/08 1:36 PM, "Bradford Stephens" <br...@gmail.com>
> wrote:
> > What do you all use to run your Hadoop clusters? I haven't found a 
> > good
> tool
> > to let me run a command on multiple machines and examine the output,
> yet.
>
>     We basically use a set of custom tools that sit on top of netgroup

> and ssh.  I'm hoping at some point we can share these too, after we 
> spend some time cleaning them up.
>
>

Re: Hadoop Cluster Administration Tools?

Posted by Khalil Honsali <k....@gmail.com>.

as for running a distributed command and get the output back, I use either
of (in order of preference)
- pssh, dsh, clusterssh
there is also C3 (Cluster Command & Control) but have'nt used it yet.

2008/4/30 Allen Wittenauer <aw...@yahoo-inc.com>:

> On 4/29/08 1:36 PM, "Bradford Stephens" <br...@gmail.com>
> wrote:
> > What do you all use to run your Hadoop clusters? I haven't found a good
> tool
> > to let me run a command on multiple machines and examine the output,
> yet.
>
>     We basically use a set of custom tools that sit on top of netgroup and
> ssh.  I'm hoping at some point we can share these too, after we spend some
> time cleaning them up.
>
>

Re: Hadoop Cluster Administration Tools?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 4/29/08 1:36 PM, "Bradford Stephens" <br...@gmail.com> wrote:
> What do you all use to run your Hadoop clusters? I haven't found a good tool
> to let me run a command on multiple machines and examine the output, yet.

    We basically use a set of custom tools that sit on top of netgroup and
ssh.  I'm hoping at some point we can share these too, after we spend some
time cleaning them up.