You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Khalil Honsali <k....@gmail.com> on 2008/05/01 00:33:42 UTC

Re: Hadoop Cluster Administration Tools?

Thanks Mr. Steve, and everyone..

I actually have just 16 machines (normal P4 PCs), so in case I need to do
things manually it takes half an hour (for example when installing sun-java,
I had to type that 'yes' for each .bin install)
but for now i'm ok with pssh or just a simple custom script, however, I'm
afraid things will get complicated soon enough...

You said:
"you can automate rpm install using pure "rpm" command, and check for
installed artifacts yourself"
Could you please explain more, I understand you run the same rpm against all
machines provided the cluster is homogeneous.


K. Honsali

2008/4/30 Steve Loughran <st...@apache.org>:

> Bradford Stephens wrote:
>
> > Greetings,
> >
> > I'm compiling a list of (free/OSS) tools commonly used to administer
> > Linux
> > clusters to help my company transition away from Win solutions.
> >
> > I use Ganglia for monitoring the general stats of the machines (Although
> > I
> > didn't get the hadoop metrics to work). I also use ntop to check out
> > network
> > performance (especially with Nutch).
> >
>
> Once you move to larger farms, you have to move away from running stuff by
> hand to even more automation. You dont really want to work with individual
> machines, just have some central configuration that you adjust and let it
> propagate out. The management tools can detect machines refusing to play and
> hadoop should stop sticking data and work on them.
>
> -LinuxCOE is how we build images; InstaLinux: http://www.instalinux.com/is a public instance of this. It can create .iso kickstart images that pulls
> RPM or deb packages down off local/remote servers
>
> -Configuration Management becomes your next problem. A lot of the CM tools
> let you declare the state of the machines, they then work to keep the
> machines in that state, detect when they are out of it, and push your
> machines back in to the desired state, or, failing that, start paging you.
> The line between CM and monitoring tools gets kind of blurred.
>
> There are a few open source tools that can do this
>
> http://en.wikipedia.org/wiki/Comparison_of_open_source_configuration_management_software
>
> I'd point you at
>  -Smartfrog (personal bias there,  as I work on it)
>  -puppet
>  -bcfg2
>  -LCFG
>  -Quattor
>
> Then I'd go search the LISA archives to see what other people are up to;
> there are some good papers there. Like this one, "On Designing and Deploying
> Internet-Scale Services":
> http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf<http://research.microsoft.com/%7Ejamesrh/TalksAndPapers/JamesRH_Lisa.pdf>
>
>
> -steve
>
> --
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/
>

Re: Hadoop Cluster Administration Tools?

Posted by Ted Dunning <td...@veoh.com>.
Great idea.

Go for it.  Pick a  place and start writing.  It is a wiki so if you start
it, others will comment on it.


On 5/2/08 5:28 AM, "Khalil Honsali" <k....@gmail.com> wrote:

> useful information indeed, though a bit complicated for my level I must say
> I think it is more than useful to post these online, say maybe in Hadoop's
> wiki or as an article on cluster resource sites..
> How about it? I can volunteer for this if you wish, a central information
> place on the hadoop wiki for pre-install clusters admin?
> - OS image install
> - ssh setup
> - dsh ant tools setup
> - rpm automation
> - this.next( ? )
> 
> 2008/5/2 Steve Loughran <st...@apache.org>:
> 
>> Allen Wittenauer wrote:
>> 
>>> On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com>
>>> wrote:
>>> 
>>>> *Very* cool information. As someone who's leading the transition to
>>>> open-source and cluster-orientation  at a company of about 50 people,
>>>> finding good tools for the IT staff to use is essential. Thanks so
>>>> much for
>>>> the continued feedback.
>>>> 
>>> 
>>>    Hmm.  I should upload my slides.
>>> 
>>> 
>>> 
>> That would be excellent! I was trying not to scare people with things like
>> PXE preboot or the challenge of bringing up a farm of 500+ servers when the
>> building has just suffered a power outage. I will let your slides do that.
>> 
>> The key things people have to remember are
>> -you can't do stuff by hand once you have more than one box; you need to
>> have some story for scaling things up. It could be hand creating some
>> machine image that is cloned, it could be using CM tools. If you find
>> yourself trying to ssh in to boxes to configure them by hand, you are in
>> trouble
>> 
>> -once you have enough racks in your cluster, you can abandon any notion of
>> 100% availability. You have to have be prepared to deal with the failures as
>> an everyday event. The worst failures are not the machines that drop off the
>> net, its the ones that start misbehaving with memory corruption or a network
>> card that starts flooding the network,.
>> 
>> 
> 
> 
> --


Re: Hadoop Cluster Administration Tools?

Posted by Khalil Honsali <k....@gmail.com>.
useful information indeed, though a bit complicated for my level I must say
I think it is more than useful to post these online, say maybe in Hadoop's
wiki or as an article on cluster resource sites..
How about it? I can volunteer for this if you wish, a central information
place on the hadoop wiki for pre-install clusters admin?
- OS image install
- ssh setup
- dsh ant tools setup
- rpm automation
- this.next( ? )

2008/5/2 Steve Loughran <st...@apache.org>:

> Allen Wittenauer wrote:
>
> > On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com>
> > wrote:
> >
> > > *Very* cool information. As someone who's leading the transition to
> > > open-source and cluster-orientation  at a company of about 50 people,
> > > finding good tools for the IT staff to use is essential. Thanks so
> > > much for
> > > the continued feedback.
> > >
> >
> >    Hmm.  I should upload my slides.
> >
> >
> >
> That would be excellent! I was trying not to scare people with things like
> PXE preboot or the challenge of bringing up a farm of 500+ servers when the
> building has just suffered a power outage. I will let your slides do that.
>
> The key things people have to remember are
> -you can't do stuff by hand once you have more than one box; you need to
> have some story for scaling things up. It could be hand creating some
> machine image that is cloned, it could be using CM tools. If you find
> yourself trying to ssh in to boxes to configure them by hand, you are in
> trouble
>
> -once you have enough racks in your cluster, you can abandon any notion of
> 100% availability. You have to have be prepared to deal with the failures as
> an everyday event. The worst failures are not the machines that drop off the
> net, its the ones that start misbehaving with memory corruption or a network
> card that starts flooding the network,.
>
>


--

Re: Hadoop Cluster Administration Tools?

Posted by Steve Loughran <st...@apache.org>.
Allen Wittenauer wrote:
> On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com> wrote:
>> *Very* cool information. As someone who's leading the transition to
>> open-source and cluster-orientation  at a company of about 50 people,
>> finding good tools for the IT staff to use is essential. Thanks so much for
>> the continued feedback.
> 
>     Hmm.  I should upload my slides.
> 
> 

That would be excellent! I was trying not to scare people with things 
like PXE preboot or the challenge of bringing up a farm of 500+ servers 
when the building has just suffered a power outage. I will let your 
slides do that.

The key things people have to remember are
-you can't do stuff by hand once you have more than one box; you need to 
have some story for scaling things up. It could be hand creating some 
machine image that is cloned, it could be using CM tools. If you find 
yourself trying to ssh in to boxes to configure them by hand, you are in 
trouble

-once you have enough racks in your cluster, you can abandon any notion 
of 100% availability. You have to have be prepared to deal with the 
failures as an everyday event. The worst failures are not the machines 
that drop off the net, its the ones that start misbehaving with memory 
corruption or a network card that starts flooding the network,.


Re: Hadoop Cluster Administration Tools?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.
On 5/1/08 5:00 PM, "Bradford Stephens" <br...@gmail.com> wrote:
> *Very* cool information. As someone who's leading the transition to
> open-source and cluster-orientation  at a company of about 50 people,
> finding good tools for the IT staff to use is essential. Thanks so much for
> the continued feedback.

    Hmm.  I should upload my slides.



Re: Hadoop Cluster Administration Tools?

Posted by Bradford Stephens <br...@gmail.com>.
*Very* cool information. As someone who's leading the transition to
open-source and cluster-orientation  at a company of about 50 people,
finding good tools for the IT staff to use is essential. Thanks so much for
the continued feedback.

On Thu, May 1, 2008 at 6:10 AM, Steve Loughran <st...@apache.org> wrote:

> Khalil Honsali wrote:
>
> > Thanks Mr. Steve, and everyone..
> >
> > I actually have just 16 machines (normal P4 PCs), so in case I need to
> > do
> > things manually it takes half an hour (for example when installing
> > sun-java,
> > I had to type that 'yes' for each .bin install)
> > but for now i'm ok with pssh or just a simple custom script, however,
> > I'm
> > afraid things will get complicated soon enough...
> >
> > You said:
> > "you can automate rpm install using pure "rpm" command, and check for
> > installed artifacts yourself"
> > Could you please explain more, I understand you run the same rpm against
> > all
> > machines provided the cluster is homogeneous.
> >
> >
> 1. you can push out the same RPM files to all machines.
>
> 2. if you use rpmbuild (ant's <rpm> task does this), you can build your
> own RPMs and push them out, possibly with scp, then run ssh to install them.
> http://wiki.smartfrog.org/wiki/display/sf/RPM+Files
>
> 3. A lot of linux distros have adopted Yum
> http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Yum
>
>
> I was discussing Yum support on the Config-Management list last week,
> funnily enough
> http://lopsa.org/pipermail/config-mgmt/2008-April/000662.html
>
> Nobody likes automating it much as
>  -it doesnt provide much state information
>  -it doesnt let you roll back very easily, or fix what you want
>
> Most people in that group -the CM tool authors - prefer to automate RPM
> install/rollback themselves, so they can stay in control.
>
> Having a look at how our build.xml file manages test RPMs -that is from
> the build VMware image to a clean test image, we <scp> and then <ssh> the
> operations
>
>
>    <scp remoteToDir="${rpm.ssh.path}"
>        passphrase="${rpm.ssh.passphrase}"
>        keyfile="${rpm.ssh.keyfile}"
>        trust="${rpm.ssh.trust}"
>        verbose="${rpm.ssh.verbose}">
>      <fileset refid="rpm.upload.fileset"/>
>    </scp>
>
>
>
>  <target name="rpm-remote-install-all" depends="rpm-upload">
>    <rootssh
>        command="cd ${rpm.full.ssh.dir};rpm --upgrade --force
> ${rpm.verbosity} smartfrog-*.rpm"
>        outputProperty="rpm.result.all"/>
>    <validate-rpm-result result="${rpm.result.all}"/>
>  </target>
>
>
> The <rootssh> preset runs a remote root command
>
>    <presetdef name="rpmssh">
>      <sshexec host="${rpm.ssh.server}"
>          username="${rpm.ssh.user}"
>          passphrase="${rpm.ssh.passphrase}"
>          trust="${rpm.ssh.trust}"
>          keyfile="${rpm.ssh.keyfile}"
>          timeout="${ssh.command.timeout}"
>          />
>    </presetdef>
>
>    <presetdef name="rootssh">
>      <rpmssh
>          username="root"
>          timeout="${ssh.rpm.command.timeout}"
>          />
>    </presetdef>
>
> More troublesome is how we check for errors. No simple exit code here,
> instead I have to scan for strings in the response.
>
>    <macrodef name="validate-rpm-result">
>      <attribute name="result"/>
>      <sequential>
>        <echo>
>          @{result}
>        </echo>
>        <fail>
>          <condition>
>            <contains
>                string="@{result}"
>                substring="does not exist"/>
>          </condition>
>          The rpm contains files belonging to an unknown user.
>        </fail>
>      </sequential>
>    </macrodef>
>
> Then, once everything is installed, I do something even scarier - run lots
> of query commands and look for error strings. I do need to automate this
> better; its on my todo list and one of the things I might use as a test
> project would be automating creating custom hadoop EC2 images, something
> like
>
> -bring up the image
> -push out new RPMs and ssh keys, including JVM versions.
> -create the new AMI
> -set the AMI access rights up.
> -delete the old one.
>
> Like I said, on the todo list.
>
>
>
>
> --
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/
>

Re: Hadoop Cluster Administration Tools?

Posted by Steve Loughran <st...@apache.org>.
Khalil Honsali wrote:
> Thanks Mr. Steve, and everyone..
> 
> I actually have just 16 machines (normal P4 PCs), so in case I need to do
> things manually it takes half an hour (for example when installing sun-java,
> I had to type that 'yes' for each .bin install)
> but for now i'm ok with pssh or just a simple custom script, however, I'm
> afraid things will get complicated soon enough...
> 
> You said:
> "you can automate rpm install using pure "rpm" command, and check for
> installed artifacts yourself"
> Could you please explain more, I understand you run the same rpm against all
> machines provided the cluster is homogeneous.
> 

1. you can push out the same RPM files to all machines.

2. if you use rpmbuild (ant's <rpm> task does this), you can build your 
own RPMs and push them out, possibly with scp, then run ssh to install them.
http://wiki.smartfrog.org/wiki/display/sf/RPM+Files

3. A lot of linux distros have adopted Yum
http://wiki.smartfrog.org/wiki/display/sf/Pattern+-+Yum


I was discussing Yum support on the Config-Management list last week, 
funnily enough
http://lopsa.org/pipermail/config-mgmt/2008-April/000662.html

Nobody likes automating it much as
  -it doesnt provide much state information
  -it doesnt let you roll back very easily, or fix what you want

Most people in that group -the CM tool authors - prefer to automate RPM 
install/rollback themselves, so they can stay in control.

Having a look at how our build.xml file manages test RPMs -that is from 
the build VMware image to a clean test image, we <scp> and then <ssh> 
the operations


     <scp remoteToDir="${rpm.ssh.path}"
         passphrase="${rpm.ssh.passphrase}"
         keyfile="${rpm.ssh.keyfile}"
         trust="${rpm.ssh.trust}"
         verbose="${rpm.ssh.verbose}">
       <fileset refid="rpm.upload.fileset"/>
     </scp>



   <target name="rpm-remote-install-all" depends="rpm-upload">
     <rootssh
         command="cd ${rpm.full.ssh.dir};rpm --upgrade --force 
${rpm.verbosity} smartfrog-*.rpm"
         outputProperty="rpm.result.all"/>
     <validate-rpm-result result="${rpm.result.all}"/>
   </target>


The <rootssh> preset runs a remote root command

     <presetdef name="rpmssh">
       <sshexec host="${rpm.ssh.server}"
           username="${rpm.ssh.user}"
           passphrase="${rpm.ssh.passphrase}"
           trust="${rpm.ssh.trust}"
           keyfile="${rpm.ssh.keyfile}"
           timeout="${ssh.command.timeout}"
           />
     </presetdef>

     <presetdef name="rootssh">
       <rpmssh
           username="root"
           timeout="${ssh.rpm.command.timeout}"
           />
     </presetdef>

More troublesome is how we check for errors. No simple exit code here, 
instead I have to scan for strings in the response.

     <macrodef name="validate-rpm-result">
       <attribute name="result"/>
       <sequential>
         <echo>
           @{result}
         </echo>
         <fail>
           <condition>
             <contains
                 string="@{result}"
                 substring="does not exist"/>
           </condition>
           The rpm contains files belonging to an unknown user.
         </fail>
       </sequential>
     </macrodef>

Then, once everything is installed, I do something even scarier - run 
lots of query commands and look for error strings. I do need to automate 
this better; its on my todo list and one of the things I might use as a 
test project would be automating creating custom hadoop EC2 images, 
something like

-bring up the image
-push out new RPMs and ssh keys, including JVM versions.
-create the new AMI
-set the AMI access rights up.
-delete the old one.

Like I said, on the todo list.




-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/