You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Deepak Diwakar <dd...@gmail.com> on 2010/08/09 20:09:27 UTC

Command line config arguments

Hey friends,

I am in a doubt. Suppose i want to pass program specific config parameter
through command line and after reading it  setting up to the desired local
variable. For example, suppose I am  passing threshold value for wordcount
example to tab only those words who crosses the threshold. I declare a
wordcound static member called "threshold" which is set once we read the
command line config value in run().

When i read value of the threshold in mapper in standalone mode, it is well
set. But when I run the same job in DFS mode, and  see value of the
threshold in mapper, is not set. In fact it is taking the default value
which is assigned at the time of declaration.

Currently whenever i have to do such custom program-related config
assignments, I use a sub-program to store this info into a place called
metastore and then let the slaves(who all are running map-reduce)  to access
and set the value of variables accordingly.

Could somebody  point me out any other way to do so?

Appreciate help.


Thanks & regards,
- Deepak Diwakar,

Re: Best practices - Large Hadoop Cluster

Posted by Joe Stein <ch...@allthingshadoop.com>.
Not sure this was mentioned already but Adobe open sourced their puppet impl   http://github.com/hstack/puppet as well as a nice post in regards to it http://hstack.org/hstack-automated-deployment-using-puppet/

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/

On Aug 11, 2010, at 7:40 AM, Steve Loughran <st...@apache.org> wrote:

> On 10/08/10 21:06, Raj V wrote:
>> Mike
>> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8
>> character password, ensuring that everything looks ok) is about 8.5 hours. After
>> that if something does not work, that is a different level of pain altogether.
>> 
>> Using scp to exchange keys simply does not scale.
>> 
>> My question was simple, how do other people in the group who run large clusters
>> manage this?  Brian put it better; Whats is the best, duplicatable  way of
>> running hadoop  when the cluster is large. I agree, this is not a hadoop
>> question per se, but hadoop is really what I care about now.
>> 
> 
> 
> SSH is great, but you still shouldn't be playing around trying to do things by hand, even those parallel SSH tools break the moment you have a hint of inconsistency between machines.
> 
> 
> Instead general practise in managing *any large datacentre scale application*, be it hadoop or not is automate things so the machines do the work themselves, leaving sysadmins to deal with important issues like why all packets are being routed via singapore or whether the HDD failure rate is statistically significant.
> 
> The standard techniques usually one of
> 
> * build your own RPMs, deb files, push out stuff with kickstart, change a machine by rebuilding its root disk.
> Strengths: good for clean builds
> Weaknesses: a lot of work, doesn't do recovery
> 
> * Model driven tools. I know most people now say "yes, puppet", but actually cfEngine and bcfg2 have been around for a while, SmartFrog is what we use. In these tools, you specify what you want, they keep an eye on things and push the machines back into the desired state.
> Strengths: recovers from bad state, keeps the machines close to the desired state
> Weaknesses: if the desired state is not consistent, they tend to circle between the various unreachable states.
> 
> * Scripts. People end up doing this without thinking.
> Strengths: take your commands and script them, strong order to operations
> Weaknesses: bad at recovery.
> 
> * VM images, maintained by hand or another technique
> Strengths: OK if you have one gold image that can be pushed out every time a VM is created -and VMs are short lived.
> Weaknesses: Unless your VMs are short lived, you've just created a maintenance nightmare worse than before.
> 
> 
> Hadoop itself is not too bad at handling failures of individual machines, but the general best practices in large cluster management (look at LISA proceedings) are pretty much foundational.
> 
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
> 
> -Steve

Re: Best practices - Large Hadoop Cluster

Posted by Steve Loughran <st...@apache.org>.
On 10/08/10 21:06, Raj V wrote:
> Mike
> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8
> character password, ensuring that everything looks ok) is about 8.5 hours. After
> that if something does not work, that is a different level of pain altogether.
>
> Using scp to exchange keys simply does not scale.
>
> My question was simple, how do other people in the group who run large clusters
> manage this?  Brian put it better; Whats is the best, duplicatable  way of
> running hadoop  when the cluster is large. I agree, this is not a hadoop
> question per se, but hadoop is really what I care about now.
>


SSH is great, but you still shouldn't be playing around trying to do 
things by hand, even those parallel SSH tools break the moment you have 
a hint of inconsistency between machines.


Instead general practise in managing *any large datacentre scale 
application*, be it hadoop or not is automate things so the machines do 
the work themselves, leaving sysadmins to deal with important issues 
like why all packets are being routed via singapore or whether the HDD 
failure rate is statistically significant.

The standard techniques usually one of

  * build your own RPMs, deb files, push out stuff with kickstart, 
change a machine by rebuilding its root disk.
  Strengths: good for clean builds
  Weaknesses: a lot of work, doesn't do recovery

  * Model driven tools. I know most people now say "yes, puppet", but 
actually cfEngine and bcfg2 have been around for a while, SmartFrog is 
what we use. In these tools, you specify what you want, they keep an eye 
on things and push the machines back into the desired state.
  Strengths: recovers from bad state, keeps the machines close to the 
desired state
  Weaknesses: if the desired state is not consistent, they tend to 
circle between the various unreachable states.

  * Scripts. People end up doing this without thinking.
  Strengths: take your commands and script them, strong order to operations
  Weaknesses: bad at recovery.

* VM images, maintained by hand or another technique
  Strengths: OK if you have one gold image that can be pushed out every 
time a VM is created -and VMs are short lived.
  Weaknesses: Unless your VMs are short lived, you've just created a 
maintenance nightmare worse than before.


Hadoop itself is not too bad at handling failures of individual 
machines, but the general best practices in large cluster management 
(look at LISA proceedings) are pretty much foundational.

http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

-Steve

Re: Best practices - Large Hadoop Cluster

Posted by Praveen Yarlagadda <pr...@gmail.com>.
Raj,

If you use cluster ssh, you can do it quickly. You can log onto several
hosts at a time and then execute commands on all of them at a time. I have
used it to manage 64 nodes.

Thanks,
Praveen



On Tue, Aug 10, 2010 at 4:55 PM, Michael Segel <mi...@hotmail.com>wrote:

>
> Raj...
>
> Ok, one of the things we have at one of my clients is the hadoop user's
> account is actually a centralized account. (User's accounts are mounted as
> they log in to the machine.)
> So you have a single account hadoop for all of the machines.
>
> So when you set up the keys, they are in the ~hadoop account.
>
> So you have a bit of work w 512 nodes, and yeah, its painful for the first
> time.
>
> Like I said, I don't have a cloud of 512 nodes, and when I am building the
> cloud of 20+ machines, setting up ssh is just part of the process.
>
> If you set up hadoop as a system service, then does that mean when you boot
> the machine, your node goes up on its own like other services?
> I personally don't think that's a good idea...
>
> I haven't evaluated puppet, I'm pulled yet again in to other things....
>
> So I don't have an answer.
>
> My point was that you ca go through and add the user/password keys as part
> of the build process and while painful, its not that painful. (Trust me,
> there's worse things that can get dropped on your desk. ;-)
>
> -Mike
>
>
> > Date: Tue, 10 Aug 2010 13:06:51 -0700
> > From: rajvish@yahoo.com
> > Subject: Re: Best practices - Large Hadoop Cluster
> > To: common-user@hadoop.apache.org
> >
> > Mike
> > 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8
> > character password, ensuring that everything looks ok) is about 8.5
> hours. After
> > that if something does not work, that is a different level of pain
> altogether.
> >
> > Using scp to exchange keys simply does not scale.
> >
> > My question was simple, how do other people in the group who run large
> clusters
> > manage this?  Brian put it better; Whats is the best, duplicatable  way
> of
> > running hadoop  when the cluster is large. I agree, this is not a hadoop
> > question per se, but hadoop is really what I care about now.
> >
> > Thanks to others for useful suggestions. I will examine them and post a
> summary
> > if anyone is interested.
> >
> > Raj
> >
> >
> >
> >
> >
> > ________________________________
> > From: Michael Segel <mi...@hotmail.com>
> > To: common-user@hadoop.apache.org
> > Sent: Tue, August 10, 2010 11:36:14 AM
> > Subject: RE: Best practices - Large Hadoop Cluster
> >
> >
> > I'm a little confused by Raj's problem.
> >
> > If you follow the instructions outlined in the Hadoop books and
> everywhere else
> > about setting up ssh keys, you shouldn't have a problem.
> > I'd just ssh as the hadoop user to each of the nodes before trying to
> start
> > hadoop for the first time.
> >
> > At 512 nodes, I think you may run in to other issues... (I don't know, I
> don't
> > have 512 machines to play with :-(  ) And puppet has been recommended a
> couple
> > of times.
> >
> > Just my $0.02
> >
> > -Mike
> >
> >
> > > Date: Tue, 10 Aug 2010 23:43:12 +0530
> > > From: gokulm@huawei.com
> > > Subject: RE: Best practices - Large Hadoop Cluster
> > > To: common-user@hadoop.apache.org
> > >
> > >
> > > Hi Raj,
> > >
> > >     As per my understanding the problem is with ssh password each time
> > > you start/stop the cluster. You need password less startup shutdown
> right.?
> > >
> > >     Here is my way of overcoming the ssh problem
> > >
> > >     Write a shell script as follows:
> > >
> > >     1. Generate a ssh key from the namenode machine (where you will
> > > start/stop the cluster)
> > >
> > >     2. Read each entry from the conf/slaves file and do the following
> > >
> > >         2.1 add the key you generated in step 1 to the ssh
> > > authorized_keys file of the datanode machine that you got in step 2
> > > something like below script
> > >             cat $HOME/.ssh/public_key_file | ssh username@host '
> > > cat >> $HOME/.ssh/authorized_keys'
> > >
> > >
> > >     3. Repeat step 2 for conf/masters also
> > >
> > >     Note: Password must be specified for the specified username@host
> > > first time since the ssh command given in point 2.1 requires it.
> > >
> > >     Now you can start/stop your hadoop cluster without ssh password
> > > overhead
> > >
> > >
> > >  Thanks,
> > >   Gokul
> > >
> > >
> > >
> > >
> ****************************************************************************
> > > ***********
> > >
> > > -----Original Message-----
> > > From: Raj V [mailto:rajvish@yahoo.com]
> > > Sent: Tuesday, August 10, 2010 7:16 PM
> > > To: common-user@hadoop.apache.org
> > > Subject: Best practices - Large Hadoop Cluster
> > >
> > > I need to start setting up a large - hadoop cluster of 512 nodes . My
> > > biggest
> > > problem is the SSH keys. Is there a simpler way of generating and
> exchanging
> > > ssh
> > > keys among the nodes? Any best practices? If there is none, I could
> > > volunteer to
> > > do it,
> > >
> > > Raj
>
>

RE: Best practices - Large Hadoop Cluster

Posted by Michael Segel <mi...@hotmail.com>.
Raj...

Ok, one of the things we have at one of my clients is the hadoop user's account is actually a centralized account. (User's accounts are mounted as they log in to the machine.)
So you have a single account hadoop for all of the machines.

So when you set up the keys, they are in the ~hadoop account.

So you have a bit of work w 512 nodes, and yeah, its painful for the first time.

Like I said, I don't have a cloud of 512 nodes, and when I am building the cloud of 20+ machines, setting up ssh is just part of the process.

If you set up hadoop as a system service, then does that mean when you boot the machine, your node goes up on its own like other services? 
I personally don't think that's a good idea...

I haven't evaluated puppet, I'm pulled yet again in to other things....

So I don't have an answer.

My point was that you ca go through and add the user/password keys as part of the build process and while painful, its not that painful. (Trust me, there's worse things that can get dropped on your desk. ;-)

-Mike


> Date: Tue, 10 Aug 2010 13:06:51 -0700
> From: rajvish@yahoo.com
> Subject: Re: Best practices - Large Hadoop Cluster
> To: common-user@hadoop.apache.org
> 
> Mike
> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8 
> character password, ensuring that everything looks ok) is about 8.5 hours. After 
> that if something does not work, that is a different level of pain altogether. 
> 
> Using scp to exchange keys simply does not scale.
> 
> My question was simple, how do other people in the group who run large clusters 
> manage this?  Brian put it better; Whats is the best, duplicatable  way of 
> running hadoop  when the cluster is large. I agree, this is not a hadoop 
> question per se, but hadoop is really what I care about now.
> 
> Thanks to others for useful suggestions. I will examine them and post a summary 
> if anyone is interested.
> 
> Raj
> 
> 
> 
> 
> 
> ________________________________
> From: Michael Segel <mi...@hotmail.com>
> To: common-user@hadoop.apache.org
> Sent: Tue, August 10, 2010 11:36:14 AM
> Subject: RE: Best practices - Large Hadoop Cluster
> 
> 
> I'm a little confused by Raj's problem.
> 
> If you follow the instructions outlined in the Hadoop books and everywhere else 
> about setting up ssh keys, you shouldn't have a problem.
> I'd just ssh as the hadoop user to each of the nodes before trying to start 
> hadoop for the first time.
> 
> At 512 nodes, I think you may run in to other issues... (I don't know, I don't 
> have 512 machines to play with :-(  ) And puppet has been recommended a couple 
> of times.
> 
> Just my $0.02
> 
> -Mike
> 
> 
> > Date: Tue, 10 Aug 2010 23:43:12 +0530
> > From: gokulm@huawei.com
> > Subject: RE: Best practices - Large Hadoop Cluster
> > To: common-user@hadoop.apache.org
> > 
> > 
> > Hi Raj,
> > 
> >     As per my understanding the problem is with ssh password each time
> > you start/stop the cluster. You need password less startup shutdown right.?
> > 
> >     Here is my way of overcoming the ssh problem 
> > 
> >     Write a shell script as follows:
> >     
> >     1. Generate a ssh key from the namenode machine (where you will
> > start/stop the cluster)
> > 
> >     2. Read each entry from the conf/slaves file and do the following
> >     
> >         2.1 add the key you generated in step 1 to the ssh
> > authorized_keys file of the datanode machine that you got in step 2
> > something like below script
> >             cat $HOME/.ssh/public_key_file | ssh username@host '
> > cat >> $HOME/.ssh/authorized_keys'
> > 
> > 
> >     3. Repeat step 2 for conf/masters also
> > 
> >     Note: Password must be specified for the specified username@host
> > first time since the ssh command given in point 2.1 requires it. 
> >         
> >     Now you can start/stop your hadoop cluster without ssh password
> > overhead
> > 
> > 
> >  Thanks,
> >   Gokul
> >  
> >    
> >  
> > ****************************************************************************
> > ***********
> > 
> > -----Original Message-----
> > From: Raj V [mailto:rajvish@yahoo.com] 
> > Sent: Tuesday, August 10, 2010 7:16 PM
> > To: common-user@hadoop.apache.org
> > Subject: Best practices - Large Hadoop Cluster
> > 
> > I need to start setting up a large - hadoop cluster of 512 nodes . My
> > biggest 
> > problem is the SSH keys. Is there a simpler way of generating and exchanging
> > ssh 
> > keys among the nodes? Any best practices? If there is none, I could
> > volunteer to 
> > do it,
> > 
> > Raj
 		 	   		  

Re: Best practices - Large Hadoop Cluster

Posted by zGreenfelder <zg...@gmail.com>.
On Tue, Aug 10, 2010 at 4:06 PM, Raj V <ra...@yahoo.com> wrote:
> Mike
> 512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8

> Thanks to others for useful suggestions. I will examine them and post a summary
> if anyone is interested.
>
> Raj
>

I may well be oversimplifing things, and of course security is always
a concern... but wouldn't it make more sense to generate the ssh key
on one of your central admin machine, then create the authorized_keys
file based on that, and make the system network  installer (pxe,
kickstart, autoyast or whatever) install the authorized_keys file for
a particular user ID?

so all machines for a given ID would have the exact same auth key file
(and perhaps the same id/key if you want the hosts to be able to cross
access.  (e.g. from central admin -> node.x and from node.x ->
node.y).   the down side would be that one lost key would give away
the whole kingdom, instead of a single host... but security is always
a balance between risk and usability.

baring that... maybe configure everything to use kerberos
authentication in sshd and setup/maintain all of that wonderful fun.
although that's a whole can of worms I'd be very reluctant to open,
personally.

-- 
Even the Magic 8 ball has an opinion on email clients: Outlook not so good.

Re: Best practices - Large Hadoop Cluster

Posted by Raj V <ra...@yahoo.com>.
Mike
512 nodes, even a minute for each node ( ssh-ing to each node, typing a 8 
character password, ensuring that everything looks ok) is about 8.5 hours. After 
that if something does not work, that is a different level of pain altogether. 

Using scp to exchange keys simply does not scale.

My question was simple, how do other people in the group who run large clusters 
manage this?  Brian put it better; Whats is the best, duplicatable  way of 
running hadoop  when the cluster is large. I agree, this is not a hadoop 
question per se, but hadoop is really what I care about now.

Thanks to others for useful suggestions. I will examine them and post a summary 
if anyone is interested.

Raj





________________________________
From: Michael Segel <mi...@hotmail.com>
To: common-user@hadoop.apache.org
Sent: Tue, August 10, 2010 11:36:14 AM
Subject: RE: Best practices - Large Hadoop Cluster


I'm a little confused by Raj's problem.

If you follow the instructions outlined in the Hadoop books and everywhere else 
about setting up ssh keys, you shouldn't have a problem.
I'd just ssh as the hadoop user to each of the nodes before trying to start 
hadoop for the first time.

At 512 nodes, I think you may run in to other issues... (I don't know, I don't 
have 512 machines to play with :-(  ) And puppet has been recommended a couple 
of times.

Just my $0.02

-Mike


> Date: Tue, 10 Aug 2010 23:43:12 +0530
> From: gokulm@huawei.com
> Subject: RE: Best practices - Large Hadoop Cluster
> To: common-user@hadoop.apache.org
> 
> 
> Hi Raj,
> 
>     As per my understanding the problem is with ssh password each time
> you start/stop the cluster. You need password less startup shutdown right.?
> 
>     Here is my way of overcoming the ssh problem 
> 
>     Write a shell script as follows:
>     
>     1. Generate a ssh key from the namenode machine (where you will
> start/stop the cluster)
> 
>     2. Read each entry from the conf/slaves file and do the following
>     
>         2.1 add the key you generated in step 1 to the ssh
> authorized_keys file of the datanode machine that you got in step 2
> something like below script
>             cat $HOME/.ssh/public_key_file | ssh username@host '
> cat >> $HOME/.ssh/authorized_keys'
> 
> 
>     3. Repeat step 2 for conf/masters also
> 
>     Note: Password must be specified for the specified username@host
> first time since the ssh command given in point 2.1 requires it. 
>         
>     Now you can start/stop your hadoop cluster without ssh password
> overhead
> 
> 
>  Thanks,
>   Gokul
>  
>    
>  
> ****************************************************************************
> ***********
> 
> -----Original Message-----
> From: Raj V [mailto:rajvish@yahoo.com] 
> Sent: Tuesday, August 10, 2010 7:16 PM
> To: common-user@hadoop.apache.org
> Subject: Best practices - Large Hadoop Cluster
> 
> I need to start setting up a large - hadoop cluster of 512 nodes . My
> biggest 
> problem is the SSH keys. Is there a simpler way of generating and exchanging
> ssh 
> keys among the nodes? Any best practices? If there is none, I could
> volunteer to 
> do it,
> 
> Raj

RE: Best practices - Large Hadoop Cluster

Posted by Michael Segel <mi...@hotmail.com>.
I'm a little confused by Raj's problem.

If you follow the instructions outlined in the Hadoop books and everywhere else about setting up ssh keys, you shouldn't have a problem.
I'd just ssh as the hadoop user to each of the nodes before trying to start hadoop for the first time.

At 512 nodes, I think you may run in to other issues... (I don't know, I don't have 512 machines to play with :-(  ) And puppet has been recommended a couple of times.

Just my $0.02

-Mike


> Date: Tue, 10 Aug 2010 23:43:12 +0530
> From: gokulm@huawei.com
> Subject: RE: Best practices - Large Hadoop Cluster
> To: common-user@hadoop.apache.org
> 
> 
> Hi Raj,
> 
> 	As per my understanding the problem is with ssh password each time
> you start/stop the cluster. You need password less startup shutdown right.?
> 
> 	Here is my way of overcoming the ssh problem 
> 
> 	Write a shell script as follows:
> 	
> 	1. Generate a ssh key from the namenode machine (where you will
> start/stop the cluster)
> 
> 	2. Read each entry from the conf/slaves file and do the following
> 	
> 		2.1 add the key you generated in step 1 to the ssh
> authorized_keys file of the datanode machine that you got in step 2
> something like below script
> 			cat $HOME/.ssh/public_key_file | ssh username@host '
> cat >> $HOME/.ssh/authorized_keys'
> 
> 
> 	3. Repeat step 2 for conf/masters also
> 
> 	Note: Password must be specified for the specified username@host
> first time since the ssh command given in point 2.1 requires it. 
> 		
> 	Now you can start/stop your hadoop cluster without ssh password
> overhead
> 
> 
>  Thanks,
>   Gokul
>  
>    
>  
> ****************************************************************************
> ***********
> 
> -----Original Message-----
> From: Raj V [mailto:rajvish@yahoo.com] 
> Sent: Tuesday, August 10, 2010 7:16 PM
> To: common-user@hadoop.apache.org
> Subject: Best practices - Large Hadoop Cluster
> 
> I need to start setting up a large - hadoop cluster of 512 nodes . My
> biggest 
> problem is the SSH keys. Is there a simpler way of generating and exchanging
> ssh 
> keys among the nodes? Any best practices? If there is none, I could
> volunteer to 
> do it,
> 
> Raj
 		 	   		  

Re: Best practices - Large Hadoop Cluster

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Aug 10, 2010, at 6:46 AM, Raj V wrote:

> I need to start setting up a large - hadoop cluster of 512 nodes . My biggest 
> problem is the SSH keys. Is there a simpler way of generating and exchanging ssh 
> keys among the nodes? Any best practices? If there is none, I could volunteer to 
> do it,

For hosts? DNS SSHFP records

For users?  Use Kerberos + SSH with GSSAPI enabled

Re: Best practices - Large Hadoop Cluster

Posted by Gregor Willemsen <gr...@googlemail.com>.
Hi Raj,

maybe this link is worth looking at: https://fedorahosted.org/func/.
Although taken from the RedHat community the documentation provides
hints to run a setup on Debian systems.

Gregor

2010/8/10 Edward Capriolo <ed...@gmail.com>:
> On Tue, Aug 10, 2010 at 10:01 AM, Brian Bockelman <bb...@cse.unl.edu> wrote:
>> Hi Raj,
>>
>> I believe the best practice is to *not* start up Hadoop over SSH.  Set it up as a system service and let your configuration management software take care of it.
>>
>> You probably want to look at ROCKS or one of its variants, or at least something like puppet or cfEngine.
>>
>> Brian
>>
>> On Aug 10, 2010, at 8:46 AM, Raj V wrote:
>>
>>> I need to start setting up a large - hadoop cluster of 512 nodes . My biggest
>>> problem is the SSH keys. Is there a simpler way of generating and exchanging ssh
>>> keys among the nodes? Any best practices? If there is none, I could volunteer to
>>> do it,
>>>
>>> Raj
>>
>>
>
> Shameless blog plug -alternative to ssh keys-
> http://www.edwardcapriolo.com/roller/edwardcapriolo/date/20100716
>

Re: Best practices - Large Hadoop Cluster

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Aug 10, 2010 at 10:01 AM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hi Raj,
>
> I believe the best practice is to *not* start up Hadoop over SSH.  Set it up as a system service and let your configuration management software take care of it.
>
> You probably want to look at ROCKS or one of its variants, or at least something like puppet or cfEngine.
>
> Brian
>
> On Aug 10, 2010, at 8:46 AM, Raj V wrote:
>
>> I need to start setting up a large - hadoop cluster of 512 nodes . My biggest
>> problem is the SSH keys. Is there a simpler way of generating and exchanging ssh
>> keys among the nodes? Any best practices? If there is none, I could volunteer to
>> do it,
>>
>> Raj
>
>

Shameless blog plug -alternative to ssh keys-
http://www.edwardcapriolo.com/roller/edwardcapriolo/date/20100716

Re: Best practices - Large Hadoop Cluster

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hi Raj,

I believe the best practice is to *not* start up Hadoop over SSH.  Set it up as a system service and let your configuration management software take care of it.

You probably want to look at ROCKS or one of its variants, or at least something like puppet or cfEngine.

Brian

On Aug 10, 2010, at 8:46 AM, Raj V wrote:

> I need to start setting up a large - hadoop cluster of 512 nodes . My biggest 
> problem is the SSH keys. Is there a simpler way of generating and exchanging ssh 
> keys among the nodes? Any best practices? If there is none, I could volunteer to 
> do it,
> 
> Raj


Re: Best practices - Large Hadoop Cluster

Posted by Brian Bockelman <bb...@cse.unl.edu>.
I think Raj's question was about best practices, not how to do it.  Best practice is definitely *not* to manage configurations one by one.  This is not a Hadoop question, it's a "how do I manage a lot of computers" question.

Best practices is some combination of:
1) Automated way of installing OS / software on all nodes according to a given profile (cobbler / ROCKS / perceus are ways of doing RHEL-variants), then run Hadoop as a system service.  This guarantees you are able to replicate the configuration of a system.
2) Use puppet or cfEngine to enforce system configuration properties (such as running daemons).  This is sometimes more useful for rapidly changing environments.

While it is possible to manage things using passwordless ssh-based commands (in fact, ROCKS will automatically do a nice passwordless ssh setup for you), this is often a few steps from disaster.  It's all too easy to make undocumented changes - and you don't know how many undocumented changes there are until your sysadmin leaves and it becomes a disaster.

So - best practices for the large scale for running service X is not using ssh.  It is to use the service management techniques provided by your operating system or some other service management tool accepted by your organization (for example, SmartFrog from HP Labs goes above and beyond Linux's somewhat antiquated system).  This statement does not change if X="Hadoop".

Brian

On Aug 10, 2010, at 1:13 PM, Gokulakannan M wrote:

> 
> Hi Raj,
> 
> 	As per my understanding the problem is with ssh password each time
> you start/stop the cluster. You need password less startup shutdown right.?
> 
> 	Here is my way of overcoming the ssh problem 
> 
> 	Write a shell script as follows:
> 	
> 	1. Generate a ssh key from the namenode machine (where you will
> start/stop the cluster)
> 
> 	2. Read each entry from the conf/slaves file and do the following
> 	
> 		2.1 add the key you generated in step 1 to the ssh
> authorized_keys file of the datanode machine that you got in step 2
> something like below script
> 			cat $HOME/.ssh/public_key_file | ssh username@host '
> cat >> $HOME/.ssh/authorized_keys'
> 
> 
> 	3. Repeat step 2 for conf/masters also
> 
> 	Note: Password must be specified for the specified username@host
> first time since the ssh command given in point 2.1 requires it. 
> 		
> 	Now you can start/stop your hadoop cluster without ssh password
> overhead
> 
> 
> Thanks,
>  Gokul
> 
> 
> 
> ****************************************************************************
> ***********
> 
> -----Original Message-----
> From: Raj V [mailto:rajvish@yahoo.com] 
> Sent: Tuesday, August 10, 2010 7:16 PM
> To: common-user@hadoop.apache.org
> Subject: Best practices - Large Hadoop Cluster
> 
> I need to start setting up a large - hadoop cluster of 512 nodes . My
> biggest 
> problem is the SSH keys. Is there a simpler way of generating and exchanging
> ssh 
> keys among the nodes? Any best practices? If there is none, I could
> volunteer to 
> do it,
> 
> Raj


RE: Best practices - Large Hadoop Cluster

Posted by Gokulakannan M <go...@huawei.com>.
Hi Raj,

	As per my understanding the problem is with ssh password each time
you start/stop the cluster. You need password less startup shutdown right.?

	Here is my way of overcoming the ssh problem 

	Write a shell script as follows:
	
	1. Generate a ssh key from the namenode machine (where you will
start/stop the cluster)

	2. Read each entry from the conf/slaves file and do the following
	
		2.1 add the key you generated in step 1 to the ssh
authorized_keys file of the datanode machine that you got in step 2
something like below script
			cat $HOME/.ssh/public_key_file | ssh username@host '
cat >> $HOME/.ssh/authorized_keys'


	3. Repeat step 2 for conf/masters also

	Note: Password must be specified for the specified username@host
first time since the ssh command given in point 2.1 requires it. 
		
	Now you can start/stop your hadoop cluster without ssh password
overhead


 Thanks,
  Gokul
 
   
 
****************************************************************************
***********

-----Original Message-----
From: Raj V [mailto:rajvish@yahoo.com] 
Sent: Tuesday, August 10, 2010 7:16 PM
To: common-user@hadoop.apache.org
Subject: Best practices - Large Hadoop Cluster

I need to start setting up a large - hadoop cluster of 512 nodes . My
biggest 
problem is the SSH keys. Is there a simpler way of generating and exchanging
ssh 
keys among the nodes? Any best practices? If there is none, I could
volunteer to 
do it,

Raj

Best practices - Large Hadoop Cluster

Posted by Raj V <ra...@yahoo.com>.
I need to start setting up a large - hadoop cluster of 512 nodes . My biggest 
problem is the SSH keys. Is there a simpler way of generating and exchanging ssh 
keys among the nodes? Any best practices? If there is none, I could volunteer to 
do it,

Raj

Re: Command line config arguments

Posted by Chaitanya Krishna <ch...@gmail.com>.
Hi Deepak,

  As per what I understand, you want to have a threshold value to be set and
this value is the same across all the slaves.
In that case, you can set it as a configuration property by using
Configuration.set() method in run(). Or if you are passing it as a command
line argument while running the job (using '-D'), then it is set for you in
the configuration whenever job's configuration is created.

  In case you want to have a different value for all the maps, you can get
the config parameter and try setting local variable in the mapper method
instead of run().

Hope this helps,
Chaitanya.



On Mon, Aug 9, 2010 at 11:39 PM, Deepak Diwakar <dd...@gmail.com> wrote:

> Hey friends,
>
> I am in a doubt. Suppose i want to pass program specific config parameter
> through command line and after reading it  setting up to the desired local
> variable. For example, suppose I am  passing threshold value for wordcount
> example to tab only those words who crosses the threshold. I declare a
> wordcound static member called "threshold" which is set once we read the
> command line config value in run().
>
> When i read value of the threshold in mapper in standalone mode, it is well
> set. But when I run the same job in DFS mode, and  see value of the
> threshold in mapper, is not set. In fact it is taking the default value
> which is assigned at the time of declaration.
>
> Currently whenever i have to do such custom program-related config
> assignments, I use a sub-program to store this info into a place called
> metastore and then let the slaves(who all are running map-reduce)  to
> access
> and set the value of variables accordingly.
>
> Could somebody  point me out any other way to do so?
>
> Appreciate help.
>
>
> Thanks & regards,
> - Deepak Diwakar,
>