You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matthias Scherer <Ma...@1und1.de> on 2009/01/21 13:23:31 UTC

Why does Hadoop need ssh access to master and slaves?

Hi all,

we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
hadoop grid was very easy and works fine.

Now our operations team wonders why hadoop has to be able to connect to
the master and slaves via password-less ssh?! Can anyone give us an
answer to this question?

Thanks & Regards
Matthias

Re: AW: Why does Hadoop need ssh access to master and slaves?

Posted by Steve Loughran <st...@apache.org>.
Matthias Scherer wrote:
> Hi Steve and Amit,
> 
> Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it transfer data over the grid with ssh access? How can I find a short description what exactly hadoop needs ssh for? The documentation says only that I have to configure it.
> 
> Thanks & Regards
> Matthias
> 

SSH is used by the various scripts in bin/ to start and stop clusters, 
slaves.sh does the work, the other ones (like hadoop-daemons.sh) use it 
to run stuff on the machines.

The EC2 scripts use SSH to talk to the machines brought up there; when 
you ask amazon for machines, you give it a public key to be set to the 
allowed keys list of root; you use that to ssh in and run code.

There is currently no liveness/restarting built into the scripts; you 
need other things to do that. I am working on this, with  HADOOP-3628, 
https://issues.apache.org/jira/browse/HADOOP-3628

I will be showing some other management options at ApacheCon EU 2009, 
which being on the same continent and timezone is something you may want 
to consider attending; lots of Hadoop people will be there, with some 
all-day sessions on it.
http://eu.apachecon.com/c/aceu2009/sessions/227

One big problem with cluster management is not just recognising failed 
nodes, it's handling them. The actions you take are different with a 
VM-cluster like EC2 (fix: reboot, then kill that AMI and create a new 
one), from that of a VM-ware/Xen-managed cluster, to that of physical 
systems (Y!: phone Allen, us: email paolo). Once we have the health 
monitoring in there different people will need to apply their own policies.

-steve

-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5


Re: Why does Hadoop need ssh access to master and slaves?

Posted by Edward Capriolo <ed...@gmail.com>.
I am looking to create some RA scripts and experiment with starting
hadoop via linux-ha cluster manager.  Linux HA would handle restarting
downed nodes and eliminate the ssh key dependency.

AW: Why does Hadoop need ssh access to master and slaves?

Posted by Matthias Scherer <Ma...@1und1.de>.
Hi Tom,

Thanks for your reply. That's what I wanted to know. And it's good to know that it would not be a show stopper if our ops department would like to use their own system to control daemons.

Regards
Matthias 


> -----Ursprüngliche Nachricht-----
> Von: Tom White [mailto:tom@cloudera.com] 
> Gesendet: Mittwoch, 21. Januar 2009 14:47
> An: core-user@hadoop.apache.org
> Betreff: Re: Why does Hadoop need ssh access to master and slaves?
> 
> Hi Matthias,
> 
> It is not necessary to have SSH set up to run Hadoop, but it 
> does make things easier. SSH is used by the scripts in the 
> bin directory which start and stop daemons across the cluster 
> (the slave nodes are defined in the slaves file), see the 
> start-all.sh script as a starting point.
> These scripts are a convenient way to control Hadoop, but 
> there are other possibilities. If you had another system to 
> control daemons on your cluster then you wouldn't need SSH.
> 
> Tom
> 
> On Wed, Jan 21, 2009 at 1:20 PM, Matthias Scherer 
> <Ma...@1und1.de> wrote:
> > Hi Steve and Amit,
> >
> > Thanks for your answers. I agree with you that key-based 
> ssh is nothing to worry about. But I'm wondering what exactly 
> - that means wich grid administration tasks - hadoop does via 
> ssh?! Does it restart crashed data nodes or tasks trackers on 
> the slaves? Oder does it transfer data over the grid with ssh 
> access? How can I find a short description what exactly 
> hadoop needs ssh for? The documentation says only that I have 
> to configure it.
> >
> > Thanks & Regards
> > Matthias
> >
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Steve Loughran [mailto:stevel@apache.org]
> >> Gesendet: Mittwoch, 21. Januar 2009 13:59
> >> An: core-user@hadoop.apache.org
> >> Betreff: Re: Why does Hadoop need ssh access to master and slaves?
> >>
> >> Amit k. Saha wrote:
> >> > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer 
> >> > <Ma...@1und1.de> wrote:
> >> >> Hi all,
> >> >>
> >> >> we've made our first steps in evaluating hadoop. The setup
> >> of 2 VMs
> >> >> as a hadoop grid was very easy and works fine.
> >> >>
> >> >> Now our operations team wonders why hadoop has to be able
> >> to connect
> >> >> to the master and slaves via password-less ssh?! Can
> >> anyone give us
> >> >> an answer to this question?
> >> >
> >> > 1. There has to be a way to connect to the remote hosts-
> >> slaves and a
> >> > secondary master, and SSH is the secure way to do it 2. It
> >> has to be
> >> > password-less to enable automatic logins
> >> >
> >>
> >> SSH is *a * secure way to do it, but not the only way. Other 
> >> management tools can bring up hadoop clusters. Hadoop ships with 
> >> scripted support for SSH as it is standard with Linux distros and 
> >> generally the best way to bring up a remote console.
> >>
> >> Matthias,
> >> Your ops team should not be worrying about the SSH 
> security, as long 
> >> as they keep their keys under control.
> >>
> >> (a) Key-based SSH is more secure than passworded SSH, as 
> >> man-in-middle attacks are prevented. passphrase protected 
> SSH keys on 
> >> external USB keys even better.
> >>
> >> (b) once the cluster is up, that filesystem is pretty 
> vulnerable to 
> >> anything on the LAN. You do need to lock down your 
> datacentre, or set 
> >> up the firewall/routing of the servers so that only 
> trusted hosts can 
> >> talk to the FS. SSH becomes a detail at that point.
> >>
> >>
> >>
> >
> 

Re: Why does Hadoop need ssh access to master and slaves?

Posted by Tom White <to...@cloudera.com>.
Hi Matthias,

It is not necessary to have SSH set up to run Hadoop, but it does make
things easier. SSH is used by the scripts in the bin directory which
start and stop daemons across the cluster (the slave nodes are defined
in the slaves file), see the start-all.sh script as a starting point.
These scripts are a convenient way to control Hadoop, but there are
other possibilities. If you had another system to control daemons on
your cluster then you wouldn't need SSH.

Tom

On Wed, Jan 21, 2009 at 1:20 PM, Matthias Scherer
<Ma...@1und1.de> wrote:
> Hi Steve and Amit,
>
> Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it transfer data over the grid with ssh access? How can I find a short description what exactly hadoop needs ssh for? The documentation says only that I have to configure it.
>
> Thanks & Regards
> Matthias
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: Steve Loughran [mailto:stevel@apache.org]
>> Gesendet: Mittwoch, 21. Januar 2009 13:59
>> An: core-user@hadoop.apache.org
>> Betreff: Re: Why does Hadoop need ssh access to master and slaves?
>>
>> Amit k. Saha wrote:
>> > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
>> > <Ma...@1und1.de> wrote:
>> >> Hi all,
>> >>
>> >> we've made our first steps in evaluating hadoop. The setup
>> of 2 VMs
>> >> as a hadoop grid was very easy and works fine.
>> >>
>> >> Now our operations team wonders why hadoop has to be able
>> to connect
>> >> to the master and slaves via password-less ssh?! Can
>> anyone give us
>> >> an answer to this question?
>> >
>> > 1. There has to be a way to connect to the remote hosts-
>> slaves and a
>> > secondary master, and SSH is the secure way to do it 2. It
>> has to be
>> > password-less to enable automatic logins
>> >
>>
>> SSH is *a * secure way to do it, but not the only way. Other
>> management tools can bring up hadoop clusters. Hadoop ships
>> with scripted support for SSH as it is standard with Linux
>> distros and generally the best way to bring up a remote console.
>>
>> Matthias,
>> Your ops team should not be worrying about the SSH security,
>> as long as they keep their keys under control.
>>
>> (a) Key-based SSH is more secure than passworded SSH, as
>> man-in-middle attacks are prevented. passphrase protected SSH
>> keys on external USB keys even better.
>>
>> (b) once the cluster is up, that filesystem is pretty
>> vulnerable to anything on the LAN. You do need to lock down
>> your datacentre, or set up the firewall/routing of the
>> servers so that only trusted hosts can talk to the FS. SSH
>> becomes a detail at that point.
>>
>>
>>
>

AW: Why does Hadoop need ssh access to master and slaves?

Posted by Matthias Scherer <Ma...@1und1.de>.
Hi Steve and Amit,

Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it transfer data over the grid with ssh access? How can I find a short description what exactly hadoop needs ssh for? The documentation says only that I have to configure it.

Thanks & Regards
Matthias


> -----Ursprüngliche Nachricht-----
> Von: Steve Loughran [mailto:stevel@apache.org] 
> Gesendet: Mittwoch, 21. Januar 2009 13:59
> An: core-user@hadoop.apache.org
> Betreff: Re: Why does Hadoop need ssh access to master and slaves?
> 
> Amit k. Saha wrote:
> > On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer 
> > <Ma...@1und1.de> wrote:
> >> Hi all,
> >>
> >> we've made our first steps in evaluating hadoop. The setup 
> of 2 VMs 
> >> as a hadoop grid was very easy and works fine.
> >>
> >> Now our operations team wonders why hadoop has to be able 
> to connect 
> >> to the master and slaves via password-less ssh?! Can 
> anyone give us 
> >> an answer to this question?
> > 
> > 1. There has to be a way to connect to the remote hosts- 
> slaves and a 
> > secondary master, and SSH is the secure way to do it 2. It 
> has to be 
> > password-less to enable automatic logins
> > 
> 
> SSH is *a * secure way to do it, but not the only way. Other 
> management tools can bring up hadoop clusters. Hadoop ships 
> with scripted support for SSH as it is standard with Linux 
> distros and generally the best way to bring up a remote console.
> 
> Matthias,
> Your ops team should not be worrying about the SSH security, 
> as long as they keep their keys under control.
> 
> (a) Key-based SSH is more secure than passworded SSH, as 
> man-in-middle attacks are prevented. passphrase protected SSH 
> keys on external USB keys even better.
> 
> (b) once the cluster is up, that filesystem is pretty 
> vulnerable to anything on the LAN. You do need to lock down 
> your datacentre, or set up the firewall/routing of the 
> servers so that only trusted hosts can talk to the FS. SSH 
> becomes a detail at that point.
> 
> 
> 

Re: Why does Hadoop need ssh access to master and slaves?

Posted by Steve Loughran <st...@apache.org>.
Amit k. Saha wrote:
> On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
> <Ma...@1und1.de> wrote:
>> Hi all,
>>
>> we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
>> hadoop grid was very easy and works fine.
>>
>> Now our operations team wonders why hadoop has to be able to connect to
>> the master and slaves via password-less ssh?! Can anyone give us an
>> answer to this question?
> 
> 1. There has to be a way to connect to the remote hosts- slaves and a
> secondary master, and SSH is the secure way to do it
> 2. It has to be password-less to enable automatic logins
> 

SSH is *a * secure way to do it, but not the only way. Other management 
tools can bring up hadoop clusters. Hadoop ships with scripted support 
for SSH as it is standard with Linux distros and generally the best way 
to bring up a remote console.

Matthias,
Your ops team should not be worrying about the SSH security, as long as 
they keep their keys under control.

(a) Key-based SSH is more secure than passworded SSH, as man-in-middle 
attacks are prevented. passphrase protected SSH keys on external USB 
keys even better.

(b) once the cluster is up, that filesystem is pretty vulnerable to 
anything on the LAN. You do need to lock down your datacentre, or set up 
the firewall/routing of the servers so that only trusted hosts can talk 
to the FS. SSH becomes a detail at that point.



Re: Why does Hadoop need ssh access to master and slaves?

Posted by "Amit k. Saha" <am...@gmail.com>.
On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
<Ma...@1und1.de> wrote:
> Hi all,
>
> we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
> hadoop grid was very easy and works fine.
>
> Now our operations team wonders why hadoop has to be able to connect to
> the master and slaves via password-less ssh?! Can anyone give us an
> answer to this question?

1. There has to be a way to connect to the remote hosts- slaves and a
secondary master, and SSH is the secure way to do it
2. It has to be password-less to enable automatic logins

-Amit


>
> Thanks & Regards
> Matthias
>



-- 
Amit Kumar Saha
http://amitksaha.blogspot.com
http://amitsaha.in.googlepages.com/
*Bangalore Open Java Users Group*:http:www.bojug.in