You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@whirr.apache.org by John Conwell <jo...@iamjohn.me> on 2011/06/14 18:26:51 UTC

hadoop security and ssh proxy

I get the whole "security is a good thing" thing, but could someone give me
a description as to why when whirr configures hadoop it sets up the ssh
proxy to disallow all coms to the data / task nodes except via the name node
over the proxy?  If I'm running on EC2, wont correctly setting up security
groups give me enough security?

The reason I ask is that I'm using Whirr through its API to
automate...well...all the cool things whirr does.  But they key point is
automation.  After a hadoop cluster is up and running I'd like the program
to kick off a hadoop job, monitor jobs and tasks.  But that means my program
has to launch hadoop-proxy.sh somehow, capture the PID of the process, kick
off my hadoop job, then when done, kill the process via the PID.  The whole
calling a shell script, capturing the PID, persisting it, and killing it all
through my java automation just seems a bit duct-tape and bailing-wire'ish.


So I'm trying to figure out why we have the whole hadoop-proxy.sh thing in
the first place (specifically within the context of EC2)

-- 

Thanks,
John C

Re: hadoop security and ssh proxy

Posted by John Conwell <jo...@iamjohn.me>.

oh man.  I didnt know there was a HadoopProxy class that actually had start
and stop methods.  I was starting it via Runtime.getRuntime().exec().  Thats
so much nicer.

On Wed, Jun 15, 2011 at 10:41 AM, Andrei Savu <sa...@gmail.com> wrote:

> Also the current trunk has an examples maven submodule. That code is mostly
> extracted from tests.
> On Jun 15, 2011 8:32 PM, "John Conwell" <jo...@iamjohn.me> wrote:
> > oh cool. Thanks for the pointer
> >
> > On Wed, Jun 15, 2011 at 10:28 AM, Tom White <to...@gmail.com>
> wrote:
> >
> >> On Wed, Jun 15, 2011 at 10:18 AM, John Conwell <jo...@iamjohn.me> wrote:
> >> > Ok, that makes sense. Thanks for the clarification. It
> >> > is definitely unwieldy when trying to integrate whirr's API into
> another
> >> API
> >> > to wrap spinning up hadoop clusters, and getting it to work without
> any
> >> > manual steps.
> >>
> >> Agreed, but it is possible - see the Hadoop integration tests which
> >> are an example of spinning up a Hadoop cluster from Java in a
> >> completely automated fashion.
> >>
> >> Tom
> >>
> >> >
> >> >
> >> > On Tue, Jun 14, 2011 at 5:13 PM, Tom White <to...@gmail.com>
> >> wrote:
> >> >>
> >> >> The proxy is not used for security (which would be better provided by
> >> >> a firewall), but to make the datanode addresses resolve correctly for
> >> >> the client. Without the proxy the datanodes return their internal
> >> >> addresses which are not routable by the client (which runs in an
> >> >> external network typically).
> >> >>
> >> >> I agree that it would be better if we could replace the proxy with
> >> >> something better, such as
> >> >> https://issues.apache.org/jira/browse/WHIRR-81.
> >> >>
> >> >> On Tue, Jun 14, 2011 at 9:26 AM, John Conwell <jo...@iamjohn.me>
> wrote:
> >> >> > I get the whole "security is a good thing" thing, but could someone
> >> give
> >> >> > me
> >> >> > a description as to why when whirr configures hadoop it sets up the
> >> ssh
> >> >> > proxy to disallow all coms to the data / task nodes except via the
> >> name
> >> >> > node
> >> >> > over the proxy? If I'm running on EC2, wont correctly setting up
> >> >> > security
> >> >> > groups give me enough security?
> >> >> > The reason I ask is that I'm using Whirr through its API to
> >> >> > automate...well...all the cool things whirr does. But they key
> point
> >> is
> >> >> > automation. After a hadoop cluster is up and running I'd like the
> >> >> > program
> >> >> > to kick off a hadoop job, monitor jobs and tasks. But that means my
> >> >> > program
> >> >> > has to launch hadoop-proxy.sh somehow, capture the PID of the
> process,
> >> >> > kick
> >> >> > off my hadoop job, then when done, kill the process via the PID.
> The
> >> >> > whole
> >> >> > calling a shell script, capturing the PID, persisting it, and
> killing
> >> it
> >> >> > all
> >> >> > through my java automation just seems a bit duct-tape and
> >> >> > bailing-wire'ish.
> >> >>
> >> >> You can run the proxy from Java via HadoopProxy, which handles all
> >> >> these details for you.
> >> >>
> >> >> >
> >> >> > So I'm trying to figure out why we have the whole hadoop-proxy.sh
> >> thing
> >> >> > in
> >> >> > the first place (specifically within the context of EC2)
> >> >> >
> >> >> > --
> >> >> >
> >> >> > Thanks,
> >> >> > John C
> >> >> >
> >> >>
> >> >> Cheers,
> >> >> Tom
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Thanks,
> >> > John C
> >> >
> >>
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
>



-- 

Thanks,
John C

Re: hadoop security and ssh proxy

Posted by Andrei Savu <sa...@gmail.com>.

Also the current trunk has an examples maven submodule. That code is mostly
extracted from tests.
On Jun 15, 2011 8:32 PM, "John Conwell" <jo...@iamjohn.me> wrote:
> oh cool. Thanks for the pointer
>
> On Wed, Jun 15, 2011 at 10:28 AM, Tom White <to...@gmail.com> wrote:
>
>> On Wed, Jun 15, 2011 at 10:18 AM, John Conwell <jo...@iamjohn.me> wrote:
>> > Ok, that makes sense. Thanks for the clarification. It
>> > is definitely unwieldy when trying to integrate whirr's API into
another
>> API
>> > to wrap spinning up hadoop clusters, and getting it to work without any
>> > manual steps.
>>
>> Agreed, but it is possible - see the Hadoop integration tests which
>> are an example of spinning up a Hadoop cluster from Java in a
>> completely automated fashion.
>>
>> Tom
>>
>> >
>> >
>> > On Tue, Jun 14, 2011 at 5:13 PM, Tom White <to...@gmail.com>
>> wrote:
>> >>
>> >> The proxy is not used for security (which would be better provided by
>> >> a firewall), but to make the datanode addresses resolve correctly for
>> >> the client. Without the proxy the datanodes return their internal
>> >> addresses which are not routable by the client (which runs in an
>> >> external network typically).
>> >>
>> >> I agree that it would be better if we could replace the proxy with
>> >> something better, such as
>> >> https://issues.apache.org/jira/browse/WHIRR-81.
>> >>
>> >> On Tue, Jun 14, 2011 at 9:26 AM, John Conwell <jo...@iamjohn.me> wrote:
>> >> > I get the whole "security is a good thing" thing, but could someone
>> give
>> >> > me
>> >> > a description as to why when whirr configures hadoop it sets up the
>> ssh
>> >> > proxy to disallow all coms to the data / task nodes except via the
>> name
>> >> > node
>> >> > over the proxy? If I'm running on EC2, wont correctly setting up
>> >> > security
>> >> > groups give me enough security?
>> >> > The reason I ask is that I'm using Whirr through its API to
>> >> > automate...well...all the cool things whirr does. But they key point
>> is
>> >> > automation. After a hadoop cluster is up and running I'd like the
>> >> > program
>> >> > to kick off a hadoop job, monitor jobs and tasks. But that means my
>> >> > program
>> >> > has to launch hadoop-proxy.sh somehow, capture the PID of the
process,
>> >> > kick
>> >> > off my hadoop job, then when done, kill the process via the PID. The
>> >> > whole
>> >> > calling a shell script, capturing the PID, persisting it, and
killing
>> it
>> >> > all
>> >> > through my java automation just seems a bit duct-tape and
>> >> > bailing-wire'ish.
>> >>
>> >> You can run the proxy from Java via HadoopProxy, which handles all
>> >> these details for you.
>> >>
>> >> >
>> >> > So I'm trying to figure out why we have the whole hadoop-proxy.sh
>> thing
>> >> > in
>> >> > the first place (specifically within the context of EC2)
>> >> >
>> >> > --
>> >> >
>> >> > Thanks,
>> >> > John C
>> >> >
>> >>
>> >> Cheers,
>> >> Tom
>> >
>> >
>> >
>> > --
>> >
>> > Thanks,
>> > John C
>> >
>>
>
>
>
> --
>
> Thanks,
> John C

Re: hadoop security and ssh proxy

Posted by John Conwell <jo...@iamjohn.me>.

oh cool.  Thanks for the pointer

On Wed, Jun 15, 2011 at 10:28 AM, Tom White <to...@gmail.com> wrote:

> On Wed, Jun 15, 2011 at 10:18 AM, John Conwell <jo...@iamjohn.me> wrote:
> > Ok, that makes sense.  Thanks for the clarification.  It
> > is definitely unwieldy when trying to integrate whirr's API into another
> API
> > to wrap spinning up hadoop clusters, and getting it to work without any
> > manual steps.
>
> Agreed, but it is possible - see the Hadoop integration tests which
> are an example of spinning up a Hadoop cluster from Java in a
> completely automated fashion.
>
> Tom
>
> >
> >
> > On Tue, Jun 14, 2011 at 5:13 PM, Tom White <to...@gmail.com>
> wrote:
> >>
> >> The proxy is not used for security (which would be better provided by
> >> a firewall), but to make the datanode addresses resolve correctly for
> >> the client. Without the proxy the datanodes return their internal
> >> addresses which are not routable by the client (which runs in an
> >> external network typically).
> >>
> >> I agree that it would be better if we could replace the proxy with
> >> something better, such as
> >> https://issues.apache.org/jira/browse/WHIRR-81.
> >>
> >> On Tue, Jun 14, 2011 at 9:26 AM, John Conwell <jo...@iamjohn.me> wrote:
> >> > I get the whole "security is a good thing" thing, but could someone
> give
> >> > me
> >> > a description as to why when whirr configures hadoop it sets up the
> ssh
> >> > proxy to disallow all coms to the data / task nodes except via the
> name
> >> > node
> >> > over the proxy?  If I'm running on EC2, wont correctly setting up
> >> > security
> >> > groups give me enough security?
> >> > The reason I ask is that I'm using Whirr through its API to
> >> > automate...well...all the cool things whirr does.  But they key point
> is
> >> > automation.  After a hadoop cluster is up and running I'd like the
> >> > program
> >> > to kick off a hadoop job, monitor jobs and tasks.  But that means my
> >> > program
> >> > has to launch hadoop-proxy.sh somehow, capture the PID of the process,
> >> > kick
> >> > off my hadoop job, then when done, kill the process via the PID.  The
> >> > whole
> >> > calling a shell script, capturing the PID, persisting it, and killing
> it
> >> > all
> >> > through my java automation just seems a bit duct-tape and
> >> > bailing-wire'ish.
> >>
> >> You can run the proxy from Java via HadoopProxy, which handles all
> >> these details for you.
> >>
> >> >
> >> > So I'm trying to figure out why we have the whole hadoop-proxy.sh
> thing
> >> > in
> >> > the first place (specifically within the context of EC2)
> >> >
> >> > --
> >> >
> >> > Thanks,
> >> > John C
> >> >
> >>
> >> Cheers,
> >> Tom
> >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Re: hadoop security and ssh proxy

Posted by Tom White <to...@gmail.com>.

On Wed, Jun 15, 2011 at 10:18 AM, John Conwell <jo...@iamjohn.me> wrote:
> Ok, that makes sense.  Thanks for the clarification.  It
> is definitely unwieldy when trying to integrate whirr's API into another API
> to wrap spinning up hadoop clusters, and getting it to work without any
> manual steps.

Agreed, but it is possible - see the Hadoop integration tests which
are an example of spinning up a Hadoop cluster from Java in a
completely automated fashion.

Tom

>
>
> On Tue, Jun 14, 2011 at 5:13 PM, Tom White <to...@gmail.com> wrote:
>>
>> The proxy is not used for security (which would be better provided by
>> a firewall), but to make the datanode addresses resolve correctly for
>> the client. Without the proxy the datanodes return their internal
>> addresses which are not routable by the client (which runs in an
>> external network typically).
>>
>> I agree that it would be better if we could replace the proxy with
>> something better, such as
>> https://issues.apache.org/jira/browse/WHIRR-81.
>>
>> On Tue, Jun 14, 2011 at 9:26 AM, John Conwell <jo...@iamjohn.me> wrote:
>> > I get the whole "security is a good thing" thing, but could someone give
>> > me
>> > a description as to why when whirr configures hadoop it sets up the ssh
>> > proxy to disallow all coms to the data / task nodes except via the name
>> > node
>> > over the proxy?  If I'm running on EC2, wont correctly setting up
>> > security
>> > groups give me enough security?
>> > The reason I ask is that I'm using Whirr through its API to
>> > automate...well...all the cool things whirr does.  But they key point is
>> > automation.  After a hadoop cluster is up and running I'd like the
>> > program
>> > to kick off a hadoop job, monitor jobs and tasks.  But that means my
>> > program
>> > has to launch hadoop-proxy.sh somehow, capture the PID of the process,
>> > kick
>> > off my hadoop job, then when done, kill the process via the PID.  The
>> > whole
>> > calling a shell script, capturing the PID, persisting it, and killing it
>> > all
>> > through my java automation just seems a bit duct-tape and
>> > bailing-wire'ish.
>>
>> You can run the proxy from Java via HadoopProxy, which handles all
>> these details for you.
>>
>> >
>> > So I'm trying to figure out why we have the whole hadoop-proxy.sh thing
>> > in
>> > the first place (specifically within the context of EC2)
>> >
>> > --
>> >
>> > Thanks,
>> > John C
>> >
>>
>> Cheers,
>> Tom
>
>
>
> --
>
> Thanks,
> John C
>

Re: hadoop security and ssh proxy

Posted by John Conwell <jo...@iamjohn.me>.

Ok, that makes sense.  Thanks for the clarification.  It
is definitely unwieldy when trying to integrate whirr's API into another API
to wrap spinning up hadoop clusters, and getting it to work without any
manual steps.


On Tue, Jun 14, 2011 at 5:13 PM, Tom White <to...@gmail.com> wrote:

> The proxy is not used for security (which would be better provided by
> a firewall), but to make the datanode addresses resolve correctly for
> the client. Without the proxy the datanodes return their internal
> addresses which are not routable by the client (which runs in an
> external network typically).
>
> I agree that it would be better if we could replace the proxy with
> something better, such as
> https://issues.apache.org/jira/browse/WHIRR-81.
>
> On Tue, Jun 14, 2011 at 9:26 AM, John Conwell <jo...@iamjohn.me> wrote:
> > I get the whole "security is a good thing" thing, but could someone give
> me
> > a description as to why when whirr configures hadoop it sets up the ssh
> > proxy to disallow all coms to the data / task nodes except via the name
> node
> > over the proxy?  If I'm running on EC2, wont correctly setting up
> security
> > groups give me enough security?
> > The reason I ask is that I'm using Whirr through its API to
> > automate...well...all the cool things whirr does.  But they key point is
> > automation.  After a hadoop cluster is up and running I'd like the
> program
> > to kick off a hadoop job, monitor jobs and tasks.  But that means my
> program
> > has to launch hadoop-proxy.sh somehow, capture the PID of the process,
> kick
> > off my hadoop job, then when done, kill the process via the PID.  The
> whole
> > calling a shell script, capturing the PID, persisting it, and killing it
> all
> > through my java automation just seems a bit duct-tape and
> bailing-wire'ish.
>
> You can run the proxy from Java via HadoopProxy, which handles all
> these details for you.
>
> >
> > So I'm trying to figure out why we have the whole hadoop-proxy.sh thing
> in
> > the first place (specifically within the context of EC2)
> >
> > --
> >
> > Thanks,
> > John C
> >
>
> Cheers,
> Tom
>



-- 

Thanks,
John C

Re: hadoop security and ssh proxy

Posted by Tom White <to...@gmail.com>.

The proxy is not used for security (which would be better provided by
a firewall), but to make the datanode addresses resolve correctly for
the client. Without the proxy the datanodes return their internal
addresses which are not routable by the client (which runs in an
external network typically).

I agree that it would be better if we could replace the proxy with
something better, such as
https://issues.apache.org/jira/browse/WHIRR-81.

On Tue, Jun 14, 2011 at 9:26 AM, John Conwell <jo...@iamjohn.me> wrote:
> I get the whole "security is a good thing" thing, but could someone give me
> a description as to why when whirr configures hadoop it sets up the ssh
> proxy to disallow all coms to the data / task nodes except via the name node
> over the proxy?  If I'm running on EC2, wont correctly setting up security
> groups give me enough security?
> The reason I ask is that I'm using Whirr through its API to
> automate...well...all the cool things whirr does.  But they key point is
> automation.  After a hadoop cluster is up and running I'd like the program
> to kick off a hadoop job, monitor jobs and tasks.  But that means my program
> has to launch hadoop-proxy.sh somehow, capture the PID of the process, kick
> off my hadoop job, then when done, kill the process via the PID.  The whole
> calling a shell script, capturing the PID, persisting it, and killing it all
> through my java automation just seems a bit duct-tape and bailing-wire'ish.

You can run the proxy from Java via HadoopProxy, which handles all
these details for you.

>
> So I'm trying to figure out why we have the whole hadoop-proxy.sh thing in
> the first place (specifically within the context of EC2)
>
> --
>
> Thanks,
> John C
>

Cheers,
Tom