You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andreas Kostyrka <an...@kostyrka.org> on 2008/05/28 22:22:55 UTC

hadoop on EC2

Hi!

I just wondered what other people use to access the hadoop webservers,
when running on EC2?

Ideas that I had:
1.) opening ports 50030 and so on => not good, data goes unprotected
over the internet. Even if I could enable some form of authentication it
would still plain http.

2.) Some kind of tunneling solution. The problem on this side is that
each of my cluster node is in a different subnet, plus the dualism
between the internal and external addresses of the nodes.

Any hints? TIA,

Andreas

Re: hadoop on EC2

Posted by Andreas Kostyrka <an...@kostyrka.org>.

That presumes that you have a static source address. Plus for
nontechnical reasons changing the firewall rules is nontrivial.
(I'm responsible for the inside of the VMs, but somebody else holds the
ec2 keys, don't ask)

Andreas

Am Mittwoch, den 28.05.2008, 16:27 -0400 schrieb Jake Thompson:
> What is wron with opening up the ports only to the hosts that you want to
> have access to them.  This is what I cam currently doing, -s 0.0.0.0/0 is
> everyone everywhere so change it to -s my.ip.add.ress/32
> 
> 
> 
> On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka <an...@kostyrka.org>
> wrote:
> 
> > Hi!
> >
> > I just wondered what other people use to access the hadoop webservers,
> > when running on EC2?
> >
> > Ideas that I had:
> > 1.) opening ports 50030 and so on => not good, data goes unprotected
> > over the internet. Even if I could enable some form of authentication it
> > would still plain http.
> >
> > 2.) Some kind of tunneling solution. The problem on this side is that
> > each of my cluster node is in a different subnet, plus the dualism
> > between the internal and external addresses of the nodes.
> >
> > Any hints? TIA,
> >
> > Andreas
> >

Re: hadoop on EC2

Posted by Jake Thompson <ja...@jakethompson.com>.

What is wron with opening up the ports only to the hosts that you want to
have access to them.  This is what I cam currently doing, -s 0.0.0.0/0 is
everyone everywhere so change it to -s my.ip.add.ress/32

On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka <an...@kostyrka.org>
wrote:

> Hi!
>
> I just wondered what other people use to access the hadoop webservers,
> when running on EC2?
>
> Ideas that I had:
> 1.) opening ports 50030 and so on => not good, data goes unprotected
> over the internet. Even if I could enable some form of authentication it
> would still plain http.
>
> 2.) Some kind of tunneling solution. The problem on this side is that
> each of my cluster node is in a different subnet, plus the dualism
> between the internal and external addresses of the nodes.
>
> Any hints? TIA,
>
> Andreas
>

Re: hadoop on EC2

Posted by Nate Carlson <na...@natecarlson.com>.

On Wed, 28 May 2008, Andreas Kostyrka wrote:
> 1.) opening ports 50030 and so on => not good, data goes unprotected 
> over the internet. Even if I could enable some form of authentication it 
> would still plain http.

Personally, I set up an Apache server (with https and auth), and then set 
up cgiproxy[1] on it, and only allowed it to forward to the hadoop nodes. 
That way, when you click a link to go to a slave node's logs or whatnot it 
still works.  ;)

I no longer use Hadoop on ec2, but still use the config described above!

[1]http://www.jmarshall.com/tools/cgiproxy/

------------------------------------------------------------------------
| nate carlson | natecars@natecarlson.com | http://www.natecarlson.com |
|       depriving some poor village of its idiot since 1981            |
------------------------------------------------------------------------

Re: hadoop on EC2

Posted by Chris K Wensel <ch...@wensel.net>.

These are the FoxyProxy wildcards I use

*compute-1.amazonaws.com*
*.ec2.internal*
*.compute-1.internal*

and w/ hadoop 0.17.0, just type (after booting your cluster)

hadoop-ec2 proxy <cluster-name>

to start the tunnel for that cluster

On Jun 3, 2008, at 11:26 PM, James Moore wrote:

> On Tue, Jun 3, 2008 at 5:04 PM, Andreas Kostyrka  
> <an...@kostyrka.org> wrote:
>> Plus to make it even more painful, you cannot easily run it with  
>> one simple
>> SOCKS server, because you need to defer DNS resolution to the  
>> inside the
>> cluster, because VM names do resolve to external IPs, while the  
>> webservers
>> we'd be all interested in reside on the internal 10/8 IPs.
>
> It's easy with foxyproxy.
>
> Run ssh with the -D command:
>
> ssh -D 2324 ec2-75-101-XXX-XX.compute-1.amazonaws.com
>
> Tell FoxyProxy to "use SOCKS proxy for DNS lookups" (tools > foxyproxy
>> more > global settings > use SOCKS proxy for DNS lookups)
>
> Configure foxyproxy with rules for when to use local port 2324.  Use
> wildcards like http*ec2*internal*.  I put a screenshot on my blog -
> http://blog.restphone.com/2008/6/4/foxyproxy-hadoop-and-socks
>
> All the features I cared about worked when set up this way.
>
> (And of course the choice of 2324 isn't special - use any port you  
> like.)
> -- 
> James Moore | james@restphone.com
> Ruby and Ruby on Rails consulting
> blog.restphone.com

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Re: hadoop on EC2

Posted by James Moore <ja...@gmail.com>.

On Tue, Jun 3, 2008 at 5:04 PM, Andreas Kostyrka <an...@kostyrka.org> wrote:
> Plus to make it even more painful, you cannot easily run it with one simple
> SOCKS server, because you need to defer DNS resolution to the inside the
> cluster, because VM names do resolve to external IPs, while the webservers
> we'd be all interested in reside on the internal 10/8 IPs.

It's easy with foxyproxy.

Run ssh with the -D command:

ssh -D 2324 ec2-75-101-XXX-XX.compute-1.amazonaws.com

Tell FoxyProxy to "use SOCKS proxy for DNS lookups" (tools > foxyproxy
> more > global settings > use SOCKS proxy for DNS lookups)

Configure foxyproxy with rules for when to use local port 2324.  Use
wildcards like http*ec2*internal*.  I put a screenshot on my blog -
http://blog.restphone.com/2008/6/4/foxyproxy-hadoop-and-socks

All the features I cared about worked when set up this way.

(And of course the choice of 2324 isn't special - use any port you like.)
-- 
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: hadoop on EC2

Posted by Steve Loughran <st...@apache.org>.

Andreas Kostyrka wrote:
> Well, the basic "trouble" with EC2 is that clusters usually are not networks 
> in the TCP/IP sense.
> 
> This makes it painful to decide which URLs should be resolved where.
> 
> Plus to make it even more painful, you cannot easily run it with one simple 
> SOCKS server, because you need to defer DNS resolution to the inside the 
> cluster, because VM names do resolve to external IPs, while the webservers 
> we'd be all interested in reside on the internal 10/8 IPs.
> 
> Another fun item is that in many situations you will have multiple islands 
> inside EC2 (the contractor working for multiple customers that have EC2 
> deployments come to mind), so you cannot just route everything over one pipe 
> into EC2.
> 
> My current setup relies on a very long list of -L ssh tunnel forwards plus 
> iptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 get 
> redirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 via 
> ssh. (Implementation left as an exercise for the reader, or my ugly non-error 
> checking script available on request :-P)
> 
> If one would want to have a more generic solution to redirect TCP ports via a 
> ssh SOCKS tunnel (aka "dynamic port forwarding"), the following components 
> would be needed:
> 
> -) a list of rules what gets forwarded where and how.
> -) a DNS resolver that issues fake IP addresses to capture the "name" of the 
> connected host.
> -) a small forwarding script that checks the "real destination IP" to decide 
> which IP address/port is being requested. (Hint: current Linux kernels don't 
> use getsockname anymore, the real destination is carried nowadays as a socket 
> option)
> 
> One of the uglier parts that I have found no "real" solution was the fact that 
> one cannot be sure that ssh will be able to listen on a given port. 
> 
> Solutions I've found include:
> -) check the port before issueing ssh (Racecondition warning: Going through 
> this hole the whole federation star fleet could get lost.)
> -) using some kind of except to drive ssh through a pty.
> -) roll your own ssh tunnel solution. The only lib that come to my mind is 
> Twisted, in which case one could ignore the need for the SOCKS protocol.
> 
> But luckily for us, the solution is easier, because we only need to tunnel 
> http in the hadoop case, which has the high benefit that we do not need to 
> capture the hostname, because http remembers the hostname inside the payload.

Do you worry/address the risk of someone like me bringing up a machine 
in the EC2 farm that then portscans all the near-neighbours in the 
address space for open hdfs data node/name node ports, and strikes up a 
conversation with your filesystem?



-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: hadoop on EC2

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Well, the basic "trouble" with EC2 is that clusters usually are not networks 
in the TCP/IP sense.

This makes it painful to decide which URLs should be resolved where.

Plus to make it even more painful, you cannot easily run it with one simple 
SOCKS server, because you need to defer DNS resolution to the inside the 
cluster, because VM names do resolve to external IPs, while the webservers 
we'd be all interested in reside on the internal 10/8 IPs.

Another fun item is that in many situations you will have multiple islands 
inside EC2 (the contractor working for multiple customers that have EC2 
deployments come to mind), so you cannot just route everything over one pipe 
into EC2.

My current setup relies on a very long list of -L ssh tunnel forwards plus 
iptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 get 
redirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 via 
ssh. (Implementation left as an exercise for the reader, or my ugly non-error 
checking script available on request :-P)

If one would want to have a more generic solution to redirect TCP ports via a 
ssh SOCKS tunnel (aka "dynamic port forwarding"), the following components 
would be needed:

-) a list of rules what gets forwarded where and how.
-) a DNS resolver that issues fake IP addresses to capture the "name" of the 
connected host.
-) a small forwarding script that checks the "real destination IP" to decide 
which IP address/port is being requested. (Hint: current Linux kernels don't 
use getsockname anymore, the real destination is carried nowadays as a socket 
option)

One of the uglier parts that I have found no "real" solution was the fact that 
one cannot be sure that ssh will be able to listen on a given port. 

Solutions I've found include:
-) check the port before issueing ssh (Racecondition warning: Going through 
this hole the whole federation star fleet could get lost.)
-) using some kind of except to drive ssh through a pty.
-) roll your own ssh tunnel solution. The only lib that come to my mind is 
Twisted, in which case one could ignore the need for the SOCKS protocol.

But luckily for us, the solution is easier, because we only need to tunnel 
http in the hadoop case, which has the high benefit that we do not need to 
capture the hostname, because http remembers the hostname inside the payload.

Not tested, but the following should work:
1.) Setup a proxy on the cluster somewhere. Make it do auth (proxy auth might 
work too, but depending upon how one makes the browser access the proxy this 
might be a bad idea).
2.) Make the client access the proxy for the needed hosts/port combinations. 
FoxyProxy or similiar extensions for firefox come to mind, or some 
destination nat rules on your packet firewall should do the trick.

Andreas

On Monday 02 June 2008 20:27:53 Chris K Wensel wrote:
> > obviously this isn't the best solution if you need to let many semi
> > trusted users browse your cluster.
>
> Actually, it would be much more secure if the tunnel service ran on a
> trusted server letting your users connect remotely via SOCKS and then
> browse the cluster. These users wouldn't need any AWS keys etc.
>
>
> Chris K Wensel
> chris@wensel.net
> http://chris.wensel.net/
> http://www.cascading.org/

Re: hadoop on EC2

Posted by Chris K Wensel <ch...@wensel.net>.

> obviously this isn't the best solution if you need to let many semi  
> trusted users browse your cluster.


Actually, it would be much more secure if the tunnel service ran on a  
trusted server letting your users connect remotely via SOCKS and then  
browse the cluster. These users wouldn't need any AWS keys etc.


Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Re: hadoop on EC2

Posted by Chris K Wensel <ch...@wensel.net>.

if you use the new scripts in 0.17.0, just run

 > hadoop-ec2 proxy <cluster-name>

this starts a ssh tunnel to your cluster.

installing foxy proxy in FF gives you whole cluster visibility..

obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.

On May 28, 2008, at 1:22 PM, Andreas Kostyrka wrote:

> Hi!
>
> I just wondered what other people use to access the hadoop webservers,
> when running on EC2?
>
> Ideas that I had:
> 1.) opening ports 50030 and so on => not good, data goes unprotected
> over the internet. Even if I could enable some form of  
> authentication it
> would still plain http.
>
> 2.) Some kind of tunneling solution. The problem on this side is that
> each of my cluster node is in a different subnet, plus the dualism
> between the internal and external addresses of the nodes.
>
> Any hints? TIA,
>
> Andreas

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Re: hadoop on EC2

Posted by Chris Anderson <jc...@grabb.it>.

On Wed, May 28, 2008 at 2:23 PM, Ted Dunning <td...@veoh.com> wrote:
>
> That doesn't work because the various web pages have links or redirects to
> other pages on other machines.
>
> Also, you would need to ssh to ALL of your cluster to get the file browser
> to work.

True. That makes it a little impractical.

>
> Better to do the proxy thing.
>

This would be a nice addition to the Hadoop EC2 AMI (which is super
helpful, by the way). Thanks to whoever put it together.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: hadoop on EC2

Posted by "Jim R. Wilson" <wi...@gmail.com>.

Recently I spent some time hacking the contrib/ec2 scripts to install
and configure OpenVPN on top of the other installed packages.  Our use
case required that all the slaves running mappers would need to
connect back through to our primary mysql database (firewalled as you
can imagine).  Simultaneously, our webservers had to be able to
connect to Hbase running atop the same Hadoop cluster via Thrift.

The scheme I eventually settled on was to have a server cert/key and a
"client" cert/key which would be shared across all the clients - then
make the master node the OpenVPN server, and have all the slave nodes
connect as clients.  Then, if any other box needed access to the
cluster (like our firewalled database and webservers), they'd connect
to the master hadoop node, whose EC2 group had UDP 1194 open to
0.0.0.0.  Such a client could then address any hadoop nodes by their
tunneled vpn IP (10.8.0.x), derived from their AMI instance start ID.

I almost had it all working - the only piece which was giving me
trouble was actually making the slaves connect back to the master at
instance boot time.  I could have figured it out, but got pulled off
because we decided to move away from ec2 for the time being :/

-- Jim R. Wilson (jimbojw)

On Wed, May 28, 2008 at 4:23 PM, Ted Dunning <td...@veoh.com> wrote:
>
> That doesn't work because the various web pages have links or redirects to
> other pages on other machines.
>
> Also, you would need to ssh to ALL of your cluster to get the file browser
> to work.
>
> Better to do the proxy thing.
>
>
> On 5/28/08 2:16 PM, "Chris Anderson" <jc...@grabb.it> wrote:
>
>> Andreas,
>>
>> If you can ssh into the nodes, you can always set up port-forwarding
>> with ssh -L to bring those ports to your local machine.
>>
>> On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org>
>> wrote:
>>> What I wonder is what ports do I need to access?
>>>
>>> 50060 on all nodes.
>>> 50030 on the jobtracker.
>>>
>>> Any other ports?
>>>
>>> Andreas
>>>
>>> Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
>>>>
>>>>
>>>> On 5/28/08 1:22 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
>>>>> I just wondered what other people use to access the hadoop webservers,
>>>>> when running on EC2?
>>>>
>>>>     While we don't run on EC2 :), we do protect the hadoop web processes by
>>>> putting a proxy in front of it.  A user connects to the proxy,
>>>> authenticates, and then gets the output from the hadoop process.  All of the
>>>> redirection magic happens via a localhost connection, so no data is leaked
>>>> unprotected.
>>>>
>>>
>>
>>
>
>

Re: hadoop on EC2

Posted by Ted Dunning <td...@veoh.com>.

That doesn't work because the various web pages have links or redirects to
other pages on other machines.

Also, you would need to ssh to ALL of your cluster to get the file browser
to work.

Better to do the proxy thing.


On 5/28/08 2:16 PM, "Chris Anderson" <jc...@grabb.it> wrote:

> Andreas,
> 
> If you can ssh into the nodes, you can always set up port-forwarding
> with ssh -L to bring those ports to your local machine.
> 
> On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org>
> wrote:
>> What I wonder is what ports do I need to access?
>> 
>> 50060 on all nodes.
>> 50030 on the jobtracker.
>> 
>> Any other ports?
>> 
>> Andreas
>> 
>> Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
>>> 
>>> 
>>> On 5/28/08 1:22 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
>>>> I just wondered what other people use to access the hadoop webservers,
>>>> when running on EC2?
>>> 
>>>     While we don't run on EC2 :), we do protect the hadoop web processes by
>>> putting a proxy in front of it.  A user connects to the proxy,
>>> authenticates, and then gets the output from the hadoop process.  All of the
>>> redirection magic happens via a localhost connection, so no data is leaked
>>> unprotected.
>>> 
>> 
> 
>

Re: hadoop on EC2

Posted by Andreas Kostyrka <an...@kostyrka.org>.

On Wednesday 28 May 2008 23:16:43 Chris Anderson wrote:
> Andreas,
>
> If you can ssh into the nodes, you can always set up port-forwarding
> with ssh -L to bring those ports to your local machine.

Yes, and the missing part is simple too: iptables with DNAT on OUTPUT :)

I even made a small ugly script for this kind of tunneling.

Andreas

>
> On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org> 
wrote:
> > What I wonder is what ports do I need to access?
> >
> > 50060 on all nodes.
> > 50030 on the jobtracker.
> >
> > Any other ports?
> >
> > Andreas
> >
> > Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
> >> On 5/28/08 1:22 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> >> > I just wondered what other people use to access the hadoop webservers,
> >> > when running on EC2?
> >>
> >>     While we don't run on EC2 :), we do protect the hadoop web processes
> >> by putting a proxy in front of it.  A user connects to the proxy,
> >> authenticates, and then gets the output from the hadoop process.  All of
> >> the redirection magic happens via a localhost connection, so no data is
> >> leaked unprotected.

Re: hadoop on EC2

Posted by Chris Anderson <jc...@grabb.it>.

Andreas,

If you can ssh into the nodes, you can always set up port-forwarding
with ssh -L to bring those ports to your local machine.

On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org> wrote:
> What I wonder is what ports do I need to access?
>
> 50060 on all nodes.
> 50030 on the jobtracker.
>
> Any other ports?
>
> Andreas
>
> Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
>>
>>
>> On 5/28/08 1:22 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
>> > I just wondered what other people use to access the hadoop webservers,
>> > when running on EC2?
>>
>>     While we don't run on EC2 :), we do protect the hadoop web processes by
>> putting a proxy in front of it.  A user connects to the proxy,
>> authenticates, and then gets the output from the hadoop process.  All of the
>> redirection magic happens via a localhost connection, so no data is leaked
>> unprotected.
>>
>



-- 
Chris Anderson
http://jchris.mfdz.com

Re: hadoop on EC2

Posted by Andreas Kostyrka <an...@kostyrka.org>.

What I wonder is what ports do I need to access?

50060 on all nodes.
50030 on the jobtracker.

Any other ports?

Andreas

Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
> 
> 
> On 5/28/08 1:22 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> > I just wondered what other people use to access the hadoop webservers,
> > when running on EC2?
> 
>     While we don't run on EC2 :), we do protect the hadoop web processes by
> putting a proxy in front of it.  A user connects to the proxy,
> authenticates, and then gets the output from the hadoop process.  All of the
> redirection magic happens via a localhost connection, so no data is leaked
> unprotected.
>

Re: hadoop on EC2

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.



On 5/28/08 1:22 PM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> I just wondered what other people use to access the hadoop webservers,
> when running on EC2?

    While we don't run on EC2 :), we do protect the hadoop web processes by
putting a proxy in front of it.  A user connects to the proxy,
authenticates, and then gets the output from the hadoop process.  All of the
redirection magic happens via a localhost connection, so no data is leaked
unprotected.