You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Joydeep Sen Sarma <js...@facebook.com> on 2009/05/13 06:11:55 UTC

RE: public IP for datanode on EC2

(raking up real old thread)

After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.

This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.

HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)? 

I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.

 

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Tuesday, September 09, 2008 7:04 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

> I think most people try to avoid allowing remote access for security 
> reasons. If you can add a file, I can mount your filesystem too, maybe 
> even delete things. Whereas with EC2-only filesystems, your files are 
> *only* exposed to everyone else that knows or can scan for your IPAddr and ports.
>

I imagine that the access to the ports used by HDFS could be restricted to specific IPs using the EC2 group (ec2-authorize) or any other firewall mechanism if necessary.

Could anyone confirm that there is no conf parameter I could use to force the address of my DataNodes?

Thanks

Julien

--
DigitalPebble Ltd
http://www.digitalpebble.com

Re: public IP for datanode on EC2

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Philip Zeyliger wrote:
> 
> You could use ssh to set up a SOCKS proxy between your machine and
> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
> socket factory.
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
> has more information.

very useful write up. Regd the problem with reverse DNS mentioned (thats 
why you had to add a DNS record for internal ip) it is fixed in 
https://issues.apache.org/jira/browse/HADOOP-5191 (for HDFS access 
least). Some mapred parts are still affected (HADOOP-5610). Depending on 
reverse DNS should avoided.

Ideally setting fs.default.name to internal ip should just work for 
clients.. both internally and externally (through proxies).

Raghu.

RE: public IP for datanode on EC2

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Btw - I figured out the problem.

The jobconf from the remote client had the socks proxy configuration - the jvm spawned by TTs picked this up and tried to connect using the proxy which of course didn't work.

This was easy to solve - just had to make the remote initialization script mark hadoop.rpc.socket.factory.class.default as final variable in the hadoop-site.xml on server side.

I am assuming that this would be a good thing to do in general (can't believe why server side traffic would be routed through a proxy!). 

Filed https://issues.apache.org/jira/browse/HADOOP-5839 to follow up on the issues uncovered here.

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com] 
Sent: Thursday, May 14, 2009 7:07 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

Yes, you're absolutely right.

Tom

On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside).
>
> The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small.
>
>
> -----Original Message-----
> From: Tom White [mailto:tom@cloudera.com]
> Sent: Thursday, May 14, 2009 5:58 AM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> Hi Joydeep,
>
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
>
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> fast!
>
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.
>
> Cheers,
> Tom
>
> On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
>> I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).
>>
>> Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:
>>
>>
>> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
>> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
>> java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy1.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>>
>>
>> strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences.
>>
>> Totally foxed now.
>>
>> Joydeep
>>
>> -----Original Message-----
>> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
>> Sent: Wednesday, May 13, 2009 9:38 PM
>> To: core-user@hadoop.apache.org
>> Cc: Tom White
>> Subject: RE: public IP for datanode on EC2
>>
>> Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.
>>
>> However - I am hitting a new error during job submission (running hadoop-0.19.0):
>>
>> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
>> java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
>>        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
>>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy0.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
>>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>>        at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>>
>>
>> looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard EC2 scripts assign this to the internal hostname of the hadoop master.
>>
>> Is there any downside to using public hostnames instead of the private ones in the ec2 starter scripts?
>>
>> Thanks for the help,
>>
>> Joydeep
>>
>>
>> -----Original Message-----
>> From: Philip Zeyliger [mailto:philip@cloudera.com]
>> Sent: Wednesday, May 13, 2009 2:40 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: public IP for datanode on EC2
>>
>> On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
>>> (raking up real old thread)
>>>
>>> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>>>
>>> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>>>
>>> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>>>
>>> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.
>>
>> You could use ssh to set up a SOCKS proxy between your machine and
>> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
>> socket factory.
>> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
>> has more information.
>>
>> -- Philip
>>
>

Re: public IP for datanode on EC2

Posted by Tom White <to...@cloudera.com>.

Yes, you're absolutely right.

Tom

On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside).
>
> The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small.
>
>
> -----Original Message-----
> From: Tom White [mailto:tom@cloudera.com]
> Sent: Thursday, May 14, 2009 5:58 AM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> Hi Joydeep,
>
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
>
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> fast!
>
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.
>
> Cheers,
> Tom
>
> On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
>> I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).
>>
>> Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:
>>
>>
>> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
>> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
>> java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
>>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy1.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>>
>>
>> strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences.
>>
>> Totally foxed now.
>>
>> Joydeep
>>
>> -----Original Message-----
>> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
>> Sent: Wednesday, May 13, 2009 9:38 PM
>> To: core-user@hadoop.apache.org
>> Cc: Tom White
>> Subject: RE: public IP for datanode on EC2
>>
>> Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.
>>
>> However - I am hitting a new error during job submission (running hadoop-0.19.0):
>>
>> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
>> java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
>>        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
>>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>        at $Proxy0.getProtocolVersion(Unknown Source)
>>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
>>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>>        at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>>
>>
>> looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard EC2 scripts assign this to the internal hostname of the hadoop master.
>>
>> Is there any downside to using public hostnames instead of the private ones in the ec2 starter scripts?
>>
>> Thanks for the help,
>>
>> Joydeep
>>
>>
>> -----Original Message-----
>> From: Philip Zeyliger [mailto:philip@cloudera.com]
>> Sent: Wednesday, May 13, 2009 2:40 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: public IP for datanode on EC2
>>
>> On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
>>> (raking up real old thread)
>>>
>>> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>>>
>>> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>>>
>>> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>>>
>>> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.
>>
>> You could use ssh to set up a SOCKS proxy between your machine and
>> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
>> socket factory.
>> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
>> has more information.
>>
>> -- Philip
>>
>

RE: public IP for datanode on EC2

Posted by Joydeep Sen Sarma <js...@facebook.com>.

The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside).

The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small.


-----Original Message-----
From: Tom White [mailto:tom@cloudera.com] 
Sent: Thursday, May 14, 2009 5:58 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.

Cheers,
Tom

On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).
>
> Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:
>
>
> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>
>
> strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences.
>
> Totally foxed now.
>
> Joydeep
>
> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
> Sent: Wednesday, May 13, 2009 9:38 PM
> To: core-user@hadoop.apache.org
> Cc: Tom White
> Subject: RE: public IP for datanode on EC2
>
> Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.
>
> However - I am hitting a new error during job submission (running hadoop-0.19.0):
>
> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
> java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
>        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>        at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>
>
> looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard EC2 scripts assign this to the internal hostname of the hadoop master.
>
> Is there any downside to using public hostnames instead of the private ones in the ec2 starter scripts?
>
> Thanks for the help,
>
> Joydeep
>
>
> -----Original Message-----
> From: Philip Zeyliger [mailto:philip@cloudera.com]
> Sent: Wednesday, May 13, 2009 2:40 PM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
>> (raking up real old thread)
>>
>> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>>
>> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>>
>> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>>
>> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.
>
> You could use ssh to set up a SOCKS proxy between your machine and
> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
> socket factory.
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
> has more information.
>
> -- Philip
>

Re: public IP for datanode on EC2

Posted by Steve Loughran <st...@apache.org>.

Tom White wrote:
> Hi Joydeep,
> 
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
> 
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> fast!
> 
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.


I thought that AWS had stopped you being able to talk to things within 
the cluster using the public IP addresses -stopped you using DynDNS as 
your way of bootstrapping discovery

Here's what may work
-bring up the EC2 cluster using the local names
-open up the ports
-have the clients talk using the public IP addresses

the problem will arise when the namenode checks the fs name used and it 
doesnt match its expectations -there were some recent patches in the 
code to handle this when someone talks to the namenode using the 
ipaddress instead of the hostname; they may work for this situation too.

personally, I wouldn't trust the NN in the EC2 datacentres to be secure 
to external callers, but that problem already exists within their 
datacentres anyway

Re: public IP for datanode on EC2

Posted by Tom White <to...@cloudera.com>.

Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.

Cheers,
Tom

On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).
>
> Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:
>
>
> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
>        at org.apache.hadoop.ipc.Client.call(Client.java:699)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
>        at org.apache.hadoop.mapred.Child.main(Child.java:153)
>
>
> strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences.
>
> Totally foxed now.
>
> Joydeep
>
> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
> Sent: Wednesday, May 13, 2009 9:38 PM
> To: core-user@hadoop.apache.org
> Cc: Tom White
> Subject: RE: public IP for datanode on EC2
>
> Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.
>
> However - I am hitting a new error during job submission (running hadoop-0.19.0):
>
> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
> java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
>        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
>        at org.apache.hadoop.ipc.Client.call(Client.java:686)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
>        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
>        at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
>
>
> looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard EC2 scripts assign this to the internal hostname of the hadoop master.
>
> Is there any downside to using public hostnames instead of the private ones in the ec2 starter scripts?
>
> Thanks for the help,
>
> Joydeep
>
>
> -----Original Message-----
> From: Philip Zeyliger [mailto:philip@cloudera.com]
> Sent: Wednesday, May 13, 2009 2:40 PM
> To: core-user@hadoop.apache.org
> Subject: Re: public IP for datanode on EC2
>
> On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
>> (raking up real old thread)
>>
>> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>>
>> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>>
>> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>>
>> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.
>
> You could use ssh to set up a SOCKS proxy between your machine and
> ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
> socket factory.
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
> has more information.
>
> -- Philip
>

RE: public IP for datanode on EC2

Posted by Joydeep Sen Sarma <js...@facebook.com>.

I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).

Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:


2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
        at org.apache.hadoop.ipc.Client.call(Client.java:699)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at $Proxy1.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
        at org.apache.hadoop.mapred.Child.main(Child.java:153)


strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences.

Totally foxed now.

Joydeep

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com] 
Sent: Wednesday, May 13, 2009 9:38 PM
To: core-user@hadoop.apache.org
Cc: Tom White
Subject: RE: public IP for datanode on EC2

Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.

However - I am hitting a new error during job submission (running hadoop-0.19.0):

2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
	at org.apache.hadoop.ipc.Client.call(Client.java:686)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
	at $Proxy0.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
	at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
	at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
	at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)


looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard EC2 scripts assign this to the internal hostname of the hadoop master.

Is there any downside to using public hostnames instead of the private ones in the ec2 starter scripts?

Thanks for the help,

Joydeep


-----Original Message-----
From: Philip Zeyliger [mailto:philip@cloudera.com] 
Sent: Wednesday, May 13, 2009 2:40 PM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> (raking up real old thread)
>
> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>
> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>
> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>
> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.

You could use ssh to set up a SOCKS proxy between your machine and
ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
socket factory.
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
has more information.

-- Philip

RE: public IP for datanode on EC2

Posted by Joydeep Sen Sarma <js...@facebook.com>.

Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.

However - I am hitting a new error during job submission (running hadoop-0.19.0):

2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal
	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
	at org.apache.hadoop.ipc.Client.call(Client.java:686)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
	at $Proxy0.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
	at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:176)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
	at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
	at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)


looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard EC2 scripts assign this to the internal hostname of the hadoop master.

Is there any downside to using public hostnames instead of the private ones in the ec2 starter scripts?

Thanks for the help,

Joydeep


-----Original Message-----
From: Philip Zeyliger [mailto:philip@cloudera.com] 
Sent: Wednesday, May 13, 2009 2:40 PM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> (raking up real old thread)
>
> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>
> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>
> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>
> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.

You could use ssh to set up a SOCKS proxy between your machine and
ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
socket factory.
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
has more information.

-- Philip

Re: public IP for datanode on EC2

Posted by Philip Zeyliger <ph...@cloudera.com>.

On Tue, May 12, 2009 at 9:11 PM, Joydeep Sen Sarma <js...@facebook.com> wrote:
> (raking up real old thread)
>
> After struggling with this issue for sometime now - it seems that accessing hdfs on ec2 from outside ec2 is not possible.
>
> This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. Even if datanode ports are authorized in ec2 and we set the public hostname via slave.host.name - the namenode uses the internal IP address of the datanodes for block locations. DFS clients outside ec2 cannot reach these addresses and report failure reading/writing data blocks.
>
> HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP addresses (and use datanode host names as pre-985)?
>
> I really like the idea of being able to use an external node (my personal workstation) to do job submission (which typically requires interacting with HDFS in order to push files into the jobcache etc). This way I don't need custom AMIs - I can use stock hadoop amis (all the custom software is on the external node). Without the above option - this is not possible currently.

You could use ssh to set up a SOCKS proxy between your machine and
ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
socket factory.
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
has more information.

-- Philip