You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by saurabh pratap singh <sa...@gmail.com> on 2019/09/12 07:04:13 UTC

accessing hdfs cluster through ssh tunnel

Hadoop version : 2.8.5
I have a hdfs set up in private data center (which is not exposed to
internet ) .In the same data center I have another node (gateway
node).Purpose of this gateway node is to provide access to hdfs from edge
machine (which is present outside of data center) through public internet .
To enable this kind of setup I have setup an ssh tunnel from edge machine
to name node host and port(9000) through gateway node .
something like

ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
<gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .

When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine from
edge machine but
when i executed hadoop fs -put <some-file> hdfs://localhost:<local-port> it
fails with following error message.

org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=/<private-ip-of-datanode>:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at
org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
at
org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
at
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)


Looks like it is trying to write directly to private ip address of data
node .How do i resolve this?

Do let me know if some other information is needed .

Thanks

Re: accessing hdfs cluster through ssh tunnel

Posted by saurabh pratap singh <sa...@gmail.com>.
Hi all

So I was not satisfied with the above mentioned approach and tried hadoop
socks server config at client end and used ssh with -D option as mentioned
by Hariharan Iyer (Thank you for that) and it worked as expected without
the need of opening separate ssh tunnels for data nodes.

Thanks.

On Fri, Sep 13, 2019 at 10:20 PM saurabh pratap singh <
saurabh.cse16@gmail.com> wrote:

> Thank you all for your help .
> Solution that worked for me is as follows:
> I opened ssh tunnel for namenode which ensure that hadoop fs -ls works
> In order for hadoop fs -put to work (as it was timing out because namenode
> was returning private ip addresses of datanode which cant be resolved by
> edge machine) I routed private ip addresses of data node to "some_port" on
> localhost of edge machine by adding an entry in iptable for each data
> node.In addition to that I opened  ssh tunnel which forwards all traffic
> from localhost : <some_port> to data node private ips via gateway machine.
>
> I was hoping that there would be some hadoop configuration which I  can
> set so that I dont have to do all this setup on my own .I found out about
> hadoop.socks.server config .But it didnt work for me /Ii tried setting this
> hadoop.socks.server to localhost:<port> (whose traffic is tunneled via
> gateway node) and setting socksfactory config in core-site of client then
> on server side .
>
>
>
>
> On Fri, Sep 13, 2019 at 7:04 PM Hariharan Iyer <hi...@qubole.com> wrote:
>
>> You will have to use a socks proxy (-D option in ssh tunnel). In
>> addition, when invoking hadoop fs command, you will have to add
>> -Dsocks.proxyHost and - Dsocks.proxyPort.
>>
>> Thanks,
>> Hariharan
>>
>> On Thu, 12 Sep 2019, 23:26 saurabh pratap singh, <sa...@gmail.com>
>> wrote:
>>
>>> Thank you so much for your reply .
>>> I have further question there are some blogs which talks about some
>>> similar setup like this one
>>>
>>> https://github.com/vkovalchuk/hadoop-2.6.0-windows/wiki/How-to-access-HDFS-behind-firewall-using-SOCKS-proxy
>>>
>>>
>>> I am just curious how does that works.
>>>
>>> On Thu, Sep 12, 2019 at 11:05 PM Tony S. Wu <to...@gmail.com>
>>> wrote:
>>>
>>>> You need connectivity from edge node to the entire cluster, not just
>>>> namenode. Your topology, unfortunately, probably won’t work too well. A
>>>> proper VPN / IPSec tunnel might be a better idea.
>>>>
>>>> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
>>>> saurabh.cse16@gmail.com> wrote:
>>>>
>>>>> Hadoop version : 2.8.5
>>>>> I have a hdfs set up in private data center (which is not exposed to
>>>>> internet ) .In the same data center I have another node (gateway
>>>>> node).Purpose of this gateway node is to provide access to hdfs from edge
>>>>> machine (which is present outside of data center) through public internet .
>>>>> To enable this kind of setup I have setup an ssh tunnel from edge
>>>>> machine to name node host and port(9000) through gateway node .
>>>>> something like
>>>>>
>>>>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
>>>>> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>>>>>
>>>>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine
>>>>> from edge machine but
>>>>> when i executed hadoop fs -put
>>>>> <some-file> hdfs://localhost:<local-port> it fails with following error
>>>>> message.
>>>>>
>>>>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout
>>>>> while waiting for channel to be ready for connect. ch :
>>>>> java.nio.channels.SocketChannel[connection-pending
>>>>> remote=/<private-ip-of-datanode>:50010]
>>>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
>>>>> at
>>>>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
>>>>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>>>>>
>>>>>
>>>>> Looks like it is trying to write directly to private ip address of
>>>>> data node .How do i resolve this?
>>>>>
>>>>> Do let me know if some other information is needed .
>>>>>
>>>>> Thanks
>>>>>
>>>>

Re: accessing hdfs cluster through ssh tunnel

Posted by saurabh pratap singh <sa...@gmail.com>.
Thank you all for your help .
Solution that worked for me is as follows:
I opened ssh tunnel for namenode which ensure that hadoop fs -ls works
In order for hadoop fs -put to work (as it was timing out because namenode
was returning private ip addresses of datanode which cant be resolved by
edge machine) I routed private ip addresses of data node to "some_port" on
localhost of edge machine by adding an entry in iptable for each data
node.In addition to that I opened  ssh tunnel which forwards all traffic
from localhost : <some_port> to data node private ips via gateway machine.

I was hoping that there would be some hadoop configuration which I  can set
so that I dont have to do all this setup on my own .I found out about
hadoop.socks.server config .But it didnt work for me /Ii tried setting this
hadoop.socks.server to localhost:<port> (whose traffic is tunneled via
gateway node) and setting socksfactory config in core-site of client then
on server side .




On Fri, Sep 13, 2019 at 7:04 PM Hariharan Iyer <hi...@qubole.com> wrote:

> You will have to use a socks proxy (-D option in ssh tunnel). In addition,
> when invoking hadoop fs command, you will have to add -Dsocks.proxyHost and
> - Dsocks.proxyPort.
>
> Thanks,
> Hariharan
>
> On Thu, 12 Sep 2019, 23:26 saurabh pratap singh, <sa...@gmail.com>
> wrote:
>
>> Thank you so much for your reply .
>> I have further question there are some blogs which talks about some
>> similar setup like this one
>>
>> https://github.com/vkovalchuk/hadoop-2.6.0-windows/wiki/How-to-access-HDFS-behind-firewall-using-SOCKS-proxy
>>
>>
>> I am just curious how does that works.
>>
>> On Thu, Sep 12, 2019 at 11:05 PM Tony S. Wu <to...@gmail.com>
>> wrote:
>>
>>> You need connectivity from edge node to the entire cluster, not just
>>> namenode. Your topology, unfortunately, probably won’t work too well. A
>>> proper VPN / IPSec tunnel might be a better idea.
>>>
>>> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
>>> saurabh.cse16@gmail.com> wrote:
>>>
>>>> Hadoop version : 2.8.5
>>>> I have a hdfs set up in private data center (which is not exposed to
>>>> internet ) .In the same data center I have another node (gateway
>>>> node).Purpose of this gateway node is to provide access to hdfs from edge
>>>> machine (which is present outside of data center) through public internet .
>>>> To enable this kind of setup I have setup an ssh tunnel from edge
>>>> machine to name node host and port(9000) through gateway node .
>>>> something like
>>>>
>>>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
>>>> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>>>>
>>>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine
>>>> from edge machine but
>>>> when i executed hadoop fs -put
>>>> <some-file> hdfs://localhost:<local-port> it fails with following error
>>>> message.
>>>>
>>>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout
>>>> while waiting for channel to be ready for connect. ch :
>>>> java.nio.channels.SocketChannel[connection-pending
>>>> remote=/<private-ip-of-datanode>:50010]
>>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>>>> at
>>>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>>>> at
>>>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
>>>> at
>>>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
>>>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>>>>
>>>>
>>>> Looks like it is trying to write directly to private ip address of data
>>>> node .How do i resolve this?
>>>>
>>>> Do let me know if some other information is needed .
>>>>
>>>> Thanks
>>>>
>>>

Re: accessing hdfs cluster through ssh tunnel

Posted by Hariharan Iyer <hi...@qubole.com>.
You will have to use a socks proxy (-D option in ssh tunnel). In addition,
when invoking hadoop fs command, you will have to add -Dsocks.proxyHost and
- Dsocks.proxyPort.

Thanks,
Hariharan

On Thu, 12 Sep 2019, 23:26 saurabh pratap singh, <sa...@gmail.com>
wrote:

> Thank you so much for your reply .
> I have further question there are some blogs which talks about some
> similar setup like this one
>
> https://github.com/vkovalchuk/hadoop-2.6.0-windows/wiki/How-to-access-HDFS-behind-firewall-using-SOCKS-proxy
>
>
> I am just curious how does that works.
>
> On Thu, Sep 12, 2019 at 11:05 PM Tony S. Wu <to...@gmail.com> wrote:
>
>> You need connectivity from edge node to the entire cluster, not just
>> namenode. Your topology, unfortunately, probably won’t work too well. A
>> proper VPN / IPSec tunnel might be a better idea.
>>
>> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
>> saurabh.cse16@gmail.com> wrote:
>>
>>> Hadoop version : 2.8.5
>>> I have a hdfs set up in private data center (which is not exposed to
>>> internet ) .In the same data center I have another node (gateway
>>> node).Purpose of this gateway node is to provide access to hdfs from edge
>>> machine (which is present outside of data center) through public internet .
>>> To enable this kind of setup I have setup an ssh tunnel from edge
>>> machine to name node host and port(9000) through gateway node .
>>> something like
>>>
>>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
>>> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>>>
>>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine
>>> from edge machine but
>>> when i executed hadoop fs -put <some-file> hdfs://localhost:<local-port>
>>> it fails with following error message.
>>>
>>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout
>>> while waiting for channel to be ready for connect. ch :
>>> java.nio.channels.SocketChannel[connection-pending
>>> remote=/<private-ip-of-datanode>:50010]
>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>>> at
>>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>>> at
>>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
>>> at
>>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
>>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>>>
>>>
>>> Looks like it is trying to write directly to private ip address of data
>>> node .How do i resolve this?
>>>
>>> Do let me know if some other information is needed .
>>>
>>> Thanks
>>>
>>

Re: accessing hdfs cluster through ssh tunnel

Posted by saurabh pratap singh <sa...@gmail.com>.
Thank you so much for your reply .
I have further question there are some blogs which talks about some similar
setup like this one

https://github.com/vkovalchuk/hadoop-2.6.0-windows/wiki/How-to-access-HDFS-behind-firewall-using-SOCKS-proxy


I am just curious how does that works.

On Thu, Sep 12, 2019 at 11:05 PM Tony S. Wu <to...@gmail.com> wrote:

> You need connectivity from edge node to the entire cluster, not just
> namenode. Your topology, unfortunately, probably won’t work too well. A
> proper VPN / IPSec tunnel might be a better idea.
>
> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
> saurabh.cse16@gmail.com> wrote:
>
>> Hadoop version : 2.8.5
>> I have a hdfs set up in private data center (which is not exposed to
>> internet ) .In the same data center I have another node (gateway
>> node).Purpose of this gateway node is to provide access to hdfs from edge
>> machine (which is present outside of data center) through public internet .
>> To enable this kind of setup I have setup an ssh tunnel from edge machine
>> to name node host and port(9000) through gateway node .
>> something like
>>
>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
>> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>>
>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine from
>> edge machine but
>> when i executed hadoop fs -put <some-file> hdfs://localhost:<local-port>
>> it fails with following error message.
>>
>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
>> waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=/<private-ip-of-datanode>:50010]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>> at
>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>> at
>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
>> at
>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>>
>>
>> Looks like it is trying to write directly to private ip address of data
>> node .How do i resolve this?
>>
>> Do let me know if some other information is needed .
>>
>> Thanks
>>
>

Re: accessing hdfs cluster through ssh tunnel

Posted by Julien Laurenceau <ju...@pepitedata.com>.
Hi
Hadoop is designed to avoid proxy as it will act as a bottleneck. Namenodes
are used to obtain a direct socket client / datanodes that is specific to
each job.

Le ven. 13 sept. 2019 à 14:21, Tony S. Wu <to...@gmail.com> a écrit :

> You need connectivity from edge node to the entire cluster, not just
> namenode. Your topology, unfortunately, probably won’t work too well. A
> proper VPN / IPSec tunnel might be a better idea.
>
> On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
> saurabh.cse16@gmail.com> wrote:
>
>> Hadoop version : 2.8.5
>> I have a hdfs set up in private data center (which is not exposed to
>> internet ) .In the same data center I have another node (gateway
>> node).Purpose of this gateway node is to provide access to hdfs from edge
>> machine (which is present outside of data center) through public internet .
>> To enable this kind of setup I have setup an ssh tunnel from edge machine
>> to name node host and port(9000) through gateway node .
>> something like
>>
>> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
>> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>>
>> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine from
>> edge machine but
>> when i executed hadoop fs -put <some-file> hdfs://localhost:<local-port>
>> it fails with following error message.
>>
>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
>> waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=/<private-ip-of-datanode>:50010]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>> at
>> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
>> at
>> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
>> at
>> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
>> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>>
>>
>> Looks like it is trying to write directly to private ip address of data
>> node .How do i resolve this?
>>
>> Do let me know if some other information is needed .
>>
>> Thanks
>>
>

Re: accessing hdfs cluster through ssh tunnel

Posted by "Tony S. Wu" <to...@gmail.com>.
You need connectivity from edge node to the entire cluster, not just
namenode. Your topology, unfortunately, probably won’t work too well. A
proper VPN / IPSec tunnel might be a better idea.

On Thu, Sep 12, 2019 at 12:04 AM saurabh pratap singh <
saurabh.cse16@gmail.com> wrote:

> Hadoop version : 2.8.5
> I have a hdfs set up in private data center (which is not exposed to
> internet ) .In the same data center I have another node (gateway
> node).Purpose of this gateway node is to provide access to hdfs from edge
> machine (which is present outside of data center) through public internet .
> To enable this kind of setup I have setup an ssh tunnel from edge machine
> to name node host and port(9000) through gateway node .
> something like
>
> ssh -N -L <local-port>:<namenode-private-ip>:<namenodeport>
> <gateway-user>@<gatewayhost> -i <ssh-keys>  -vvvv .
>
> When i did hadoop fs -ls hdfs://localhost:<local-port> it works fine from
> edge machine but
> when i executed hadoop fs -put <some-file> hdfs://localhost:<local-port>
> it fails with following error message.
>
> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
> waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=/<private-ip-of-datanode>:50010]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
> at
> org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:253)
> at
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1725)
> at
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
>
>
> Looks like it is trying to write directly to private ip address of data
> node .How do i resolve this?
>
> Do let me know if some other information is needed .
>
> Thanks
>