You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Ben Clay <rb...@ncsu.edu> on 2011/07/18 21:53:45 UTC

TaskTrackers behind NAT

I'd like to spread Hadoop across two physical clusters, one which is
publicly accessible and the other which is behind a NAT.  The NAT'd machines
will only run TaskTrackers, not HDFS, and not Reducers either (configured
with 0 Reduce slots).  The master node will run in the publicly-available
cluster.

 

Two questions:

 

1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks
fetch intermediate data from http://
<http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput,
correct ?  I'm getting "Too many fetch-failures" with no open ports, so I
assume the Reduce tasks need to pull the intermediate data instead of Map
tasks pushing it.

 

2. Although the NAT'd machines have unique IPs and reach the outside, the
DHCP is not assigning them hostnames.  Therefore, when they join the
JobTracker I get
"tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the
machine list page.  Is there some way to force Hadoop to refer to them via
IP instead of hostname, since I don't have control over the DHCP?  I could
manually assign a hostname via /etc/hosts on each NAT'd machine, but these
are actually VMs and I will have many of them receiving semi-random IPs,
making this an ugly administrative task.

 

Thanks for any input!

 

-Ben

 


Re: TaskTrackers behind NAT

Posted by Allen Wittenauer <aw...@apache.org>.
On Jul 18, 2011, at 12:53 PM, Ben Clay wrote:

> I'd like to spread Hadoop across two physical clusters, one which is
> publicly accessible and the other which is behind a NAT. The NAT'd machines
> will only run TaskTrackers, not HDFS, and not Reducers either (configured
> with 0 Reduce slots).  The master node will run in the publicly-available
> cluster.

	Off the top, I doubt it will work : MR is bi-directional, across many random ports.  So I would suspect there is going to be a lot of hackiness in the network config to make this work.

> 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks
> fetch intermediate data from http://
> <http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput,
> correct ?  I'm getting "Too many fetch-failures" with no open ports, so I
> assume the Reduce tasks need to pull the intermediate data instead of Map
> tasks pushing it.

	Correct. Reduce tasks pull.

> 2. Although the NAT'd machines have unique IPs and reach the outside, the
> DHCP is not assigning them hostnames.  Therefore, when they join the
> JobTracker I get
> "tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the
> machine list page.  Is there some way to force Hadoop to refer to them via
> IP instead of hostname, since I don't have control over the DHCP? I could
> manually assign a hostname via /etc/hosts on each NAT'd machine, but these
> are actually VMs and I will have many of them receiving semi-random IPs,
> making this an ugly administrative task.


	Short answer: no.

	Long answer: no, fix your DHCP and/or do the /etc/hosts hack.