You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Eric Sammer <er...@lifeless.net> on 2010/01/20 23:13:53 UTC
Task tracker reported machine name / IP

All:

I have a situation where I have to rely on less than stellar hosts files
right now. This will be cleaned up in the future. For now, I wanted to
get some verification on how task trackers figure out and communicate
their IP / hostname to the JT.

When a task tracker starts, it performs some voodoo to figure out its
machine name and IP address. Here is where I think things go south for
me. It seems to be in o.a.h.mapred.TaskTracker#initialize(). A config
variable mapreduce.tasktracker.host.name is pulled from the supplied
JobConf in the constructor. It seems that this would allow one to get
around a guessed hostname and IP due to a bad hosts file but nothing I
do seems to affect it in a meaningful way. Setting this in
mapred-site.xml has no effect. I also noticed that TaskTracker uses
o.a.h.net.NetUtils which is a bit strange. There is some notion of a
static host map; is this exposed via configuration somewhere?

I've tried setting the TT HTTP listen address explicitly as well as the
DNS interface property to its proper value, but nothing seems to work.

The exact problem I'm fighting is too many fetch failures during jobs.
It looks like task trackers are trying to fetch mapper outputs from
127.0.0.1.

2010-01-19 21:55:06,791 INFO org.apache.hadoop.mapred.TaskTracker:
Starting thread: Map-events fetcher for all reduce tasks on
tracker_localhost.
localdomain:localhost.localdomain/127.0.0.1:43817
...
2010-01-19 22:06:52,726 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.0.1:50060,
dest: 127.0.0.1:40975, bytes: 0, op: MAPRED_
SHUFFLE, cliID: attempt_201001192118_0002_m_000002_0

These log entries seem to indicate that, regardless of any settings,
this task tracker is selecting localhost.localdomain/127.0.0.1 no matter
what. The second entry looks like the bad fetch of map output I
mentioned. Eventually this job dies with too many fetch failures.
Removing all task trackers except for one running on the same machine as
the JT works as expected.

After reading through the code (as best I can) and tracing some of the
machine name resolution bits, it seems as if the machine's configured
hostname (and the IP it resolves to, by whatever means) is the address
advised by the TT. Is this correct? If not, what am I missing? Is there
any way to force a TT to advertise a specific hostname (and related IP)
regardless of the host's configuration? If not, does anyone else feel
like there should be?

I completely understand the correct answer is to fix the hosts file or
not depend on it at all, deferring to DNS. But, it does seem like this
bit of the code is overly complicated and brittle.

Thoughts?
Thanks.
-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com