You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Paul Rogers <pa...@gmail.com> on 2013/11/21 11:20:32 UTC

Re: Issue with map reduce on examples - SOLVED

Hi Guys

This was indeed a DNS issue (with problems with the reverse lookup).
 Resolution was as follows:

I found a stack overflow post suggesting using

hadoop-dns-checker <https://github.com/sujee/hadoop-dns-checker>

This showed that whilst the lookup/reverse lookup for localhost was fine,
the same for the hostname was not working.  Fixing this fixed the problem.

The actual problem was that the host was getting it's IP via DHCP (via the
router) which was separate from the (dnsmasq) dns server.  As such it was
not registering the hostname with the DNS server.  Enabling the dnsmasq
dhcp and disbaling that on the router fixed the issue - the host now gets
its IP address from dnsmasq and registers its hostname with the DNS server.
 DNS lookup and reverse lookup now work for both localhost and the
host/hostname.

Hope that helps someone.  Apologies for the noise.

P


On 20 November 2013 15:10, Paul Rogers <pa...@gmail.com> wrote:

> UPDATE
>
> I think I have some more info.  If I look at the running reduce task
>
>     (
> http://localhost:50030/taskdetails.jsp?tipid=task_201311201256_0001_r_000000
> )
>
>
>
> I see it is assigned to machine /default-rack/hit-nxdomain.opendns.com
>
> If I then try and click on the "Last 4KB Task Logs" link it sends me to
>
>
> http://hit-nxdomain.opendns.com:50060/tasklog?attemptid=attempt_201311201256_0001_r_000000_0&start=-4097
>
> amending this URL to
>
>
> http://localhost:50060/tasklog?attemptid=attempt_201311201256_0001_r_000000_0&start=-4097
>
> then shows the log with many examples of the following:
>
>     2013-11-20 14:59:54,726 INFO org.apache.hadoop.mapred.ReduceTask:
> Penalized(slow) Hosts:
>     2013-11-20 14:59:54,726 INFO org.apache.hadoop.mapred.ReduceTask:
> hit-nxdomain.opendns.com Will be considered after: 814 seconds.
>     2013-11-20 15:00:54,729 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201311201256_0001_r_000000_0 Need another 4 map output(s) where 0
> is already in progress
>     2013-11-20 15:00:54,729 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201311201256_0001_r_000000_0 Scheduled 0 outputs (1 slow hosts and0
> dup hosts)
>     2013-11-20 15:00:54,730 INFO org.apache.hadoop.mapred.ReduceTask:
> Penalized(slow) Hosts:
>     2013-11-20 15:00:54,730 INFO org.apache.hadoop.mapred.ReduceTask:
> hit-nxdomain.opendns.com Will be considered after: 754 seconds.
>
> So it seems that hadoop thinks the task is running on the
> hit-nxdomain.opendns.com host.
>
> The host (localhost) picks it's DNS settings up via DHCP with the router
> set as the DNS server.  The router in turn uses opendns.com to resolve
> external addresses.
>
> Am I right in thinking this is therefore a DNS issue?
>
> Any idea how hadoop has ended up with this host name?
>
> Any idea how to fix it?
>
> Many thanks
>
>
> Paul
>
>
> On 18 November 2013 12:53, Paul Rogers <pa...@gmail.com> wrote:
>
>> Hi All
>>
>> Having some problems with map reduce running in pseudo-distributed mode.
>>  I am running version 1.2.1 on linux.  I have:
>> 1. created $JAVA_HOME & $HADOOP_HOME and added the relative bin
>> directories to the path;
>> 2. Formatted the dfs;
>> 3. executed start-dfs.sh and start-mapred.sh.
>>
>> Executing jps seems to show everything running that should be running (I
>> think).
>>
>> [paul@lt001 bin]$ jps
>> 8724 TaskTracker
>> 8487 SecondaryNameNode
>> 8841 Jps
>> 8353 DataNode
>> 7239 NameNode
>> 8597 JobTracker
>>
>> I have then tried to run the wordcount and pi examples with similar
>> results, eg:
>>
>> [paul@lt001 bin]$ hadoop jar hadoop/hadoop-examples-1.2.1.jar pi 4 1000
>> Warning: $HADOOP_HOME is deprecated.
>>
>> Number of Maps  = 4
>> Samples per Map = 1000
>> Wrote input for Map #0
>> Wrote input for Map #1
>> Wrote input for Map #2
>> Wrote input for Map #3
>> Starting Job
>> 13/11/18 10:31:38 INFO mapred.FileInputFormat: Total input paths to
>> process : 4
>> 13/11/18 10:31:39 INFO mapred.JobClient: Running job:
>> job_201311181028_0001
>> 13/11/18 10:31:40 INFO mapred.JobClient:  map 0% reduce 0%
>> 13/11/18 10:31:47 INFO mapred.JobClient:  map 50% reduce 0%
>> 13/11/18 10:31:52 INFO mapred.JobClient:  map 100% reduce 0%
>>
>> In each instance the output reaches the map 100% reduce 0% stage then
>> stalls.  No matter how long I wait the job does not advance any further.  I
>> have checked the logs and the one I suspect is indicating the problem
>> is hadoop-paul-tasktracker-lt001.log which has the following output:
>>
>> 2013-11-18 10:31:55,969 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 10:34:59,148 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 10:35:05,196 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 10:35:11,253 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>>
>> ..........
>>
>> 2013-11-18 11:10:03,259 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:06,290 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:12,320 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:18,343 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:21,369 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:27,395 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:33,426 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>> 2013-11-18 11:10:36,463 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201311181028_0001_r_000000_0 0.0% reduce > copy >
>>
>> It seems it is stuck on reduce > copy > but why?  Can anyone help with
>> where to look next?
>>
>> Many thanks
>>
>>
>> Paul
>>
>
>