You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Yunhong Gu1 <yg...@cs.uic.edu> on 2008/01/23 01:42:00 UTC
reduce > copy ?? (was Reduce hangs)
Hi, All
I tried many possible configurations and I think this is the deepest
reason I can dig out so far. The whole "Reduce hangs" because one task
tracker does not progress at all. It is doing some "reduce > copy"
forever, as shown below.
This is very easy to reproduce on my machines (AMD dual dual-core
Opteron 3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode).
gu@ncdm-8:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log
2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:03,627 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:09,644 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:15,661 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:18,670 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:24,687 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00 MB/s) >
On Tue, 22 Jan 2008, Yunhong Gu1 wrote:
>
> Thanks, I tried but this probably not the reason. I checked the network
> connection using "netstat" and the client is actually connected to the
> correct server address. In addition, "mrbench" works sometime; if it is
> network problem, nothing should work at all.
>
> I let the "sort" program ran longer, and get some interesting output, the
> reduce progress can actually be descreasing and oscillating (see the bottom
> part of the output below).
>
> The error information:
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>
> What can cause "Error reading task outputncdm15"?
>
> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-examples.jar sort
> rand randsort
> Running on 2 nodes to sort from /user/gu/rand into /user/gu/randsort with 8
> reduces.
> Job started: Tue Jan 22 11:56:39 CST 2008
> 08/01/22 11:56:39 INFO mapred.FileInputFormat: Total input paths to process:
> 20
> 08/01/22 11:56:40 INFO mapred.JobClient: Running job: job_200801221154_0005
> 08/01/22 11:56:41 INFO mapred.JobClient: map 0% reduce 0%
> 08/01/22 11:56:54 INFO mapred.JobClient: map 1% reduce 0%
> 08/01/22 11:56:58 INFO mapred.JobClient: map 2% reduce 0%
> 08/01/22 11:57:04 INFO mapred.JobClient: map 3% reduce 0%
> 08/01/22 11:57:08 INFO mapred.JobClient: map 5% reduce 0%
> 08/01/22 11:57:15 INFO mapred.JobClient: map 7% reduce 0%
> 08/01/22 11:57:23 INFO mapred.JobClient: map 8% reduce 0%
> 08/01/22 11:57:25 INFO mapred.JobClient: map 9% reduce 0%
> 08/01/22 11:57:27 INFO mapred.JobClient: map 10% reduce 0%
> 08/01/22 11:57:34 INFO mapred.JobClient: map 11% reduce 0%
> 08/01/22 11:57:35 INFO mapred.JobClient: map 12% reduce 0%
> 08/01/22 11:57:42 INFO mapred.JobClient: map 13% reduce 0%
> 08/01/22 11:57:48 INFO mapred.JobClient: map 14% reduce 0%
> 08/01/22 11:57:49 INFO mapred.JobClient: map 15% reduce 0%
> 08/01/22 11:57:56 INFO mapred.JobClient: map 17% reduce 0%
> 08/01/22 11:58:05 INFO mapred.JobClient: map 19% reduce 0%
> 08/01/22 11:58:12 INFO mapred.JobClient: map 20% reduce 0%
> 08/01/22 11:58:13 INFO mapred.JobClient: map 20% reduce 1%
> 08/01/22 11:58:14 INFO mapred.JobClient: map 21% reduce 1%
> 08/01/22 11:58:19 INFO mapred.JobClient: map 22% reduce 1%
> 08/01/22 11:58:26 INFO mapred.JobClient: map 23% reduce 1%
> 08/01/22 11:58:30 INFO mapred.JobClient: map 24% reduce 1%
> 08/01/22 11:58:32 INFO mapred.JobClient: map 25% reduce 1%
> 08/01/22 11:58:37 INFO mapred.JobClient: map 26% reduce 1%
> 08/01/22 11:58:42 INFO mapred.JobClient: map 27% reduce 1%
> 08/01/22 11:58:44 INFO mapred.JobClient: map 28% reduce 1%
> 08/01/22 11:58:49 INFO mapred.JobClient: map 29% reduce 1%
> 08/01/22 11:58:55 INFO mapred.JobClient: map 30% reduce 1%
> 08/01/22 11:58:57 INFO mapred.JobClient: map 31% reduce 1%
> 08/01/22 11:58:59 INFO mapred.JobClient: map 32% reduce 1%
> 08/01/22 11:59:07 INFO mapred.JobClient: map 33% reduce 1%
> 08/01/22 11:59:08 INFO mapred.JobClient: map 34% reduce 1%
> 08/01/22 11:59:15 INFO mapred.JobClient: map 35% reduce 1%
> 08/01/22 11:59:17 INFO mapred.JobClient: map 36% reduce 2%
> 08/01/22 11:59:20 INFO mapred.JobClient: map 37% reduce 2%
> 08/01/22 11:59:27 INFO mapred.JobClient: map 38% reduce 2%
> 08/01/22 11:59:29 INFO mapred.JobClient: map 39% reduce 2%
> 08/01/22 11:59:34 INFO mapred.JobClient: map 40% reduce 2%
> 08/01/22 11:59:38 INFO mapred.JobClient: map 41% reduce 2%
> 08/01/22 11:59:41 INFO mapred.JobClient: map 42% reduce 2%
> 08/01/22 11:59:42 INFO mapred.JobClient: map 43% reduce 2%
> 08/01/22 11:59:51 INFO mapred.JobClient: map 44% reduce 2%
> 08/01/22 11:59:57 INFO mapred.JobClient: map 46% reduce 2%
> 08/01/22 12:00:01 INFO mapred.JobClient: map 47% reduce 2%
> 08/01/22 12:00:06 INFO mapred.JobClient: map 48% reduce 2%
> 08/01/22 12:00:08 INFO mapred.JobClient: map 49% reduce 2%
> 08/01/22 12:00:14 INFO mapred.JobClient: map 50% reduce 2%
> 08/01/22 12:00:16 INFO mapred.JobClient: map 51% reduce 2%
> 08/01/22 12:00:22 INFO mapred.JobClient: map 52% reduce 2%
> 08/01/22 12:00:27 INFO mapred.JobClient: map 53% reduce 3%
> 08/01/22 12:00:30 INFO mapred.JobClient: map 54% reduce 3%
> 08/01/22 12:00:32 INFO mapred.JobClient: map 55% reduce 3%
> 08/01/22 12:00:37 INFO mapred.JobClient: map 56% reduce 3%
> 08/01/22 12:00:38 INFO mapred.JobClient: map 57% reduce 3%
> 08/01/22 12:00:44 INFO mapred.JobClient: map 58% reduce 3%
> 08/01/22 12:00:49 INFO mapred.JobClient: map 59% reduce 3%
> 08/01/22 12:00:52 INFO mapred.JobClient: map 60% reduce 3%
> 08/01/22 12:00:58 INFO mapred.JobClient: map 61% reduce 3%
> 08/01/22 12:00:59 INFO mapred.JobClient: map 62% reduce 3%
> 08/01/22 12:01:05 INFO mapred.JobClient: map 63% reduce 3%
> 08/01/22 12:01:08 INFO mapred.JobClient: map 64% reduce 3%
> 08/01/22 12:01:13 INFO mapred.JobClient: map 65% reduce 3%
> 08/01/22 12:01:14 INFO mapred.JobClient: map 66% reduce 3%
> 08/01/22 12:01:20 INFO mapred.JobClient: map 67% reduce 3%
> 08/01/22 12:01:25 INFO mapred.JobClient: map 68% reduce 3%
> 08/01/22 12:01:28 INFO mapred.JobClient: map 69% reduce 3%
> 08/01/22 12:01:35 INFO mapred.JobClient: map 71% reduce 4%
> 08/01/22 12:01:39 INFO mapred.JobClient: map 72% reduce 4%
> 08/01/22 12:01:46 INFO mapred.JobClient: map 74% reduce 4%
> 08/01/22 12:01:51 INFO mapred.JobClient: map 75% reduce 4%
> 08/01/22 12:01:54 INFO mapred.JobClient: map 76% reduce 4%
> 08/01/22 12:02:01 INFO mapred.JobClient: map 77% reduce 4%
> 08/01/22 12:02:04 INFO mapred.JobClient: map 78% reduce 4%
> 08/01/22 12:02:10 INFO mapred.JobClient: map 79% reduce 4%
> 08/01/22 12:02:13 INFO mapred.JobClient: map 80% reduce 4%
> 08/01/22 12:02:14 INFO mapred.JobClient: map 81% reduce 4%
> 08/01/22 12:02:17 INFO mapred.JobClient: map 82% reduce 4%
> 08/01/22 12:02:21 INFO mapred.JobClient: map 83% reduce 4%
> 08/01/22 12:02:29 INFO mapred.JobClient: map 84% reduce 4%
> 08/01/22 12:02:31 INFO mapred.JobClient: map 85% reduce 4%
> 08/01/22 12:02:36 INFO mapred.JobClient: map 86% reduce 4%
> 08/01/22 12:02:39 INFO mapred.JobClient: map 87% reduce 4%
> 08/01/22 12:02:43 INFO mapred.JobClient: map 88% reduce 4%
> 08/01/22 12:02:46 INFO mapred.JobClient: map 89% reduce 4%
> 08/01/22 12:02:50 INFO mapred.JobClient: map 90% reduce 4%
> 08/01/22 12:02:56 INFO mapred.JobClient: map 91% reduce 5%
> 08/01/22 12:02:58 INFO mapred.JobClient: map 92% reduce 5%
> 08/01/22 12:03:05 INFO mapred.JobClient: map 93% reduce 5%
> 08/01/22 12:03:07 INFO mapred.JobClient: map 94% reduce 5%
> 08/01/22 12:03:12 INFO mapred.JobClient: map 96% reduce 5%
> 08/01/22 12:03:20 INFO mapred.JobClient: map 97% reduce 5%
> 08/01/22 12:03:24 INFO mapred.JobClient: map 98% reduce 5%
> 08/01/22 12:03:28 INFO mapred.JobClient: map 99% reduce 5%
> 08/01/22 12:03:35 INFO mapred.JobClient: map 100% reduce 5%
> 08/01/22 12:03:45 INFO mapred.JobClient: map 100% reduce 6%
> 08/01/22 12:04:05 INFO mapred.JobClient: map 100% reduce 7%
> 08/01/22 12:04:25 INFO mapred.JobClient: map 100% reduce 8%
> 08/01/22 12:04:52 INFO mapred.JobClient: map 100% reduce 9%
> 08/01/22 12:05:15 INFO mapred.JobClient: map 100% reduce 10%
> 08/01/22 12:05:43 INFO mapred.JobClient: map 100% reduce 11%
> 08/01/22 12:06:13 INFO mapred.JobClient: map 100% reduce 12%
> 08/01/22 12:06:32 INFO mapred.JobClient: map 100% reduce 13%
> 08/01/22 12:07:02 INFO mapred.JobClient: map 100% reduce 14%
> 08/01/22 12:07:25 INFO mapred.JobClient: map 100% reduce 15%
> 08/01/22 12:07:52 INFO mapred.JobClient: map 100% reduce 16%
> 08/01/22 13:56:21 INFO mapred.JobClient: map 100% reduce 13%
> 08/01/22 13:56:21 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000007_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 14:08:11 INFO mapred.JobClient: map 100% reduce 11%
> 08/01/22 14:08:11 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000005_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 14:08:11 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 14:08:11 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 14:20:09 INFO mapred.JobClient: map 100% reduce 8%
> 08/01/22 14:20:09 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000004_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 14:20:09 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 14:20:09 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 14:23:49 INFO mapred.JobClient: map 100% reduce 5%
> 08/01/22 14:23:49 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000006_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 14:23:49 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 14:23:49 WARN mapred.JobClient: Error reading task outputncdm15
> 08/01/22 15:10:19 INFO mapred.JobClient: map 100% reduce 4%
> 08/01/22 15:10:19 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 15:10:42 INFO mapred.JobClient: map 100% reduce 5%
> 08/01/22 15:10:59 INFO mapred.JobClient: map 100% reduce 6%
> 08/01/22 15:11:22 INFO mapred.JobClient: map 100% reduce 7%
> 08/01/22 15:11:55 INFO mapred.JobClient: map 100% reduce 8%
> 08/01/22 15:13:39 INFO mapred.JobClient: map 100% reduce 7%
> 08/01/22 15:13:39 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000003_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 15:14:09 INFO mapred.JobClient: map 100% reduce 8%
> 08/01/22 15:14:29 INFO mapred.JobClient: map 100% reduce 9%
> 08/01/22 15:14:52 INFO mapred.JobClient: map 100% reduce 10%
> 08/01/22 15:15:33 INFO mapred.JobClient: map 100% reduce 11%
> 08/01/22 15:37:01 INFO mapred.JobClient: map 100% reduce 9%
> 08/01/22 15:37:01 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000002_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 15:37:13 INFO mapred.JobClient: map 100% reduce 10%
> 08/01/22 15:37:33 INFO mapred.JobClient: map 100% reduce 11%
> 08/01/22 15:37:53 INFO mapred.JobClient: map 100% reduce 12%
> 08/01/22 15:38:23 INFO mapred.JobClient: map 100% reduce 13%
> 08/01/22 15:45:51 INFO mapred.JobClient: map 100% reduce 12%
> 08/01/22 15:45:51 INFO mapred.JobClient: Task Id :
> task_200801221154_0005_r_000001_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/22 15:46:11 INFO mapred.JobClient: map 100% reduce 13%
> 08/01/22 15:46:31 INFO mapred.JobClient: map 100% reduce 14%
> 08/01/22 15:46:51 INFO mapred.JobClient: map 100% reduce 15%
> 08/01/22 15:47:24 INFO mapred.JobClient: map 100% reduce 16%
>
>
> On Tue, 22 Jan 2008, Taeho Kang wrote:
>
>> You said you had two network interfaces... and it might be the source of
>> your problem.
>>
>> Try disabling one of your network interfaces
>> Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
>> your config file so that datanode knows which network interface to pick up.
>>
>> /Taeho
>>
>> On Jan 22, 2008 7:34 AM, Yunhong Gu1 <yg...@cs.uic.edu> wrote:
>>
>>>
>>> Hi, all
>>>
>>> Just to keep this topic updated :) I am still tring to figure what
>>> happened.
>>>
>>> I found that in my 2-node configuration (namenode and jobtracker on
>>> node-1, while both are datanodes and tasktrackers). The reduce task may
>>> sometimes (but rarely) complete for programs that needs small amount of
>>> CPU time (e.g., mrbench), but for programs with large computation, it
>>> never finish. When reduces blocks, it always fails at 16%.
>>>
>>> Eventually I will get this error information:
>>> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
>>> outputncdm-IPxxxxxxxxx
>>> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
>>> outputncdm-IPxxxxxxxxx
>>> 08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
>>> task_200801181145_0005_r_000000_1, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
>>> outputConnection timed out
>>> 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
>>> outputConnection timed out
>>> 08/01/18 20:09:40 INFO mapred.JobClient: map 100% reduce 100%
>>> java.io.IOException: Job failed!
>>>
>>> I found that "IPxxxxxxxx" is not the correct network address
>>> that Hadoop should read result from. The servers I use have 2 network
>>> interfaces and I am using another one. I explicitly fill the IP addresses
>>> 10.0.0.x in all the configuration files.
>>>
>>> Might this be the reason of Reduce failure? But the Map phase does work.
>>>
>>> Thanks
>>> Yunhong
>>>
>>>
>>> On Sat, 19 Jan 2008, Yunhong Gu1 wrote:
>>>
>>>>
>>>> Oh, so it is the task running on the other node (ncdm-15) fails and
>>> Hadoop
>>>> re-run the task on the local node (ncdm-8). (I only have two nodes,
>>> ncdm-8
>>>> and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
>>> program
>>>> is also started on ncdm-8).
>>>>
>>>>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>>>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>> fetch-failures
>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>> outputncdm15
>>>>
>>>> Any ideas why the task would fail? And why it takes so long for Hadoop
>>> to
>>>> detect the failure?
>>>>
>>>> Thanks
>>>> Yunhong
>>>>
>>>> On Sat, 19 Jan 2008, Devaraj Das wrote:
>>>>
>>>>> Hi Yunhong,
>>>>> As per the output it seems the job ran to successful completion (albeit
>>>
>>>>> with
>>>>> some failures)...
>>>>> Devaraj
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Yunhong Gu1 [mailto:ygu1@cs.uic.edu]
>>>>>> Sent: Saturday, January 19, 2008 8:56 AM
>>>>>> To: hadoop-user@lucene.apache.org
>>>>>> Subject: Re: Reduce hangs
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, it looks like HADOOP-1374
>>>>>>
>>>>>> The program actually failed after a while:
>>>>>>
>>>>>>
>>>>>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar
>>>>>> hadoop-0.15.2-test.jar mrbench
>>>>>> MRBenchmark.0.0.2
>>>>>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
>>>>>> 1 numLines, ASCENDING sortOrder
>>>>>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
>>>>>> /benchmarks/MRBench/mr_input/input_-450753747.txt
>>>>>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
>>>>>> input=/benchmarks/MRBench/mr_input
>>>>>> output=/benchmarks/MRBench/mr_output/output_1843693325
>>>>>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
>>>>>> paths to process : 1
>>>>>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
>>>>>> job_200801181852_0001
>>>>>> 08/01/18 18:53:10 INFO mapred.JobClient: map 0% reduce 0%
>>>>>> 08/01/18 18:53:17 INFO mapred.JobClient: map 100% reduce 0%
>>>>>> 08/01/18 18:53:25 INFO mapred.JobClient: map 100% reduce 16%
>>>>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>>>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>>>>> fetch-failures
>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>>>> outputncdm15
>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>>>> outputncdm15
>>>>>> 08/01/18 19:08:34 INFO mapred.JobClient: map 100% reduce 100%
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
>>>>>> job_200801181852_0001
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Job Counters
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Launched map tasks=3
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Launched reduce tasks=1
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Data-local map tasks=2
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map-Reduce Framework
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map input records=1
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map output records=1
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map input bytes=2
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map output bytes=5
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Reduce input groups=1
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Reduce input records=1
>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Reduce output records=1
>>>>>> DataLines Maps Reduces AvgTime (milliseconds)
>>>>>> 1 2 1 926333
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
>>>>>>
>>>>>>> Looks like we still have this unsolved mysterious problem:
>>>>>>>
>>>>>>> http://issues.apache.org/jira/browse/HADOOP-1374
>>>>>>>
>>>>>>> Could it be related to HADOOP-1246? Arun?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> --Konstantin
>>>>>>>
>>>>>>> Yunhong Gu1 wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> If someone knows how to fix the problem described below,
>>>>>> please help
>>>>>>>> me out. Thanks!
>>>>>>>>
>>>>>>>> I am testing Hadoop on 2-node cluster and the "reduce"
>>>>>> always hangs
>>>>>>>> at some stage, even if I use different clusters. My OS is Debian
>>>>>>>> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
>>>>>> is 0.15.2.
>>>>>>>> Java version is 1.5.0_01-b08.
>>>>>>>>
>>>>>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
>>>>>> mrbench" and
>>>>>>>> when the map stage finishes, the reduce stage will hang
>>>>>> somewhere in
>>>>>>>> the middle, sometimes at 0%. I also tried any other
>>>>>> mapreduce program
>>>>>>>> I can find in the example jar package but they all hang.
>>>>>>>>
>>>>>>>> The log file simply print
>>>>>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>>>>
>>>>>>>> forever.
>>>>>>>>
>>>>>>>> The program does work if I start Hadoop only on single node.
>>>>>>>>
>>>>>>>> Below is my hadoop-site.xml configuration:
>>>>>>>>
>>>>>>>> <configuration>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name> fs.default.name</name>
>>>>>>>> <value>10.0.0.1:60000</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>mapred.job.tracker</name>
>>>>>>>> <value>10.0.0.1:60001</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>dfs.data.dir</name>
>>>>>>>> <value>/raid/hadoop/data</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>mapred.local.dir</name>
>>>>>>>> <value>/raid/hadoop/mapred</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>hadoop.tmp.dir</name>
>>>>>>>> <value>/raid/hadoop/tmp</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>mapred.child.java.opts</name>
>>>>>>>> <value>-Xmx1024m</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>mapred.tasktracker.tasks.maximum</name>
>>>>>>>> <value>4</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <!--
>>>>>>>> <property>
>>>>>>>> <name>mapred.map.tasks</name>
>>>>>>>> <value>7</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>mapred.reduce.tasks</name>
>>>>>>>> <value>3</value>
>>>>>>>> </property>
>>>>>>>> -->
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>fs.inmemory.size.mb</name>
>>>>>>>> <value>200</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>dfs.block.size</name>
>>>>>>>> <value>134217728</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>io.sort.factor</name>
>>>>>>>> <value>100</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>io.sort.mb</name>
>>>>>>>> <value>200</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> <property>
>>>>>>>> <name>io.file.buffer.size</name>
>>>>>>>> <value>131072</value>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> </configuration>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Taeho Kang [tkang.blogspot.com]
>> Software Engineer, NHN Corporation, Korea
>>
>
Re: reduce > copy ?? (was Reduce hangs)
Posted by Yunhong Gu1 <yg...@cs.uic.edu>.
This is the Java stack when the problem happens. Not sure if it is a
deadlock. (Hadoop version 0.15.2)
Thread-25[1] threads
Group system:
(java.lang.ref.Reference$ReferenceHandler)0x77b Reference Handler cond. waiting
(java.lang.ref.Finalizer$FinalizerThread)0x77a Finalizer cond. waiting
(java.lang.Thread)0x779 Signal Dispatcher running
Group main:
(java.lang.Thread)0x1 main cond. waiting
(java.lang.Thread)0x778 taskCleanup cond. waiting
(org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger)0x777 SessionScavenger running
(org.mortbay.util.ThreadedServer$Acceptor)0x776 Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060] running
(org.mortbay.util.ThreadPool$PoolThread)0x775 SocketListener0-0 cond. waiting
(org.apache.hadoop.ipc.Server$Listener)0x774 IPC Server listener on 34516 running
(org.apache.hadoop.ipc.Server$Handler)0x773 IPC Server handler 0 on 34516 cond. waiting
(org.apache.hadoop.ipc.Server$Handler)0x772 IPC Server handler 1 on 34516 cond. waiting
(org.apache.hadoop.ipc.Server$Handler)0x771 IPC Server handler 2 on 34516 cond. waiting
(org.apache.hadoop.ipc.Server$Handler)0x770 IPC Server handler 3 on 34516 cond. waiting
(org.apache.hadoop.ipc.Client$ConnectionCuller)0x76f org.apache.hadoop.io.ObjectWritable Connection Culler running
(org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread)0x76e Map-events fetcher for all reduce tasks on tracker_ncdm15:/10.0.0.2:34516 running
(org.apache.hadoop.util.Daemon)0x76d org.apache.hadoop.dfs.DFSClient$LeaseChecker@e0f0ad running
(org.apache.hadoop.mapred.ReduceTaskRunner)0x76c Thread-25 cond. waiting
(java.lang.UNIXProcess$1$1)0x76b process reaper running
(org.apache.hadoop.mapred.ReduceTaskRunner)0x7d0 Thread-110 cond. waiting
(java.lang.UNIXProcess$1$1)0x7d1 process reaper running
(org.apache.hadoop.ipc.Client$Connection)0x7f6 IPC Client connection to /10.0.0.1:60001 cond. waiting
Thread-25[1] thread 0x7d0
Thread-110[1] where
[1] java.lang.Object.wait (native method)
[2] java.lang.Object.wait (Object.java:485)
[3] java.lang.UNIXProcess.waitFor (UNIXProcess.java:165)
[4] org.apache.hadoop.mapred.TaskRunner.runChild (TaskRunner.java:477)
[5] org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:343)
On Tue, 22 Jan 2008, Yunhong Gu1 wrote:
>
> Hi, All
>
> I tried many possible configurations and I think this is the deepest reason I
> can dig out so far. The whole "Reduce hangs" because one task tracker does
> not progress at all. It is doing some "reduce > copy" forever, as shown
> below.
>
> This is very easy to reproduce on my machines (AMD dual dual-core Opteron
> 3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode).
>
> gu@ncdm-8:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log
> 2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:35:03,627 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:35:09,644 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:35:15,661 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:35:18,670 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
> 2008-01-22 18:35:24,687 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801221827_0001_r_000000_0 0.16666667% reduce > copy (1 of 2 at 0.00
> MB/s) >
>
>
> On Tue, 22 Jan 2008, Yunhong Gu1 wrote:
>
>>
>> Thanks, I tried but this probably not the reason. I checked the network
>> connection using "netstat" and the client is actually connected to the
>> correct server address. In addition, "mrbench" works sometime; if it is
>> network problem, nothing should work at all.
>>
>> I let the "sort" program ran longer, and get some interesting output, the
>> reduce progress can actually be descreasing and oscillating (see the bottom
>> part of the output below).
>>
>> The error information:
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>>
>> What can cause "Error reading task outputncdm15"?
>>
>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-examples.jar sort
>> rand randsort
>> Running on 2 nodes to sort from /user/gu/rand into /user/gu/randsort with 8
>> reduces.
>> Job started: Tue Jan 22 11:56:39 CST 2008
>> 08/01/22 11:56:39 INFO mapred.FileInputFormat: Total input paths to
>> process: 20
>> 08/01/22 11:56:40 INFO mapred.JobClient: Running job: job_200801221154_0005
>> 08/01/22 11:56:41 INFO mapred.JobClient: map 0% reduce 0%
>> 08/01/22 11:56:54 INFO mapred.JobClient: map 1% reduce 0%
>> 08/01/22 11:56:58 INFO mapred.JobClient: map 2% reduce 0%
>> 08/01/22 11:57:04 INFO mapred.JobClient: map 3% reduce 0%
>> 08/01/22 11:57:08 INFO mapred.JobClient: map 5% reduce 0%
>> 08/01/22 11:57:15 INFO mapred.JobClient: map 7% reduce 0%
>> 08/01/22 11:57:23 INFO mapred.JobClient: map 8% reduce 0%
>> 08/01/22 11:57:25 INFO mapred.JobClient: map 9% reduce 0%
>> 08/01/22 11:57:27 INFO mapred.JobClient: map 10% reduce 0%
>> 08/01/22 11:57:34 INFO mapred.JobClient: map 11% reduce 0%
>> 08/01/22 11:57:35 INFO mapred.JobClient: map 12% reduce 0%
>> 08/01/22 11:57:42 INFO mapred.JobClient: map 13% reduce 0%
>> 08/01/22 11:57:48 INFO mapred.JobClient: map 14% reduce 0%
>> 08/01/22 11:57:49 INFO mapred.JobClient: map 15% reduce 0%
>> 08/01/22 11:57:56 INFO mapred.JobClient: map 17% reduce 0%
>> 08/01/22 11:58:05 INFO mapred.JobClient: map 19% reduce 0%
>> 08/01/22 11:58:12 INFO mapred.JobClient: map 20% reduce 0%
>> 08/01/22 11:58:13 INFO mapred.JobClient: map 20% reduce 1%
>> 08/01/22 11:58:14 INFO mapred.JobClient: map 21% reduce 1%
>> 08/01/22 11:58:19 INFO mapred.JobClient: map 22% reduce 1%
>> 08/01/22 11:58:26 INFO mapred.JobClient: map 23% reduce 1%
>> 08/01/22 11:58:30 INFO mapred.JobClient: map 24% reduce 1%
>> 08/01/22 11:58:32 INFO mapred.JobClient: map 25% reduce 1%
>> 08/01/22 11:58:37 INFO mapred.JobClient: map 26% reduce 1%
>> 08/01/22 11:58:42 INFO mapred.JobClient: map 27% reduce 1%
>> 08/01/22 11:58:44 INFO mapred.JobClient: map 28% reduce 1%
>> 08/01/22 11:58:49 INFO mapred.JobClient: map 29% reduce 1%
>> 08/01/22 11:58:55 INFO mapred.JobClient: map 30% reduce 1%
>> 08/01/22 11:58:57 INFO mapred.JobClient: map 31% reduce 1%
>> 08/01/22 11:58:59 INFO mapred.JobClient: map 32% reduce 1%
>> 08/01/22 11:59:07 INFO mapred.JobClient: map 33% reduce 1%
>> 08/01/22 11:59:08 INFO mapred.JobClient: map 34% reduce 1%
>> 08/01/22 11:59:15 INFO mapred.JobClient: map 35% reduce 1%
>> 08/01/22 11:59:17 INFO mapred.JobClient: map 36% reduce 2%
>> 08/01/22 11:59:20 INFO mapred.JobClient: map 37% reduce 2%
>> 08/01/22 11:59:27 INFO mapred.JobClient: map 38% reduce 2%
>> 08/01/22 11:59:29 INFO mapred.JobClient: map 39% reduce 2%
>> 08/01/22 11:59:34 INFO mapred.JobClient: map 40% reduce 2%
>> 08/01/22 11:59:38 INFO mapred.JobClient: map 41% reduce 2%
>> 08/01/22 11:59:41 INFO mapred.JobClient: map 42% reduce 2%
>> 08/01/22 11:59:42 INFO mapred.JobClient: map 43% reduce 2%
>> 08/01/22 11:59:51 INFO mapred.JobClient: map 44% reduce 2%
>> 08/01/22 11:59:57 INFO mapred.JobClient: map 46% reduce 2%
>> 08/01/22 12:00:01 INFO mapred.JobClient: map 47% reduce 2%
>> 08/01/22 12:00:06 INFO mapred.JobClient: map 48% reduce 2%
>> 08/01/22 12:00:08 INFO mapred.JobClient: map 49% reduce 2%
>> 08/01/22 12:00:14 INFO mapred.JobClient: map 50% reduce 2%
>> 08/01/22 12:00:16 INFO mapred.JobClient: map 51% reduce 2%
>> 08/01/22 12:00:22 INFO mapred.JobClient: map 52% reduce 2%
>> 08/01/22 12:00:27 INFO mapred.JobClient: map 53% reduce 3%
>> 08/01/22 12:00:30 INFO mapred.JobClient: map 54% reduce 3%
>> 08/01/22 12:00:32 INFO mapred.JobClient: map 55% reduce 3%
>> 08/01/22 12:00:37 INFO mapred.JobClient: map 56% reduce 3%
>> 08/01/22 12:00:38 INFO mapred.JobClient: map 57% reduce 3%
>> 08/01/22 12:00:44 INFO mapred.JobClient: map 58% reduce 3%
>> 08/01/22 12:00:49 INFO mapred.JobClient: map 59% reduce 3%
>> 08/01/22 12:00:52 INFO mapred.JobClient: map 60% reduce 3%
>> 08/01/22 12:00:58 INFO mapred.JobClient: map 61% reduce 3%
>> 08/01/22 12:00:59 INFO mapred.JobClient: map 62% reduce 3%
>> 08/01/22 12:01:05 INFO mapred.JobClient: map 63% reduce 3%
>> 08/01/22 12:01:08 INFO mapred.JobClient: map 64% reduce 3%
>> 08/01/22 12:01:13 INFO mapred.JobClient: map 65% reduce 3%
>> 08/01/22 12:01:14 INFO mapred.JobClient: map 66% reduce 3%
>> 08/01/22 12:01:20 INFO mapred.JobClient: map 67% reduce 3%
>> 08/01/22 12:01:25 INFO mapred.JobClient: map 68% reduce 3%
>> 08/01/22 12:01:28 INFO mapred.JobClient: map 69% reduce 3%
>> 08/01/22 12:01:35 INFO mapred.JobClient: map 71% reduce 4%
>> 08/01/22 12:01:39 INFO mapred.JobClient: map 72% reduce 4%
>> 08/01/22 12:01:46 INFO mapred.JobClient: map 74% reduce 4%
>> 08/01/22 12:01:51 INFO mapred.JobClient: map 75% reduce 4%
>> 08/01/22 12:01:54 INFO mapred.JobClient: map 76% reduce 4%
>> 08/01/22 12:02:01 INFO mapred.JobClient: map 77% reduce 4%
>> 08/01/22 12:02:04 INFO mapred.JobClient: map 78% reduce 4%
>> 08/01/22 12:02:10 INFO mapred.JobClient: map 79% reduce 4%
>> 08/01/22 12:02:13 INFO mapred.JobClient: map 80% reduce 4%
>> 08/01/22 12:02:14 INFO mapred.JobClient: map 81% reduce 4%
>> 08/01/22 12:02:17 INFO mapred.JobClient: map 82% reduce 4%
>> 08/01/22 12:02:21 INFO mapred.JobClient: map 83% reduce 4%
>> 08/01/22 12:02:29 INFO mapred.JobClient: map 84% reduce 4%
>> 08/01/22 12:02:31 INFO mapred.JobClient: map 85% reduce 4%
>> 08/01/22 12:02:36 INFO mapred.JobClient: map 86% reduce 4%
>> 08/01/22 12:02:39 INFO mapred.JobClient: map 87% reduce 4%
>> 08/01/22 12:02:43 INFO mapred.JobClient: map 88% reduce 4%
>> 08/01/22 12:02:46 INFO mapred.JobClient: map 89% reduce 4%
>> 08/01/22 12:02:50 INFO mapred.JobClient: map 90% reduce 4%
>> 08/01/22 12:02:56 INFO mapred.JobClient: map 91% reduce 5%
>> 08/01/22 12:02:58 INFO mapred.JobClient: map 92% reduce 5%
>> 08/01/22 12:03:05 INFO mapred.JobClient: map 93% reduce 5%
>> 08/01/22 12:03:07 INFO mapred.JobClient: map 94% reduce 5%
>> 08/01/22 12:03:12 INFO mapred.JobClient: map 96% reduce 5%
>> 08/01/22 12:03:20 INFO mapred.JobClient: map 97% reduce 5%
>> 08/01/22 12:03:24 INFO mapred.JobClient: map 98% reduce 5%
>> 08/01/22 12:03:28 INFO mapred.JobClient: map 99% reduce 5%
>> 08/01/22 12:03:35 INFO mapred.JobClient: map 100% reduce 5%
>> 08/01/22 12:03:45 INFO mapred.JobClient: map 100% reduce 6%
>> 08/01/22 12:04:05 INFO mapred.JobClient: map 100% reduce 7%
>> 08/01/22 12:04:25 INFO mapred.JobClient: map 100% reduce 8%
>> 08/01/22 12:04:52 INFO mapred.JobClient: map 100% reduce 9%
>> 08/01/22 12:05:15 INFO mapred.JobClient: map 100% reduce 10%
>> 08/01/22 12:05:43 INFO mapred.JobClient: map 100% reduce 11%
>> 08/01/22 12:06:13 INFO mapred.JobClient: map 100% reduce 12%
>> 08/01/22 12:06:32 INFO mapred.JobClient: map 100% reduce 13%
>> 08/01/22 12:07:02 INFO mapred.JobClient: map 100% reduce 14%
>> 08/01/22 12:07:25 INFO mapred.JobClient: map 100% reduce 15%
>> 08/01/22 12:07:52 INFO mapred.JobClient: map 100% reduce 16%
>> 08/01/22 13:56:21 INFO mapred.JobClient: map 100% reduce 13%
>> 08/01/22 13:56:21 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000007_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:08:11 INFO mapred.JobClient: map 100% reduce 11%
>> 08/01/22 14:08:11 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000005_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 14:08:11 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:08:11 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:20:09 INFO mapred.JobClient: map 100% reduce 8%
>> 08/01/22 14:20:09 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000004_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 14:20:09 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:20:09 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:23:49 INFO mapred.JobClient: map 100% reduce 5%
>> 08/01/22 14:23:49 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000006_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 14:23:49 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 14:23:49 WARN mapred.JobClient: Error reading task outputncdm15
>> 08/01/22 15:10:19 INFO mapred.JobClient: map 100% reduce 4%
>> 08/01/22 15:10:19 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:10:42 INFO mapred.JobClient: map 100% reduce 5%
>> 08/01/22 15:10:59 INFO mapred.JobClient: map 100% reduce 6%
>> 08/01/22 15:11:22 INFO mapred.JobClient: map 100% reduce 7%
>> 08/01/22 15:11:55 INFO mapred.JobClient: map 100% reduce 8%
>> 08/01/22 15:13:39 INFO mapred.JobClient: map 100% reduce 7%
>> 08/01/22 15:13:39 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000003_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:14:09 INFO mapred.JobClient: map 100% reduce 8%
>> 08/01/22 15:14:29 INFO mapred.JobClient: map 100% reduce 9%
>> 08/01/22 15:14:52 INFO mapred.JobClient: map 100% reduce 10%
>> 08/01/22 15:15:33 INFO mapred.JobClient: map 100% reduce 11%
>> 08/01/22 15:37:01 INFO mapred.JobClient: map 100% reduce 9%
>> 08/01/22 15:37:01 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000002_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:37:13 INFO mapred.JobClient: map 100% reduce 10%
>> 08/01/22 15:37:33 INFO mapred.JobClient: map 100% reduce 11%
>> 08/01/22 15:37:53 INFO mapred.JobClient: map 100% reduce 12%
>> 08/01/22 15:38:23 INFO mapred.JobClient: map 100% reduce 13%
>> 08/01/22 15:45:51 INFO mapred.JobClient: map 100% reduce 12%
>> 08/01/22 15:45:51 INFO mapred.JobClient: Task Id :
>> task_200801221154_0005_r_000001_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/01/22 15:46:11 INFO mapred.JobClient: map 100% reduce 13%
>> 08/01/22 15:46:31 INFO mapred.JobClient: map 100% reduce 14%
>> 08/01/22 15:46:51 INFO mapred.JobClient: map 100% reduce 15%
>> 08/01/22 15:47:24 INFO mapred.JobClient: map 100% reduce 16%
>>
>>
>> On Tue, 22 Jan 2008, Taeho Kang wrote:
>>
>>> You said you had two network interfaces... and it might be the source of
>>> your problem.
>>>
>>> Try disabling one of your network interfaces
>>> Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
>>> your config file so that datanode knows which network interface to pick
>>> up.
>>>
>>> /Taeho
>>>
>>> On Jan 22, 2008 7:34 AM, Yunhong Gu1 <yg...@cs.uic.edu> wrote:
>>>
>>>>
>>>> Hi, all
>>>>
>>>> Just to keep this topic updated :) I am still tring to figure what
>>>> happened.
>>>>
>>>> I found that in my 2-node configuration (namenode and jobtracker on
>>>> node-1, while both are datanodes and tasktrackers). The reduce task may
>>>> sometimes (but rarely) complete for programs that needs small amount of
>>>> CPU time (e.g., mrbench), but for programs with large computation, it
>>>> never finish. When reduces blocks, it always fails at 16%.
>>>>
>>>> Eventually I will get this error information:
>>>> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
>>>> outputncdm-IPxxxxxxxxx
>>>> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
>>>> outputncdm-IPxxxxxxxxx
>>>> 08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
>>>> task_200801181145_0005_r_000000_1, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>> 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
>>>> outputConnection timed out
>>>> 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
>>>> outputConnection timed out
>>>> 08/01/18 20:09:40 INFO mapred.JobClient: map 100% reduce 100%
>>>> java.io.IOException: Job failed!
>>>>
>>>> I found that "IPxxxxxxxx" is not the correct network address
>>>> that Hadoop should read result from. The servers I use have 2 network
>>>> interfaces and I am using another one. I explicitly fill the IP addresses
>>>> 10.0.0.x in all the configuration files.
>>>>
>>>> Might this be the reason of Reduce failure? But the Map phase does work.
>>>>
>>>> Thanks
>>>> Yunhong
>>>>
>>>>
>>>> On Sat, 19 Jan 2008, Yunhong Gu1 wrote:
>>>>
>>>>>
>>>>> Oh, so it is the task running on the other node (ncdm-15) fails and
>>>> Hadoop
>>>>> re-run the task on the local node (ncdm-8). (I only have two nodes,
>>>> ncdm-8
>>>>> and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
>>>> program
>>>>> is also started on ncdm-8).
>>>>>
>>>>>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>>>>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>>> fetch-failures
>>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>> outputncdm15
>>>>>
>>>>> Any ideas why the task would fail? And why it takes so long for Hadoop
>>>> to
>>>>> detect the failure?
>>>>>
>>>>> Thanks
>>>>> Yunhong
>>>>>
>>>>> On Sat, 19 Jan 2008, Devaraj Das wrote:
>>>>>
>>>>>> Hi Yunhong,
>>>>>> As per the output it seems the job ran to successful completion (albeit
>>>>
>>>>>> with
>>>>>> some failures)...
>>>>>> Devaraj
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Yunhong Gu1 [mailto:ygu1@cs.uic.edu]
>>>>>>> Sent: Saturday, January 19, 2008 8:56 AM
>>>>>>> To: hadoop-user@lucene.apache.org
>>>>>>> Subject: Re: Reduce hangs
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Yes, it looks like HADOOP-1374
>>>>>>>
>>>>>>> The program actually failed after a while:
>>>>>>>
>>>>>>>
>>>>>>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar
>>>>>>> hadoop-0.15.2-test.jar mrbench
>>>>>>> MRBenchmark.0.0.2
>>>>>>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
>>>>>>> 1 numLines, ASCENDING sortOrder
>>>>>>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
>>>>>>> /benchmarks/MRBench/mr_input/input_-450753747.txt
>>>>>>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
>>>>>>> input=/benchmarks/MRBench/mr_input
>>>>>>> output=/benchmarks/MRBench/mr_output/output_1843693325
>>>>>>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
>>>>>>> paths to process : 1
>>>>>>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
>>>>>>> job_200801181852_0001
>>>>>>> 08/01/18 18:53:10 INFO mapred.JobClient: map 0% reduce 0%
>>>>>>> 08/01/18 18:53:17 INFO mapred.JobClient: map 100% reduce 0%
>>>>>>> 08/01/18 18:53:25 INFO mapred.JobClient: map 100% reduce 16%
>>>>>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>>>>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>>>>>> fetch-failures
>>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>>>>> outputncdm15
>>>>>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>>>>>> outputncdm15
>>>>>>> 08/01/18 19:08:34 INFO mapred.JobClient: map 100% reduce 100%
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
>>>>>>> job_200801181852_0001
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Job Counters
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Launched map tasks=3
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Launched reduce tasks=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Data-local map tasks=2
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map-Reduce Framework
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map input records=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map output records=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map input bytes=2
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Map output bytes=5
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Reduce input groups=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Reduce input records=1
>>>>>>> 08/01/18 19:08:35 INFO mapred.JobClient: Reduce output records=1
>>>>>>> DataLines Maps Reduces AvgTime (milliseconds)
>>>>>>> 1 2 1 926333
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
>>>>>>>
>>>>>>>> Looks like we still have this unsolved mysterious problem:
>>>>>>>>
>>>>>>>> http://issues.apache.org/jira/browse/HADOOP-1374
>>>>>>>>
>>>>>>>> Could it be related to HADOOP-1246? Arun?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> --Konstantin
>>>>>>>>
>>>>>>>> Yunhong Gu1 wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> If someone knows how to fix the problem described below,
>>>>>>> please help
>>>>>>>>> me out. Thanks!
>>>>>>>>>
>>>>>>>>> I am testing Hadoop on 2-node cluster and the "reduce"
>>>>>>> always hangs
>>>>>>>>> at some stage, even if I use different clusters. My OS is Debian
>>>>>>>>> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
>>>>>>> is 0.15.2.
>>>>>>>>> Java version is 1.5.0_01-b08.
>>>>>>>>>
>>>>>>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
>>>>>>> mrbench" and
>>>>>>>>> when the map stage finishes, the reduce stage will hang
>>>>>>> somewhere in
>>>>>>>>> the middle, sometimes at 0%. I also tried any other
>>>>>>> mapreduce program
>>>>>>>>> I can find in the example jar package but they all hang.
>>>>>>>>>
>>>>>>>>> The log file simply print
>>>>>>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>>>>>
>>>>>>>>> forever.
>>>>>>>>>
>>>>>>>>> The program does work if I start Hadoop only on single node.
>>>>>>>>>
>>>>>>>>> Below is my hadoop-site.xml configuration:
>>>>>>>>>
>>>>>>>>> <configuration>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name> fs.default.name</name>
>>>>>>>>> <value>10.0.0.1:60000</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>mapred.job.tracker</name>
>>>>>>>>> <value>10.0.0.1:60001</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>dfs.data.dir</name>
>>>>>>>>> <value>/raid/hadoop/data</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>mapred.local.dir</name>
>>>>>>>>> <value>/raid/hadoop/mapred</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>hadoop.tmp.dir</name>
>>>>>>>>> <value>/raid/hadoop/tmp</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>mapred.child.java.opts</name>
>>>>>>>>> <value>-Xmx1024m</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>mapred.tasktracker.tasks.maximum</name>
>>>>>>>>> <value>4</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <!--
>>>>>>>>> <property>
>>>>>>>>> <name>mapred.map.tasks</name>
>>>>>>>>> <value>7</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>mapred.reduce.tasks</name>
>>>>>>>>> <value>3</value>
>>>>>>>>> </property>
>>>>>>>>> -->
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>fs.inmemory.size.mb</name>
>>>>>>>>> <value>200</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>dfs.block.size</name>
>>>>>>>>> <value>134217728</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>io.sort.factor</name>
>>>>>>>>> <value>100</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>io.sort.mb</name>
>>>>>>>>> <value>200</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>io.file.buffer.size</name>
>>>>>>>>> <value>131072</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> </configuration>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Taeho Kang [tkang.blogspot.com]
>>> Software Engineer, NHN Corporation, Korea
>>>
>>
>