You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sandeep Reddy P <sa...@gmail.com> on 2012/05/22 16:02:23 UTC
Map/Reduce Tasks Fails
Hi,
We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
some of the map tasks are Failed/Killed and the logs show similar error on
all machines.
2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream 10.0.25.149:50010
java.net.SocketTimeoutException: 69000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
remote=/10.0.25.149:50010]
2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_7260720956806950576_1825
2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
Excluding datanode 10.0.25.149:50010
2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
died. Exiting attempt_201205211504_0007_m_000016_1.
Are these kind of errors common?? Atleast 1 map task is failing due to
above reason on all the machines.We are using 24 mappers for teragen.
For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
and 17failed/8 killed task attempts.
24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
Cluster works good for small datasets.
Re: Map/Reduce Tasks Fails
Posted by Sandeep Reddy P <sa...@gmail.com>.
I got samilar errors for Apache Hadoop 1.0.0
Thanks,
Sandeep.
Re: Map/Reduce Tasks Fails
Posted by Arun C Murthy <ac...@hortonworks.com>.
Seems like a question better suited for Cloudera lists...
On May 22, 2012, at 7:02 AM, Sandeep Reddy P wrote:
> Hi,
> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
> some of the map tasks are Failed/Killed and the logs show similar error on
> all machines.
>
> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream 10.0.25.149:50010
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
> remote=/10.0.25.149:50010]
> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7260720956806950576_1825
> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding datanode 10.0.25.149:50010
> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
> died. Exiting attempt_201205211504_0007_m_000016_1.
>
>
>
> Are these kind of errors common?? Atleast 1 map task is failing due to
> above reason on all the machines.We are using 24 mappers for teragen.
> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
> and 17failed/8 killed task attempts.
>
> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
> Cluster works good for small datasets.
--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/
Re: Map/Reduce Tasks Fails
Posted by Sandeep Reddy P <sa...@gmail.com>.
Raj,
Top from one datanode when i get error from that machine
top - 14:10:15 up 23:12, 1 user, load average: 13.45, 12.91, 8.31
Tasks: 187 total, 1 running, 186 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.4%sy, 0.0%ni, 0.0%id, 98.9%wa, 0.0%hi, 0.1%si,
0.0%st
Mem: 8061608k total, 7927124k used, 134484k free, 19316k buffers
Swap: 2097144k total, 384k used, 2096760k free, 6694656k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1622 hdfs 20 0 1619m 157m 11m S 2.0 2.0 33:55.42 java
14712 mapred 20 0 709m 119m 11m S 1.3 1.5 0:10.06 java
1706 mapred 20 0 1588m 126m 11m S 1.0 1.6 24:51.69 java
14663 mapred 20 0 708m 89m 11m S 1.0 1.1 0:11.23 java
14686 mapred 20 0 714m 106m 11m S 0.7 1.4 0:11.53 java
14762 mapred 20 0 710m 89m 11m S 0.7 1.1 0:10.05 java
14640 mapred 20 0 704m 119m 11m S 0.3 1.5 0:11.36 java
Error Message:
12/05/22 14:09:52 INFO mapred.JobClient: Task Id :
attempt_201205211504_0009_m_000002_0, Status : FAILED
java.io.IOException: All datanodes 10.0.24.175:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3181)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2720)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2892)
attempt_201205211504_0009_m_000002_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201205211504_0009_m_000002_0: log4j:WARN Please initialize the
log4j system properly.
But other map tasks are running on the same datanode.
Thanks,
sandeep.
Re: Map/Reduce Tasks Fails
Posted by Sandeep Reddy P <sa...@gmail.com>.
Raj,
- Network Card: VMware generic Gigabit Network adapter. As longer as
this VMs are only talking to each other, the communication speed will be
close to 1Gb. Top is when the systems are idle.
]
Th
e
E
Re: Map/Reduce Tasks Fails
Posted by Raj Vishwanathan <ra...@yahoo.com>.
Sandeep
How many network interfaces? Are the network shared between iSCSI and M/R communications?
Is this the top when the system is idle or when you are getting errors? ( I am guessing idle!)
Raj
>________________________________
> From: Sandeep Reddy P <sa...@gmail.com>
>To: common-user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com>
>Sent: Tuesday, May 22, 2012 8:02 AM
>Subject: Re: Map/Reduce Tasks Fails
>
>Hi Raj,
>We are using SAN shared storage used by multiple servers connected over
>iSCSI.
>
>
>TOP from one of the datanode
>
>top - 11:01:04 up 19:53, 1 user, load average: 0.00, 0.00, 0.35
>Tasks: 180 total, 1 running, 179 sleeping, 0 stopped, 0 zombie
>Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si,
>0.0%st
>Mem: 8061608k total, 5010408k used, 3051200k free, 13152k buffers
>Swap: 2097144k total, 272k used, 2096872k free, 4355840k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>1714 mapred 20 0 1582m 129m 11m S 0.7 1.6 5:49.68 java
>14331 root 20 0 15012 1364 988 R 0.3 0.0 0:00.02 top
> 1 root 20 0 19204 1372 1084 S 0.0 0.0 0:00.82 init
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
> 3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
> 4 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
> 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
> 7 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1
> 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
> 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
> 10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 watchdog/1
> 11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
> 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
> 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2
> 14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
> 15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
> 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
> 17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3
> 18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
> 19 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
> 20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
> 21 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/4
> 22 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/4
> 23 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/5
> 24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5
> 25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5
> 26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/5
> 27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events/0
> 28 root 20 0 0 0 0 S 0.0 0.0 0:04.27 events/1
> 29 root 20 0 0 0 0 S 0.0 0.0 0:02.39 events/2
> 30 root 20 0 0 0 0 S 0.0 0.0 0:01.46 events/3
> 31 root 20 0 0 0 0 S 0.0 0.0 0:00.11 events/4
> 32 root 20 0 0 0 0 S 0.0 0.0 0:00.84 events/5
> 33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
> 34 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
> 35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
>
>
>
Re: Map/Reduce Tasks Fails
Posted by Sandeep Reddy P <sa...@gmail.com>.
Hi Raj,
We are using SAN shared storage used by multiple servers connected over
iSCSI.
TOP from one of the datanode
top - 11:01:04 up 19:53, 1 user, load average: 0.00, 0.00, 0.35
Tasks: 180 total, 1 running, 179 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 8061608k total, 5010408k used, 3051200k free, 13152k buffers
Swap: 2097144k total, 272k used, 2096872k free, 4355840k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1714 mapred 20 0 1582m 129m 11m S 0.7 1.6 5:49.68 java
14331 root 20 0 15012 1364 988 R 0.3 0.0 0:00.02 top
1 root 20 0 19204 1372 1084 S 0.0 0.0 0:00.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
19 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
21 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/4
22 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/4
23 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/5
24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5
25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5
26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/5
27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events/0
28 root 20 0 0 0 0 S 0.0 0.0 0:04.27 events/1
29 root 20 0 0 0 0 S 0.0 0.0 0:02.39 events/2
30 root 20 0 0 0 0 S 0.0 0.0 0:01.46 events/3
31 root 20 0 0 0 0 S 0.0 0.0 0:00.11 events/4
32 root 20 0 0 0 0 S 0.0 0.0 0:00.84 events/5
33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
34 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
Re: Map/Reduce Tasks Fails
Posted by Raj Vishwanathan <ra...@yahoo.com>.
What kind of storage is attached to the data nodes ? This kind of error can happen when the CPU is really busy with I/O or interrupts.
Can you run top or dstat on some of the data nodes to see how the system is performing?
Raj
>________________________________
> From: Sandeep Reddy P <sa...@gmail.com>
>To: common-user@hadoop.apache.org
>Sent: Tuesday, May 22, 2012 7:23 AM
>Subject: Re: Map/Reduce Tasks Fails
>
>*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
>Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
>Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
>Start* *Total
>Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
>Tasks Last Hour* *Seconds since heartbeat*
>tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225<http://hadoop2.liaisondevqa.local:50060/>
>hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
>tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363<http://hadoop4.liaisondevqa.local:50060/>
>hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
>tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605<http://hadoop5.liaisondevqa.local:50060/>
>hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
>tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305<http://hadoop3.liaisondevqa.local:50060/>
>hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0 Highest Failures:
>tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
>failures
>
>
>
Re: Map/Reduce Tasks Fails
Posted by Sandeep Reddy P <sa...@gmail.com>.
*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
Start* *Total
Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
Tasks Last Hour* *Seconds since heartbeat*
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225<http://hadoop2.liaisondevqa.local:50060/>
hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363<http://hadoop4.liaisondevqa.local:50060/>
hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605<http://hadoop5.liaisondevqa.local:50060/>
hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305<http://hadoop3.liaisondevqa.local:50060/>
hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0 Highest Failures:
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
failures
Re: Map/Reduce Tasks Fails
Posted by Sandeep Reddy P <sa...@gmail.com>.
I see killed maps on almost all machines.I just finished terasort on 5gb
data with 9 killed map tasks.
Re: Map/Reduce Tasks Fails
Posted by Raj Vishwanathan <ra...@yahoo.com>.
>________________________________
> From: Harsh J <ha...@cloudera.com>
>To: common-user@hadoop.apache.org
>Sent: Tuesday, May 22, 2012 7:13 AM
>Subject: Re: Map/Reduce Tasks Fails
>
>Sandeep,
>
>Is the same DN 10.0.25.149 reported across all failures? And do you
>notice any machine patterns when observing the failed tasks (i.e. are
>they clumped on any one or a few particular TTs repeatedly)?
>
>On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
><sa...@gmail.com> wrote:
>> Hi,
>> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
>> some of the map tasks are Failed/Killed and the logs show similar error on
>> all machines.
>>
>> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream 10.0.25.149:50010
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
>> remote=/10.0.25.149:50010]
>> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_7260720956806950576_1825
>> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
>> Excluding datanode 10.0.25.149:50010
>> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
>> died. Exiting attempt_201205211504_0007_m_000016_1.
>>
>>
>>
>> Are these kind of errors common?? Atleast 1 map task is failing due to
>> above reason on all the machines.We are using 24 mappers for teragen.
>> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
>> and 17failed/8 killed task attempts.
>>
>> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
>> Cluster works good for small datasets.
>
>
>
>--
>Harsh J
>
>
>
Re: Map/Reduce Tasks Fails
Posted by Harsh J <ha...@cloudera.com>.
Sandeep,
Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?
On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
<sa...@gmail.com> wrote:
> Hi,
> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
> some of the map tasks are Failed/Killed and the logs show similar error on
> all machines.
>
> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream 10.0.25.149:50010
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
> remote=/10.0.25.149:50010]
> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7260720956806950576_1825
> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding datanode 10.0.25.149:50010
> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
> died. Exiting attempt_201205211504_0007_m_000016_1.
>
>
>
> Are these kind of errors common?? Atleast 1 map task is failing due to
> above reason on all the machines.We are using 24 mappers for teragen.
> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
> and 17failed/8 killed task attempts.
>
> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
> Cluster works good for small datasets.
--
Harsh J