You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sandeep Reddy P <sa...@gmail.com> on 2012/05/22 16:02:23 UTC

Map/Reduce Tasks Fails

Hi,
We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
some of the map tasks are Failed/Killed and the logs show similar error on
all machines.

2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
Exception in createBlockOutputStream 10.0.25.149:50010
java.net.SocketTimeoutException: 69000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
remote=/10.0.25.149:50010]
2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
Abandoning block blk_7260720956806950576_1825
2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
Excluding datanode 10.0.25.149:50010
2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
died.  Exiting attempt_201205211504_0007_m_000016_1.



Are these kind of errors common?? Atleast 1 map task is failing due to
above reason on all the machines.We are using 24 mappers for teragen.
For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
and 17failed/8 killed task attempts.

24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
Cluster works good for small datasets.

Re: Map/Reduce Tasks Fails

Posted by Sandeep Reddy P <sa...@gmail.com>.

I got samilar errors for Apache Hadoop 1.0.0
Thanks,
Sandeep.

Re: Map/Reduce Tasks Fails

Posted by Arun C Murthy <ac...@hortonworks.com>.

Seems like a question better suited for Cloudera lists...

On May 22, 2012, at 7:02 AM, Sandeep Reddy P wrote:

> Hi,
> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
> some of the map tasks are Failed/Killed and the logs show similar error on
> all machines.
> 
> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream 10.0.25.149:50010
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
> remote=/10.0.25.149:50010]
> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7260720956806950576_1825
> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding datanode 10.0.25.149:50010
> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
> died.  Exiting attempt_201205211504_0007_m_000016_1.
> 
> 
> 
> Are these kind of errors common?? Atleast 1 map task is failing due to
> above reason on all the machines.We are using 24 mappers for teragen.
> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
> and 17failed/8 killed task attempts.
> 
> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
> Cluster works good for small datasets.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Map/Reduce Tasks Fails

Posted by Sandeep Reddy P <sa...@gmail.com>.

Raj,
Top from one datanode when i get error from that machine

top - 14:10:15 up 23:12,  1 user,  load average: 13.45, 12.91, 8.31
Tasks: 187 total,   1 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.7%us,  0.4%sy,  0.0%ni,  0.0%id, 98.9%wa,  0.0%hi,  0.1%si,
0.0%st
Mem:   8061608k total,  7927124k used,   134484k free,    19316k buffers
Swap:  2097144k total,      384k used,  2096760k free,  6694656k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1622 hdfs      20   0 1619m 157m  11m S  2.0  2.0  33:55.42 java
14712 mapred    20   0  709m 119m  11m S  1.3  1.5   0:10.06 java
 1706 mapred    20   0 1588m 126m  11m S  1.0  1.6  24:51.69 java
14663 mapred    20   0  708m  89m  11m S  1.0  1.1   0:11.23 java
14686 mapred    20   0  714m 106m  11m S  0.7  1.4   0:11.53 java
14762 mapred    20   0  710m  89m  11m S  0.7  1.1   0:10.05 java
14640 mapred    20   0  704m 119m  11m S  0.3  1.5   0:11.36 java

Error Message:
12/05/22 14:09:52 INFO mapred.JobClient: Task Id :
attempt_201205211504_0009_m_000002_0, Status : FAILED
java.io.IOException: All datanodes 10.0.24.175:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3181)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2720)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2892)

attempt_201205211504_0009_m_000002_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201205211504_0009_m_000002_0: log4j:WARN Please initialize the
log4j system properly.

But other map tasks are running on the same datanode.

Thanks,
sandeep.

Re: Map/Reduce Tasks Fails

Posted by Sandeep Reddy P <sa...@gmail.com>.

Raj,

-        Network Card: VMware generic Gigabit Network adapter. As longer as
this VMs are only talking to each other, the communication speed will be
close to 1Gb. Top is when the systems are idle.

]


Th

e


E

Re: Map/Reduce Tasks Fails

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Sandeep

How many network interfaces? Are the network shared between iSCSI and M/R communications?

Is this the top when the system is idle or when you are getting  errors? ( I am guessing idle!)

Raj



>________________________________
> From: Sandeep Reddy P <sa...@gmail.com>
>To: common-user@hadoop.apache.org; Raj Vishwanathan <ra...@yahoo.com> 
>Sent: Tuesday, May 22, 2012 8:02 AM
>Subject: Re: Map/Reduce Tasks Fails
> 
>Hi Raj,
>We are using SAN shared storage used by multiple servers connected over
>iSCSI.
>
>
>TOP from one of the datanode
>
>top - 11:01:04 up 19:53,  1 user,  load average: 0.00, 0.00, 0.35
>Tasks: 180 total,   1 running, 179 sleeping,   0 stopped,   0 zombie
>Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,
>0.0%st
>Mem:   8061608k total,  5010408k used,  3051200k free,    13152k buffers
>Swap:  2097144k total,      272k used,  2096872k free,  4355840k cached
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>1714 mapred    20   0 1582m 129m  11m S  0.7  1.6   5:49.68 java
>14331 root      20   0 15012 1364  988 R  0.3  0.0   0:00.02 top
>    1 root      20   0 19204 1372 1084 S  0.0  0.0   0:00.82 init
>    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
>    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
>    4 root      20   0     0    0    0 S  0.0  0.0   0:00.14 ksoftirqd/0
>    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
>    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
>    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
>    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
>    9 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1
>   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.04 watchdog/1
>   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
>   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
>   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2
>   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2
>   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
>   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
>   17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3
>   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/3
>   19 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/4
>   20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/4
>   21 root      20   0     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/4
>   22 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/4
>   23 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/5
>   24 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/5
>   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5
>   26 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/5
>   27 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events/0
>   28 root      20   0     0    0    0 S  0.0  0.0   0:04.27 events/1
>   29 root      20   0     0    0    0 S  0.0  0.0   0:02.39 events/2
>   30 root      20   0     0    0    0 S  0.0  0.0   0:01.46 events/3
>   31 root      20   0     0    0    0 S  0.0  0.0   0:00.11 events/4
>   32 root      20   0     0    0    0 S  0.0  0.0   0:00.84 events/5
>   33 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cpuset
>   34 root      20   0     0    0    0 S  0.0  0.0   0:00.00 khelper
>   35 root      20   0     0    0    0 S  0.0  0.0   0:00.00 netns
>
>
>

Re: Map/Reduce Tasks Fails

Posted by Sandeep Reddy P <sa...@gmail.com>.

Hi Raj,
We are using SAN shared storage used by multiple servers connected over
iSCSI.


TOP from one of the datanode

 top - 11:01:04 up 19:53,  1 user,  load average: 0.00, 0.00, 0.35
Tasks: 180 total,   1 running, 179 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   8061608k total,  5010408k used,  3051200k free,    13152k buffers
Swap:  2097144k total,      272k used,  2096872k free,  4355840k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1714 mapred    20   0 1582m 129m  11m S  0.7  1.6   5:49.68 java
14331 root      20   0 15012 1364  988 R  0.3  0.0   0:00.02 top
    1 root      20   0 19204 1372 1084 S  0.0  0.0   0:00.82 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.14 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.04 watchdog/1
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2
   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/3
   19 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/4
   20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/4
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.02 ksoftirqd/4
   22 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/4
   23 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/5
   24 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/5
   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5
   26 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/5
   27 root      20   0     0    0    0 S  0.0  0.0   0:00.00 events/0
   28 root      20   0     0    0    0 S  0.0  0.0   0:04.27 events/1
   29 root      20   0     0    0    0 S  0.0  0.0   0:02.39 events/2
   30 root      20   0     0    0    0 S  0.0  0.0   0:01.46 events/3
   31 root      20   0     0    0    0 S  0.0  0.0   0:00.11 events/4
   32 root      20   0     0    0    0 S  0.0  0.0   0:00.84 events/5
   33 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cpuset
   34 root      20   0     0    0    0 S  0.0  0.0   0:00.00 khelper
   35 root      20   0     0    0    0 S  0.0  0.0   0:00.00 netns

Re: Map/Reduce Tasks Fails

Posted by Raj Vishwanathan <ra...@yahoo.com>.

What kind of storage is attached to the data nodes ? This kind of error can happen when the CPU is really busy with I/O or interrupts.

Can you run top or dstat on some of the data nodes to see how the system is performing?

Raj



>________________________________
> From: Sandeep Reddy P <sa...@gmail.com>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, May 22, 2012 7:23 AM
>Subject: Re: Map/Reduce Tasks Fails
> 
>*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
>Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
>Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
>Start* *Total
>Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
>Tasks Last Hour* *Seconds since heartbeat*
>tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225<http://hadoop2.liaisondevqa.local:50060/>
>hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
>tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363<http://hadoop4.liaisondevqa.local:50060/>
>hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
>tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605<http://hadoop5.liaisondevqa.local:50060/>
>hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
>tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305<http://hadoop3.liaisondevqa.local:50060/>
>hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0  Highest Failures:
>tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
>failures
>
>
>

Re: Map/Reduce Tasks Fails

Posted by Sandeep Reddy P <sa...@gmail.com>.

*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
Start* *Total
Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
Tasks Last Hour* *Seconds since heartbeat*
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225<http://hadoop2.liaisondevqa.local:50060/>
hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363<http://hadoop4.liaisondevqa.local:50060/>
hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605<http://hadoop5.liaisondevqa.local:50060/>
hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305<http://hadoop3.liaisondevqa.local:50060/>
hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0  Highest Failures:
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
failures

Re: Map/Reduce Tasks Fails

Posted by Sandeep Reddy P <sa...@gmail.com>.

I see killed maps on almost all machines.I just finished terasort on 5gb
data with 9 killed map tasks.

Re: Map/Reduce Tasks Fails

Posted by Raj Vishwanathan <ra...@yahoo.com>.





>________________________________
> From: Harsh J <ha...@cloudera.com>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, May 22, 2012 7:13 AM
>Subject: Re: Map/Reduce Tasks Fails
> 
>Sandeep,
>
>Is the same DN 10.0.25.149 reported across all failures? And do you
>notice any machine patterns when observing the failed tasks (i.e. are
>they clumped on any one or a few particular TTs repeatedly)?
>
>On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
><sa...@gmail.com> wrote:
>> Hi,
>> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
>> some of the map tasks are Failed/Killed and the logs show similar error on
>> all machines.
>>
>> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream 10.0.25.149:50010
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
>> remote=/10.0.25.149:50010]
>> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_7260720956806950576_1825
>> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
>> Excluding datanode 10.0.25.149:50010
>> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
>> died.  Exiting attempt_201205211504_0007_m_000016_1.
>>
>>
>>
>> Are these kind of errors common?? Atleast 1 map task is failing due to
>> above reason on all the machines.We are using 24 mappers for teragen.
>> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
>> and 17failed/8 killed task attempts.
>>
>> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
>> Cluster works good for small datasets.
>
>
>
>-- 
>Harsh J
>
>
>

Re: Map/Reduce Tasks Fails

Posted by Harsh J <ha...@cloudera.com>.

Sandeep,

Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?

On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
<sa...@gmail.com> wrote:
> Hi,
> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
> some of the map tasks are Failed/Killed and the logs show similar error on
> all machines.
>
> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream 10.0.25.149:50010
> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
> remote=/10.0.25.149:50010]
> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_7260720956806950576_1825
> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding datanode 10.0.25.149:50010
> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
> died.  Exiting attempt_201205211504_0007_m_000016_1.
>
>
>
> Are these kind of errors common?? Atleast 1 map task is failing due to
> above reason on all the machines.We are using 24 mappers for teragen.
> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
> and 17failed/8 killed task attempts.
>
> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
> Cluster works good for small datasets.



-- 
Harsh J