You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by himanshu chandola <hi...@yahoo.com> on 2009/09/17 23:08:05 UTC

hadoop hangs on reduce

Hi,
Has anyone seen hadoop getting stuck on reduces? 
I'm using a compiled version of hadoop from cloudera:

Hadoop 0.18.3-14.cloudera.CH0_3
Subversion  -r HEAD
Compiled by root on Mon Jul  6 15:02:31 EDT 2009

I've a map reduce job and hadoop gets stuck at 96.49% with no progress since the last 1 hour. I've tried to look into the logs and there isn't anything interesting there. Here's the log from the last  1 hour:

>>>>
2009-09-17 14:51:47,033 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.42.1.1:50010, storageID=DS-1052225239-129.170.192.42-50010-1252112893659, infoPort=50075, ipcPort=50020):DataXceiver: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.42.1.1:50010 remote=/10.42.255.247:41915]
    at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
    at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
    at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
    at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1938)
    at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:2032)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1159)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1087)
    at java.lang.Thread.run(Thread.java:619)

2009-09-17 14:52:29,916 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 40 blocks got processed in 7 msecs
2009-09-17 15:15:06,586 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_4472685249728744796_2145
2009-09-17 15:52:28,686 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 40 blocks got processed in 8 msecs
2009-09-17 16:52:30,397 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 40 blocks got processed in 8 msecs
>>>>
while the hadoop reduce is stuck at 96.49 for the last 1 hr:
09/09/17 15:49:49 INFO mapred.JobClient:  map 100% reduce 96%

This is the second time I've tried to run this code and both of the times I've seen it hitting a barrier on reduce. My reduce step is just aggregating all of the keys and dumping them into files using MultipleTextOutputFormat. I've run a problem of similar size before where the input to reduce was of the same size as this one.

Any help would be greatly appreciated. I can't seem to find any reason why this is happening.

Thanks

Himanshu

 Morpheus: Do you believe in fate, Neo?
Neo: No.
Morpheus: Why Not?
Neo: Because I don't like the idea that I'm not in control of my life.

Re: hadoop hangs on reduce

Posted by Steve Loughran <st...@apache.org>.

himanshu chandola wrote:
> Just as an update. I made a dummy map job so that the map outputs a unique key for every input and hence the input to reduce is unique too. Still my reduce jobs hang at 76.02 % now (I've added a few nodes into my cluster so I suspect what was earlier 96.49 is 76.02). So this is definitely not a memory or io issue. 
> 
> Do I restart my task trackers ? (ive tried once but didnt help)
> 

I see reduce hangs when the TT's cant talk to each other, when they 
can't get data from the other TTs

check the value of mapred.task.tracker.report.address , that it is on an 
external address (not 127.0.0.1) and that the port in use is open on all 
the machines.

-steve

Re: hadoop hangs on reduce

Posted by himanshu chandola <hi...@yahoo.com>.

Just as an update. I made a dummy map job so that the map outputs a unique key for every input and hence the input to reduce is unique too. Still my reduce jobs hang at 76.02 % now (I've added a few nodes into my cluster so I suspect what was earlier 96.49 is 76.02). So this is definitely not a memory or io issue. 

Do I restart my task trackers ? (ive tried once but didnt help)

thanks

 Morpheus: Do you believe in fate, Neo?
Neo: No.
Morpheus: Why Not?
Neo: Because I don't like the idea that I'm not in control of my life.



----- Original Message ----
From: himanshu chandola <hi...@yahoo.com>
To: general@hadoop.apache.org
Sent: Thursday, September 17, 2009 5:08:05 PM
Subject: hadoop hangs on reduce

Hi,
Has anyone seen hadoop getting stuck on reduces? 
I'm using a compiled version of hadoop from cloudera:

Hadoop 0.18.3-14.cloudera.CH0_3
Subversion  -r HEAD
Compiled by root on Mon Jul  6 15:02:31 EDT 2009

I've a map reduce job and hadoop gets stuck at 96.49% with no progress since the last 1 hour. I've tried to look into the logs and there isn't anything interesting there. Here's the log from the last  1 hour:

>>>>
2009-09-17 14:51:47,033 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.42.1.1:50010, storageID=DS-1052225239-129.170.192.42-50010-1252112893659, infoPort=50075, ipcPort=50020):DataXceiver: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.42.1.1:50010 remote=/10.42.255.247:41915]
    at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
    at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
    at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
    at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1938)
    at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:2032)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1159)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1087)
    at java.lang.Thread.run(Thread.java:619)

2009-09-17 14:52:29,916 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 40 blocks got processed in 7 msecs
2009-09-17 15:15:06,586 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_4472685249728744796_2145
2009-09-17 15:52:28,686 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 40 blocks got processed in 8 msecs
2009-09-17 16:52:30,397 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 40 blocks got processed in 8 msecs
>>>>
while the hadoop reduce is stuck at 96.49 for the last 1 hr:
09/09/17 15:49:49 INFO mapred.JobClient:  map 100% reduce 96%

This is the second time I've tried to run this code and both of the times I've seen it hitting a barrier on reduce. My reduce step is just aggregating all of the keys and dumping them into files using MultipleTextOutputFormat. I've run a problem of similar size before where the input to reduce was of the same size as this one.

Any help would be greatly appreciated. I can't seem to find any reason why this is happening.

Thanks

Himanshu

Morpheus: Do you believe in fate, Neo?
Neo: No.
Morpheus: Why Not?
Neo: Because I don't like the idea that I'm not in control of my life.