You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stu Teasdale <st...@drogna.org.uk> on 2012/01/20 16:58:10 UTC

Issues with reduce writing results back to HDFS

Hi all,

I have a fairly simple hadoop streaming map-reduce task which takes a 
batch of 60ishlog files (each around 5-10MB gzip compressed) and 
processes them using a cdh3u2 based cluster. The map stage of the job 
finishes successfully and reasonably swiftly, but the reduce is taking 
hours to complete. This seems to be being caused by intermittent 
failures to write progress from the reduce back to hdfs.

The reduce task log shows the following:

2012-01-20 15:39:54,148 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=6700000/6088353/0 in:549=6700000/12191 [rec/s] out:499=6088353/12191 [rec/s]
2012-01-20 15:39:55,110 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=6800000/6177133/0 in:557=6800000/12192 [rec/s] out:506=6177133/12192 [rec/s]
2012-01-20 15:48:10,928 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=6876659/6244505
2012-01-20 15:48:11,160 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=6900000/6264683/0 in:543=6900000/12688 [rec/s] out:493=6264683/12688 [rec/s]
2012-01-20 15:48:39,819 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_6780976120193953642_21228java.io.IOException: Bad response 1 for 
block blk_6780976120193953642_21228 from datanode 10.0.7.33:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)

2012-01-20 15:48:39,820 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6780976120193953642_21228 bad datanode[1] 10.0.7.33:50010
2012-01-20 15:48:39,820 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6780976120193953642_21228 in pipeline 10.0.7.31:50010, 10.0.7.33:50010, 10.0.7.18:50010, 
10.0.7.30:50010: bad datanode 10.0.7.33:50010
2012-01-20 15:48:40,424 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=6941238/6298916

andon the datanode that is apparently failing:

2012-01-20 15:38:03,651 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 4855 blocks got processed in 185 msecs
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-7412588566623594115_21213 java.io.EOFException: while trying to read 
23494 bytes
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-7412588566623594115_21213 Interrupted.
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-7412588566623594115_21213 terminating
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-7412588566623594115_21213 received exception java.io.EOFException: while trying to read 23494 
bytes
2012-01-20 15:39:48,988 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.7.33:50010, storageID=DS-117577783-10.0.7.33-50010-1311781817523, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 23494 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2012-01-20 15:39:55,810 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_6780976120193953642_21228 src: /10.0.7.31:38948 dest: /10.0.7.33:50010
2012-01-20 15:48:30,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-3723929898193965394_21231 src: /10.0.7.11:44630 dest: /10.0.7.33:50010
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_6780976120193953642_21228 java.io.EOFException: while trying to read 
65557 bytes
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_6780976120193953642_21228 2 : Thread is interrupted.
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_6780976120193953642_21228 terminating
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_6780976120193953642_21228 received exception java.io.EOFException: while trying to read 65557 
bytes
2012-01-20 15:48:39,820 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.7.33:50010, storageID=DS-117577783-10.0.7.33-50010-1311781817523, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:314)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)

So it looks to me as if the datanode never receives the data it is 
expecting from the jobnode and eventually hits some sort of timeout and 
returns an error.

The machines are not under heavy network or io load, and these write 
failures seem to occur intermittently to all the datanodes in the 
(small) cluster as the job runs.

Any idea what I've messed up?

Thanks,

Stuart
-- 
>From the prompt of Stu Teasdale

"By the time they had diminished from 50 to 8, the other dwarves began
to suspect "Hungry."
		-- a Larson cartoon

Re: Issues with reduce writing results back to HDFS

Posted by Stu Teasdale <st...@drogna.org.uk>.
On Fri, Jan 20, 2012 at 03:58:10PM +0000, Stu Teasdale wrote:
> Hi all,
> 
> I have a fairly simple hadoop streaming map-reduce task which takes a 
> batch of 60ishlog files (each around 5-10MB gzip compressed) and 
> processes them using a cdh3u2 based cluster. The map stage of the job 
> finishes successfully and reasonably swiftly, but the reduce is taking 
> hours to complete. This seems to be being caused by intermittent 
> failures to write progress from the reduce back to hdfs.

Looks like this might be hardware/kernel related. There's issues with 
the network driver/card on one of the machines in our cluster when it's 
under heavy network load, and it seemed to be being favoured to run the 
reduce job.

Stu
-- 
From the prompt of Stu Teasdale

I have more humility in my little finger than you have in your whole ____BODY!
		-- from "Cerebus" #82