You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stu Teasdale <st...@drogna.org.uk> on 2012/01/20 16:58:10 UTC
Issues with reduce writing results back to HDFS
Hi all,
I have a fairly simple hadoop streaming map-reduce task which takes a
batch of 60ishlog files (each around 5-10MB gzip compressed) and
processes them using a cdh3u2 based cluster. The map stage of the job
finishes successfully and reasonably swiftly, but the reduce is taking
hours to complete. This seems to be being caused by intermittent
failures to write progress from the reduce back to hdfs.
The reduce task log shows the following:
2012-01-20 15:39:54,148 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=6700000/6088353/0 in:549=6700000/12191 [rec/s] out:499=6088353/12191 [rec/s]
2012-01-20 15:39:55,110 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=6800000/6177133/0 in:557=6800000/12192 [rec/s] out:506=6177133/12192 [rec/s]
2012-01-20 15:48:10,928 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=6876659/6244505
2012-01-20 15:48:11,160 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=6900000/6264683/0 in:543=6900000/12688 [rec/s] out:493=6264683/12688 [rec/s]
2012-01-20 15:48:39,819 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_6780976120193953642_21228java.io.IOException: Bad response 1 for
block blk_6780976120193953642_21228 from datanode 10.0.7.33:50010
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)
2012-01-20 15:48:39,820 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6780976120193953642_21228 bad datanode[1] 10.0.7.33:50010
2012-01-20 15:48:39,820 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6780976120193953642_21228 in pipeline 10.0.7.31:50010, 10.0.7.33:50010, 10.0.7.18:50010,
10.0.7.30:50010: bad datanode 10.0.7.33:50010
2012-01-20 15:48:40,424 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=6941238/6298916
andon the datanode that is apparently failing:
2012-01-20 15:38:03,651 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 4855 blocks got processed in 185 msecs
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-7412588566623594115_21213 java.io.EOFException: while trying to read
23494 bytes
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-7412588566623594115_21213 Interrupted.
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-7412588566623594115_21213 terminating
2012-01-20 15:39:48,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-7412588566623594115_21213 received exception java.io.EOFException: while trying to read 23494
bytes
2012-01-20 15:39:48,988 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.7.33:50010, storageID=DS-117577783-10.0.7.33-50010-1311781817523, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 23494 bytes
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2012-01-20 15:39:55,810 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_6780976120193953642_21228 src: /10.0.7.31:38948 dest: /10.0.7.33:50010
2012-01-20 15:48:30,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-3723929898193965394_21231 src: /10.0.7.11:44630 dest: /10.0.7.33:50010
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_6780976120193953642_21228 java.io.EOFException: while trying to read
65557 bytes
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_6780976120193953642_21228 2 : Thread is interrupted.
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_6780976120193953642_21228 terminating
2012-01-20 15:48:39,820 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_6780976120193953642_21228 received exception java.io.EOFException: while trying to read 65557
bytes
2012-01-20 15:48:39,820 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.7.33:50010, storageID=DS-117577783-10.0.7.33-50010-1311781817523, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:314)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
So it looks to me as if the datanode never receives the data it is
expecting from the jobnode and eventually hits some sort of timeout and
returns an error.
The machines are not under heavy network or io load, and these write
failures seem to occur intermittently to all the datanodes in the
(small) cluster as the job runs.
Any idea what I've messed up?
Thanks,
Stuart
--
>From the prompt of Stu Teasdale
"By the time they had diminished from 50 to 8, the other dwarves began
to suspect "Hungry."
-- a Larson cartoon
Re: Issues with reduce writing results back to HDFS
Posted by Stu Teasdale <st...@drogna.org.uk>.
On Fri, Jan 20, 2012 at 03:58:10PM +0000, Stu Teasdale wrote:
> Hi all,
>
> I have a fairly simple hadoop streaming map-reduce task which takes a
> batch of 60ishlog files (each around 5-10MB gzip compressed) and
> processes them using a cdh3u2 based cluster. The map stage of the job
> finishes successfully and reasonably swiftly, but the reduce is taking
> hours to complete. This seems to be being caused by intermittent
> failures to write progress from the reduce back to hdfs.
Looks like this might be hardware/kernel related. There's issues with
the network driver/card on one of the machines in our cluster when it's
under heavy network load, and it seemed to be being favoured to run the
reduce job.
Stu
--
From the prompt of Stu Teasdale
I have more humility in my little finger than you have in your whole ____BODY!
-- from "Cerebus" #82