You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by jason hadoop <ja...@gmail.com> on 2009/02/02 02:58:05 UTC

Re: problem with completion notification from block movement

The Datanode's use multiple threads with locking and one of the assumptions
is that the block report (1ce per hour by default) takes little time. The
datanode will pause while the block report is running and if it happens to
take a while weird things start to happen.


On Fri, Jan 30, 2009 at 8:59 AM, Karl Kleinpaste <ka...@conviva.com> wrote:

> We have a small test cluster, a double master (NameNode+JobTracker) plus
> 2 slaves, running 0.18.1.  We are seeing an intermittent problem where
> our application logs failures out of DFSClient, thus:
>
> 2009-01-30 01:59:42,072 WARN org.apache.hadoop.dfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_7603130349014268849_2349933java.net.SocketTimeoutException: 66000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.10.102:54700
> remote=/10.0.10.108:50010]
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>        at java.io.DataInputStream.readFully(DataInputStream.java:176)
>        at java.io.DataInputStream.readLong(DataInputStream.java:380)
>        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream
> $ResponseProcessor.run(DFSClient.java:2044)
>
> (Apologies for paste formatting.  I hate Evolution.)
>
> Our application here is our "JobConsole," which is responsible for
> taking notifications from an external data-generating application: The
> external app scribbles files into DFS and then tells JobConsole about
> them.  JobConsole submits jobs to crunch that data in response to the
> external app's notifications of data availability.  JobConsole runs on
> the master node.
>
> Chasing that block identifier through our JobConsole log plus the
> DataNode logs on the slaves, we have an odd timeline, which is this:
> 01:58:32        slave (.108, above): receiving blk from master (.102)
> 01:58:35        other slave (.107): receiving blk from .108
> 01:58:36        .107: received blk
> 01:58:38        .107: terminate PacketResponder
> 01:59:42        JobConsole (.102): 66s t.o. + Error Recovery (above)
> 01:59:42        .107: invoke recoverBlock on that blk
> 02:01:15        .108: received blk + terminate PacketResponder
> 03:03:24        .108: deleting blk, from Linux pathname in DFS storage
>
> What's clear from this is that .108 got the block quickly, because it
> was in a position immediately to send a copy to .107, which responded
> promptly enough to say that it was in possession.  But .108's DataNode
> sat on the block for a full 3 minutes before announcing what appears to
> have been ordinary completion and responder termination.  After the
> first minute-plus of that long period, JobConsole gave up and did a
> recovery operation, which appears to work.  If .108's DataNode sent a
> notification when it finally logged completed reception, no doubt there
> was nobody listening for it any more.
>
> What's particularly of interest to us is that the NameNode log shows us
> that the data being moved is job.jar:
>
> 2009-01-30 01:58:32,353 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /usr/local/rinera/hadoop/hadoop-runtime/0/mapred/system/job_200901291752_3021/job.jar.
> blk_7603130349014268849_2349933
>
> Note block name and timestamp.
>
> Does anyone else have knowledge or history with such glitches?  We've
> recently begun seeing a number of problems in communication between task
> management processes and DFS that previously had not been seen, and
> we're trying to nail down where they're coming from, without success.
>
>

Re: problem with completion notification from block movement

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Karl Kleinpaste wrote:
> On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote:
>> The Datanode's use multiple threads with locking and one of the
>> assumptions is that the block report (1ce per hour by default) takes
>> little time. The datanode will pause while the block report is running
>> and if it happens to take a while weird things start to happen.
> 
> Thank you for responding, this is very informative for us.
> 
> Having looked through the source code with a co-worker regarding
> periodic scan and then checking the logs once again, we find that we are
> finding reports of this sort:
> 
> BlockReport of 1158499 blocks got processed in 308860 msecs
> BlockReport of 1159840 blocks got processed in 237925 msecs
> BlockReport of 1161274 blocks got processed in 177853 msecs
> BlockReport of 1162408 blocks got processed in 285094 msecs
> BlockReport of 1164194 blocks got processed in 184478 msecs
> BlockReport of 1165673 blocks got processed in 226401 msecs
> 
> The 3rd of these exactly straddles the particular example timeline I
> discussed in my original email about this question.  I suspect I'll find
> more of the same as I look through other related errors.

You could ask for "complete fix" in 
https://issues.apache.org/jira/browse/HADOOP-4584 . I don't think 
current patch there fixes your problem.

Raghu.

> --karl
> 


Re: problem with completion notification from block movement

Posted by Karl Kleinpaste <ka...@conviva.com>.
On Mon, 2009-02-02 at 20:06 -0800, jason hadoop wrote:
> This can be made significantly worse by your underlying host file
> system and the disks that support it.

Oh, yes, we know...  It was a late-realized mistake just yesterday that
we weren't using noatime on that cluster's slaves.

The attached graph is instructive.  We have our nightly-rotated logs for
DataNode all the way back to when this test cluster was created in
November.  This morning on one node, I sampled the first 10 BlockReport
scan lines from each day's log, up through the current hour today, and
handed it to gnuplot to graph.  The seriously erratic behavior that
begins around the 900K-1M point is very disturbing.

Immediate solutions for us include noatime, nodiratime, BIOS upgrade on
the discs, and eliminating enough small files (blocks) in DFS to get the
total count below 400K.

Re: problem with completion notification from block movement

Posted by jason hadoop <ja...@gmail.com>.
This can be made significantly worse by your underlying host file system and
the disks that support it.
Disabling atime updates via noatime should by you an immediate 10% gain on
the block report time.
not using a raid 5 controller under the covers should by you a chunk too.

I haven't tried it but it may be that ext3 with file_type and dir_index
might be faster.
Disable journals on your file systems, or put the journal on a different
device.

The goal of the above set of operations is to try to make it faster for an
operation to get file system meta data for each file in a large directory.

Using the deadline IO scheduler might help, or otherwise tuning the OS level
access to prioritize small reads over large writes.

noatime reduces the number of writes being generated by the scan
no raid 5 reduces the number of reads needed for ancillary write operations.

In general the collected wisdom says to use multiple individual drives for
the block storage with a comma separated list for the dfs.data.dir
parameter, were each entry on the list is on a separate drive, that
preferably only does Datanode service.

<property><name>dfs.data.dir</name><value>/drive1,/drive2,/drive3,/drive4</value></property>

Before I left Attributor, there was a though of running a continous find on
the dfs.data.dir to try to force the kernel to keep the inodes in memory,
but I think they abandoned that strategy.



On Mon, Feb 2, 2009 at 10:23 AM, Karl Kleinpaste <ka...@conviva.com> wrote:

> On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote:
> > The Datanode's use multiple threads with locking and one of the
> > assumptions is that the block report (1ce per hour by default) takes
> > little time. The datanode will pause while the block report is running
> > and if it happens to take a while weird things start to happen.
>
> Thank you for responding, this is very informative for us.
>
> Having looked through the source code with a co-worker regarding
> periodic scan and then checking the logs once again, we find that we are
> finding reports of this sort:
>
> BlockReport of 1158499 blocks got processed in 308860 msecs
> BlockReport of 1159840 blocks got processed in 237925 msecs
> BlockReport of 1161274 blocks got processed in 177853 msecs
> BlockReport of 1162408 blocks got processed in 285094 msecs
> BlockReport of 1164194 blocks got processed in 184478 msecs
> BlockReport of 1165673 blocks got processed in 226401 msecs
>
> The 3rd of these exactly straddles the particular example timeline I
> discussed in my original email about this question.  I suspect I'll find
> more of the same as I look through other related errors.
>
> --karl
>
>

Re: problem with completion notification from block movement

Posted by Karl Kleinpaste <ka...@conviva.com>.
On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote:
> The Datanode's use multiple threads with locking and one of the
> assumptions is that the block report (1ce per hour by default) takes
> little time. The datanode will pause while the block report is running
> and if it happens to take a while weird things start to happen.

Thank you for responding, this is very informative for us.

Having looked through the source code with a co-worker regarding
periodic scan and then checking the logs once again, we find that we are
finding reports of this sort:

BlockReport of 1158499 blocks got processed in 308860 msecs
BlockReport of 1159840 blocks got processed in 237925 msecs
BlockReport of 1161274 blocks got processed in 177853 msecs
BlockReport of 1162408 blocks got processed in 285094 msecs
BlockReport of 1164194 blocks got processed in 184478 msecs
BlockReport of 1165673 blocks got processed in 226401 msecs

The 3rd of these exactly straddles the particular example timeline I
discussed in my original email about this question.  I suspect I'll find
more of the same as I look through other related errors.

--karl