You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Bharadia, Akshay" <ak...@amazon.com> on 2012/05/28 12:23:18 UTC

Help with DFSClient Exception.

Hi,

We are frequently observing the exception
java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002.  Giving up.
on our cluster.  The exception occurs during writing a file.  We are using Hadoop 0.20.2. It's ~250 nodes cluster and on average 1 box goes down every 3 days.

Detailed stack trace :
12/05/27 23:26:54 INFO mapred.JobClient: Task Id : attempt_201205232329_28133_r_000002_0, Status : FAILED
java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002.  Giving up.
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Our investigation:
We have min replication factor set to 2.  As mentioned here (http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html) , "A call to complete() will not return true until all the file's blocks have been replicated the minimum number of times.  Thus, DataNode failures may cause a client to call complete() several times before succeeding", we should retry complete() several times.
The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls complete() function and retries it for 20 times.  But in spite of that file blocks are not replicated minimum number of times. The retry count is not configurable.  Changing min replication factor to 1 is also not a good idea since there are continuously jobs running on our cluster.

Do we have any solution / workaround for this problem?
What is min replication factor in general used in industry?

Let me know if any further inputs required.

Thanks,
-Akshay