You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bobby Dennett <bd...@gmail.com> on 2010/10/15 21:36:55 UTC

After adding nodes to 0.20.2 cluster, getting "Could not complete file" errors and hung JobTracker

Hi all,

We are currently in the process of replacing the servers in our Hadoop
0.20.2 production cluster and in the last couple of days have
experienced an error similar to the following (from the JobTracker
log) several times, which then appears to hang the JobTracker:

2010-10-15 04:13:38,980 INFO org.apache.hadoop.mapred.JobInProgress:
Job job_201010140844_0510 has completed successfully.
2010-10-15 04:13:44,192 INFO org.apache.hadoop.hdfs.DFSClient: Could
not complete file
/user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist
ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300
retrying...
2010-10-15 04:13:44,592 INFO org.apache.hadoop.hdfs.DFSClient: Could
not complete file
/user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist
ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300
retrying...
2010-10-15 04:13:44,993 INFO org.apache.hadoop.hdfs.DFSClient: Could
not complete file
/user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist
ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300
retrying...
2010-10-15 04:13:45,393 INFO org.apache.hadoop.hdfs.DFSClient: Could
not complete file
/user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist
ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300
retrying...
2010-10-15 04:13:45,794 INFO org.apache.hadoop.hdfs.DFSClient: Could
not complete file
/user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/history/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300
retrying...

We haven't seen an issue like this until we added 6 new nodes to our
existing 65 node cluster. The only other configuration change made
recently was to setup include/exclude files for DFS and MapReduce to
"enable" Hadoop's node decommissioning functionality.

Once we encounter this issue (which has happened twice in the last 24
hours), we end up needing to restart the MapReduce processes which we
cannot do on a frequent basis. After the last occurrence, I increased
the value of the mapred.job.tracker.handler.count to 60 and am waiting
to see if it has an impact.

Has anyone else seen this behavior before? Are there any
recommendations for trying to prevent this from happening in the
future?

Thanks in advance,
-Bobby