You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by alex bohr <al...@gmail.com> on 2013/10/25 20:17:27 UTC

DFSClient: Could not complete write history logs

Hi,
I've suddenly been having the JobTracker freeze up every couple hours when
it goes into a loop trying to write Job history files.

I get the error in various job but it's always on writing the
"_logs/history" files.

I'm running MRv1: Hadoop 2.0.0-cdh4.4.0

Here's a sample error:
"2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer
retrying.."

I have to stop and restart the jobtracker and then it happens again, and
the intervals between errors have been getting shorter.

I see this ticket:
https://issues.apache.org/jira/browse/HDFS-1059
But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks.

I also found this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3CCAF8-MNf7P_Kr8SNHBng1cDJ70vGET58_V+JNMA21OWymrc1aVA@mail.gmail.com%3E

I'm not familiar with the different IO schedulers, so before I change this
on all our datanodes - *does anyone recommend using deadline instead of
CFQ? *
We are using Ext4 file system on our datanodes which have 24 drives (we
checked for any bad drives and found one that wasn't responding and pulled
it from the config for that machine but errors keep happening).

Or any other advice on addressing this inifinite loop beyond IO scheduler
is much appreciated.
Thanks,
Alex

Re: DFSClient: Could not complete write history logs

Posted by alex bohr <al...@gmail.com>.

I should add we recently changed some mapred-site properties on the
Jobtracker to tone down how much history the job-tracker stores in memory.
 Is it possible these settings are too agressive and the jobtracker is
removing old jobs from memory as it's trying to write the status of a
running job?

Here's properties we recently changed:
    <property>
      <name>mapred.job.tracker.retiredjobs.cache.size</name>
      <value>100</value>
    </property>
    <property>
      <name>mapreduce.job.user.name</name>
      <value>hdfs</value>
    </property>
    <property>
      <name>mapred.jobtracker.completeuserjobs.maximum</name>
      <value>25</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.interval</name>
      <value>86400000</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.check</name>
      <value>3600000</value>
    </property>
    <property>
      <name>mapred.job.tracker</name>
      <value>10.4.41.207:9001</value>
    </property>
</configuration>


I've used "mapred.job.tracker.retiredjobs.cache.size" previously and I'm
very certain it was originally responsible for preventing weekly crashes of
the JobTracker, but the other setting we introduced for the first time.

Thanks



On Fri, Oct 25, 2013 at 11:17 AM, alex bohr <al...@gmail.com> wrote:

> Hi,
> I've suddenly been having the JobTracker freeze up every couple hours when
> it goes into a loop trying to write Job history files.
>
> I get the error in various job but it's always on writing the
> "_logs/history" files.
>
> I'm running MRv1: Hadoop 2.0.0-cdh4.4.0
>
> Here's a sample error:
> "2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete
> /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer
> retrying.."
>
> I have to stop and restart the jobtracker and then it happens again, and
> the intervals between errors have been getting shorter.
>
> I see this ticket:
> https://issues.apache.org/jira/browse/HDFS-1059
> But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks.
>
> I also found this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3CCAF8-MNf7P_Kr8SNHBng1cDJ70vGET58_V+JNMA21OWymrc1aVA@mail.gmail.com%3E
>
> I'm not familiar with the different IO schedulers, so before I change this
> on all our datanodes - *does anyone recommend using deadline instead of
> CFQ? *
> We are using Ext4 file system on our datanodes which have 24 drives (we
> checked for any bad drives and found one that wasn't responding and pulled
> it from the config for that machine but errors keep happening).
>
> Or any other advice on addressing this inifinite loop beyond IO scheduler
> is much appreciated.
> Thanks,
> Alex
>
>
>
>
>

Re: DFSClient: Could not complete write history logs

Posted by alex bohr <al...@gmail.com>.

I should add we recently changed some mapred-site properties on the
Jobtracker to tone down how much history the job-tracker stores in memory.
 Is it possible these settings are too agressive and the jobtracker is
removing old jobs from memory as it's trying to write the status of a
running job?

Here's properties we recently changed:
    <property>
      <name>mapred.job.tracker.retiredjobs.cache.size</name>
      <value>100</value>
    </property>
    <property>
      <name>mapreduce.job.user.name</name>
      <value>hdfs</value>
    </property>
    <property>
      <name>mapred.jobtracker.completeuserjobs.maximum</name>
      <value>25</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.interval</name>
      <value>86400000</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.check</name>
      <value>3600000</value>
    </property>
    <property>
      <name>mapred.job.tracker</name>
      <value>10.4.41.207:9001</value>
    </property>
</configuration>


I've used "mapred.job.tracker.retiredjobs.cache.size" previously and I'm
very certain it was originally responsible for preventing weekly crashes of
the JobTracker, but the other setting we introduced for the first time.

Thanks



On Fri, Oct 25, 2013 at 11:17 AM, alex bohr <al...@gmail.com> wrote:

> Hi,
> I've suddenly been having the JobTracker freeze up every couple hours when
> it goes into a loop trying to write Job history files.
>
> I get the error in various job but it's always on writing the
> "_logs/history" files.
>
> I'm running MRv1: Hadoop 2.0.0-cdh4.4.0
>
> Here's a sample error:
> "2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete
> /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer
> retrying.."
>
> I have to stop and restart the jobtracker and then it happens again, and
> the intervals between errors have been getting shorter.
>
> I see this ticket:
> https://issues.apache.org/jira/browse/HDFS-1059
> But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks.
>
> I also found this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3CCAF8-MNf7P_Kr8SNHBng1cDJ70vGET58_V+JNMA21OWymrc1aVA@mail.gmail.com%3E
>
> I'm not familiar with the different IO schedulers, so before I change this
> on all our datanodes - *does anyone recommend using deadline instead of
> CFQ? *
> We are using Ext4 file system on our datanodes which have 24 drives (we
> checked for any bad drives and found one that wasn't responding and pulled
> it from the config for that machine but errors keep happening).
>
> Or any other advice on addressing this inifinite loop beyond IO scheduler
> is much appreciated.
> Thanks,
> Alex
>
>
>
>
>

Re: DFSClient: Could not complete write history logs

Posted by alex bohr <al...@gmail.com>.

I should add we recently changed some mapred-site properties on the
Jobtracker to tone down how much history the job-tracker stores in memory.
 Is it possible these settings are too agressive and the jobtracker is
removing old jobs from memory as it's trying to write the status of a
running job?

Here's properties we recently changed:
    <property>
      <name>mapred.job.tracker.retiredjobs.cache.size</name>
      <value>100</value>
    </property>
    <property>
      <name>mapreduce.job.user.name</name>
      <value>hdfs</value>
    </property>
    <property>
      <name>mapred.jobtracker.completeuserjobs.maximum</name>
      <value>25</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.interval</name>
      <value>86400000</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.check</name>
      <value>3600000</value>
    </property>
    <property>
      <name>mapred.job.tracker</name>
      <value>10.4.41.207:9001</value>
    </property>
</configuration>


I've used "mapred.job.tracker.retiredjobs.cache.size" previously and I'm
very certain it was originally responsible for preventing weekly crashes of
the JobTracker, but the other setting we introduced for the first time.

Thanks



On Fri, Oct 25, 2013 at 11:17 AM, alex bohr <al...@gmail.com> wrote:

> Hi,
> I've suddenly been having the JobTracker freeze up every couple hours when
> it goes into a loop trying to write Job history files.
>
> I get the error in various job but it's always on writing the
> "_logs/history" files.
>
> I'm running MRv1: Hadoop 2.0.0-cdh4.4.0
>
> Here's a sample error:
> "2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete
> /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer
> retrying.."
>
> I have to stop and restart the jobtracker and then it happens again, and
> the intervals between errors have been getting shorter.
>
> I see this ticket:
> https://issues.apache.org/jira/browse/HDFS-1059
> But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks.
>
> I also found this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3CCAF8-MNf7P_Kr8SNHBng1cDJ70vGET58_V+JNMA21OWymrc1aVA@mail.gmail.com%3E
>
> I'm not familiar with the different IO schedulers, so before I change this
> on all our datanodes - *does anyone recommend using deadline instead of
> CFQ? *
> We are using Ext4 file system on our datanodes which have 24 drives (we
> checked for any bad drives and found one that wasn't responding and pulled
> it from the config for that machine but errors keep happening).
>
> Or any other advice on addressing this inifinite loop beyond IO scheduler
> is much appreciated.
> Thanks,
> Alex
>
>
>
>
>

Re: DFSClient: Could not complete write history logs

Posted by alex bohr <al...@gmail.com>.

I should add we recently changed some mapred-site properties on the
Jobtracker to tone down how much history the job-tracker stores in memory.
 Is it possible these settings are too agressive and the jobtracker is
removing old jobs from memory as it's trying to write the status of a
running job?

Here's properties we recently changed:
    <property>
      <name>mapred.job.tracker.retiredjobs.cache.size</name>
      <value>100</value>
    </property>
    <property>
      <name>mapreduce.job.user.name</name>
      <value>hdfs</value>
    </property>
    <property>
      <name>mapred.jobtracker.completeuserjobs.maximum</name>
      <value>25</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.interval</name>
      <value>86400000</value>
    </property>
    <property>
      <name>mapred.jobtracker.retirejob.check</name>
      <value>3600000</value>
    </property>
    <property>
      <name>mapred.job.tracker</name>
      <value>10.4.41.207:9001</value>
    </property>
</configuration>


I've used "mapred.job.tracker.retiredjobs.cache.size" previously and I'm
very certain it was originally responsible for preventing weekly crashes of
the JobTracker, but the other setting we introduced for the first time.

Thanks



On Fri, Oct 25, 2013 at 11:17 AM, alex bohr <al...@gmail.com> wrote:

> Hi,
> I've suddenly been having the JobTracker freeze up every couple hours when
> it goes into a loop trying to write Job history files.
>
> I get the error in various job but it's always on writing the
> "_logs/history" files.
>
> I'm running MRv1: Hadoop 2.0.0-cdh4.4.0
>
> Here's a sample error:
> "2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete
> /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer
> retrying.."
>
> I have to stop and restart the jobtracker and then it happens again, and
> the intervals between errors have been getting shorter.
>
> I see this ticket:
> https://issues.apache.org/jira/browse/HDFS-1059
> But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks.
>
> I also found this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3CCAF8-MNf7P_Kr8SNHBng1cDJ70vGET58_V+JNMA21OWymrc1aVA@mail.gmail.com%3E
>
> I'm not familiar with the different IO schedulers, so before I change this
> on all our datanodes - *does anyone recommend using deadline instead of
> CFQ? *
> We are using Ext4 file system on our datanodes which have 24 drives (we
> checked for any bad drives and found one that wasn't responding and pulled
> it from the config for that machine but errors keep happening).
>
> Or any other advice on addressing this inifinite loop beyond IO scheduler
> is much appreciated.
> Thanks,
> Alex
>
>
>
>
>