You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Todd Lipcon <to...@cloudera.com> on 2009/05/14 20:14:52 UTC

Re: Shorten interval between datanode going down and being detected as dead by namenode?

Hi Nesvarbu,

It sounds like your problem might be related to the following JIRA:

https://issues.apache.org/jira/browse/HADOOP-5713

Here's the relevant code from FSNamesystem.java:

    long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) *
1000;
    this.heartbeatRecheckInterval = conf.getInt(
        "heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes
    this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval +
      10 * heartbeatInterval;

It looks like you specified dfs.heartbeat.recheck.interval instead of
heartbeat.recheck.interval. This inconsistency is unfortunate :(

-Todd

On Fri, May 8, 2009 at 2:13 PM, nesvarbu No <ne...@gmail.com> wrote:

> Hi All,
>
> I've been testing hdfs with 3 datanodes cluster, and I've noticed that if I
> stopped 1 datanode I still can read all the files, but "hadoop dfs
> -copyFromLocal" command fails. In the namenode web interface I can see that
> it still thinks that datanode is alive and basically detects that it's dead
> in 10 minutes. After reading list archives I've tried modifying heartbeat
> intervals, by using these options:
>
> <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>1</value>
>  <description>Determines datanode heartbeat interval in
> seconds.</description>
> </property>
>
> <property>
>  <name>dfs.heartbeat.recheck.interval</name>
>  <value>1</value>
>  <description>Determines datanode heartbeat interval in
> seconds.</description>
> </property>
>
> <property>
>  <name>dfs.namenode.decommission.interval</name>
>  <value>1</value>
>  <description>Determines datanode heartbeat interval in
> seconds.</description>
> </property>
>
> It still detects in 10 minutes. Is there a way to shorten this interval? (I
> thought if I set data replication to 2, and have 3 nodes (basically have
> one
> spare) writes won't fail, but they still do fail.)
>