You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by nesvarbu No <ne...@gmail.com> on 2009/05/08 23:13:02 UTC

Shorten interval between datanode going down and being detected as dead by namenode?

Hi All,

I've been testing hdfs with 3 datanodes cluster, and I've noticed that if I
stopped 1 datanode I still can read all the files, but "hadoop dfs
-copyFromLocal" command fails. In the namenode web interface I can see that
it still thinks that datanode is alive and basically detects that it's dead
in 10 minutes. After reading list archives I've tried modifying heartbeat
intervals, by using these options:

<property>
  <name>dfs.heartbeat.interval</name>
  <value>1</value>
  <description>Determines datanode heartbeat interval in
seconds.</description>
</property>

<property>
  <name>dfs.heartbeat.recheck.interval</name>
  <value>1</value>
  <description>Determines datanode heartbeat interval in
seconds.</description>
</property>

<property>
  <name>dfs.namenode.decommission.interval</name>
  <value>1</value>
  <description>Determines datanode heartbeat interval in
seconds.</description>
</property>

It still detects in 10 minutes. Is there a way to shorten this interval? (I
thought if I set data replication to 2, and have 3 nodes (basically have one
spare) writes won't fail, but they still do fail.)

Re: Shorten interval between datanode going down and being detected as dead by namenode?

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Nesvarbu,

It sounds like your problem might be related to the following JIRA:

https://issues.apache.org/jira/browse/HADOOP-5713

Here's the relevant code from FSNamesystem.java:

    long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) *
1000;
    this.heartbeatRecheckInterval = conf.getInt(
        "heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes
    this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval +
      10 * heartbeatInterval;

It looks like you specified dfs.heartbeat.recheck.interval instead of
heartbeat.recheck.interval. This inconsistency is unfortunate :(

-Todd

On Fri, May 8, 2009 at 2:13 PM, nesvarbu No <ne...@gmail.com> wrote:

> Hi All,
>
> I've been testing hdfs with 3 datanodes cluster, and I've noticed that if I
> stopped 1 datanode I still can read all the files, but "hadoop dfs
> -copyFromLocal" command fails. In the namenode web interface I can see that
> it still thinks that datanode is alive and basically detects that it's dead
> in 10 minutes. After reading list archives I've tried modifying heartbeat
> intervals, by using these options:
>
> <property>
>  <name>dfs.heartbeat.interval</name>
>  <value>1</value>
>  <description>Determines datanode heartbeat interval in
> seconds.</description>
> </property>
>
> <property>
>  <name>dfs.heartbeat.recheck.interval</name>
>  <value>1</value>
>  <description>Determines datanode heartbeat interval in
> seconds.</description>
> </property>
>
> <property>
>  <name>dfs.namenode.decommission.interval</name>
>  <value>1</value>
>  <description>Determines datanode heartbeat interval in
> seconds.</description>
> </property>
>
> It still detects in 10 minutes. Is there a way to shorten this interval? (I
> thought if I set data replication to 2, and have 3 nodes (basically have
> one
> spare) writes won't fail, but they still do fail.)
>