You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Murali Krishna <mu...@yahoo-inc.com> on 2008/06/15 15:47:10 UTC

All datanodes getting marked as dead

Hi,

            I was running some M/R job on a 90+ node cluster. While the
job was running the entire data nodes seems to have become dead. Only
major error I saw in the name node log is 'java.io.IOException: Too many
open files'. The job might try to open thousands of file.

            After some time, there are lot of exceptions saying 'could
only be replicated to 0 nodes instead of 1'. So looks like all the data
nodes are not responding now; job has failed since it couldn't write. I
can see the following in the data nodes logs:

            2008-06-15 02:38:28,477 WARN org.apache.hadoop.dfs.DataNode:
java.net.SocketTimeoutException: timed out waiting for rpc response

        at org.apache.hadoop.ipc.Client.call(Client.java:484)

        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)

        at org.apache.hadoop.dfs.$Proxy0.sendHeartbeat(Unknown Source)

 

All processes (datanodes + namenodes) are still running..(dfs health
status page shows all nodes as dead)

 

Some questions:

*         Is this kind of behavior expected when name node runs out of
file handles?

*         Why the data nodes are not able to send the heart beat (is it
related to name node not having enough handles?)

*         What happens to the data in the hdfs when all the data nodes
fail to send the heart beat and name node is in this state?

*         Is the solution is to just increase the number of file handles
and restart the cluster? 

 

Thanks,

Murali

Re: All datanodes getting marked as dead

Posted by Dhruba Borthakur <dh...@gmail.com>.

You are running out of file handles on the namenode.  When this
happens, the namenode cannot receive heartbeats from datanodes because
these heartbeats arrive on a tcp/ip socket connection and the namenode
does not have any free file descriptors to accept these socket
connections. Your data is still safe with the datanodes. If you
increase the number of handles on the namenode, all datanodes will
re-join the cluster and things should be fine.

what OS platform is the namenode running on?

thanks,
dhruba

On Sun, Jun 15, 2008 at 5:47 AM, Murali Krishna <mu...@yahoo-inc.com> wrote:
> Hi,
>
>            I was running some M/R job on a 90+ node cluster. While the
> job was running the entire data nodes seems to have become dead. Only
> major error I saw in the name node log is 'java.io.IOException: Too many
> open files'. The job might try to open thousands of file.
>
>            After some time, there are lot of exceptions saying 'could
> only be replicated to 0 nodes instead of 1'. So looks like all the data
> nodes are not responding now; job has failed since it couldn't write. I
> can see the following in the data nodes logs:
>
>            2008-06-15 02:38:28,477 WARN org.apache.hadoop.dfs.DataNode:
> java.net.SocketTimeoutException: timed out waiting for rpc response
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:484)
>
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
>
>        at org.apache.hadoop.dfs.$Proxy0.sendHeartbeat(Unknown Source)
>
>
>
> All processes (datanodes + namenodes) are still running..(dfs health
> status page shows all nodes as dead)
>
>
>
> Some questions:
>
> *         Is this kind of behavior expected when name node runs out of
> file handles?
>
> *         Why the data nodes are not able to send the heart beat (is it
> related to name node not having enough handles?)
>
> *         What happens to the data in the hdfs when all the data nodes
> fail to send the heart beat and name node is in this state?
>
> *         Is the solution is to just increase the number of file handles
> and restart the cluster?
>
>
>
> Thanks,
>
> Murali
>
>