You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Karl Kleinpaste <ka...@conviva.com> on 2008/12/10 17:38:10 UTC

question: NameNode hanging on startup as it intends to leave safe mode

We have a cluster comprised of 21 nodes holding a total capacity of
about 55T where we have had a problem twice in the last couple weeks on
startup of NameNode.  We are running 0.18.1.  DFS space is currently
just below the halfway point of actual occupation, about 25T.

Symptom is that there is normal startup logging on NameNode's part,
where it self-analyzes its expected DFS content, reports #files known,
and begins to accept reports from slaves' DataNodes about blocks they
hold.  During this time, NameNode is in safe mode pending adequate block
discovery from slaves.  As the fraction of reported blocks rises,
eventually it hits the required 0.9990 threshold and announces that it
will leave safe mode in 30 seconds.

The problem occurs when, at the point of logging "0 seconds to leave
safe mode," NameNode hangs: It uses no more CPU; it logs nothing
further; it stops responding on its port 50070 web interface; "hadoop
fs" commands report no contact with NameNode; "netstat -atp" shows a
number of open connections on 9000 and 50070, indicating the connections
are being accepted, but NameNode never processes them.

This has happened twice in the last 2 weeks and it has us fairly
concerned.  Both times, it has been adequate simply to start over again,
and NameNode successfully comes to life the 2nd time around.  Is anyone
else familiar with this sort of hang, and do you know of any solutions?

Re: question: NameNode hanging on startup as it intends to leave safe mode

Posted by Karl Kleinpaste <ka...@conviva.com>.

On Wed, 2008-12-10 at 11:52 -0800, Konstantin Shvachko wrote:
> This is probably related to HADOOP-4795.

Thanx for the observation and reference.  However, my sense is that the
bug report you reference reflects NameNode going into an infloop spin,
whereas the situation we have faced concerns NameNode being stuck/hung,
as though in a resource embrace error: Using no CPU, reacting to no
outside events other than to accept incoming connections without
proceeding to handle them.

Can your referenced bug manifest itself as a NameNode hang as well as a
spin?

Re: question: NameNode hanging on startup as it intends to leave safe mode

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

This is probably related to HADOOP-4795.
http://issues.apache.org/jira/browse/HADOOP-4795

We are testing it on 0.18 now. Should be committed soon.
Please let know if it is something else.

Thanks,
--Konstantin

Karl Kleinpaste wrote:
> We have a cluster comprised of 21 nodes holding a total capacity of
> about 55T where we have had a problem twice in the last couple weeks on
> startup of NameNode.  We are running 0.18.1.  DFS space is currently
> just below the halfway point of actual occupation, about 25T.
> 
> Symptom is that there is normal startup logging on NameNode's part,
> where it self-analyzes its expected DFS content, reports #files known,
> and begins to accept reports from slaves' DataNodes about blocks they
> hold.  During this time, NameNode is in safe mode pending adequate block
> discovery from slaves.  As the fraction of reported blocks rises,
> eventually it hits the required 0.9990 threshold and announces that it
> will leave safe mode in 30 seconds.
> 
> The problem occurs when, at the point of logging "0 seconds to leave
> safe mode," NameNode hangs: It uses no more CPU; it logs nothing
> further; it stops responding on its port 50070 web interface; "hadoop
> fs" commands report no contact with NameNode; "netstat -atp" shows a
> number of open connections on 9000 and 50070, indicating the connections
> are being accepted, but NameNode never processes them.
> 
> This has happened twice in the last 2 weeks and it has us fairly
> concerned.  Both times, it has been adequate simply to start over again,
> and NameNode successfully comes to life the 2nd time around.  Is anyone
> else familiar with this sort of hang, and do you know of any solutions?
> 
>