You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Balanagireddy Mudiam <ba...@gmail.com> on 2010/05/08 00:17:20 UTC

HDFS safemode recovery take more than an hour

Hi,

We are running our cluster on Amazon EC2. we are using cloudera
scripts to setup hadoop. On the master node, we start below services.

609   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode'
610   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode'
611   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker'
612
613   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait'

On the slave machine, we run the below services.

625   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode'
626   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker'

The main problem we are facing is, hdfs safemode recovery is taking
more than an hour and this is causing delays in our job completion.

Below are the main log messages.

1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO
ipc.Client: Retrying connect to server:
ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already
tried 21 time(s).
2. The reported blocks 283634 needs additional 322258 blocks to reach
the threshold 0.9990 of total blocks 606499. Safe mode will be turned
off automatically.

The first message is thrown in task trackers log because, job tracker
is not started. job tracker didn't start because of hdfs safemode
recovery.

The second message is thrown during the recovery process.

Is there something I am doing wrong?
How much time does normal hdfs safemode recovery takes?
Will there be any speedup, by not starting task trackers till job
tracker is started?
Are there any known hadoop problems on amazon cluster?

Thanks for your help.

Regards
Bala Mudiam

Re: HDFS safemode recovery take more than an hour

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Jun 11, 2010, at 6:04 PM, Bhupesh Bansal wrote:

> How you doing? Heard finally moving away from Solaris and moving to linux :)
> Hope things are going well for you !

HP apparently doesn't want us to eval their hardware (at least, by their non response), so at this rate we aren't. :( Maybe they are afraid I'll make it break. ;)   [I'll likely stick to Solaris on the NN and JT due to much more sane large page support.  That really needs to get fixed in the Linux kernel.]

> I think I found the source of my problems, The issue is in Amazon EC2 when I
> start my cluster (1 namenode, 16 datanodes) datanodes are not able to talk
> to namenode at all (I tried telnet from datanode to namenode) and it gets
> fixed progressively and magically in about 30-40 mins when all of them to be
> able to talk and hence the safemode taking 40 mins.

Oh, weird.  I have no practical experience with EC2, so can't really offer any guidance.  Tom or someone else might be able to tho.

Re: HDFS safemode recovery take more than an hour

Posted by Bhupesh Bansal <bb...@tellapart.com>.

Allen,

How you doing? Heard finally moving away from Solaris and moving to linux :)
Hope things are going well for you !

I think I found the source of my problems, The issue is in Amazon EC2 when I
start my cluster (1 namenode, 16 datanodes) datanodes are not able to talk
to namenode at all (I tried telnet from datanode to namenode) and it gets
fixed progressively and magically in about 30-40 mins when all of them to be
able to talk and hence the safemode taking 40 mins.

We are running secondary namenode and do regular scps to safe guard the
data.

Best
Bhupesh

On Fri, Jun 11, 2010 at 5:57 PM, Allen Wittenauer [via Lucene] <
ml-node+889956-634000226-291170@n3.nabble.com<ml...@n3.nabble.com>
> wrote:

>
> (removing hadoop-user@lucene)
>
> On Jun 11, 2010, at 5:08 PM, Bhupesh Bansal wrote:
>
> > I am also seeing similar issues, I am not clear how will the secondary
> name
> > node helps here ?
> > AFAIK secondary namenode checkpoints and saves namenode snapshots
> > periodically and namenode
> > do not check with secondary namenode for any data inconsistencies.
>
>
> You can copy the checkpoint over to the primary.  This is better than no
> backup at all. :)
>
> ------------------------------
>  View message @
> http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889956.html
> To unsubscribe from Re: HDFS safemode recovery take more than an hour, click
> here< (link removed) >.
>
>
>

-- 
View this message in context: http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889964.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: HDFS safemode recovery take more than an hour

Posted by Allen Wittenauer <aw...@linkedin.com>.

(removing hadoop-user@lucene)

On Jun 11, 2010, at 5:08 PM, Bhupesh Bansal wrote:

> I am also seeing similar issues, I am not clear how will the secondary name
> node helps here ? 
> AFAIK secondary namenode checkpoints and saves namenode snapshots
> periodically and namenode 
> do not check with secondary namenode for any data inconsistencies. 

You can copy the checkpoint over to the primary.  This is better than no backup at all. :)

Re: HDFS safemode recovery take more than an hour

Posted by Bhupesh Bansal <bb...@tellapart.com>.

Steve, 

I am also seeing similar issues, I am not clear how will the secondary name
node helps here ? 
AFAIK secondary namenode checkpoints and saves namenode snapshots
periodically and namenode 
do not check with secondary namenode for any data inconsistencies. 

Best
Bhupesh
-- 
View this message in context: http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889900.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: HDFS safemode recovery take more than an hour

Posted by Steve Loughran <st...@apache.org>.

Balanagireddy Mudiam wrote:

> How much time does normal hdfs safemode recovery takes?

If you don't have secondary namenode set up it has to replay namenode 
operations, the time to recover then depends on how long the cluster has 
been up. 40 minutes is entirely possible. Don't panic and kill the 
process, incidentally, that only makes things worse.

> Will there be any speedup, by not starting task trackers till job
> tracker is started?

no

-steve