You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Bhupesh Bansal <bb...@tellapart.com> on 2010/06/12 02:08:20 UTC

Re: HDFS safemode recovery take more than an hour

Steve, 

I am also seeing similar issues, I am not clear how will the secondary name
node helps here ? 
AFAIK secondary namenode checkpoints and saves namenode snapshots
periodically and namenode 
do not check with secondary namenode for any data inconsistencies. 

Best
Bhupesh
-- 
View this message in context: http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889900.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: HDFS safemode recovery take more than an hour

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Jun 11, 2010, at 6:04 PM, Bhupesh Bansal wrote:

> How you doing? Heard finally moving away from Solaris and moving to linux :)
> Hope things are going well for you !

HP apparently doesn't want us to eval their hardware (at least, by their non response), so at this rate we aren't. :( Maybe they are afraid I'll make it break. ;)   [I'll likely stick to Solaris on the NN and JT due to much more sane large page support.  That really needs to get fixed in the Linux kernel.]

> I think I found the source of my problems, The issue is in Amazon EC2 when I
> start my cluster (1 namenode, 16 datanodes) datanodes are not able to talk
> to namenode at all (I tried telnet from datanode to namenode) and it gets
> fixed progressively and magically in about 30-40 mins when all of them to be
> able to talk and hence the safemode taking 40 mins.

Oh, weird.  I have no practical experience with EC2, so can't really offer any guidance.  Tom or someone else might be able to tho.

Re: HDFS safemode recovery take more than an hour

Posted by Bhupesh Bansal <bb...@tellapart.com>.

Allen,

How you doing? Heard finally moving away from Solaris and moving to linux :)
Hope things are going well for you !

I think I found the source of my problems, The issue is in Amazon EC2 when I
start my cluster (1 namenode, 16 datanodes) datanodes are not able to talk
to namenode at all (I tried telnet from datanode to namenode) and it gets
fixed progressively and magically in about 30-40 mins when all of them to be
able to talk and hence the safemode taking 40 mins.

We are running secondary namenode and do regular scps to safe guard the
data.

Best
Bhupesh

On Fri, Jun 11, 2010 at 5:57 PM, Allen Wittenauer [via Lucene] <
ml-node+889956-634000226-291170@n3.nabble.com<ml...@n3.nabble.com>
> wrote:

>
> (removing hadoop-user@lucene)
>
> On Jun 11, 2010, at 5:08 PM, Bhupesh Bansal wrote:
>
> > I am also seeing similar issues, I am not clear how will the secondary
> name
> > node helps here ?
> > AFAIK secondary namenode checkpoints and saves namenode snapshots
> > periodically and namenode
> > do not check with secondary namenode for any data inconsistencies.
>
>
> You can copy the checkpoint over to the primary.  This is better than no
> backup at all. :)
>
> ------------------------------
>  View message @
> http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889956.html
> To unsubscribe from Re: HDFS safemode recovery take more than an hour, click
> here< (link removed) >.
>
>
>

-- 
View this message in context: http://lucene.472066.n3.nabble.com/HDFS-safemode-recovery-take-more-than-an-hour-tp784779p889964.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: HDFS safemode recovery take more than an hour

Posted by Allen Wittenauer <aw...@linkedin.com>.

(removing hadoop-user@lucene)

On Jun 11, 2010, at 5:08 PM, Bhupesh Bansal wrote:

> I am also seeing similar issues, I am not clear how will the secondary name
> node helps here ? 
> AFAIK secondary namenode checkpoints and saves namenode snapshots
> periodically and namenode 
> do not check with secondary namenode for any data inconsistencies. 

You can copy the checkpoint over to the primary.  This is better than no backup at all. :)