You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/01/15 02:25:08 UTC

hadoop namenode recovery

Hello,

Is there a standard way to prevent the failure of Namenode crash in a
Hadoop cluster?
or what is the standard or best practice for overcoming the Single point
failure problem of Hadoop.

I am not ready to take chances on a production server with Hadoop 2.0 Alpha
release, which claims to have solved the problem. Are there any other
things I can do to either prevent the failure or recover from the failure
in a very short time.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Randy,

A very slow NFS is certainly troublesome, and slows down the write
performance for every edit waiting to get logged to disk (there's a
logSync_avg_time metric that could be monitored for this), and therefore a
dedicated NFS mount is required if you are unwilling to use the proper
HA-HDFS setup.

If the write fails (due to NFS client timeout, or unavailability), then as
I mentioned, NN removes it from the write queue and carries on with the
remaining disks. Use of soft mounted NFS is important here, otherwise (i.e.
hard mounts) the NFS itself will end up hanging the NN. On a soft mounted
NFS the write will not block forever.


On Thu, Jan 17, 2013 at 3:23 PM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there,
> but it can't be written to or accessed. What happens in that case? If the
> network is congested or slow, does that slow down the overall NN
> performance?
>
> Thanks,
> randy
>
>
> On 01/15/2013 11:14 PM, Harsh J wrote:
>
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>>
>>
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>>
>>     What happens to the NN and/or performance if there's a problem with
>>     the NFS server? Or the network?
>>
>>     Thanks,
>>     randy
>>
>>
>>     On 01/14/2013 11:36 PM, Harsh J wrote:
>>
>>         Its very rare to observe an NN crash due to a software bug in
>>         production. Most of the times its a hardware fault you should
>>         worry about.
>>
>>         On 1.x, or any non-HA-carrying release, the best you can get to
>>         safeguard against a total loss is to have redundant disk volumes
>>         configured, one preferably over a dedicated remote NFS mount.
>>         This way
>>         the NN is recoverable after the node goes down, since you can
>>         retrieve a
>>         current copy from another machine (i.e. via the NFS mount) and
>>         set a new
>>         node up to replace the older NN and continue along.
>>
>>         A load balancer will not work as the NN is not a simple
>>         webserver - it
>>         maintains state which you cannot sync. We wrote HA-HDFS features
>> to
>>         address the very concern you have.
>>
>>         If you want true, painless HA, branch-2 is your best bet at this
>>         point.
>>         An upcoming 2.0.3 release should include the QJM based HA
>>         features that
>>         is painless to setup and very reliable to use (over other
>>         options), and
>>         works with commodity level hardware. FWIW, we've (my team and I)
>>         been
>>         supporting several users and customers who're running the 2.x
>>         based HA
>>         in production and other types of environments and it has been
>>         greatly
>>         stable in our experience. There are also some folks in the
>> community
>>         running 2.x based HDFS for HA/else.
>>
>>
>>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>         <ouchwhisper@gmail.com <ma...@gmail.com>
>>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>**
>> >__>
>>
>>         wrote:
>>
>>              Hello,
>>
>>              Is there a standard way to prevent the failure of Namenode
>>         crash in
>>              a Hadoop cluster?
>>              or what is the standard or best practice for overcoming the
>>         Single
>>              point failure problem of Hadoop.
>>
>>              I am not ready to take chances on a production server with
>>         Hadoop
>>              2.0 Alpha release, which claims to have solved the problem.
>> Are
>>              there any other things I can do to either prevent the
>>         failure or
>>              recover from the failure in a very short time.
>>
>>              Thanking You,
>>
>>              --
>>              Regards,
>>              Ouch Whisper
>>              010101010101
>>
>>
>>
>>
>>         --
>>         Harsh J
>>
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by Michel Segel <mi...@hotmail.com>.

MapR was the first vendor to remove the NN as a SPOF.
They did this w their 1.0 release when it first came out. The downside is that their release is proprietary and very different in terms of the underlying architecture from Apace based releases.

Horton works relies on VMware as a key piece of their release.

If you want HA, you are going to have to look at a specific vendor's approach.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 17, 2013, at 3:53 AM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there, but it can't be written to or accessed. What happens in that case? If the network is congested or slow, does that slow down the overall NN performance?
> 
> Thanks,
> randy
> 
> On 01/15/2013 11:14 PM, Harsh J wrote:
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>> 
>> 
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>> 
>>    What happens to the NN and/or performance if there's a problem with
>>    the NFS server? Or the network?
>> 
>>    Thanks,
>>    randy
>> 
>> 
>>    On 01/14/2013 11:36 PM, Harsh J wrote:
>> 
>>        Its very rare to observe an NN crash due to a software bug in
>>        production. Most of the times its a hardware fault you should
>>        worry about.
>> 
>>        On 1.x, or any non-HA-carrying release, the best you can get to
>>        safeguard against a total loss is to have redundant disk volumes
>>        configured, one preferably over a dedicated remote NFS mount.
>>        This way
>>        the NN is recoverable after the node goes down, since you can
>>        retrieve a
>>        current copy from another machine (i.e. via the NFS mount) and
>>        set a new
>>        node up to replace the older NN and continue along.
>> 
>>        A load balancer will not work as the NN is not a simple
>>        webserver - it
>>        maintains state which you cannot sync. We wrote HA-HDFS features to
>>        address the very concern you have.
>> 
>>        If you want true, painless HA, branch-2 is your best bet at this
>>        point.
>>        An upcoming 2.0.3 release should include the QJM based HA
>>        features that
>>        is painless to setup and very reliable to use (over other
>>        options), and
>>        works with commodity level hardware. FWIW, we've (my team and I)
>>        been
>>        supporting several users and customers who're running the 2.x
>>        based HA
>>        in production and other types of environments and it has been
>>        greatly
>>        stable in our experience. There are also some folks in the community
>>        running 2.x based HDFS for HA/else.
>> 
>> 
>>        On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>        <ouchwhisper@gmail.com <ma...@gmail.com>
>>        <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>>        wrote:
>> 
>>             Hello,
>> 
>>             Is there a standard way to prevent the failure of Namenode
>>        crash in
>>             a Hadoop cluster?
>>             or what is the standard or best practice for overcoming the
>>        Single
>>             point failure problem of Hadoop.
>> 
>>             I am not ready to take chances on a production server with
>>        Hadoop
>>             2.0 Alpha release, which claims to have solved the problem. Are
>>             there any other things I can do to either prevent the
>>        failure or
>>             recover from the failure in a very short time.
>> 
>>             Thanking You,
>> 
>>             --
>>             Regards,
>>             Ouch Whisper
>>             010101010101
>> 
>> 
>> 
>> 
>>        --
>>        Harsh J
>> 
>> 
>> 
>> 
>> 
>> --
>> Harsh J
> 
>

Re: hadoop namenode recovery

Posted by Michel Segel <mi...@hotmail.com>.

MapR was the first vendor to remove the NN as a SPOF.
They did this w their 1.0 release when it first came out. The downside is that their release is proprietary and very different in terms of the underlying architecture from Apace based releases.

Horton works relies on VMware as a key piece of their release.

If you want HA, you are going to have to look at a specific vendor's approach.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 17, 2013, at 3:53 AM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there, but it can't be written to or accessed. What happens in that case? If the network is congested or slow, does that slow down the overall NN performance?
> 
> Thanks,
> randy
> 
> On 01/15/2013 11:14 PM, Harsh J wrote:
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>> 
>> 
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>> 
>>    What happens to the NN and/or performance if there's a problem with
>>    the NFS server? Or the network?
>> 
>>    Thanks,
>>    randy
>> 
>> 
>>    On 01/14/2013 11:36 PM, Harsh J wrote:
>> 
>>        Its very rare to observe an NN crash due to a software bug in
>>        production. Most of the times its a hardware fault you should
>>        worry about.
>> 
>>        On 1.x, or any non-HA-carrying release, the best you can get to
>>        safeguard against a total loss is to have redundant disk volumes
>>        configured, one preferably over a dedicated remote NFS mount.
>>        This way
>>        the NN is recoverable after the node goes down, since you can
>>        retrieve a
>>        current copy from another machine (i.e. via the NFS mount) and
>>        set a new
>>        node up to replace the older NN and continue along.
>> 
>>        A load balancer will not work as the NN is not a simple
>>        webserver - it
>>        maintains state which you cannot sync. We wrote HA-HDFS features to
>>        address the very concern you have.
>> 
>>        If you want true, painless HA, branch-2 is your best bet at this
>>        point.
>>        An upcoming 2.0.3 release should include the QJM based HA
>>        features that
>>        is painless to setup and very reliable to use (over other
>>        options), and
>>        works with commodity level hardware. FWIW, we've (my team and I)
>>        been
>>        supporting several users and customers who're running the 2.x
>>        based HA
>>        in production and other types of environments and it has been
>>        greatly
>>        stable in our experience. There are also some folks in the community
>>        running 2.x based HDFS for HA/else.
>> 
>> 
>>        On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>        <ouchwhisper@gmail.com <ma...@gmail.com>
>>        <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>>        wrote:
>> 
>>             Hello,
>> 
>>             Is there a standard way to prevent the failure of Namenode
>>        crash in
>>             a Hadoop cluster?
>>             or what is the standard or best practice for overcoming the
>>        Single
>>             point failure problem of Hadoop.
>> 
>>             I am not ready to take chances on a production server with
>>        Hadoop
>>             2.0 Alpha release, which claims to have solved the problem. Are
>>             there any other things I can do to either prevent the
>>        failure or
>>             recover from the failure in a very short time.
>> 
>>             Thanking You,
>> 
>>             --
>>             Regards,
>>             Ouch Whisper
>>             010101010101
>> 
>> 
>> 
>> 
>>        --
>>        Harsh J
>> 
>> 
>> 
>> 
>> 
>> --
>> Harsh J
> 
>

Re: hadoop namenode recovery

Posted by Michel Segel <mi...@hotmail.com>.

MapR was the first vendor to remove the NN as a SPOF.
They did this w their 1.0 release when it first came out. The downside is that their release is proprietary and very different in terms of the underlying architecture from Apace based releases.

Horton works relies on VMware as a key piece of their release.

If you want HA, you are going to have to look at a specific vendor's approach.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 17, 2013, at 3:53 AM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there, but it can't be written to or accessed. What happens in that case? If the network is congested or slow, does that slow down the overall NN performance?
> 
> Thanks,
> randy
> 
> On 01/15/2013 11:14 PM, Harsh J wrote:
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>> 
>> 
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>> 
>>    What happens to the NN and/or performance if there's a problem with
>>    the NFS server? Or the network?
>> 
>>    Thanks,
>>    randy
>> 
>> 
>>    On 01/14/2013 11:36 PM, Harsh J wrote:
>> 
>>        Its very rare to observe an NN crash due to a software bug in
>>        production. Most of the times its a hardware fault you should
>>        worry about.
>> 
>>        On 1.x, or any non-HA-carrying release, the best you can get to
>>        safeguard against a total loss is to have redundant disk volumes
>>        configured, one preferably over a dedicated remote NFS mount.
>>        This way
>>        the NN is recoverable after the node goes down, since you can
>>        retrieve a
>>        current copy from another machine (i.e. via the NFS mount) and
>>        set a new
>>        node up to replace the older NN and continue along.
>> 
>>        A load balancer will not work as the NN is not a simple
>>        webserver - it
>>        maintains state which you cannot sync. We wrote HA-HDFS features to
>>        address the very concern you have.
>> 
>>        If you want true, painless HA, branch-2 is your best bet at this
>>        point.
>>        An upcoming 2.0.3 release should include the QJM based HA
>>        features that
>>        is painless to setup and very reliable to use (over other
>>        options), and
>>        works with commodity level hardware. FWIW, we've (my team and I)
>>        been
>>        supporting several users and customers who're running the 2.x
>>        based HA
>>        in production and other types of environments and it has been
>>        greatly
>>        stable in our experience. There are also some folks in the community
>>        running 2.x based HDFS for HA/else.
>> 
>> 
>>        On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>        <ouchwhisper@gmail.com <ma...@gmail.com>
>>        <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>>        wrote:
>> 
>>             Hello,
>> 
>>             Is there a standard way to prevent the failure of Namenode
>>        crash in
>>             a Hadoop cluster?
>>             or what is the standard or best practice for overcoming the
>>        Single
>>             point failure problem of Hadoop.
>> 
>>             I am not ready to take chances on a production server with
>>        Hadoop
>>             2.0 Alpha release, which claims to have solved the problem. Are
>>             there any other things I can do to either prevent the
>>        failure or
>>             recover from the failure in a very short time.
>> 
>>             Thanking You,
>> 
>>             --
>>             Regards,
>>             Ouch Whisper
>>             010101010101
>> 
>> 
>> 
>> 
>>        --
>>        Harsh J
>> 
>> 
>> 
>> 
>> 
>> --
>> Harsh J
> 
>

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Randy,

A very slow NFS is certainly troublesome, and slows down the write
performance for every edit waiting to get logged to disk (there's a
logSync_avg_time metric that could be monitored for this), and therefore a
dedicated NFS mount is required if you are unwilling to use the proper
HA-HDFS setup.

If the write fails (due to NFS client timeout, or unavailability), then as
I mentioned, NN removes it from the write queue and carries on with the
remaining disks. Use of soft mounted NFS is important here, otherwise (i.e.
hard mounts) the NFS itself will end up hanging the NN. On a soft mounted
NFS the write will not block forever.


On Thu, Jan 17, 2013 at 3:23 PM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there,
> but it can't be written to or accessed. What happens in that case? If the
> network is congested or slow, does that slow down the overall NN
> performance?
>
> Thanks,
> randy
>
>
> On 01/15/2013 11:14 PM, Harsh J wrote:
>
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>>
>>
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>>
>>     What happens to the NN and/or performance if there's a problem with
>>     the NFS server? Or the network?
>>
>>     Thanks,
>>     randy
>>
>>
>>     On 01/14/2013 11:36 PM, Harsh J wrote:
>>
>>         Its very rare to observe an NN crash due to a software bug in
>>         production. Most of the times its a hardware fault you should
>>         worry about.
>>
>>         On 1.x, or any non-HA-carrying release, the best you can get to
>>         safeguard against a total loss is to have redundant disk volumes
>>         configured, one preferably over a dedicated remote NFS mount.
>>         This way
>>         the NN is recoverable after the node goes down, since you can
>>         retrieve a
>>         current copy from another machine (i.e. via the NFS mount) and
>>         set a new
>>         node up to replace the older NN and continue along.
>>
>>         A load balancer will not work as the NN is not a simple
>>         webserver - it
>>         maintains state which you cannot sync. We wrote HA-HDFS features
>> to
>>         address the very concern you have.
>>
>>         If you want true, painless HA, branch-2 is your best bet at this
>>         point.
>>         An upcoming 2.0.3 release should include the QJM based HA
>>         features that
>>         is painless to setup and very reliable to use (over other
>>         options), and
>>         works with commodity level hardware. FWIW, we've (my team and I)
>>         been
>>         supporting several users and customers who're running the 2.x
>>         based HA
>>         in production and other types of environments and it has been
>>         greatly
>>         stable in our experience. There are also some folks in the
>> community
>>         running 2.x based HDFS for HA/else.
>>
>>
>>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>         <ouchwhisper@gmail.com <ma...@gmail.com>
>>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>**
>> >__>
>>
>>         wrote:
>>
>>              Hello,
>>
>>              Is there a standard way to prevent the failure of Namenode
>>         crash in
>>              a Hadoop cluster?
>>              or what is the standard or best practice for overcoming the
>>         Single
>>              point failure problem of Hadoop.
>>
>>              I am not ready to take chances on a production server with
>>         Hadoop
>>              2.0 Alpha release, which claims to have solved the problem.
>> Are
>>              there any other things I can do to either prevent the
>>         failure or
>>              recover from the failure in a very short time.
>>
>>              Thanking You,
>>
>>              --
>>              Regards,
>>              Ouch Whisper
>>              010101010101
>>
>>
>>
>>
>>         --
>>         Harsh J
>>
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Randy,

A very slow NFS is certainly troublesome, and slows down the write
performance for every edit waiting to get logged to disk (there's a
logSync_avg_time metric that could be monitored for this), and therefore a
dedicated NFS mount is required if you are unwilling to use the proper
HA-HDFS setup.

If the write fails (due to NFS client timeout, or unavailability), then as
I mentioned, NN removes it from the write queue and carries on with the
remaining disks. Use of soft mounted NFS is important here, otherwise (i.e.
hard mounts) the NFS itself will end up hanging the NN. On a soft mounted
NFS the write will not block forever.


On Thu, Jan 17, 2013 at 3:23 PM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there,
> but it can't be written to or accessed. What happens in that case? If the
> network is congested or slow, does that slow down the overall NN
> performance?
>
> Thanks,
> randy
>
>
> On 01/15/2013 11:14 PM, Harsh J wrote:
>
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>>
>>
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>>
>>     What happens to the NN and/or performance if there's a problem with
>>     the NFS server? Or the network?
>>
>>     Thanks,
>>     randy
>>
>>
>>     On 01/14/2013 11:36 PM, Harsh J wrote:
>>
>>         Its very rare to observe an NN crash due to a software bug in
>>         production. Most of the times its a hardware fault you should
>>         worry about.
>>
>>         On 1.x, or any non-HA-carrying release, the best you can get to
>>         safeguard against a total loss is to have redundant disk volumes
>>         configured, one preferably over a dedicated remote NFS mount.
>>         This way
>>         the NN is recoverable after the node goes down, since you can
>>         retrieve a
>>         current copy from another machine (i.e. via the NFS mount) and
>>         set a new
>>         node up to replace the older NN and continue along.
>>
>>         A load balancer will not work as the NN is not a simple
>>         webserver - it
>>         maintains state which you cannot sync. We wrote HA-HDFS features
>> to
>>         address the very concern you have.
>>
>>         If you want true, painless HA, branch-2 is your best bet at this
>>         point.
>>         An upcoming 2.0.3 release should include the QJM based HA
>>         features that
>>         is painless to setup and very reliable to use (over other
>>         options), and
>>         works with commodity level hardware. FWIW, we've (my team and I)
>>         been
>>         supporting several users and customers who're running the 2.x
>>         based HA
>>         in production and other types of environments and it has been
>>         greatly
>>         stable in our experience. There are also some folks in the
>> community
>>         running 2.x based HDFS for HA/else.
>>
>>
>>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>         <ouchwhisper@gmail.com <ma...@gmail.com>
>>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>**
>> >__>
>>
>>         wrote:
>>
>>              Hello,
>>
>>              Is there a standard way to prevent the failure of Namenode
>>         crash in
>>              a Hadoop cluster?
>>              or what is the standard or best practice for overcoming the
>>         Single
>>              point failure problem of Hadoop.
>>
>>              I am not ready to take chances on a production server with
>>         Hadoop
>>              2.0 Alpha release, which claims to have solved the problem.
>> Are
>>              there any other things I can do to either prevent the
>>         failure or
>>              recover from the failure in a very short time.
>>
>>              Thanking You,
>>
>>              --
>>              Regards,
>>              Ouch Whisper
>>              010101010101
>>
>>
>>
>>
>>         --
>>         Harsh J
>>
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by Michel Segel <mi...@hotmail.com>.

MapR was the first vendor to remove the NN as a SPOF.
They did this w their 1.0 release when it first came out. The downside is that their release is proprietary and very different in terms of the underlying architecture from Apace based releases.

Horton works relies on VMware as a key piece of their release.

If you want HA, you are going to have to look at a specific vendor's approach.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 17, 2013, at 3:53 AM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there, but it can't be written to or accessed. What happens in that case? If the network is congested or slow, does that slow down the overall NN performance?
> 
> Thanks,
> randy
> 
> On 01/15/2013 11:14 PM, Harsh J wrote:
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>> 
>> 
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>> 
>>    What happens to the NN and/or performance if there's a problem with
>>    the NFS server? Or the network?
>> 
>>    Thanks,
>>    randy
>> 
>> 
>>    On 01/14/2013 11:36 PM, Harsh J wrote:
>> 
>>        Its very rare to observe an NN crash due to a software bug in
>>        production. Most of the times its a hardware fault you should
>>        worry about.
>> 
>>        On 1.x, or any non-HA-carrying release, the best you can get to
>>        safeguard against a total loss is to have redundant disk volumes
>>        configured, one preferably over a dedicated remote NFS mount.
>>        This way
>>        the NN is recoverable after the node goes down, since you can
>>        retrieve a
>>        current copy from another machine (i.e. via the NFS mount) and
>>        set a new
>>        node up to replace the older NN and continue along.
>> 
>>        A load balancer will not work as the NN is not a simple
>>        webserver - it
>>        maintains state which you cannot sync. We wrote HA-HDFS features to
>>        address the very concern you have.
>> 
>>        If you want true, painless HA, branch-2 is your best bet at this
>>        point.
>>        An upcoming 2.0.3 release should include the QJM based HA
>>        features that
>>        is painless to setup and very reliable to use (over other
>>        options), and
>>        works with commodity level hardware. FWIW, we've (my team and I)
>>        been
>>        supporting several users and customers who're running the 2.x
>>        based HA
>>        in production and other types of environments and it has been
>>        greatly
>>        stable in our experience. There are also some folks in the community
>>        running 2.x based HDFS for HA/else.
>> 
>> 
>>        On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>        <ouchwhisper@gmail.com <ma...@gmail.com>
>>        <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>>        wrote:
>> 
>>             Hello,
>> 
>>             Is there a standard way to prevent the failure of Namenode
>>        crash in
>>             a Hadoop cluster?
>>             or what is the standard or best practice for overcoming the
>>        Single
>>             point failure problem of Hadoop.
>> 
>>             I am not ready to take chances on a production server with
>>        Hadoop
>>             2.0 Alpha release, which claims to have solved the problem. Are
>>             there any other things I can do to either prevent the
>>        failure or
>>             recover from the failure in a very short time.
>> 
>>             Thanking You,
>> 
>>             --
>>             Regards,
>>             Ouch Whisper
>>             010101010101
>> 
>> 
>> 
>> 
>>        --
>>        Harsh J
>> 
>> 
>> 
>> 
>> 
>> --
>> Harsh J
> 
>

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Randy,

A very slow NFS is certainly troublesome, and slows down the write
performance for every edit waiting to get logged to disk (there's a
logSync_avg_time metric that could be monitored for this), and therefore a
dedicated NFS mount is required if you are unwilling to use the proper
HA-HDFS setup.

If the write fails (due to NFS client timeout, or unavailability), then as
I mentioned, NN removes it from the write queue and carries on with the
remaining disks. Use of soft mounted NFS is important here, otherwise (i.e.
hard mounts) the NFS itself will end up hanging the NN. On a soft mounted
NFS the write will not block forever.


On Thu, Jan 17, 2013 at 3:23 PM, randy <ra...@comcast.net> wrote:

> I've seen NFS get in a state many times where the mount is still there,
> but it can't be written to or accessed. What happens in that case? If the
> network is congested or slow, does that slow down the overall NN
> performance?
>
> Thanks,
> randy
>
>
> On 01/15/2013 11:14 PM, Harsh J wrote:
>
>> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
>> ejects it out and continues with the local disk. If auto-restore is
>> configured, it will re-add the NFS if its detected good again later.
>>
>>
>> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
>> <ma...@comcast.net>> wrote:
>>
>>     What happens to the NN and/or performance if there's a problem with
>>     the NFS server? Or the network?
>>
>>     Thanks,
>>     randy
>>
>>
>>     On 01/14/2013 11:36 PM, Harsh J wrote:
>>
>>         Its very rare to observe an NN crash due to a software bug in
>>         production. Most of the times its a hardware fault you should
>>         worry about.
>>
>>         On 1.x, or any non-HA-carrying release, the best you can get to
>>         safeguard against a total loss is to have redundant disk volumes
>>         configured, one preferably over a dedicated remote NFS mount.
>>         This way
>>         the NN is recoverable after the node goes down, since you can
>>         retrieve a
>>         current copy from another machine (i.e. via the NFS mount) and
>>         set a new
>>         node up to replace the older NN and continue along.
>>
>>         A load balancer will not work as the NN is not a simple
>>         webserver - it
>>         maintains state which you cannot sync. We wrote HA-HDFS features
>> to
>>         address the very concern you have.
>>
>>         If you want true, painless HA, branch-2 is your best bet at this
>>         point.
>>         An upcoming 2.0.3 release should include the QJM based HA
>>         features that
>>         is painless to setup and very reliable to use (over other
>>         options), and
>>         works with commodity level hardware. FWIW, we've (my team and I)
>>         been
>>         supporting several users and customers who're running the 2.x
>>         based HA
>>         in production and other types of environments and it has been
>>         greatly
>>         stable in our experience. There are also some folks in the
>> community
>>         running 2.x based HDFS for HA/else.
>>
>>
>>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>>         <ouchwhisper@gmail.com <ma...@gmail.com>
>>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>**
>> >__>
>>
>>         wrote:
>>
>>              Hello,
>>
>>              Is there a standard way to prevent the failure of Namenode
>>         crash in
>>              a Hadoop cluster?
>>              or what is the standard or best practice for overcoming the
>>         Single
>>              point failure problem of Hadoop.
>>
>>              I am not ready to take chances on a production server with
>>         Hadoop
>>              2.0 Alpha release, which claims to have solved the problem.
>> Are
>>              there any other things I can do to either prevent the
>>         failure or
>>              recover from the failure in a very short time.
>>
>>              Thanking You,
>>
>>              --
>>              Regards,
>>              Ouch Whisper
>>              010101010101
>>
>>
>>
>>
>>         --
>>         Harsh J
>>
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

I've seen NFS get in a state many times where the mount is still there, 
but it can't be written to or accessed. What happens in that case? If 
the network is congested or slow, does that slow down the overall NN 
performance?

Thanks,
randy

On 01/15/2013 11:14 PM, Harsh J wrote:
> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
> ejects it out and continues with the local disk. If auto-restore is
> configured, it will re-add the NFS if its detected good again later.
>
>
> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
> <ma...@comcast.net>> wrote:
>
>     What happens to the NN and/or performance if there's a problem with
>     the NFS server? Or the network?
>
>     Thanks,
>     randy
>
>
>     On 01/14/2013 11:36 PM, Harsh J wrote:
>
>         Its very rare to observe an NN crash due to a software bug in
>         production. Most of the times its a hardware fault you should
>         worry about.
>
>         On 1.x, or any non-HA-carrying release, the best you can get to
>         safeguard against a total loss is to have redundant disk volumes
>         configured, one preferably over a dedicated remote NFS mount.
>         This way
>         the NN is recoverable after the node goes down, since you can
>         retrieve a
>         current copy from another machine (i.e. via the NFS mount) and
>         set a new
>         node up to replace the older NN and continue along.
>
>         A load balancer will not work as the NN is not a simple
>         webserver - it
>         maintains state which you cannot sync. We wrote HA-HDFS features to
>         address the very concern you have.
>
>         If you want true, painless HA, branch-2 is your best bet at this
>         point.
>         An upcoming 2.0.3 release should include the QJM based HA
>         features that
>         is painless to setup and very reliable to use (over other
>         options), and
>         works with commodity level hardware. FWIW, we've (my team and I)
>         been
>         supporting several users and customers who're running the 2.x
>         based HA
>         in production and other types of environments and it has been
>         greatly
>         stable in our experience. There are also some folks in the community
>         running 2.x based HDFS for HA/else.
>
>
>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>         <ouchwhisper@gmail.com <ma...@gmail.com>
>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>         wrote:
>
>              Hello,
>
>              Is there a standard way to prevent the failure of Namenode
>         crash in
>              a Hadoop cluster?
>              or what is the standard or best practice for overcoming the
>         Single
>              point failure problem of Hadoop.
>
>              I am not ready to take chances on a production server with
>         Hadoop
>              2.0 Alpha release, which claims to have solved the problem. Are
>              there any other things I can do to either prevent the
>         failure or
>              recover from the failure in a very short time.
>
>              Thanking You,
>
>              --
>              Regards,
>              Ouch Whisper
>              010101010101
>
>
>
>
>         --
>         Harsh J
>
>
>
>
>
> --
> Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

I've seen NFS get in a state many times where the mount is still there, 
but it can't be written to or accessed. What happens in that case? If 
the network is congested or slow, does that slow down the overall NN 
performance?

Thanks,
randy

On 01/15/2013 11:14 PM, Harsh J wrote:
> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
> ejects it out and continues with the local disk. If auto-restore is
> configured, it will re-add the NFS if its detected good again later.
>
>
> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
> <ma...@comcast.net>> wrote:
>
>     What happens to the NN and/or performance if there's a problem with
>     the NFS server? Or the network?
>
>     Thanks,
>     randy
>
>
>     On 01/14/2013 11:36 PM, Harsh J wrote:
>
>         Its very rare to observe an NN crash due to a software bug in
>         production. Most of the times its a hardware fault you should
>         worry about.
>
>         On 1.x, or any non-HA-carrying release, the best you can get to
>         safeguard against a total loss is to have redundant disk volumes
>         configured, one preferably over a dedicated remote NFS mount.
>         This way
>         the NN is recoverable after the node goes down, since you can
>         retrieve a
>         current copy from another machine (i.e. via the NFS mount) and
>         set a new
>         node up to replace the older NN and continue along.
>
>         A load balancer will not work as the NN is not a simple
>         webserver - it
>         maintains state which you cannot sync. We wrote HA-HDFS features to
>         address the very concern you have.
>
>         If you want true, painless HA, branch-2 is your best bet at this
>         point.
>         An upcoming 2.0.3 release should include the QJM based HA
>         features that
>         is painless to setup and very reliable to use (over other
>         options), and
>         works with commodity level hardware. FWIW, we've (my team and I)
>         been
>         supporting several users and customers who're running the 2.x
>         based HA
>         in production and other types of environments and it has been
>         greatly
>         stable in our experience. There are also some folks in the community
>         running 2.x based HDFS for HA/else.
>
>
>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>         <ouchwhisper@gmail.com <ma...@gmail.com>
>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>         wrote:
>
>              Hello,
>
>              Is there a standard way to prevent the failure of Namenode
>         crash in
>              a Hadoop cluster?
>              or what is the standard or best practice for overcoming the
>         Single
>              point failure problem of Hadoop.
>
>              I am not ready to take chances on a production server with
>         Hadoop
>              2.0 Alpha release, which claims to have solved the problem. Are
>              there any other things I can do to either prevent the
>         failure or
>              recover from the failure in a very short time.
>
>              Thanking You,
>
>              --
>              Regards,
>              Ouch Whisper
>              010101010101
>
>
>
>
>         --
>         Harsh J
>
>
>
>
>
> --
> Harsh J

RE: hadoop namenode recovery

Posted by Rakesh R <ra...@huawei.com>.

Hi,



I feel the most reliable approach is using NN-HA features with shared storage. Here the idea is having two Namenodes. Both the Active, Standby(Secondary) Namenodes will be pointing to the shared device and writes the editlogs to it. When the Active crashes, Standby will take over and become Active and continue serving the clients reliably without much interruptions.





One of the possible approach is with BookKeeper as Shared storage device:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#BookKeeper_as_a_Shared_storage_EXPERIMENTAL



http://zookeeper.apache.org/doc/trunk/bookkeeperOverview.html



Thanks,

Rakesh

________________________________

From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, January 16, 2013 9:44 AM
To: <us...@hadoop.apache.org>
Subject: Re: hadoop namenode recovery

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects it out and continues with the local disk. If auto-restore is configured, it will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net>> wrote:
What happens to the NN and/or performance if there's a problem with the NFS server? Or the network?

Thanks,
randy


On 01/14/2013 11:36 PM, Harsh J wrote:
Its very rare to observe an NN crash due to a software bug in
production. Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to
safeguard against a total loss is to have redundant disk volumes
configured, one preferably over a dedicated remote NFS mount. This way
the NN is recoverable after the node goes down, since you can retrieve a
current copy from another machine (i.e. via the NFS mount) and set a new
node up to replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to
address the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point.
An upcoming 2.0.3 release should include the QJM based HA features that
is painless to setup and very reliable to use (over other options), and
works with commodity level hardware. FWIW, we've (my team and I) been
supporting several users and customers who're running the 2.x based HA
in production and other types of environments and it has been greatly
stable in our experience. There are also some folks in the community
running 2.x based HDFS for HA/else.


On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>
<ma...@gmail.com>>> wrote:

    Hello,

    Is there a standard way to prevent the failure of Namenode crash in
    a Hadoop cluster?
    or what is the standard or best practice for overcoming the Single
    point failure problem of Hadoop.

    I am not ready to take chances on a production server with Hadoop
    2.0 Alpha release, which claims to have solved the problem. Are
    there any other things I can do to either prevent the failure or
    recover from the failure in a very short time.

    Thanking You,

    --
    Regards,
    Ouch Whisper
    010101010101




--
Harsh J




--
Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

I've seen NFS get in a state many times where the mount is still there, 
but it can't be written to or accessed. What happens in that case? If 
the network is congested or slow, does that slow down the overall NN 
performance?

Thanks,
randy

On 01/15/2013 11:14 PM, Harsh J wrote:
> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
> ejects it out and continues with the local disk. If auto-restore is
> configured, it will re-add the NFS if its detected good again later.
>
>
> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
> <ma...@comcast.net>> wrote:
>
>     What happens to the NN and/or performance if there's a problem with
>     the NFS server? Or the network?
>
>     Thanks,
>     randy
>
>
>     On 01/14/2013 11:36 PM, Harsh J wrote:
>
>         Its very rare to observe an NN crash due to a software bug in
>         production. Most of the times its a hardware fault you should
>         worry about.
>
>         On 1.x, or any non-HA-carrying release, the best you can get to
>         safeguard against a total loss is to have redundant disk volumes
>         configured, one preferably over a dedicated remote NFS mount.
>         This way
>         the NN is recoverable after the node goes down, since you can
>         retrieve a
>         current copy from another machine (i.e. via the NFS mount) and
>         set a new
>         node up to replace the older NN and continue along.
>
>         A load balancer will not work as the NN is not a simple
>         webserver - it
>         maintains state which you cannot sync. We wrote HA-HDFS features to
>         address the very concern you have.
>
>         If you want true, painless HA, branch-2 is your best bet at this
>         point.
>         An upcoming 2.0.3 release should include the QJM based HA
>         features that
>         is painless to setup and very reliable to use (over other
>         options), and
>         works with commodity level hardware. FWIW, we've (my team and I)
>         been
>         supporting several users and customers who're running the 2.x
>         based HA
>         in production and other types of environments and it has been
>         greatly
>         stable in our experience. There are also some folks in the community
>         running 2.x based HDFS for HA/else.
>
>
>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>         <ouchwhisper@gmail.com <ma...@gmail.com>
>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>         wrote:
>
>              Hello,
>
>              Is there a standard way to prevent the failure of Namenode
>         crash in
>              a Hadoop cluster?
>              or what is the standard or best practice for overcoming the
>         Single
>              point failure problem of Hadoop.
>
>              I am not ready to take chances on a production server with
>         Hadoop
>              2.0 Alpha release, which claims to have solved the problem. Are
>              there any other things I can do to either prevent the
>         failure or
>              recover from the failure in a very short time.
>
>              Thanking You,
>
>              --
>              Regards,
>              Ouch Whisper
>              010101010101
>
>
>
>
>         --
>         Harsh J
>
>
>
>
>
> --
> Harsh J

RE: hadoop namenode recovery

Posted by Rakesh R <ra...@huawei.com>.

Hi,



I feel the most reliable approach is using NN-HA features with shared storage. Here the idea is having two Namenodes. Both the Active, Standby(Secondary) Namenodes will be pointing to the shared device and writes the editlogs to it. When the Active crashes, Standby will take over and become Active and continue serving the clients reliably without much interruptions.





One of the possible approach is with BookKeeper as Shared storage device:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#BookKeeper_as_a_Shared_storage_EXPERIMENTAL



http://zookeeper.apache.org/doc/trunk/bookkeeperOverview.html



Thanks,

Rakesh

________________________________

From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, January 16, 2013 9:44 AM
To: <us...@hadoop.apache.org>
Subject: Re: hadoop namenode recovery

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects it out and continues with the local disk. If auto-restore is configured, it will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net>> wrote:
What happens to the NN and/or performance if there's a problem with the NFS server? Or the network?

Thanks,
randy


On 01/14/2013 11:36 PM, Harsh J wrote:
Its very rare to observe an NN crash due to a software bug in
production. Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to
safeguard against a total loss is to have redundant disk volumes
configured, one preferably over a dedicated remote NFS mount. This way
the NN is recoverable after the node goes down, since you can retrieve a
current copy from another machine (i.e. via the NFS mount) and set a new
node up to replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to
address the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point.
An upcoming 2.0.3 release should include the QJM based HA features that
is painless to setup and very reliable to use (over other options), and
works with commodity level hardware. FWIW, we've (my team and I) been
supporting several users and customers who're running the 2.x based HA
in production and other types of environments and it has been greatly
stable in our experience. There are also some folks in the community
running 2.x based HDFS for HA/else.


On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>
<ma...@gmail.com>>> wrote:

    Hello,

    Is there a standard way to prevent the failure of Namenode crash in
    a Hadoop cluster?
    or what is the standard or best practice for overcoming the Single
    point failure problem of Hadoop.

    I am not ready to take chances on a production server with Hadoop
    2.0 Alpha release, which claims to have solved the problem. Are
    there any other things I can do to either prevent the failure or
    recover from the failure in a very short time.

    Thanking You,

    --
    Regards,
    Ouch Whisper
    010101010101




--
Harsh J




--
Harsh J

RE: hadoop namenode recovery

Posted by Rakesh R <ra...@huawei.com>.

Hi,



I feel the most reliable approach is using NN-HA features with shared storage. Here the idea is having two Namenodes. Both the Active, Standby(Secondary) Namenodes will be pointing to the shared device and writes the editlogs to it. When the Active crashes, Standby will take over and become Active and continue serving the clients reliably without much interruptions.





One of the possible approach is with BookKeeper as Shared storage device:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#BookKeeper_as_a_Shared_storage_EXPERIMENTAL



http://zookeeper.apache.org/doc/trunk/bookkeeperOverview.html



Thanks,

Rakesh

________________________________

From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, January 16, 2013 9:44 AM
To: <us...@hadoop.apache.org>
Subject: Re: hadoop namenode recovery

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects it out and continues with the local disk. If auto-restore is configured, it will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net>> wrote:
What happens to the NN and/or performance if there's a problem with the NFS server? Or the network?

Thanks,
randy


On 01/14/2013 11:36 PM, Harsh J wrote:
Its very rare to observe an NN crash due to a software bug in
production. Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to
safeguard against a total loss is to have redundant disk volumes
configured, one preferably over a dedicated remote NFS mount. This way
the NN is recoverable after the node goes down, since you can retrieve a
current copy from another machine (i.e. via the NFS mount) and set a new
node up to replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to
address the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point.
An upcoming 2.0.3 release should include the QJM based HA features that
is painless to setup and very reliable to use (over other options), and
works with commodity level hardware. FWIW, we've (my team and I) been
supporting several users and customers who're running the 2.x based HA
in production and other types of environments and it has been greatly
stable in our experience. There are also some folks in the community
running 2.x based HDFS for HA/else.


On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>
<ma...@gmail.com>>> wrote:

    Hello,

    Is there a standard way to prevent the failure of Namenode crash in
    a Hadoop cluster?
    or what is the standard or best practice for overcoming the Single
    point failure problem of Hadoop.

    I am not ready to take chances on a production server with Hadoop
    2.0 Alpha release, which claims to have solved the problem. Are
    there any other things I can do to either prevent the failure or
    recover from the failure in a very short time.

    Thanking You,

    --
    Regards,
    Ouch Whisper
    010101010101




--
Harsh J




--
Harsh J

RE: hadoop namenode recovery

Posted by Rakesh R <ra...@huawei.com>.

Hi,



I feel the most reliable approach is using NN-HA features with shared storage. Here the idea is having two Namenodes. Both the Active, Standby(Secondary) Namenodes will be pointing to the shared device and writes the editlogs to it. When the Active crashes, Standby will take over and become Active and continue serving the clients reliably without much interruptions.





One of the possible approach is with BookKeeper as Shared storage device:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#BookKeeper_as_a_Shared_storage_EXPERIMENTAL



http://zookeeper.apache.org/doc/trunk/bookkeeperOverview.html



Thanks,

Rakesh

________________________________

From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, January 16, 2013 9:44 AM
To: <us...@hadoop.apache.org>
Subject: Re: hadoop namenode recovery

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects it out and continues with the local disk. If auto-restore is configured, it will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net>> wrote:
What happens to the NN and/or performance if there's a problem with the NFS server? Or the network?

Thanks,
randy


On 01/14/2013 11:36 PM, Harsh J wrote:
Its very rare to observe an NN crash due to a software bug in
production. Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to
safeguard against a total loss is to have redundant disk volumes
configured, one preferably over a dedicated remote NFS mount. This way
the NN is recoverable after the node goes down, since you can retrieve a
current copy from another machine (i.e. via the NFS mount) and set a new
node up to replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to
address the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point.
An upcoming 2.0.3 release should include the QJM based HA features that
is painless to setup and very reliable to use (over other options), and
works with commodity level hardware. FWIW, we've (my team and I) been
supporting several users and customers who're running the 2.x based HA
in production and other types of environments and it has been greatly
stable in our experience. There are also some folks in the community
running 2.x based HDFS for HA/else.


On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>
<ma...@gmail.com>>> wrote:

    Hello,

    Is there a standard way to prevent the failure of Namenode crash in
    a Hadoop cluster?
    or what is the standard or best practice for overcoming the Single
    point failure problem of Hadoop.

    I am not ready to take chances on a production server with Hadoop
    2.0 Alpha release, which claims to have solved the problem. Are
    there any other things I can do to either prevent the failure or
    recover from the failure in a very short time.

    Thanking You,

    --
    Regards,
    Ouch Whisper
    010101010101




--
Harsh J




--
Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

I've seen NFS get in a state many times where the mount is still there, 
but it can't be written to or accessed. What happens in that case? If 
the network is congested or slow, does that slow down the overall NN 
performance?

Thanks,
randy

On 01/15/2013 11:14 PM, Harsh J wrote:
> The NFS mount is to be soft-mounted; so if the NFS goes down, the NN
> ejects it out and continues with the local disk. If auto-restore is
> configured, it will re-add the NFS if its detected good again later.
>
>
> On Wed, Jan 16, 2013 at 7:04 AM, randy <randysch@comcast.net
> <ma...@comcast.net>> wrote:
>
>     What happens to the NN and/or performance if there's a problem with
>     the NFS server? Or the network?
>
>     Thanks,
>     randy
>
>
>     On 01/14/2013 11:36 PM, Harsh J wrote:
>
>         Its very rare to observe an NN crash due to a software bug in
>         production. Most of the times its a hardware fault you should
>         worry about.
>
>         On 1.x, or any non-HA-carrying release, the best you can get to
>         safeguard against a total loss is to have redundant disk volumes
>         configured, one preferably over a dedicated remote NFS mount.
>         This way
>         the NN is recoverable after the node goes down, since you can
>         retrieve a
>         current copy from another machine (i.e. via the NFS mount) and
>         set a new
>         node up to replace the older NN and continue along.
>
>         A load balancer will not work as the NN is not a simple
>         webserver - it
>         maintains state which you cannot sync. We wrote HA-HDFS features to
>         address the very concern you have.
>
>         If you want true, painless HA, branch-2 is your best bet at this
>         point.
>         An upcoming 2.0.3 release should include the QJM based HA
>         features that
>         is painless to setup and very reliable to use (over other
>         options), and
>         works with commodity level hardware. FWIW, we've (my team and I)
>         been
>         supporting several users and customers who're running the 2.x
>         based HA
>         in production and other types of environments and it has been
>         greatly
>         stable in our experience. There are also some folks in the community
>         running 2.x based HDFS for HA/else.
>
>
>         On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper
>         <ouchwhisper@gmail.com <ma...@gmail.com>
>         <mailto:ouchwhisper@gmail.com <ma...@gmail.com>>__>
>         wrote:
>
>              Hello,
>
>              Is there a standard way to prevent the failure of Namenode
>         crash in
>              a Hadoop cluster?
>              or what is the standard or best practice for overcoming the
>         Single
>              point failure problem of Hadoop.
>
>              I am not ready to take chances on a production server with
>         Hadoop
>              2.0 Alpha release, which claims to have solved the problem. Are
>              there any other things I can do to either prevent the
>         failure or
>              recover from the failure in a very short time.
>
>              Thanking You,
>
>              --
>              Regards,
>              Ouch Whisper
>              010101010101
>
>
>
>
>         --
>         Harsh J
>
>
>
>
>
> --
> Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects
it out and continues with the local disk. If auto-restore is configured, it
will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net> wrote:

> What happens to the NN and/or performance if there's a problem with the
> NFS server? Or the network?
>
> Thanks,
> randy
>
>
> On 01/14/2013 11:36 PM, Harsh J wrote:
>
>> Its very rare to observe an NN crash due to a software bug in
>> production. Most of the times its a hardware fault you should worry about.
>>
>> On 1.x, or any non-HA-carrying release, the best you can get to
>> safeguard against a total loss is to have redundant disk volumes
>> configured, one preferably over a dedicated remote NFS mount. This way
>> the NN is recoverable after the node goes down, since you can retrieve a
>> current copy from another machine (i.e. via the NFS mount) and set a new
>> node up to replace the older NN and continue along.
>>
>> A load balancer will not work as the NN is not a simple webserver - it
>> maintains state which you cannot sync. We wrote HA-HDFS features to
>> address the very concern you have.
>>
>> If you want true, painless HA, branch-2 is your best bet at this point.
>> An upcoming 2.0.3 release should include the QJM based HA features that
>> is painless to setup and very reliable to use (over other options), and
>> works with commodity level hardware. FWIW, we've (my team and I) been
>> supporting several users and customers who're running the 2.x based HA
>> in production and other types of environments and it has been greatly
>> stable in our experience. There are also some folks in the community
>> running 2.x based HDFS for HA/else.
>>
>>
>> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
>> <ma...@gmail.com>**> wrote:
>>
>>     Hello,
>>
>>     Is there a standard way to prevent the failure of Namenode crash in
>>     a Hadoop cluster?
>>     or what is the standard or best practice for overcoming the Single
>>     point failure problem of Hadoop.
>>
>>     I am not ready to take chances on a production server with Hadoop
>>     2.0 Alpha release, which claims to have solved the problem. Are
>>     there any other things I can do to either prevent the failure or
>>     recover from the failure in a very short time.
>>
>>     Thanking You,
>>
>>     --
>>     Regards,
>>     Ouch Whisper
>>     010101010101
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects
it out and continues with the local disk. If auto-restore is configured, it
will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net> wrote:

> What happens to the NN and/or performance if there's a problem with the
> NFS server? Or the network?
>
> Thanks,
> randy
>
>
> On 01/14/2013 11:36 PM, Harsh J wrote:
>
>> Its very rare to observe an NN crash due to a software bug in
>> production. Most of the times its a hardware fault you should worry about.
>>
>> On 1.x, or any non-HA-carrying release, the best you can get to
>> safeguard against a total loss is to have redundant disk volumes
>> configured, one preferably over a dedicated remote NFS mount. This way
>> the NN is recoverable after the node goes down, since you can retrieve a
>> current copy from another machine (i.e. via the NFS mount) and set a new
>> node up to replace the older NN and continue along.
>>
>> A load balancer will not work as the NN is not a simple webserver - it
>> maintains state which you cannot sync. We wrote HA-HDFS features to
>> address the very concern you have.
>>
>> If you want true, painless HA, branch-2 is your best bet at this point.
>> An upcoming 2.0.3 release should include the QJM based HA features that
>> is painless to setup and very reliable to use (over other options), and
>> works with commodity level hardware. FWIW, we've (my team and I) been
>> supporting several users and customers who're running the 2.x based HA
>> in production and other types of environments and it has been greatly
>> stable in our experience. There are also some folks in the community
>> running 2.x based HDFS for HA/else.
>>
>>
>> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
>> <ma...@gmail.com>**> wrote:
>>
>>     Hello,
>>
>>     Is there a standard way to prevent the failure of Namenode crash in
>>     a Hadoop cluster?
>>     or what is the standard or best practice for overcoming the Single
>>     point failure problem of Hadoop.
>>
>>     I am not ready to take chances on a production server with Hadoop
>>     2.0 Alpha release, which claims to have solved the problem. Are
>>     there any other things I can do to either prevent the failure or
>>     recover from the failure in a very short time.
>>
>>     Thanking You,
>>
>>     --
>>     Regards,
>>     Ouch Whisper
>>     010101010101
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects
it out and continues with the local disk. If auto-restore is configured, it
will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net> wrote:

> What happens to the NN and/or performance if there's a problem with the
> NFS server? Or the network?
>
> Thanks,
> randy
>
>
> On 01/14/2013 11:36 PM, Harsh J wrote:
>
>> Its very rare to observe an NN crash due to a software bug in
>> production. Most of the times its a hardware fault you should worry about.
>>
>> On 1.x, or any non-HA-carrying release, the best you can get to
>> safeguard against a total loss is to have redundant disk volumes
>> configured, one preferably over a dedicated remote NFS mount. This way
>> the NN is recoverable after the node goes down, since you can retrieve a
>> current copy from another machine (i.e. via the NFS mount) and set a new
>> node up to replace the older NN and continue along.
>>
>> A load balancer will not work as the NN is not a simple webserver - it
>> maintains state which you cannot sync. We wrote HA-HDFS features to
>> address the very concern you have.
>>
>> If you want true, painless HA, branch-2 is your best bet at this point.
>> An upcoming 2.0.3 release should include the QJM based HA features that
>> is painless to setup and very reliable to use (over other options), and
>> works with commodity level hardware. FWIW, we've (my team and I) been
>> supporting several users and customers who're running the 2.x based HA
>> in production and other types of environments and it has been greatly
>> stable in our experience. There are also some folks in the community
>> running 2.x based HDFS for HA/else.
>>
>>
>> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
>> <ma...@gmail.com>**> wrote:
>>
>>     Hello,
>>
>>     Is there a standard way to prevent the failure of Namenode crash in
>>     a Hadoop cluster?
>>     or what is the standard or best practice for overcoming the Single
>>     point failure problem of Hadoop.
>>
>>     I am not ready to take chances on a production server with Hadoop
>>     2.0 Alpha release, which claims to have solved the problem. Are
>>     there any other things I can do to either prevent the failure or
>>     recover from the failure in a very short time.
>>
>>     Thanking You,
>>
>>     --
>>     Regards,
>>     Ouch Whisper
>>     010101010101
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

The NFS mount is to be soft-mounted; so if the NFS goes down, the NN ejects
it out and continues with the local disk. If auto-restore is configured, it
will re-add the NFS if its detected good again later.


On Wed, Jan 16, 2013 at 7:04 AM, randy <ra...@comcast.net> wrote:

> What happens to the NN and/or performance if there's a problem with the
> NFS server? Or the network?
>
> Thanks,
> randy
>
>
> On 01/14/2013 11:36 PM, Harsh J wrote:
>
>> Its very rare to observe an NN crash due to a software bug in
>> production. Most of the times its a hardware fault you should worry about.
>>
>> On 1.x, or any non-HA-carrying release, the best you can get to
>> safeguard against a total loss is to have redundant disk volumes
>> configured, one preferably over a dedicated remote NFS mount. This way
>> the NN is recoverable after the node goes down, since you can retrieve a
>> current copy from another machine (i.e. via the NFS mount) and set a new
>> node up to replace the older NN and continue along.
>>
>> A load balancer will not work as the NN is not a simple webserver - it
>> maintains state which you cannot sync. We wrote HA-HDFS features to
>> address the very concern you have.
>>
>> If you want true, painless HA, branch-2 is your best bet at this point.
>> An upcoming 2.0.3 release should include the QJM based HA features that
>> is painless to setup and very reliable to use (over other options), and
>> works with commodity level hardware. FWIW, we've (my team and I) been
>> supporting several users and customers who're running the 2.x based HA
>> in production and other types of environments and it has been greatly
>> stable in our experience. There are also some folks in the community
>> running 2.x based HDFS for HA/else.
>>
>>
>> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
>> <ma...@gmail.com>**> wrote:
>>
>>     Hello,
>>
>>     Is there a standard way to prevent the failure of Namenode crash in
>>     a Hadoop cluster?
>>     or what is the standard or best practice for overcoming the Single
>>     point failure problem of Hadoop.
>>
>>     I am not ready to take chances on a production server with Hadoop
>>     2.0 Alpha release, which claims to have solved the problem. Are
>>     there any other things I can do to either prevent the failure or
>>     recover from the failure in a very short time.
>>
>>     Thanking You,
>>
>>     --
>>     Regards,
>>     Ouch Whisper
>>     010101010101
>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

What happens to the NN and/or performance if there's a problem with the 
NFS server? Or the network?

Thanks,
randy

On 01/14/2013 11:36 PM, Harsh J wrote:
> Its very rare to observe an NN crash due to a software bug in
> production. Most of the times its a hardware fault you should worry about.
>
> On 1.x, or any non-HA-carrying release, the best you can get to
> safeguard against a total loss is to have redundant disk volumes
> configured, one preferably over a dedicated remote NFS mount. This way
> the NN is recoverable after the node goes down, since you can retrieve a
> current copy from another machine (i.e. via the NFS mount) and set a new
> node up to replace the older NN and continue along.
>
> A load balancer will not work as the NN is not a simple webserver - it
> maintains state which you cannot sync. We wrote HA-HDFS features to
> address the very concern you have.
>
> If you want true, painless HA, branch-2 is your best bet at this point.
> An upcoming 2.0.3 release should include the QJM based HA features that
> is painless to setup and very reliable to use (over other options), and
> works with commodity level hardware. FWIW, we've (my team and I) been
> supporting several users and customers who're running the 2.x based HA
> in production and other types of environments and it has been greatly
> stable in our experience. There are also some folks in the community
> running 2.x based HDFS for HA/else.
>
>
> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     Is there a standard way to prevent the failure of Namenode crash in
>     a Hadoop cluster?
>     or what is the standard or best practice for overcoming the Single
>     point failure problem of Hadoop.
>
>     I am not ready to take chances on a production server with Hadoop
>     2.0 Alpha release, which claims to have solved the problem. Are
>     there any other things I can do to either prevent the failure or
>     recover from the failure in a very short time.
>
>     Thanking You,
>
>     --
>     Regards,
>     Ouch Whisper
>     010101010101
>
>
>
>
> --
> Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

What happens to the NN and/or performance if there's a problem with the 
NFS server? Or the network?

Thanks,
randy

On 01/14/2013 11:36 PM, Harsh J wrote:
> Its very rare to observe an NN crash due to a software bug in
> production. Most of the times its a hardware fault you should worry about.
>
> On 1.x, or any non-HA-carrying release, the best you can get to
> safeguard against a total loss is to have redundant disk volumes
> configured, one preferably over a dedicated remote NFS mount. This way
> the NN is recoverable after the node goes down, since you can retrieve a
> current copy from another machine (i.e. via the NFS mount) and set a new
> node up to replace the older NN and continue along.
>
> A load balancer will not work as the NN is not a simple webserver - it
> maintains state which you cannot sync. We wrote HA-HDFS features to
> address the very concern you have.
>
> If you want true, painless HA, branch-2 is your best bet at this point.
> An upcoming 2.0.3 release should include the QJM based HA features that
> is painless to setup and very reliable to use (over other options), and
> works with commodity level hardware. FWIW, we've (my team and I) been
> supporting several users and customers who're running the 2.x based HA
> in production and other types of environments and it has been greatly
> stable in our experience. There are also some folks in the community
> running 2.x based HDFS for HA/else.
>
>
> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     Is there a standard way to prevent the failure of Namenode crash in
>     a Hadoop cluster?
>     or what is the standard or best practice for overcoming the Single
>     point failure problem of Hadoop.
>
>     I am not ready to take chances on a production server with Hadoop
>     2.0 Alpha release, which claims to have solved the problem. Are
>     there any other things I can do to either prevent the failure or
>     recover from the failure in a very short time.
>
>     Thanking You,
>
>     --
>     Regards,
>     Ouch Whisper
>     010101010101
>
>
>
>
> --
> Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

What happens to the NN and/or performance if there's a problem with the 
NFS server? Or the network?

Thanks,
randy

On 01/14/2013 11:36 PM, Harsh J wrote:
> Its very rare to observe an NN crash due to a software bug in
> production. Most of the times its a hardware fault you should worry about.
>
> On 1.x, or any non-HA-carrying release, the best you can get to
> safeguard against a total loss is to have redundant disk volumes
> configured, one preferably over a dedicated remote NFS mount. This way
> the NN is recoverable after the node goes down, since you can retrieve a
> current copy from another machine (i.e. via the NFS mount) and set a new
> node up to replace the older NN and continue along.
>
> A load balancer will not work as the NN is not a simple webserver - it
> maintains state which you cannot sync. We wrote HA-HDFS features to
> address the very concern you have.
>
> If you want true, painless HA, branch-2 is your best bet at this point.
> An upcoming 2.0.3 release should include the QJM based HA features that
> is painless to setup and very reliable to use (over other options), and
> works with commodity level hardware. FWIW, we've (my team and I) been
> supporting several users and customers who're running the 2.x based HA
> in production and other types of environments and it has been greatly
> stable in our experience. There are also some folks in the community
> running 2.x based HDFS for HA/else.
>
>
> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     Is there a standard way to prevent the failure of Namenode crash in
>     a Hadoop cluster?
>     or what is the standard or best practice for overcoming the Single
>     point failure problem of Hadoop.
>
>     I am not ready to take chances on a production server with Hadoop
>     2.0 Alpha release, which claims to have solved the problem. Are
>     there any other things I can do to either prevent the failure or
>     recover from the failure in a very short time.
>
>     Thanking You,
>
>     --
>     Regards,
>     Ouch Whisper
>     010101010101
>
>
>
>
> --
> Harsh J

Re: hadoop namenode recovery

Posted by randy <ra...@comcast.net>.

What happens to the NN and/or performance if there's a problem with the 
NFS server? Or the network?

Thanks,
randy

On 01/14/2013 11:36 PM, Harsh J wrote:
> Its very rare to observe an NN crash due to a software bug in
> production. Most of the times its a hardware fault you should worry about.
>
> On 1.x, or any non-HA-carrying release, the best you can get to
> safeguard against a total loss is to have redundant disk volumes
> configured, one preferably over a dedicated remote NFS mount. This way
> the NN is recoverable after the node goes down, since you can retrieve a
> current copy from another machine (i.e. via the NFS mount) and set a new
> node up to replace the older NN and continue along.
>
> A load balancer will not work as the NN is not a simple webserver - it
> maintains state which you cannot sync. We wrote HA-HDFS features to
> address the very concern you have.
>
> If you want true, painless HA, branch-2 is your best bet at this point.
> An upcoming 2.0.3 release should include the QJM based HA features that
> is painless to setup and very reliable to use (over other options), and
> works with commodity level hardware. FWIW, we've (my team and I) been
> supporting several users and customers who're running the 2.x based HA
> in production and other types of environments and it has been greatly
> stable in our experience. There are also some folks in the community
> running 2.x based HDFS for HA/else.
>
>
> On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ouchwhisper@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello,
>
>     Is there a standard way to prevent the failure of Namenode crash in
>     a Hadoop cluster?
>     or what is the standard or best practice for overcoming the Single
>     point failure problem of Hadoop.
>
>     I am not ready to take chances on a production server with Hadoop
>     2.0 Alpha release, which claims to have solved the problem. Are
>     there any other things I can do to either prevent the failure or
>     recover from the failure in a very short time.
>
>     Thanking You,
>
>     --
>     Regards,
>     Ouch Whisper
>     010101010101
>
>
>
>
> --
> Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Its very rare to observe an NN crash due to a software bug in production.
Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to safeguard
against a total loss is to have redundant disk volumes configured, one
preferably over a dedicated remote NFS mount. This way the NN is
recoverable after the node goes down, since you can retrieve a current copy
from another machine (i.e. via the NFS mount) and set a new node up to
replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to address
the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point. An
upcoming 2.0.3 release should include the QJM based HA features that is
painless to setup and very reliable to use (over other options), and works
with commodity level hardware. FWIW, we've (my team and I) been supporting
several users and customers who're running the 2.x based HA in production
and other types of environments and it has been greatly stable in our
experience. There are also some folks in the community running 2.x based
HDFS for HA/else.

On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Harsh J

Re: hadoop namenode recovery

Posted by nagarjuna kanamarlapudi <na...@gmail.com>.

I am not sure if this is possible as in 0.2X or 1.0 releases of Hadoop .

On Tuesday, January 15, 2013, Panshul Whisper wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
>  When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<
>
>

-- 
Sent from iPhone

Re: hadoop namenode recovery

Posted by nagarjuna kanamarlapudi <na...@gmail.com>.

I am not sure if this is possible as in 0.2X or 1.0 releases of Hadoop .

On Tuesday, January 15, 2013, Panshul Whisper wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
>  When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<
>
>

-- 
Sent from iPhone

Re: hadoop namenode recovery

Posted by anil gupta <an...@gmail.com>.

Inline

On Mon, Jan 14, 2013 at 7:48 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
This wont work since the DataNode's always needs to be aware of active
NameNode to send heartbeat and for other communication.
Hortonworks as well as Cloudera both have solution for Single Point of
Failure of Namenode. You will have to analyze the solutions and pick one.

>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Hello Bejoy,
>>
>> Thank you for the information.
>> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot
>> use them for production. For my requirements, the cluster is supposed to be
>> extremely Available. Availability is of highest concern. I have looked into
>> different distributions as well.. such as Hortonworks, they also have the
>> same problem of Single point of failure. And are waiting for Apache to
>> release the Hadoop 2.x.
>>
>> I was wondering, if I can somehow configure two Name Nodes on the same
>> Network with the same IP Address, but the second name node is redirected
>> only after the failure of the primary, that might help in automatic
>> resolution of this problem. all the slaves are connecting to the Name Node
>> with a network alias in their /etc/hosts file.
>> I am trying to implement something like this in the cluster:
>> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>>
>> please suggest if this is possible.
>>
>> Thanks for your time.
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul
>>>
>>> SecondaryNameNode is rather known as check point node. At periodic
>>> intervals it merges the editlog from NN with FS image to prevent the edit
>>> log from growing too large. This is its main functionality.
>>>
>>> At any point the SNN would have the latest fs image but not the updated
>>> edit log. If NN goes down and if you don't have an updated copy of edit log
>>> you can use the fsImage from SNN for restoring. In that case you lose the
>>> transactions in edit log.
>>>
>>> SNN is not a backup NN it is just a check point node.
>>>
>>> Two or more NN are not possible in 1.x releases but federation makes it
>>> possible with 2.x releases. Federation is for different purpose, you should
>>> be looking at hadoop HA currently with 2.x releases.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>>> *Subject: *Re: hadoop namenode recovery
>>>
>>> thank you for the reply.
>>>
>>> Is there a way with which I can configure my cluster to switch to the
>>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>>  When I run my current Hadoop, I see the primary and secondary both Name
>>> nodes running. I was wondering what is that Secondary Name Node for? and
>>> where is it configured?
>>> I was also wondering, is it possible to have two or more Name nodes
>>> running in the same cluster?
>>>
>>> Thanks,
>>> Regards,
>>> Panshul.
>>>
>>>
>>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>>
>>>> **
>>>> Hi Panshul,
>>>>
>>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>>> Of which one would be a remote location such as a nfs mount.
>>>> So that even if the NN machine crashes on a whole you still have the fs
>>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>>> the NN back again.
>>>>
>>>>
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> ------------------------------
>>>> *From: * Panshul Whisper <ou...@gmail.com>
>>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>>> *To: *<us...@hadoop.apache.org>
>>>> *ReplyTo: * user@hadoop.apache.org
>>>> *Subject: *hadoop namenode recovery
>>>>
>>>> Hello,
>>>>
>>>> Is there a standard way to prevent the failure of Namenode crash in a
>>>> Hadoop cluster?
>>>> or what is the standard or best practice for overcoming the Single
>>>> point failure problem of Hadoop.
>>>>
>>>> I am not ready to take chances on a production server with Hadoop 2.0
>>>> Alpha release, which claims to have solved the problem. Are there any other
>>>> things I can do to either prevent the failure or recover from the failure
>>>> in a very short time.
>>>>
>>>> Thanking You,
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Thanks & Regards,
Anil Gupta

Re: hadoop namenode recovery

Posted by nagarjuna kanamarlapudi <na...@gmail.com>.

I am not sure if this is possible as in 0.2X or 1.0 releases of Hadoop .

On Tuesday, January 15, 2013, Panshul Whisper wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
>  When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<
>
>

-- 
Sent from iPhone

Re: hadoop namenode recovery

Posted by nagarjuna kanamarlapudi <na...@gmail.com>.

I am not sure if this is possible as in 0.2X or 1.0 releases of Hadoop .

On Tuesday, January 15, 2013, Panshul Whisper wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
>  When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<
>
>

-- 
Sent from iPhone

Re: hadoop namenode recovery

Posted by anil gupta <an...@gmail.com>.

Inline

On Mon, Jan 14, 2013 at 7:48 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
This wont work since the DataNode's always needs to be aware of active
NameNode to send heartbeat and for other communication.
Hortonworks as well as Cloudera both have solution for Single Point of
Failure of Namenode. You will have to analyze the solutions and pick one.

>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Hello Bejoy,
>>
>> Thank you for the information.
>> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot
>> use them for production. For my requirements, the cluster is supposed to be
>> extremely Available. Availability is of highest concern. I have looked into
>> different distributions as well.. such as Hortonworks, they also have the
>> same problem of Single point of failure. And are waiting for Apache to
>> release the Hadoop 2.x.
>>
>> I was wondering, if I can somehow configure two Name Nodes on the same
>> Network with the same IP Address, but the second name node is redirected
>> only after the failure of the primary, that might help in automatic
>> resolution of this problem. all the slaves are connecting to the Name Node
>> with a network alias in their /etc/hosts file.
>> I am trying to implement something like this in the cluster:
>> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>>
>> please suggest if this is possible.
>>
>> Thanks for your time.
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul
>>>
>>> SecondaryNameNode is rather known as check point node. At periodic
>>> intervals it merges the editlog from NN with FS image to prevent the edit
>>> log from growing too large. This is its main functionality.
>>>
>>> At any point the SNN would have the latest fs image but not the updated
>>> edit log. If NN goes down and if you don't have an updated copy of edit log
>>> you can use the fsImage from SNN for restoring. In that case you lose the
>>> transactions in edit log.
>>>
>>> SNN is not a backup NN it is just a check point node.
>>>
>>> Two or more NN are not possible in 1.x releases but federation makes it
>>> possible with 2.x releases. Federation is for different purpose, you should
>>> be looking at hadoop HA currently with 2.x releases.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>>> *Subject: *Re: hadoop namenode recovery
>>>
>>> thank you for the reply.
>>>
>>> Is there a way with which I can configure my cluster to switch to the
>>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>>  When I run my current Hadoop, I see the primary and secondary both Name
>>> nodes running. I was wondering what is that Secondary Name Node for? and
>>> where is it configured?
>>> I was also wondering, is it possible to have two or more Name nodes
>>> running in the same cluster?
>>>
>>> Thanks,
>>> Regards,
>>> Panshul.
>>>
>>>
>>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>>
>>>> **
>>>> Hi Panshul,
>>>>
>>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>>> Of which one would be a remote location such as a nfs mount.
>>>> So that even if the NN machine crashes on a whole you still have the fs
>>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>>> the NN back again.
>>>>
>>>>
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> ------------------------------
>>>> *From: * Panshul Whisper <ou...@gmail.com>
>>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>>> *To: *<us...@hadoop.apache.org>
>>>> *ReplyTo: * user@hadoop.apache.org
>>>> *Subject: *hadoop namenode recovery
>>>>
>>>> Hello,
>>>>
>>>> Is there a standard way to prevent the failure of Namenode crash in a
>>>> Hadoop cluster?
>>>> or what is the standard or best practice for overcoming the Single
>>>> point failure problem of Hadoop.
>>>>
>>>> I am not ready to take chances on a production server with Hadoop 2.0
>>>> Alpha release, which claims to have solved the problem. Are there any other
>>>> things I can do to either prevent the failure or recover from the failure
>>>> in a very short time.
>>>>
>>>> Thanking You,
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Thanks & Regards,
Anil Gupta

Re: hadoop namenode recovery

Posted by anil gupta <an...@gmail.com>.

Inline

On Mon, Jan 14, 2013 at 7:48 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
This wont work since the DataNode's always needs to be aware of active
NameNode to send heartbeat and for other communication.
Hortonworks as well as Cloudera both have solution for Single Point of
Failure of Namenode. You will have to analyze the solutions and pick one.

>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Hello Bejoy,
>>
>> Thank you for the information.
>> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot
>> use them for production. For my requirements, the cluster is supposed to be
>> extremely Available. Availability is of highest concern. I have looked into
>> different distributions as well.. such as Hortonworks, they also have the
>> same problem of Single point of failure. And are waiting for Apache to
>> release the Hadoop 2.x.
>>
>> I was wondering, if I can somehow configure two Name Nodes on the same
>> Network with the same IP Address, but the second name node is redirected
>> only after the failure of the primary, that might help in automatic
>> resolution of this problem. all the slaves are connecting to the Name Node
>> with a network alias in their /etc/hosts file.
>> I am trying to implement something like this in the cluster:
>> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>>
>> please suggest if this is possible.
>>
>> Thanks for your time.
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul
>>>
>>> SecondaryNameNode is rather known as check point node. At periodic
>>> intervals it merges the editlog from NN with FS image to prevent the edit
>>> log from growing too large. This is its main functionality.
>>>
>>> At any point the SNN would have the latest fs image but not the updated
>>> edit log. If NN goes down and if you don't have an updated copy of edit log
>>> you can use the fsImage from SNN for restoring. In that case you lose the
>>> transactions in edit log.
>>>
>>> SNN is not a backup NN it is just a check point node.
>>>
>>> Two or more NN are not possible in 1.x releases but federation makes it
>>> possible with 2.x releases. Federation is for different purpose, you should
>>> be looking at hadoop HA currently with 2.x releases.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>>> *Subject: *Re: hadoop namenode recovery
>>>
>>> thank you for the reply.
>>>
>>> Is there a way with which I can configure my cluster to switch to the
>>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>>  When I run my current Hadoop, I see the primary and secondary both Name
>>> nodes running. I was wondering what is that Secondary Name Node for? and
>>> where is it configured?
>>> I was also wondering, is it possible to have two or more Name nodes
>>> running in the same cluster?
>>>
>>> Thanks,
>>> Regards,
>>> Panshul.
>>>
>>>
>>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>>
>>>> **
>>>> Hi Panshul,
>>>>
>>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>>> Of which one would be a remote location such as a nfs mount.
>>>> So that even if the NN machine crashes on a whole you still have the fs
>>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>>> the NN back again.
>>>>
>>>>
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> ------------------------------
>>>> *From: * Panshul Whisper <ou...@gmail.com>
>>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>>> *To: *<us...@hadoop.apache.org>
>>>> *ReplyTo: * user@hadoop.apache.org
>>>> *Subject: *hadoop namenode recovery
>>>>
>>>> Hello,
>>>>
>>>> Is there a standard way to prevent the failure of Namenode crash in a
>>>> Hadoop cluster?
>>>> or what is the standard or best practice for overcoming the Single
>>>> point failure problem of Hadoop.
>>>>
>>>> I am not ready to take chances on a production server with Hadoop 2.0
>>>> Alpha release, which claims to have solved the problem. Are there any other
>>>> things I can do to either prevent the failure or recover from the failure
>>>> in a very short time.
>>>>
>>>> Thanking You,
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Thanks & Regards,
Anil Gupta

Re: hadoop namenode recovery

Posted by anil gupta <an...@gmail.com>.

Inline

On Mon, Jan 14, 2013 at 7:48 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
> I have another idea.... regarding solving the single point failure of
> Hadoop...
> What If I have multiple Name Nodes setup and running behind a load
> balancer in the cluster. So this way I can have multiple Name Nodes at the
> same IP Address of the load balancer. Which resolves the problem of
> failure, If one Name Node goes down, others are working.
>
This wont work since the DataNode's always needs to be aware of active
NameNode to send heartbeat and for other communication.
Hortonworks as well as Cloudera both have solution for Single Point of
Failure of Namenode. You will have to analyze the solutions and pick one.

>
> Please suggest.... this is just a vague idea..!!
>
> Thanx
>
>
> On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:
>
>> Hello Bejoy,
>>
>> Thank you for the information.
>> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot
>> use them for production. For my requirements, the cluster is supposed to be
>> extremely Available. Availability is of highest concern. I have looked into
>> different distributions as well.. such as Hortonworks, they also have the
>> same problem of Single point of failure. And are waiting for Apache to
>> release the Hadoop 2.x.
>>
>> I was wondering, if I can somehow configure two Name Nodes on the same
>> Network with the same IP Address, but the second name node is redirected
>> only after the failure of the primary, that might help in automatic
>> resolution of this problem. all the slaves are connecting to the Name Node
>> with a network alias in their /etc/hosts file.
>> I am trying to implement something like this in the cluster:
>> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>>
>> please suggest if this is possible.
>>
>> Thanks for your time.
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul
>>>
>>> SecondaryNameNode is rather known as check point node. At periodic
>>> intervals it merges the editlog from NN with FS image to prevent the edit
>>> log from growing too large. This is its main functionality.
>>>
>>> At any point the SNN would have the latest fs image but not the updated
>>> edit log. If NN goes down and if you don't have an updated copy of edit log
>>> you can use the fsImage from SNN for restoring. In that case you lose the
>>> transactions in edit log.
>>>
>>> SNN is not a backup NN it is just a check point node.
>>>
>>> Two or more NN are not possible in 1.x releases but federation makes it
>>> possible with 2.x releases. Federation is for different purpose, you should
>>> be looking at hadoop HA currently with 2.x releases.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>>> *Subject: *Re: hadoop namenode recovery
>>>
>>> thank you for the reply.
>>>
>>> Is there a way with which I can configure my cluster to switch to the
>>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>>  When I run my current Hadoop, I see the primary and secondary both Name
>>> nodes running. I was wondering what is that Secondary Name Node for? and
>>> where is it configured?
>>> I was also wondering, is it possible to have two or more Name nodes
>>> running in the same cluster?
>>>
>>> Thanks,
>>> Regards,
>>> Panshul.
>>>
>>>
>>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>>
>>>> **
>>>> Hi Panshul,
>>>>
>>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>>> Of which one would be a remote location such as a nfs mount.
>>>> So that even if the NN machine crashes on a whole you still have the fs
>>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>>> the NN back again.
>>>>
>>>>
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> ------------------------------
>>>> *From: * Panshul Whisper <ou...@gmail.com>
>>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>>> *To: *<us...@hadoop.apache.org>
>>>> *ReplyTo: * user@hadoop.apache.org
>>>> *Subject: *hadoop namenode recovery
>>>>
>>>> Hello,
>>>>
>>>> Is there a standard way to prevent the failure of Namenode crash in a
>>>> Hadoop cluster?
>>>> or what is the standard or best practice for overcoming the Single
>>>> point failure problem of Hadoop.
>>>>
>>>> I am not ready to take chances on a production server with Hadoop 2.0
>>>> Alpha release, which claims to have solved the problem. Are there any other
>>>> things I can do to either prevent the failure or recover from the failure
>>>> in a very short time.
>>>>
>>>> Thanking You,
>>>>
>>>> --
>>>> Regards,
>>>> Ouch Whisper
>>>> 010101010101
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Thanks & Regards,
Anil Gupta

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello,
I have another idea.... regarding solving the single point failure of
Hadoop...
What If I have multiple Name Nodes setup and running behind a load balancer
in the cluster. So this way I can have multiple Name Nodes at the same IP
Address of the load balancer. Which resolves the problem of failure, If one
Name Node goes down, others are working.

Please suggest.... this is just a vague idea..!!

Thanx


On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul
>>
>> SecondaryNameNode is rather known as check point node. At periodic
>> intervals it merges the editlog from NN with FS image to prevent the edit
>> log from growing too large. This is its main functionality.
>>
>> At any point the SNN would have the latest fs image but not the updated
>> edit log. If NN goes down and if you don't have an updated copy of edit log
>> you can use the fsImage from SNN for restoring. In that case you lose the
>> transactions in edit log.
>>
>> SNN is not a backup NN it is just a check point node.
>>
>> Two or more NN are not possible in 1.x releases but federation makes it
>> possible with 2.x releases. Federation is for different purpose, you should
>> be looking at hadoop HA currently with 2.x releases.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>> *Subject: *Re: hadoop namenode recovery
>>
>> thank you for the reply.
>>
>> Is there a way with which I can configure my cluster to switch to the
>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>  When I run my current Hadoop, I see the primary and secondary both Name
>> nodes running. I was wondering what is that Secondary Name Node for? and
>> where is it configured?
>> I was also wondering, is it possible to have two or more Name nodes
>> running in the same cluster?
>>
>> Thanks,
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul,
>>>
>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>> Of which one would be a remote location such as a nfs mount.
>>> So that even if the NN machine crashes on a whole you still have the fs
>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>> the NN back again.
>>>
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *hadoop namenode recovery
>>>
>>> Hello,
>>>
>>> Is there a standard way to prevent the failure of Namenode crash in a
>>> Hadoop cluster?
>>> or what is the standard or best practice for overcoming the Single point
>>> failure problem of Hadoop.
>>>
>>> I am not ready to take chances on a production server with Hadoop 2.0
>>> Alpha release, which claims to have solved the problem. Are there any other
>>> things I can do to either prevent the failure or recover from the failure
>>> in a very short time.
>>>
>>> Thanking You,
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello,
I have another idea.... regarding solving the single point failure of
Hadoop...
What If I have multiple Name Nodes setup and running behind a load balancer
in the cluster. So this way I can have multiple Name Nodes at the same IP
Address of the load balancer. Which resolves the problem of failure, If one
Name Node goes down, others are working.

Please suggest.... this is just a vague idea..!!

Thanx


On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul
>>
>> SecondaryNameNode is rather known as check point node. At periodic
>> intervals it merges the editlog from NN with FS image to prevent the edit
>> log from growing too large. This is its main functionality.
>>
>> At any point the SNN would have the latest fs image but not the updated
>> edit log. If NN goes down and if you don't have an updated copy of edit log
>> you can use the fsImage from SNN for restoring. In that case you lose the
>> transactions in edit log.
>>
>> SNN is not a backup NN it is just a check point node.
>>
>> Two or more NN are not possible in 1.x releases but federation makes it
>> possible with 2.x releases. Federation is for different purpose, you should
>> be looking at hadoop HA currently with 2.x releases.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>> *Subject: *Re: hadoop namenode recovery
>>
>> thank you for the reply.
>>
>> Is there a way with which I can configure my cluster to switch to the
>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>  When I run my current Hadoop, I see the primary and secondary both Name
>> nodes running. I was wondering what is that Secondary Name Node for? and
>> where is it configured?
>> I was also wondering, is it possible to have two or more Name nodes
>> running in the same cluster?
>>
>> Thanks,
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul,
>>>
>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>> Of which one would be a remote location such as a nfs mount.
>>> So that even if the NN machine crashes on a whole you still have the fs
>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>> the NN back again.
>>>
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *hadoop namenode recovery
>>>
>>> Hello,
>>>
>>> Is there a standard way to prevent the failure of Namenode crash in a
>>> Hadoop cluster?
>>> or what is the standard or best practice for overcoming the Single point
>>> failure problem of Hadoop.
>>>
>>> I am not ready to take chances on a production server with Hadoop 2.0
>>> Alpha release, which claims to have solved the problem. Are there any other
>>> things I can do to either prevent the failure or recover from the failure
>>> in a very short time.
>>>
>>> Thanking You,
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello,
I have another idea.... regarding solving the single point failure of
Hadoop...
What If I have multiple Name Nodes setup and running behind a load balancer
in the cluster. So this way I can have multiple Name Nodes at the same IP
Address of the load balancer. Which resolves the problem of failure, If one
Name Node goes down, others are working.

Please suggest.... this is just a vague idea..!!

Thanx


On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul
>>
>> SecondaryNameNode is rather known as check point node. At periodic
>> intervals it merges the editlog from NN with FS image to prevent the edit
>> log from growing too large. This is its main functionality.
>>
>> At any point the SNN would have the latest fs image but not the updated
>> edit log. If NN goes down and if you don't have an updated copy of edit log
>> you can use the fsImage from SNN for restoring. In that case you lose the
>> transactions in edit log.
>>
>> SNN is not a backup NN it is just a check point node.
>>
>> Two or more NN are not possible in 1.x releases but federation makes it
>> possible with 2.x releases. Federation is for different purpose, you should
>> be looking at hadoop HA currently with 2.x releases.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>> *Subject: *Re: hadoop namenode recovery
>>
>> thank you for the reply.
>>
>> Is there a way with which I can configure my cluster to switch to the
>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>  When I run my current Hadoop, I see the primary and secondary both Name
>> nodes running. I was wondering what is that Secondary Name Node for? and
>> where is it configured?
>> I was also wondering, is it possible to have two or more Name nodes
>> running in the same cluster?
>>
>> Thanks,
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul,
>>>
>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>> Of which one would be a remote location such as a nfs mount.
>>> So that even if the NN machine crashes on a whole you still have the fs
>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>> the NN back again.
>>>
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *hadoop namenode recovery
>>>
>>> Hello,
>>>
>>> Is there a standard way to prevent the failure of Namenode crash in a
>>> Hadoop cluster?
>>> or what is the standard or best practice for overcoming the Single point
>>> failure problem of Hadoop.
>>>
>>> I am not ready to take chances on a production server with Hadoop 2.0
>>> Alpha release, which claims to have solved the problem. Are there any other
>>> things I can do to either prevent the failure or recover from the failure
>>> in a very short time.
>>>
>>> Thanking You,
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello,
I have another idea.... regarding solving the single point failure of
Hadoop...
What If I have multiple Name Nodes setup and running behind a load balancer
in the cluster. So this way I can have multiple Name Nodes at the same IP
Address of the load balancer. Which resolves the problem of failure, If one
Name Node goes down, others are working.

Please suggest.... this is just a vague idea..!!

Thanx


On Mon, Jan 14, 2013 at 7:31 PM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello Bejoy,
>
> Thank you for the information.
> about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
> them for production. For my requirements, the cluster is supposed to be
> extremely Available. Availability is of highest concern. I have looked into
> different distributions as well.. such as Hortonworks, they also have the
> same problem of Single point of failure. And are waiting for Apache to
> release the Hadoop 2.x.
>
> I was wondering, if I can somehow configure two Name Nodes on the same
> Network with the same IP Address, but the second name node is redirected
> only after the failure of the primary, that might help in automatic
> resolution of this problem. all the slaves are connecting to the Name Node
> with a network alias in their /etc/hosts file.
> I am trying to implement something like this in the cluster:
> http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html
>
> please suggest if this is possible.
>
> Thanks for your time.
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul
>>
>> SecondaryNameNode is rather known as check point node. At periodic
>> intervals it merges the editlog from NN with FS image to prevent the edit
>> log from growing too large. This is its main functionality.
>>
>> At any point the SNN would have the latest fs image but not the updated
>> edit log. If NN goes down and if you don't have an updated copy of edit log
>> you can use the fsImage from SNN for restoring. In that case you lose the
>> transactions in edit log.
>>
>> SNN is not a backup NN it is just a check point node.
>>
>> Two or more NN are not possible in 1.x releases but federation makes it
>> possible with 2.x releases. Federation is for different purpose, you should
>> be looking at hadoop HA currently with 2.x releases.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
>> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
>> *Subject: *Re: hadoop namenode recovery
>>
>> thank you for the reply.
>>
>> Is there a way with which I can configure my cluster to switch to the
>> Secondary Name Node automatically in case of the Primary Name Node failure?
>>  When I run my current Hadoop, I see the primary and secondary both Name
>> nodes running. I was wondering what is that Secondary Name Node for? and
>> where is it configured?
>> I was also wondering, is it possible to have two or more Name nodes
>> running in the same cluster?
>>
>> Thanks,
>> Regards,
>> Panshul.
>>
>>
>> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>>
>>> **
>>> Hi Panshul,
>>>
>>> Usually for reliability there will be multiple dfs.name.dir configured.
>>> Of which one would be a remote location such as a nfs mount.
>>> So that even if the NN machine crashes on a whole you still have the fs
>>> image and edit log in nfs mount. This can be utilized for reconstructing
>>> the NN back again.
>>>
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Panshul Whisper <ou...@gmail.com>
>>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *hadoop namenode recovery
>>>
>>> Hello,
>>>
>>> Is there a standard way to prevent the failure of Namenode crash in a
>>> Hadoop cluster?
>>> or what is the standard or best practice for overcoming the Single point
>>> failure problem of Hadoop.
>>>
>>> I am not ready to take chances on a production server with Hadoop 2.0
>>> Alpha release, which claims to have solved the problem. Are there any other
>>> things I can do to either prevent the failure or recover from the failure
>>> in a very short time.
>>>
>>> Thanking You,
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>>
>>
>>
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello Bejoy,

Thank you for the information.
about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
them for production. For my requirements, the cluster is supposed to be
extremely Available. Availability is of highest concern. I have looked into
different distributions as well.. such as Hortonworks, they also have the
same problem of Single point of failure. And are waiting for Apache to
release the Hadoop 2.x.

I was wondering, if I can somehow configure two Name Nodes on the same
Network with the same IP Address, but the second name node is redirected
only after the failure of the primary, that might help in automatic
resolution of this problem. all the slaves are connecting to the Name Node
with a network alias in their /etc/hosts file.
I am trying to implement something like this in the cluster:
http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html

please suggest if this is possible.

Thanks for your time.
Regards,
Panshul.


On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
> When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul,
>>
>> Usually for reliability there will be multiple dfs.name.dir configured.
>> Of which one would be a remote location such as a nfs mount.
>> So that even if the NN machine crashes on a whole you still have the fs
>> image and edit log in nfs mount. This can be utilized for reconstructing
>> the NN back again.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *hadoop namenode recovery
>>
>> Hello,
>>
>> Is there a standard way to prevent the failure of Namenode crash in a
>> Hadoop cluster?
>> or what is the standard or best practice for overcoming the Single point
>> failure problem of Hadoop.
>>
>> I am not ready to take chances on a production server with Hadoop 2.0
>> Alpha release, which claims to have solved the problem. Are there any other
>> things I can do to either prevent the failure or recover from the failure
>> in a very short time.
>>
>> Thanking You,
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello Bejoy,

Thank you for the information.
about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
them for production. For my requirements, the cluster is supposed to be
extremely Available. Availability is of highest concern. I have looked into
different distributions as well.. such as Hortonworks, they also have the
same problem of Single point of failure. And are waiting for Apache to
release the Hadoop 2.x.

I was wondering, if I can somehow configure two Name Nodes on the same
Network with the same IP Address, but the second name node is redirected
only after the failure of the primary, that might help in automatic
resolution of this problem. all the slaves are connecting to the Name Node
with a network alias in their /etc/hosts file.
I am trying to implement something like this in the cluster:
http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html

please suggest if this is possible.

Thanks for your time.
Regards,
Panshul.


On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
> When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul,
>>
>> Usually for reliability there will be multiple dfs.name.dir configured.
>> Of which one would be a remote location such as a nfs mount.
>> So that even if the NN machine crashes on a whole you still have the fs
>> image and edit log in nfs mount. This can be utilized for reconstructing
>> the NN back again.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *hadoop namenode recovery
>>
>> Hello,
>>
>> Is there a standard way to prevent the failure of Namenode crash in a
>> Hadoop cluster?
>> or what is the standard or best practice for overcoming the Single point
>> failure problem of Hadoop.
>>
>> I am not ready to take chances on a production server with Hadoop 2.0
>> Alpha release, which claims to have solved the problem. Are there any other
>> things I can do to either prevent the failure or recover from the failure
>> in a very short time.
>>
>> Thanking You,
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello Bejoy,

Thank you for the information.
about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
them for production. For my requirements, the cluster is supposed to be
extremely Available. Availability is of highest concern. I have looked into
different distributions as well.. such as Hortonworks, they also have the
same problem of Single point of failure. And are waiting for Apache to
release the Hadoop 2.x.

I was wondering, if I can somehow configure two Name Nodes on the same
Network with the same IP Address, but the second name node is redirected
only after the failure of the primary, that might help in automatic
resolution of this problem. all the slaves are connecting to the Name Node
with a network alias in their /etc/hosts file.
I am trying to implement something like this in the cluster:
http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html

please suggest if this is possible.

Thanks for your time.
Regards,
Panshul.


On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
> When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul,
>>
>> Usually for reliability there will be multiple dfs.name.dir configured.
>> Of which one would be a remote location such as a nfs mount.
>> So that even if the NN machine crashes on a whole you still have the fs
>> image and edit log in nfs mount. This can be utilized for reconstructing
>> the NN back again.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *hadoop namenode recovery
>>
>> Hello,
>>
>> Is there a standard way to prevent the failure of Namenode crash in a
>> Hadoop cluster?
>> or what is the standard or best practice for overcoming the Single point
>> failure problem of Hadoop.
>>
>> I am not ready to take chances on a production server with Hadoop 2.0
>> Alpha release, which claims to have solved the problem. Are there any other
>> things I can do to either prevent the failure or recover from the failure
>> in a very short time.
>>
>> Thanking You,
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

Hello Bejoy,

Thank you for the information.
about the Hadoop HA 2.x releases, they are in Alpha phase and I cannot use
them for production. For my requirements, the cluster is supposed to be
extremely Available. Availability is of highest concern. I have looked into
different distributions as well.. such as Hortonworks, they also have the
same problem of Single point of failure. And are waiting for Apache to
release the Hadoop 2.x.

I was wondering, if I can somehow configure two Name Nodes on the same
Network with the same IP Address, but the second name node is redirected
only after the failure of the primary, that might help in automatic
resolution of this problem. all the slaves are connecting to the Name Node
with a network alias in their /etc/hosts file.
I am trying to implement something like this in the cluster:
http://networksandservers.blogspot.de/2011/04/failover-clustering-i.html

please suggest if this is possible.

Thanks for your time.
Regards,
Panshul.


On Mon, Jan 14, 2013 at 7:11 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul
>
> SecondaryNameNode is rather known as check point node. At periodic
> intervals it merges the editlog from NN with FS image to prevent the edit
> log from growing too large. This is its main functionality.
>
> At any point the SNN would have the latest fs image but not the updated
> edit log. If NN goes down and if you don't have an updated copy of edit log
> you can use the fsImage from SNN for restoring. In that case you lose the
> transactions in edit log.
>
> SNN is not a backup NN it is just a check point node.
>
> Two or more NN are not possible in 1.x releases but federation makes it
> possible with 2.x releases. Federation is for different purpose, you should
> be looking at hadoop HA currently with 2.x releases.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 19:04:24 -0800
> *To: *<us...@hadoop.apache.org>; <be...@gmail.com>
> *Subject: *Re: hadoop namenode recovery
>
> thank you for the reply.
>
> Is there a way with which I can configure my cluster to switch to the
> Secondary Name Node automatically in case of the Primary Name Node failure?
> When I run my current Hadoop, I see the primary and secondary both Name
> nodes running. I was wondering what is that Secondary Name Node for? and
> where is it configured?
> I was also wondering, is it possible to have two or more Name nodes
> running in the same cluster?
>
> Thanks,
> Regards,
> Panshul.
>
>
> On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:
>
>> **
>> Hi Panshul,
>>
>> Usually for reliability there will be multiple dfs.name.dir configured.
>> Of which one would be a remote location such as a nfs mount.
>> So that even if the NN machine crashes on a whole you still have the fs
>> image and edit log in nfs mount. This can be utilized for reconstructing
>> the NN back again.
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Panshul Whisper <ou...@gmail.com>
>> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *hadoop namenode recovery
>>
>> Hello,
>>
>> Is there a standard way to prevent the failure of Namenode crash in a
>> Hadoop cluster?
>> or what is the standard or best practice for overcoming the Single point
>> failure problem of Hadoop.
>>
>> I am not ready to take chances on a production server with Hadoop 2.0
>> Alpha release, which claims to have solved the problem. Are there any other
>> things I can do to either prevent the failure or recover from the failure
>> in a very short time.
>>
>> Thanking You,
>>
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul

SecondaryNameNode is rather known as check point node. At periodic intervals it merges the editlog from NN with FS image to prevent the edit log from growing too large. This is its main functionality.

At any point the SNN would have the latest fs image but not the updated edit log. If NN goes down and if you don't have an updated copy of edit log you can use the fsImage from SNN for restoring. In that case you lose the transactions in edit log.

SNN is not a backup NN it is just a check point node.

Two or more NN are not possible in 1.x releases but federation makes it possible with 2.x releases. Federation is for different purpose, you should be looking at hadoop HA currently with 2.x releases. 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 19:04:24 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Subject: Re: hadoop namenode recovery

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.

On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul

SecondaryNameNode is rather known as check point node. At periodic intervals it merges the editlog from NN with FS image to prevent the edit log from growing too large. This is its main functionality.

At any point the SNN would have the latest fs image but not the updated edit log. If NN goes down and if you don't have an updated copy of edit log you can use the fsImage from SNN for restoring. In that case you lose the transactions in edit log.

SNN is not a backup NN it is just a check point node.

Two or more NN are not possible in 1.x releases but federation makes it possible with 2.x releases. Federation is for different purpose, you should be looking at hadoop HA currently with 2.x releases. 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 19:04:24 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Subject: Re: hadoop namenode recovery

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.

On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul

SecondaryNameNode is rather known as check point node. At periodic intervals it merges the editlog from NN with FS image to prevent the edit log from growing too large. This is its main functionality.

At any point the SNN would have the latest fs image but not the updated edit log. If NN goes down and if you don't have an updated copy of edit log you can use the fsImage from SNN for restoring. In that case you lose the transactions in edit log.

SNN is not a backup NN it is just a check point node.

Two or more NN are not possible in 1.x releases but federation makes it possible with 2.x releases. Federation is for different purpose, you should be looking at hadoop HA currently with 2.x releases. 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 19:04:24 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Subject: Re: hadoop namenode recovery

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.

On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul

SecondaryNameNode is rather known as check point node. At periodic intervals it merges the editlog from NN with FS image to prevent the edit log from growing too large. This is its main functionality.

At any point the SNN would have the latest fs image but not the updated edit log. If NN goes down and if you don't have an updated copy of edit log you can use the fsImage from SNN for restoring. In that case you lose the transactions in edit log.

SNN is not a backup NN it is just a check point node.

Two or more NN are not possible in 1.x releases but federation makes it possible with 2.x releases. Federation is for different purpose, you should be looking at hadoop HA currently with 2.x releases. 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 19:04:24 
To: <us...@hadoop.apache.org>; <be...@gmail.com>
Subject: Re: hadoop namenode recovery

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.

On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.


On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.


On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.


On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Panshul Whisper <ou...@gmail.com>.

thank you for the reply.

Is there a way with which I can configure my cluster to switch to the
Secondary Name Node automatically in case of the Primary Name Node failure?
When I run my current Hadoop, I see the primary and secondary both Name
nodes running. I was wondering what is that Secondary Name Node for? and
where is it configured?
I was also wondering, is it possible to have two or more Name nodes running
in the same cluster?

Thanks,
Regards,
Panshul.


On Mon, Jan 14, 2013 at 6:50 PM, <be...@gmail.com> wrote:

> **
> Hi Panshul,
>
> Usually for reliability there will be multiple dfs.name.dir configured. Of
> which one would be a remote location such as a nfs mount.
> So that even if the NN machine crashes on a whole you still have the fs
> image and edit log in nfs mount. This can be utilized for reconstructing
> the NN back again.
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Panshul Whisper <ou...@gmail.com>
> *Date: *Mon, 14 Jan 2013 17:25:08 -0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *hadoop namenode recovery
>
> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul,

Usually for reliability there will be multiple dfs.name.dir configured. Of which one would be a remote location such as a nfs mount. 
So that even if the NN machine crashes on a whole you still have the fs image and edit log  in nfs mount. This can be utilized for reconstructing the NN back again.



Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 17:25:08 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: hadoop namenode recovery

Hello,

Is there a standard way to prevent the failure of Namenode crash in a
Hadoop cluster?
or what is the standard or best practice for overcoming the Single point
failure problem of Hadoop.

I am not ready to take chances on a production server with Hadoop 2.0 Alpha
release, which claims to have solved the problem. Are there any other
things I can do to either prevent the failure or recover from the failure
in a very short time.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul,

Usually for reliability there will be multiple dfs.name.dir configured. Of which one would be a remote location such as a nfs mount. 
So that even if the NN machine crashes on a whole you still have the fs image and edit log  in nfs mount. This can be utilized for reconstructing the NN back again.



Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 17:25:08 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: hadoop namenode recovery

Hello,

Is there a standard way to prevent the failure of Namenode crash in a
Hadoop cluster?
or what is the standard or best practice for overcoming the Single point
failure problem of Hadoop.

I am not ready to take chances on a production server with Hadoop 2.0 Alpha
release, which claims to have solved the problem. Are there any other
things I can do to either prevent the failure or recover from the failure
in a very short time.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Its very rare to observe an NN crash due to a software bug in production.
Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to safeguard
against a total loss is to have redundant disk volumes configured, one
preferably over a dedicated remote NFS mount. This way the NN is
recoverable after the node goes down, since you can retrieve a current copy
from another machine (i.e. via the NFS mount) and set a new node up to
replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to address
the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point. An
upcoming 2.0.3 release should include the QJM based HA features that is
painless to setup and very reliable to use (over other options), and works
with commodity level hardware. FWIW, we've (my team and I) been supporting
several users and customers who're running the 2.x based HA in production
and other types of environments and it has been greatly stable in our
experience. There are also some folks in the community running 2.x based
HDFS for HA/else.

On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Harsh J

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul,

Usually for reliability there will be multiple dfs.name.dir configured. Of which one would be a remote location such as a nfs mount. 
So that even if the NN machine crashes on a whole you still have the fs image and edit log  in nfs mount. This can be utilized for reconstructing the NN back again.



Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 17:25:08 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: hadoop namenode recovery

Hello,

Is there a standard way to prevent the failure of Namenode crash in a
Hadoop cluster?
or what is the standard or best practice for overcoming the Single point
failure problem of Hadoop.

I am not ready to take chances on a production server with Hadoop 2.0 Alpha
release, which claims to have solved the problem. Are there any other
things I can do to either prevent the failure or recover from the failure
in a very short time.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by be...@gmail.com.

Hi Panshul,

Usually for reliability there will be multiple dfs.name.dir configured. Of which one would be a remote location such as a nfs mount. 
So that even if the NN machine crashes on a whole you still have the fs image and edit log  in nfs mount. This can be utilized for reconstructing the NN back again.



Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Panshul Whisper <ou...@gmail.com>
Date: Mon, 14 Jan 2013 17:25:08 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: hadoop namenode recovery

Hello,

Is there a standard way to prevent the failure of Namenode crash in a
Hadoop cluster?
or what is the standard or best practice for overcoming the Single point
failure problem of Hadoop.

I am not ready to take chances on a production server with Hadoop 2.0 Alpha
release, which claims to have solved the problem. Are there any other
things I can do to either prevent the failure or recover from the failure
in a very short time.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Its very rare to observe an NN crash due to a software bug in production.
Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to safeguard
against a total loss is to have redundant disk volumes configured, one
preferably over a dedicated remote NFS mount. This way the NN is
recoverable after the node goes down, since you can retrieve a current copy
from another machine (i.e. via the NFS mount) and set a new node up to
replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to address
the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point. An
upcoming 2.0.3 release should include the QJM based HA features that is
painless to setup and very reliable to use (over other options), and works
with commodity level hardware. FWIW, we've (my team and I) been supporting
several users and customers who're running the 2.x based HA in production
and other types of environments and it has been greatly stable in our
experience. There are also some folks in the community running 2.x based
HDFS for HA/else.

On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Harsh J

Re: hadoop namenode recovery

Posted by Harsh J <ha...@cloudera.com>.

Its very rare to observe an NN crash due to a software bug in production.
Most of the times its a hardware fault you should worry about.

On 1.x, or any non-HA-carrying release, the best you can get to safeguard
against a total loss is to have redundant disk volumes configured, one
preferably over a dedicated remote NFS mount. This way the NN is
recoverable after the node goes down, since you can retrieve a current copy
from another machine (i.e. via the NFS mount) and set a new node up to
replace the older NN and continue along.

A load balancer will not work as the NN is not a simple webserver - it
maintains state which you cannot sync. We wrote HA-HDFS features to address
the very concern you have.

If you want true, painless HA, branch-2 is your best bet at this point. An
upcoming 2.0.3 release should include the QJM based HA features that is
painless to setup and very reliable to use (over other options), and works
with commodity level hardware. FWIW, we've (my team and I) been supporting
several users and customers who're running the 2.x based HA in production
and other types of environments and it has been greatly stable in our
experience. There are also some folks in the community running 2.x based
HDFS for HA/else.

On Tue, Jan 15, 2013 at 6:55 AM, Panshul Whisper <ou...@gmail.com>wrote:

> Hello,
>
> Is there a standard way to prevent the failure of Namenode crash in a
> Hadoop cluster?
> or what is the standard or best practice for overcoming the Single point
> failure problem of Hadoop.
>
> I am not ready to take chances on a production server with Hadoop 2.0
> Alpha release, which claims to have solved the problem. Are there any other
> things I can do to either prevent the failure or recover from the failure
> in a very short time.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>

-- 
Harsh J