You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Cagdas Gerede <ca...@gmail.com> on 2008/03/13 21:50:56 UTC

Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails

I have a question. As we know, the name node forms a single point of failure.
In a production environment, I imagine a name node would run in a data
center. If that data center
fails, how would you a put a new name node in place in another data
center that can take over without minimum interruption?

I was wondering if anyone has any experience/ideas/comments on this.

Thanks

-Cagdas

Re: Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails

Posted by Eddie C <ed...@gmail.com>.

According to the documentation you can instruct the name node to write
data to multiple places in the configuration file.

I would think writing the data to two separate directly attached disk
arrays. Attach two servers to both arrays have the second server as a
cold or hot spare. Using some type of clustering software linux-ha
that is how I would handle it.

On Thu, Mar 13, 2008 at 5:05 PM, Cagdas Gerede <ca...@gmail.com> wrote:
> > If your data center fails, then you probably have to worry more about how to get your data.
>
>  I assume having multiple data centers. I know thanks to HDFS
>  replication data in the other data center will be enough.
>  However, as much as I see for now, HDFS has no support for replication
>  of namenode.
>  Is this true?
>  If there is no automated support, and If I need to do this replication
>  with some custom code or manual intervention,
>  what are the steps to do this replication?
>
>  Any help is appreciated.
>
>  Cagdas
>

Re: Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails

Posted by Ted Dunning <td...@veoh.com>.

Using HDFS to replicate data is problematic because the map-reduce programs
will tend to send data across the link  between the data centers (and a
cluster can generate a LOT of traffic).

My guess is that you want two clusters with a scripted mirroring capability.

On 3/13/08 2:05 PM, "Cagdas Gerede" <ca...@gmail.com> wrote:

>> If your data center fails, then you probably have to worry more about how to
>> get your data.
> 
> I assume having multiple data centers. I know thanks to HDFS
> replication data in the other data center will be enough.
> However, as much as I see for now, HDFS has no support for replication
> of namenode.
> Is this true?
> If there is no automated support, and If I need to do this replication
> with some custom code or manual intervention,
> what are the steps to do this replication?
> 
> Any help is appreciated.
> 
> Cagdas

Re: Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails

Posted by Cagdas Gerede <ca...@gmail.com>.

> If your data center fails, then you probably have to worry more about how to get your data.

I assume having multiple data centers. I know thanks to HDFS
replication data in the other data center will be enough.
However, as much as I see for now, HDFS has no support for replication
of namenode.
Is this true?
If there is no automated support, and If I need to do this replication
with some custom code or manual intervention,
what are the steps to do this replication?

Any help is appreciated.

Cagdas

Re: Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails

Posted by Ted Dunning <td...@veoh.com>.

If your data center fails, then you probably have to worry more about how to
get your data.  The namenode state is far smaller and easier to move around
and, besides, if you replicate the data elsewhere, you probably already have
a namenode next to the replicated data.

Namenode failure (without the data center failure) IS a real scenario that
needs to be addressed if losing your cluster is a terribly bad thing.  There
are numerous provisions for that, but there isn't yet a truly HA solution.

On 3/13/08 1:50 PM, "Cagdas Gerede" <ca...@gmail.com> wrote:

> I have a question. As we know, the name node forms a single point of failure.
> In a production environment, I imagine a name node would run in a data
> center. If that data center
> fails, how would you a put a new name node in place in another data
> center that can take over without minimum interruption?
> 
> I was wondering if anyone has any experience/ideas/comments on this.
> 
> Thanks
> 
> -Cagdas