You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stas Oskin <st...@gmail.com> on 2009/08/06 19:46:12 UTC

HADOOP-4539 question

Hi.

I checked this ticket and I like what I found.

Had question about it, and hoped someone can answer it:

If I have a NN, and BN, and the NN fails, how the DFS clients will know how
to connect to the new IP?

It will be a config level setting?

Or it needs to be achieved via external Linux HA scripts?

Thanks!

Re: HADOOP-4539 question

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Sep 21, 2009 at 7:50 AM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> >Storing the only copy of the NN data into NFS would make the NFS server an
> > SPOF, and you still need to solve the problems of
>
> @Steve correct. It is hair splitting but Stas asked if there was an
> approach that did not use DRBD. Linux-HA + NFS, or Linux-HA plus SAN
> does not use DRBD. Implicitly, I think he meant is there any approach
> that does not rely on "shared storage", but DRBD and Linux-HA are
> separate entities although they are often employed together.
>

Well, if you want to look at it another way, DRBD is just shared storage
that happens to work with a pair of nodes rather than an external device.
It's still a shared block device that's synchronized, right?

When discussing HA it's easy to conflate the failover mechanism and the
shared storage mechanism. Linux-HA is just a failover mechanism, with
configuration that can determine which node gets to be the master, and
hopefully enough magic that you won't have two of them (split brain
syndrome). When the standby namenode needs to become master, it has to get
the data somehow, and that's where you need some shared storage. As people
above mentioned, DRBD is but one of several viable options.

Regarding vanilla NFS for the shared storage, I wouldn't consider it a SPOF
- since the namenode can sync its edit log to multiple volumes, you can have
it write to its local disk as well as the NFS server. If the NFS server goes
down, the NN keeps running. If the NN goes down, the NFS server still has
the edit log. It's only if both of them go down that you are out of luck. If
both go down it's probably because your datacenter lost power, and then
you're screwed anyway, to put it bluntly :)

-Todd

Re: HADOOP-4539 question

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Sep 21, 2009 at 6:03 AM, Steve Loughran <st...@apache.org> wrote:
> Edward Capriolo wrote:
>
>>
>> Just for reference. Linux HA and some other tools deal with the split
>> brain decisions by requiring a quorum. A quorum involves having a
>> third party or having more then 50% of the nodes agree.
>>
>> An issue with linux-ha and hadoop is that linux-ha is only
>> supported/tested on clusters of up to 16 nodes.
>
> Usually odd numbers; stops a 50%-50% split.
>
>> That is not a hard
>> limit, but no one claims to have done it on 1000 or so nodes.
>
> If the voting algorithm requires communication with every node then there is
> an implicit limit.
>
>
>> You
>> could just install linux HA on a random sampling of 10 nodes across
>> your network. That would in theory create an effective quorum.
>
>
>
>>
>> There are other HA approaches that do not involve DRBD. One is store
>> your name node table on a SAN or and NFS server. Terracotta is another
>> option that you might want to look at. But no, at the moment there is
>> no fail-over built into hadoop.
>
> Storing the only copy of the NN data into NFS would make the NFS server an
> SPOF, and you still need to solve the problems of  -detecting NN failure and
> deciding who else is in charge
> -making another node the NN by giving it the same hostname/IPAddr as the one
> that went down.
>
> That is what the linux HA stuff promises
>
> -steve
>

>> An issue with linux-ha and hadoop is that linux-ha is only
>> supported/tested on clusters of up to 16 nodes.
>
> Usually odd numbers; stops a 50%-50% split.

@Steve correct. I was getting at the fact that unless you have your HA
cluster manager on every node in the cluster your HA Cluster manager
may be making a correct decision for the configuration, but it may not
be making the optimal decision. The only way for linux-ha to make an
optimal decisions is to install it on every node in the hadoop
cluster.

Linux HA has is tested/tested supported on more then 16 nodes. I had a
thread about this on the Linux-HA mailing list 16 is not a hard limit,
but no one has attempted larger definitely their target is not in the
thousands.

>Storing the only copy of the NN data into NFS would make the NFS server an
> SPOF, and you still need to solve the problems of

@Steve correct. It is hair splitting but Stas asked if there was an
approach that did not use DRBD. Linux-HA + NFS, or Linux-HA plus SAN
does not use DRBD. Implicitly, I think he meant is there any approach
that does not rely on "shared storage", but DRBD and Linux-HA are
separate entities although they are often employed together.

Re: HADOOP-4539 question

Posted by Stas Oskin <st...@gmail.com>.

Hi.

Just wanted to reflect my thoughts on this:

So far DRBD looks as a good enough solution. My only problem, is that it
requires from me to operate dedicate machines (physical or virtual) for
Hadoop Namenode, in active/passive configuration.

I'm interesting in HADOOP-4539 mostly because it would enable me to run the
Namenode together with other services, and it could open a way to
Active/Active HA in Hadoop (as the next iteration of 4539).

Regards.

2009/9/21 Steve Loughran <st...@apache.org>

> Edward Capriolo wrote:
>
>
>> Just for reference. Linux HA and some other tools deal with the split
>> brain decisions by requiring a quorum. A quorum involves having a
>> third party or having more then 50% of the nodes agree.
>>
>> An issue with linux-ha and hadoop is that linux-ha is only
>> supported/tested on clusters of up to 16 nodes.
>>
>
> Usually odd numbers; stops a 50%-50% split.
>
>  That is not a hard
>> limit, but no one claims to have done it on 1000 or so nodes.
>>
>
> If the voting algorithm requires communication with every node then there
> is an implicit limit.
>
>
>  You
>> could just install linux HA on a random sampling of 10 nodes across
>> your network. That would in theory create an effective quorum.
>>
>
>
>
>
>> There are other HA approaches that do not involve DRBD. One is store
>> your name node table on a SAN or and NFS server. Terracotta is another
>> option that you might want to look at. But no, at the moment there is
>> no fail-over built into hadoop.
>>
>
> Storing the only copy of the NN data into NFS would make the NFS server an
> SPOF, and you still need to solve the problems of  -detecting NN failure and
> deciding who else is in charge
> -making another node the NN by giving it the same hostname/IPAddr as the
> one that went down.
>
> That is what the linux HA stuff promises
>
> -steve
>

Re: HADOOP-4539 question

Posted by Steve Loughran <st...@apache.org>.

Edward Capriolo wrote:

> 
> Just for reference. Linux HA and some other tools deal with the split
> brain decisions by requiring a quorum. A quorum involves having a
> third party or having more then 50% of the nodes agree.
> 
> An issue with linux-ha and hadoop is that linux-ha is only
> supported/tested on clusters of up to 16 nodes. 

Usually odd numbers; stops a 50%-50% split.

>That is not a hard
> limit, but no one claims to have done it on 1000 or so nodes.

If the voting algorithm requires communication with every node then 
there is an implicit limit.


> You
> could just install linux HA on a random sampling of 10 nodes across
> your network. That would in theory create an effective quorum.



> 
> There are other HA approaches that do not involve DRBD. One is store
> your name node table on a SAN or and NFS server. Terracotta is another
> option that you might want to look at. But no, at the moment there is
> no fail-over built into hadoop.

Storing the only copy of the NN data into NFS would make the NFS server 
an SPOF, and you still need to solve the problems of  -detecting NN 
failure and deciding who else is in charge
-making another node the NN by giving it the same hostname/IPAddr as the 
one that went down.

That is what the linux HA stuff promises

-steve

Re: HADOOP-4539 question

Posted by Edward Capriolo <ed...@gmail.com>.

On Sun, Sep 20, 2009 at 7:38 PM, Stas Oskin <st...@gmail.com> wrote:
> Hi.
>
> Just wanted to find out about the status of this feature.
>
> Any idea what release this is planned for?
>
> Regards.
>
> 2009/8/17 Edward Capriolo <ed...@gmail.com>
>
>> There are some native ha like solutions that feature clustering
>> electing a dc and messaging. Check out shoal. I tinkered with build a
>> linux ha like kit over shoal.
>>
>> On 8/13/09, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
>> > There is no "native" HA solution for HDFS at the moment.
>> > "External" HA solutions, like Coudera's may exist.
>> > Cannot speak for everybody, but I know at least one different approach.
>> >
>> > --Konstantin
>> >
>> > Stas Oskin wrote:
>> >> Hi.
>> >>
>> >>> This is exactly the goal (long term). To evolve BN into StandbyNode,
>> >>> which will be able to take over when main NN dies without restarting
>> >>> anything else.
>> >>> And the only remaining step is to implement fail-over mechanism.
>> >>>
>> >>>
>> >>
>> >> Just to clarify, for the near future, the only HA option is Cloudera
>>  DRDB
>> >> approach.
>> >>
>> >> Correct?
>> >>
>> >
>>
>

Just for reference. Linux HA and some other tools deal with the split
brain decisions by requiring a quorum. A quorum involves having a
third party or having more then 50% of the nodes agree.

An issue with linux-ha and hadoop is that linux-ha is only
supported/tested on clusters of up to 16 nodes. That is not a hard
limit, but no one claims to have done it on 1000 or so nodes. You
could just install linux HA on a random sampling of 10 nodes across
your network. That would in theory create an effective quorum.

There are other HA approaches that do not involve DRBD. One is store
your name node table on a SAN or and NFS server. Terracotta is another
option that you might want to look at. But no, at the moment there is
no fail-over built into hadoop.

Re: HADOOP-4539 question

Posted by Stas Oskin <st...@gmail.com>.

Hi.

Just wanted to find out about the status of this feature.

Any idea what release this is planned for?

Regards.

2009/8/17 Edward Capriolo <ed...@gmail.com>

> There are some native ha like solutions that feature clustering
> electing a dc and messaging. Check out shoal. I tinkered with build a
> linux ha like kit over shoal.
>
> On 8/13/09, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
> > There is no "native" HA solution for HDFS at the moment.
> > "External" HA solutions, like Coudera's may exist.
> > Cannot speak for everybody, but I know at least one different approach.
> >
> > --Konstantin
> >
> > Stas Oskin wrote:
> >> Hi.
> >>
> >>> This is exactly the goal (long term). To evolve BN into StandbyNode,
> >>> which will be able to take over when main NN dies without restarting
> >>> anything else.
> >>> And the only remaining step is to implement fail-over mechanism.
> >>>
> >>>
> >>
> >> Just to clarify, for the near future, the only HA option is Cloudera
>  DRDB
> >> approach.
> >>
> >> Correct?
> >>
> >
>

Re: HADOOP-4539 question

Posted by Edward Capriolo <ed...@gmail.com>.

There are some native ha like solutions that feature clustering
electing a dc and messaging. Check out shoal. I tinkered with build a
linux ha like kit over shoal.

On 8/13/09, Konstantin Shvachko <sh...@yahoo-inc.com> wrote:
> There is no "native" HA solution for HDFS at the moment.
> "External" HA solutions, like Coudera's may exist.
> Cannot speak for everybody, but I know at least one different approach.
>
> --Konstantin
>
> Stas Oskin wrote:
>> Hi.
>>
>>> This is exactly the goal (long term). To evolve BN into StandbyNode,
>>> which will be able to take over when main NN dies without restarting
>>> anything else.
>>> And the only remaining step is to implement fail-over mechanism.
>>>
>>>
>>
>> Just to clarify, for the near future, the only HA option is Cloudera  DRDB
>> approach.
>>
>> Correct?
>>
>

Re: HADOOP-4539 question

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

There is no "native" HA solution for HDFS at the moment.
"External" HA solutions, like Coudera's may exist.
Cannot speak for everybody, but I know at least one different approach.

--Konstantin

Stas Oskin wrote:
> Hi.
> 
>> This is exactly the goal (long term). To evolve BN into StandbyNode,
>> which will be able to take over when main NN dies without restarting
>> anything else.
>> And the only remaining step is to implement fail-over mechanism.
>>
>>
> 
> Just to clarify, for the near future, the only HA option is Cloudera  DRDB
> approach.
> 
> Correct?
>

Re: HADOOP-4539 question

Posted by Stas Oskin <st...@gmail.com>.

Hi.

>
> This is exactly the goal (long term). To evolve BN into StandbyNode,
> which will be able to take over when main NN dies without restarting
> anything else.
> And the only remaining step is to implement fail-over mechanism.
>
>

Just to clarify, for the near future, the only HA option is Cloudera  DRDB
approach.

Correct?

Re: HADOOP-4539 question

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Aug 13, 2009 at 10:37 AM, Konstantin Shvachko <sh...@yahoo-inc.com>wrote:

> Steve,
>
> There are other groups claimed they work on HA solution.
> We had discussions about it not so long ago in this list.
> Is it possible that your colleagues present their design?
> As you point out the issue gets fairly complex fast,
> particularly because of the split-brain problem you describe.
>

IMHO the split-brain problem is why failover has to either be triggered
manually, or has to be done by an external system like Linux-HA where you
can get multiple media connecting the two masters. In the past I've done
this for firewalls and DB servers using a null modem serial connection plus
a crossover plus pings over the LAN - with 3 separate heartbeats it's very
tough to get a split brain. If you absolutely must avoid it, you can also
trigger a "STONITH" policy: http://linux-ha.org/STONITH


>
> There are several jiras dedicated to the problem already.
> You can post your design there or create a new one.
>
> > Looking at the facebook/google "multi-master" solution, I think they
> > don't worry about consistency, just let the masters drift apart.
>
> Not sure I follow this.
> What facebook/google "multi-master" solution?
> Why would they not worry about consistency?
> Consistency of what?
>
> Thanks,
> --Konstantin
>
>
> Steve Loughran wrote:
>
>> Konstantin Shvachko wrote:
>>
>>> And the only remaining step is to implement fail-over mechanism.
>>>
>>
>> :)
>>
>> Colleagues of mine work on HA stuff; I try and steer clear of it as it
>> gets complex fast.  Test case: what happens when a network failure splits
>> the datacentre in two, you now have two clusters each with half the data and
>> possibly a primary/2ary master in each one. Then leave the partition up for
>> a while, do inconsistent operations on each then have the network come back
>> up.  Then work out how to merge the state
>>
>> Looking at the facebook/google "multi-master" solution, I think they don't
>> worry about consistency, just let the masters drift apart.
>>
>> see also Johan's recent talk on HDFS:
>> http://www.slideshare.net/steve_l/hdfs
>>
>>

Re: HADOOP-4539 question

Posted by Steve Loughran <st...@apache.org>.

Konstantin Shvachko wrote:
> Steve,
> 
> There are other groups claimed they work on HA solution.
> We had discussions about it not so long ago in this list.
> Is it possible that your colleagues present their design?
> As you point out the issue gets fairly complex fast,
> particularly because of the split-brain problem you describe.

Konstantin, if we had an HA HDFS, you'd know about it, not least because 
I'd be trying to get it checked in.

I was just describing the general datacentre partitioning problem that 
crops up in all HA databases.

> 
> There are several jiras dedicated to the problem already.
> You can post your design there or create a new one.
> 
>  > Looking at the facebook/google "multi-master" solution, I think they
>  > don't worry about consistency, just let the masters drift apart.
> 
> Not sure I follow this.
> What facebook/google "multi-master" solution?

Johan mentions it : http://www.slideshare.net/steve_l/hdfs

> Why would they not worry about consistency?
> Consistency of what?#

imagine you have >1 NN, each with a view of the world as reported by the 
DNs, the map of where blocks live. If you dont do failover, but maintain 
separate directory structures in each NN, then you can have the two NN's 
indices diverge, without worrying about reconciling them later.

-steve

Re: HADOOP-4539 question

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Steve,

There are other groups claimed they work on HA solution.
We had discussions about it not so long ago in this list.
Is it possible that your colleagues present their design?
As you point out the issue gets fairly complex fast,
particularly because of the split-brain problem you describe.

There are several jiras dedicated to the problem already.
You can post your design there or create a new one.

 > Looking at the facebook/google "multi-master" solution, I think they
 > don't worry about consistency, just let the masters drift apart.

Not sure I follow this.
What facebook/google "multi-master" solution?
Why would they not worry about consistency?
Consistency of what?

Thanks,
--Konstantin

Steve Loughran wrote:
> Konstantin Shvachko wrote:
>> And the only remaining step is to implement fail-over mechanism.
> 
> :)
> 
> Colleagues of mine work on HA stuff; I try and steer clear of it as it 
> gets complex fast.  Test case: what happens when a network failure 
> splits the datacentre in two, you now have two clusters each with half 
> the data and possibly a primary/2ary master in each one. Then leave the 
> partition up for a while, do inconsistent operations on each then have 
> the network come back up.  Then work out how to merge the state
> 
> Looking at the facebook/google "multi-master" solution, I think they 
> don't worry about consistency, just let the masters drift apart.
> 
> see also Johan's recent talk on HDFS: 
> http://www.slideshare.net/steve_l/hdfs
>

Re: HADOOP-4539 question

Posted by Steve Loughran <st...@apache.org>.

Konstantin Shvachko wrote:
> And the only remaining step is to implement fail-over mechanism.

:)

Colleagues of mine work on HA stuff; I try and steer clear of it as it 
gets complex fast.  Test case: what happens when a network failure 
splits the datacentre in two, you now have two clusters each with half 
the data and possibly a primary/2ary master in each one. Then leave the 
partition up for a while, do inconsistent operations on each then have 
the network come back up.  Then work out how to merge the state

Looking at the facebook/google "multi-master" solution, I think they 
don't worry about consistency, just let the masters drift apart.

see also Johan's recent talk on HDFS: http://www.slideshare.net/steve_l/hdfs

Re: HADOOP-4539 question

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

 > Gotcha - I thought the long term goal for the BN was to eventually have it
 > work as a "warm standby" that could convert into a NN without restart.

This is exactly the goal (long term). To evolve BN into StandbyNode,
which will be able to take over when main NN dies without restarting anything else.
And the only remaining step is to implement fail-over mechanism.

--Konstantin

Todd Lipcon wrote:
> On Wed, Aug 12, 2009 at 12:06 PM, Konstantin Shvachko <sh...@yahoo-inc.com>wrote:
> 
>> Stas,
>>
>> There is no HA solution currently for Hadoop.
>> You can do things like Cloudera describes.
>> Their solution works with 2 real name-nodes.
>> No Backup node involved.
>>
>> As for Backup node, I don't really understand Todd's comment
>> but the fact is that Backup node (BN) is not a standby
>> node. The failover procedure is not implemented for BN,
>> so neither clients nor data-node don't fail-over anywhere
>> when the main name-node (NN) dies, they don't have a clue.
>>
> 
> Gotcha - I thought the long term goal for the BN was to eventually have it
> work as a "warm standby" that could convert into a NN without restart.
> 
> My mistake
> 
> -Todd
> 
> 
>> The purpose of the BN is
>> 1) to keep an up-to-date image of the namespace in memory.
>> This does not include block locations.
>> BN does not know where file blocks are.
>> 2) to make periodic checkpoints, like SecondaryNameNode did,
>> but more efficiently, since BN does not need to load image
>> and edits from NN, its namespace is already up-to-date.
>>
>> There is provision to transform BN to a real standby node,
>> with failover, but it has not been implemented yet.
>>
>> Hope this clarifies things.
>>
>> Thanks,
>> --Konstantin
>>
>>
>>
>> Todd Lipcon wrote:
>>
>>> On Wed, Aug 12, 2009 at 3:42 AM, Stas Oskin <st...@gmail.com> wrote:
>>>
>>>  Hi.
>>>>
>>>>  You can also use a utility like Linux-HA (aka heartbeat) to handle IP
>>>>> address failover. It will even send gratuitous ARPs to make sure to get
>>>>>
>>>> the
>>>>
>>>>> new mac address registered after a failover. Check out this blog for
>>>>> info
>>>>> about a setup like this:
>>>>>
>>>>> http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
>>>>>
>>>>> Hope that helps
>>>>>
>>>>>  Thanks, exactly what I looked for :).
>>>>  I presume that with the coming BB node, there won't be need for DRBD, am
>>>> I
>>>> correct?
>>>>
>>>>
>>> I haven't followed that development closely, but I believe that's the
>>> case.
>>> The BackupNode will stream the FSEditLog writes as they occur while
>>> replaying them into its own FSNamesystem. Then during a failover a real
>>> NameNode starts on that FSNamesystem "ready to go". As for how the
>>> BackupNode keeps track of block locations, I'm not sure - is there a
>>> replication stream between BlockManagers too? Or is the cluster in a
>>> broken
>>> state until all of the DNs have processed new block reports?
>>>
>>> -Todd
>>>
>>>
>

Re: HADOOP-4539 question

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Aug 12, 2009 at 12:06 PM, Konstantin Shvachko <sh...@yahoo-inc.com>wrote:

> Stas,
>
> There is no HA solution currently for Hadoop.
> You can do things like Cloudera describes.
> Their solution works with 2 real name-nodes.
> No Backup node involved.
>
> As for Backup node, I don't really understand Todd's comment
> but the fact is that Backup node (BN) is not a standby
> node. The failover procedure is not implemented for BN,
> so neither clients nor data-node don't fail-over anywhere
> when the main name-node (NN) dies, they don't have a clue.
>

Gotcha - I thought the long term goal for the BN was to eventually have it
work as a "warm standby" that could convert into a NN without restart.

My mistake

-Todd


>
> The purpose of the BN is
> 1) to keep an up-to-date image of the namespace in memory.
> This does not include block locations.
> BN does not know where file blocks are.
> 2) to make periodic checkpoints, like SecondaryNameNode did,
> but more efficiently, since BN does not need to load image
> and edits from NN, its namespace is already up-to-date.
>
> There is provision to transform BN to a real standby node,
> with failover, but it has not been implemented yet.
>
> Hope this clarifies things.
>
> Thanks,
> --Konstantin
>
>
>
> Todd Lipcon wrote:
>
>> On Wed, Aug 12, 2009 at 3:42 AM, Stas Oskin <st...@gmail.com> wrote:
>>
>>  Hi.
>>>
>>>
>>>  You can also use a utility like Linux-HA (aka heartbeat) to handle IP
>>>> address failover. It will even send gratuitous ARPs to make sure to get
>>>>
>>> the
>>>
>>>> new mac address registered after a failover. Check out this blog for
>>>> info
>>>> about a setup like this:
>>>>
>>>> http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
>>>>
>>>> Hope that helps
>>>>
>>>>  Thanks, exactly what I looked for :).
>>>
>>>  I presume that with the coming BB node, there won't be need for DRBD, am
>>> I
>>> correct?
>>>
>>>
>> I haven't followed that development closely, but I believe that's the
>> case.
>> The BackupNode will stream the FSEditLog writes as they occur while
>> replaying them into its own FSNamesystem. Then during a failover a real
>> NameNode starts on that FSNamesystem "ready to go". As for how the
>> BackupNode keeps track of block locations, I'm not sure - is there a
>> replication stream between BlockManagers too? Or is the cluster in a
>> broken
>> state until all of the DNs have processed new block reports?
>>
>> -Todd
>>
>>

Re: HADOOP-4539 question

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Stas,

There is no HA solution currently for Hadoop.
You can do things like Cloudera describes.
Their solution works with 2 real name-nodes.
No Backup node involved.

As for Backup node, I don't really understand Todd's comment
but the fact is that Backup node (BN) is not a standby
node. The failover procedure is not implemented for BN,
so neither clients nor data-node don't fail-over anywhere
when the main name-node (NN) dies, they don't have a clue.

The purpose of the BN is
1) to keep an up-to-date image of the namespace in memory.
This does not include block locations.
BN does not know where file blocks are.
2) to make periodic checkpoints, like SecondaryNameNode did,
but more efficiently, since BN does not need to load image
and edits from NN, its namespace is already up-to-date.

There is provision to transform BN to a real standby node,
with failover, but it has not been implemented yet.

Hope this clarifies things.

Thanks,
--Konstantin

Todd Lipcon wrote:
> On Wed, Aug 12, 2009 at 3:42 AM, Stas Oskin <st...@gmail.com> wrote:
> 
>> Hi.
>>
>>
>>> You can also use a utility like Linux-HA (aka heartbeat) to handle IP
>>> address failover. It will even send gratuitous ARPs to make sure to get
>> the
>>> new mac address registered after a failover. Check out this blog for info
>>> about a setup like this:
>>>
>>> http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
>>>
>>> Hope that helps
>>>
>> Thanks, exactly what I looked for :).
>>
>>  I presume that with the coming BB node, there won't be need for DRBD, am I
>> correct?
>>
> 
> I haven't followed that development closely, but I believe that's the case.
> The BackupNode will stream the FSEditLog writes as they occur while
> replaying them into its own FSNamesystem. Then during a failover a real
> NameNode starts on that FSNamesystem "ready to go". As for how the
> BackupNode keeps track of block locations, I'm not sure - is there a
> replication stream between BlockManagers too? Or is the cluster in a broken
> state until all of the DNs have processed new block reports?
> 
> -Todd
>

Re: HADOOP-4539 question

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Aug 12, 2009 at 3:42 AM, Stas Oskin <st...@gmail.com> wrote:

> Hi.
>
>
> > You can also use a utility like Linux-HA (aka heartbeat) to handle IP
> > address failover. It will even send gratuitous ARPs to make sure to get
> the
> > new mac address registered after a failover. Check out this blog for info
> > about a setup like this:
> >
> > http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
> >
> > Hope that helps
> >
>
> Thanks, exactly what I looked for :).
>
>  I presume that with the coming BB node, there won't be need for DRBD, am I
> correct?
>

I haven't followed that development closely, but I believe that's the case.
The BackupNode will stream the FSEditLog writes as they occur while
replaying them into its own FSNamesystem. Then during a failover a real
NameNode starts on that FSNamesystem "ready to go". As for how the
BackupNode keeps track of block locations, I'm not sure - is there a
replication stream between BlockManagers too? Or is the cluster in a broken
state until all of the DNs have processed new block reports?

-Todd

Re: HADOOP-4539 question

Posted by Stas Oskin <st...@gmail.com>.

Hi.


> You can also use a utility like Linux-HA (aka heartbeat) to handle IP
> address failover. It will even send gratuitous ARPs to make sure to get the
> new mac address registered after a failover. Check out this blog for info
> about a setup like this:
>
> http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
>
> Hope that helps
>

Thanks, exactly what I looked for :).

 I presume that with the coming BB node, there won't be need for DRBD, am I
correct?

Regards.

Re: HADOOP-4539 question

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Stas,

You can also use a utility like Linux-HA (aka heartbeat) to handle IP
address failover. It will even send gratuitous ARPs to make sure to get the
new mac address registered after a failover. Check out this blog for info
about a setup like this:

http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/

Hope that helps
-Todd

On Tue, Aug 11, 2009 at 3:45 AM, Steve Loughran <st...@apache.org> wrote:

> Stas Oskin wrote:
>
>> Hi.
>>
>> What is the recommended a utility for this?
>>
>> Thanks.
>>
>
> for those of us whose hosts are virtual and who have control over the
> infrastructure its fairly simple: bring up a new VM on a different blade
> with the same base image and hostname.
>
> If you have a non-virtual cluster, you need a machine that you can bring up
> with that same hostname; either have something sitting around (switched off)
> waiting for the call of duty, or you rename a node and reboot it.
>
> If you own DNS, bring up all the nodes (and the clients) with the JVM
> command-line property networkaddress.cache.ttl to to something low (like
> 60s), and then you should be able to bring up a node with the same name but
> a different IPAddress. This is useful if you can't control the IPAddr of a
> node, but you can at least change the DNS entry
>
>
>
>> 2009/8/7 Steve Loughran <st...@apache.org>
>>
>>  Stas Oskin wrote:
>>>
>>>  Hi.
>>>>
>>>> I checked this ticket and I like what I found.
>>>>
>>>> Had question about it, and hoped someone can answer it:
>>>>
>>>> If I have a NN, and BN, and the NN fails, how the DFS clients will know
>>>> how
>>>> to connect to the new IP?
>>>>
>>>> It will be a config level setting?
>>>>
>>>> Or it needs to be achieved via external Linux HA scripts?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>  right now the new NN has to come up with the same hostname and IP
>>> address
>>> as the original
>>>
>>>
>>
>
> --
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/
>

Re: HADOOP-4539 question

Posted by Steve Loughran <st...@apache.org>.

Stas Oskin wrote:
> Hi.
> 
> What is the recommended a utility for this?
> 
> Thanks.

for those of us whose hosts are virtual and who have control over the 
infrastructure its fairly simple: bring up a new VM on a different blade 
with the same base image and hostname.

If you have a non-virtual cluster, you need a machine that you can bring 
up with that same hostname; either have something sitting around 
(switched off) waiting for the call of duty, or you rename a node and 
reboot it.

If you own DNS, bring up all the nodes (and the clients) with the JVM 
command-line property networkaddress.cache.ttl to to something low (like 
60s), and then you should be able to bring up a node with the same name 
but a different IPAddress. This is useful if you can't control the 
IPAddr of a node, but you can at least change the DNS entry

> 
> 2009/8/7 Steve Loughran <st...@apache.org>
> 
>> Stas Oskin wrote:
>>
>>> Hi.
>>>
>>> I checked this ticket and I like what I found.
>>>
>>> Had question about it, and hoped someone can answer it:
>>>
>>> If I have a NN, and BN, and the NN fails, how the DFS clients will know
>>> how
>>> to connect to the new IP?
>>>
>>> It will be a config level setting?
>>>
>>> Or it needs to be achieved via external Linux HA scripts?
>>>
>>> Thanks!
>>>
>>>
>> right now the new NN has to come up with the same hostname and IP address
>> as the original
>>
> 

-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: HADOOP-4539 question

Posted by Stas Oskin <st...@gmail.com>.

Hi.

What is the recommended a utility for this?

Thanks.

2009/8/7 Steve Loughran <st...@apache.org>

> Stas Oskin wrote:
>
>> Hi.
>>
>> I checked this ticket and I like what I found.
>>
>> Had question about it, and hoped someone can answer it:
>>
>> If I have a NN, and BN, and the NN fails, how the DFS clients will know
>> how
>> to connect to the new IP?
>>
>> It will be a config level setting?
>>
>> Or it needs to be achieved via external Linux HA scripts?
>>
>> Thanks!
>>
>>
> right now the new NN has to come up with the same hostname and IP address
> as the original
>

Re: HADOOP-4539 question

Posted by Steve Loughran <st...@apache.org>.

Stas Oskin wrote:
> Hi.
> 
> I checked this ticket and I like what I found.
> 
> Had question about it, and hoped someone can answer it:
> 
> If I have a NN, and BN, and the NN fails, how the DFS clients will know how
> to connect to the new IP?
> 
> It will be a config level setting?
> 
> Or it needs to be achieved via external Linux HA scripts?
> 
> Thanks!
> 

right now the new NN has to come up with the same hostname and IP 
address as the original