You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2009/08/06 19:25:53 UTC

roadmap: data integrity

I updated the roadmap up on the wiki:


* Data integrity
    * Insure that proper append() support in HDFS actually closes the 
      WAL last block write hole
    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21

I have had several recent conversations on my travels with people in
Fortune 100 companies (based on this list:
http://www.wageproject.org/content/fortune/index.php).

You and I know we can set up well engineered HBase 0.20 clusters that 
will be operationally solid for a wide range of use cases, but given
those aforementioned discussions there are certain sectors which would
say HBASE-7 is #1 before HBase is "bank ready". Not until we can say:

  - Yes, when the client sees data has been committed, it actually has
been written and replicated on spinning or solid state media in all
cases.

  - Yes, we go to great lengths to recover data if ${deity} forbid you 
crush some underprovisioned cluster with load or some bizarre bug or
system fault happens.

HBASE-1295 is also required for business continuity reasons, but this
is already a priority item for some HBase committers. 

The question is I think does the above align with project goals.
Making HBase-FSCK a blocker will probably knock something someone
wants for the 0.21 timeframe off the list.

   - Andy

Re: roadmap: data integrity

Posted by Andrew Purtell <ap...@apache.org>.

Good to see there's direct edit replication support; that can make
things easier. 

I've seen people use DRDB or NFS to replicate edits currently. 

Namenode failover is a "solvable" issue with traditional HA: OS level
heartbeats, fencing, fail over -- e.g. HA infrastructure daemon starts
NN instance on node B if heartbeat from node A is lost and takes a
power control operation on A to make sure it is dead. On both nodes the
infastructure daemons trigger the OS watchdog if the NN process dies.
Combine this with automatic IP address reassignment. Then, page the
operators. Add another node C for additional redundancy, and make sure
all of the alternatives are on separate racks and power rails, and make
sure the L2 and L3 topology is also HA (e.g. bonded ethernet to
redundant switches at L2, mesh routing at L3, etc.) If the cluster is
not super huge it can all be spanned at L2 over redundant switches. L3
redundancy is tricker. A typical configuration could have a lot of OSPF
stub networks -- depends how L2 is partitoned -- which can make the
routing table difficult for operators to sort out.

I've seen this type of thing work for myself, ~15 seconds from 
(simulated) fault on NN node A to the new NN up and responding to DN
reconnections on node B, with 0.19. 

You can build in additional assurance of fast failover by building
redundant processes to run concurrently with a few datanodes which over
and over ping the NN via the namenode protocol and trigger fencing and
failover if it stops responding. 

One wrinkle is the new namenode starts up in safe mode. As long as
HBase can handle temporary periods where the cluster goes into
safemode after NN fail over, it can ride it out. 

This is ugly, but this is I believe an accepted and valid systems
engineering solution for the NN SPOF issue for the folks I mentioned
in my previous email, something they would be familiar with. Edit
replication support in HDFS 0.21 makes it a little less work to
achieve and maybe a little faster to execute, so that's an
improvement.

It may be overstating it a little bit to say that the NN SPOF is not a
concern for HBase, but, in my opinion, we need to address WAL and
(lack of FSCK) issues first before being concerned about it. HBase can
lose data all on its own. 

   - Andy

________________________________
From: Jean-Daniel Cryans <jd...@apache.org>
To: hbase-dev@hadoop.apache.org
Sent: Friday, August 7, 2009 3:25:19 AM
Subject: Re: roadmap: data integrity

https://issues.apache.org/jira/browse/HADOOP-4539

This issue was closed long ago. But, Steve Loughran just said on tha
hadoop mailing list that the new NN has to come up with the same
IP/hostname as the failed one.

J-D

On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<ry...@gmail.com> wrote:
> WAL is a major issue, but another one that is coming up fast is the
> SPOF that is the namenode.
>
> Right now, namenode aside, I can rolling restart my entire cluster,
> including rebooting the machines if I needed to. But not so with the
> namenode, because if it does AWOL, all sorts of bad can happen.
>
> I hope that HDFS 0.21 addresses both these issues.  Can we get
> positive confirmation that this is being worked on?
>
> -ryan
>
> On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<ap...@apache.org> wrote:
>> I updated the roadmap up on the wiki:
>>
>>
>> * Data integrity
>>    * Insure that proper append() support in HDFS actually closes the
>>      WAL last block write hole
>>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
>>
>> I have had several recent conversations on my travels with people in
>> Fortune 100 companies (based on this list:
>> http://www.wageproject.org/content/fortune/index.php).
>>
>> You and I know we can set up well engineered HBase 0.20 clusters that
>> will be operationally solid for a wide range of use cases, but given
>> those aforementioned discussions there are certain sectors which would
>> say HBASE-7 is #1 before HBase is "bank ready". Not until we can say:
>>
>>  - Yes, when the client sees data has been committed, it actually has
>> been written and replicated on spinning or solid state media in all
>> cases.
>>
>>  - Yes, we go to great lengths to recover data if ${deity} forbid you
>> crush some underprovisioned cluster with load or some bizarre bug or
>> system fault happens.
>>
>> HBASE-1295 is also required for business continuity reasons, but this
>> is already a priority item for some HBase committers.
>>
>> The question is I think does the above align with project goals.
>> Making HBase-FSCK a blocker will probably knock something someone
>> wants for the 0.21 timeframe off the list.
>>
>>   - Andy
>>
>>
>>
>

Re: roadmap: data integrity

Posted by Jean-Daniel Cryans <jd...@apache.org>.

https://issues.apache.org/jira/browse/HADOOP-4539

This issue was closed long ago. But, Steve Loughran just said on tha
hadoop mailing list that the new NN has to come up with the same
IP/hostname as the failed one.

J-D

On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<ry...@gmail.com> wrote:
> WAL is a major issue, but another one that is coming up fast is the
> SPOF that is the namenode.
>
> Right now, namenode aside, I can rolling restart my entire cluster,
> including rebooting the machines if I needed to. But not so with the
> namenode, because if it does AWOL, all sorts of bad can happen.
>
> I hope that HDFS 0.21 addresses both these issues.  Can we get
> positive confirmation that this is being worked on?
>
> -ryan
>
> On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<ap...@apache.org> wrote:
>> I updated the roadmap up on the wiki:
>>
>>
>> * Data integrity
>>    * Insure that proper append() support in HDFS actually closes the
>>      WAL last block write hole
>>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
>>
>> I have had several recent conversations on my travels with people in
>> Fortune 100 companies (based on this list:
>> http://www.wageproject.org/content/fortune/index.php).
>>
>> You and I know we can set up well engineered HBase 0.20 clusters that
>> will be operationally solid for a wide range of use cases, but given
>> those aforementioned discussions there are certain sectors which would
>> say HBASE-7 is #1 before HBase is "bank ready". Not until we can say:
>>
>>  - Yes, when the client sees data has been committed, it actually has
>> been written and replicated on spinning or solid state media in all
>> cases.
>>
>>  - Yes, we go to great lengths to recover data if ${deity} forbid you
>> crush some underprovisioned cluster with load or some bizarre bug or
>> system fault happens.
>>
>> HBASE-1295 is also required for business continuity reasons, but this
>> is already a priority item for some HBase committers.
>>
>> The question is I think does the above align with project goals.
>> Making HBase-FSCK a blocker will probably knock something someone
>> wants for the 0.21 timeframe off the list.
>>
>>   - Andy
>>
>>
>>
>

Re: roadmap: data integrity

Posted by Ryan Rawson <ry...@gmail.com>.

WAL is a major issue, but another one that is coming up fast is the
SPOF that is the namenode.

Right now, namenode aside, I can rolling restart my entire cluster,
including rebooting the machines if I needed to. But not so with the
namenode, because if it does AWOL, all sorts of bad can happen.

I hope that HDFS 0.21 addresses both these issues.  Can we get
positive confirmation that this is being worked on?

-ryan

On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<ap...@apache.org> wrote:
> I updated the roadmap up on the wiki:
>
>
> * Data integrity
>    * Insure that proper append() support in HDFS actually closes the
>      WAL last block write hole
>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
>
> I have had several recent conversations on my travels with people in
> Fortune 100 companies (based on this list:
> http://www.wageproject.org/content/fortune/index.php).
>
> You and I know we can set up well engineered HBase 0.20 clusters that
> will be operationally solid for a wide range of use cases, but given
> those aforementioned discussions there are certain sectors which would
> say HBASE-7 is #1 before HBase is "bank ready". Not until we can say:
>
>  - Yes, when the client sees data has been committed, it actually has
> been written and replicated on spinning or solid state media in all
> cases.
>
>  - Yes, we go to great lengths to recover data if ${deity} forbid you
> crush some underprovisioned cluster with load or some bizarre bug or
> system fault happens.
>
> HBASE-1295 is also required for business continuity reasons, but this
> is already a priority item for some HBase committers.
>
> The question is I think does the above align with project goals.
> Making HBase-FSCK a blocker will probably knock something someone
> wants for the 0.21 timeframe off the list.
>
>   - Andy
>
>
>

Re: roadmap: data integrity

Posted by Andrew Purtell <ap...@apache.org>.

One scenario I've seen in practice is HFiles corrupted due to incomplete
write. No trailer. For example, an incomplete memstore flush. But, one
can still scan what is available to recover the KVs and write a new
storefile with a complete and valid structure.

Other scenarios all involve reconciliation between META contents and what
is actually on disk. Maybe a failed split where the daughters were 
created but META was not updated. On disk, maybe one daughter was fully
created but the other is incomplete. These scenarios all involve region
inspection and sanity checking, and decisions whether to roll a failed
split forward or back. Also, maybe even total META reconstruction, if it
was hosed somehow. 

> if a regionserver crashes during an upload, how do I know what has been
> lost?  From where do I restart the upload?

If we can guarantee that when a client side flush completes successfully,
that everything has for sure been written, then the uploader can track
this. It can control its flush strategy according to its own needs and
can consider each successful flush a checkpoint. Right? 

   - Andy

________________________________
From: stack <st...@duboce.net>
To: hbase-dev@hadoop.apache.org
Sent: Friday, August 7, 2009 9:13:21 AM
Subject: Re: roadmap: data integrity

On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell <ap...@apache.org> wrote:

> I updated the roadmap up on the wiki:
>
>
> * Data integrity
>    * Insure that proper append() support in HDFS actually closes the
>      WAL last block write hole
>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
>
> I have had several recent conversations on my travels with people in
> Fortune 100 companies (based on this list:
> http://www.wageproject.org/content/fortune/index.php).

I like that links' topic matter.

The question is I think does the above align with project goals.

> Making HBase-FSCK a blocker will probably knock something someone
> wants for the 0.21 timeframe off the list.
>

I think topic of integrity is a good one to raise at this time.  Its about
time for a (re)visit.  Is there enough information in the filesystem for an
hbasck tool to do its reconstruction work?

Regions now have .regioninfo files written to them on creation with
regioninfo written to them, the hfiles have first, last, and sequenceids as
metadata in them.  What else do we need to fully-reconstruct tables when,
${deity} forbid (<-I like this one), there is a catastrophic crash?

A requirement of any hbsfck is that it finish promptly (MR job?).  It should
not be one of those tools that chew for hours on end spinning disks while a
progress bar crawls to completion.

One area that for sure could do with review is logsplitting and then replay
of edits on region redeploy.  We've not given this the attention it deserves
ensuring we are not dropping edits mostly because we've just presumed loss
because up to this there has been no working flush/append.

Another interesting question I was asked recently was, if a regionserver
crashes during an upload, how do I know what has been lost?  From where do I
restart the upload?

St.Ack

Re: roadmap: data integrity

Posted by stack <st...@duboce.net>.

On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell <ap...@apache.org> wrote:

> I updated the roadmap up on the wiki:
>
>
> * Data integrity
>    * Insure that proper append() support in HDFS actually closes the
>      WAL last block write hole
>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
>
> I have had several recent conversations on my travels with people in
> Fortune 100 companies (based on this list:
> http://www.wageproject.org/content/fortune/index.php).

I like that links' topic matter.

The question is I think does the above align with project goals.

> Making HBase-FSCK a blocker will probably knock something someone
> wants for the 0.21 timeframe off the list.
>

I think topic of integrity is a good one to raise at this time.  Its about
time for a (re)visit.  Is there enough information in the filesystem for an
hbasck tool to do its reconstruction work?

Regions now have .regioninfo files written to them on creation with
regioninfo written to them, the hfiles have first, last, and sequenceids as
metadata in them.  What else do we need to fully-reconstruct tables when,
${deity} forbid (<-I like this one), there is a catastrophic crash?

A requirement of any hbsfck is that it finish promptly (MR job?).  It should
not be one of those tools that chew for hours on end spinning disks while a
progress bar crawls to completion.

One area that for sure could do with review is logsplitting and then replay
of edits on region redeploy.  We've not given this the attention it deserves
ensuring we are not dropping edits mostly because we've just presumed loss
because up to this there has been no working flush/append.

Another interesting question I was asked recently was, if a regionserver
crashes during an upload, how do I know what has been lost?  From where do I
restart the upload?

St.Ack