You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Clint Morgan <cl...@troove.net> on 2009/10/01 00:35:03 UTC

Re: append (hadoop-4379), was -> Re: roadmap: data integrity

I got this working, did a hard kill of regionserver, and it worked.

I used the hadoop/hdfs/branches/HDFS-265 branch and was banging my head
trying to get it work. Saw that hlog was reflectively calling
SequenceFile.Writer.syncFs(). This method did not exist (in
hadoop/common/branches/branch-0.21), so I naively changed it to call sync().
But this is a different kind of sync...

To get it to work I added the Writer.syncFs() method which just calls
out.sync().

On Sat, Aug 8, 2009 at 7:51 PM, Andrew Purtell <ap...@apache.org> wrote:

> I realized too late I was not running Hadoop with DEBUG, only HBase.
>
> I'll try again next month, when it will not hurt to lose data.
>
>   - Andy
>
>
>
>
> ________________________________
> From: stack <st...@duboce.net>
> To: hbase-dev@hadoop.apache.org
> Sent: Saturday, August 8, 2009 6:34:07 PM
> Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
>
> Didn't mean to be so short.  I'd suggest that it would be good putting your
> experience up in HDFS-200/HADOOP-4379.  Lads there'd be interested in what
> you've found.
> St.Ack
>
> On Sat, Aug 8, 2009 at 9:36 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > Cluster down hard after RS failure. Master stuck indefinitely splitting
> > logs.
> > Endless instances of this message, once per second:
> >
> > org.apache.hadoop.hdfs.DFSClient: Could not complete file
> > /hbase/content/1965559571/oldlogfile.lo retrying...
> >
> > Turning off "dfs.support.append".
> >
> >   - Andy
> >
> >
> >
> >
> > ________________________________
> > From: stack <st...@duboce.net>
> > To: hbase-dev@hadoop.apache.org
> > Sent: Friday, August 7, 2009 12:34:40 PM
> > Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
> >
> > You are a good man Andrew.
> > St.Ack
> >
> > On Fri, Aug 7, 2009 at 10:27 AM, Andrew Purtell <ap...@apache.org>
> > wrote:
> >
> > > I'm going to join you in testing this stack, taking the below as config
> > > recipe.
> > >
> > >   - Andy
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: stack <st...@duboce.net>
> > > To: hbase-dev@hadoop.apache.org
> > > Sent: Friday, August 7, 2009 9:54:53 AM
> > > Subject: append (hadoop-4379), was -> Re: roadmap: data integrity
> > >
> > > Here is a quick note on the current state of my testing of HADOOP-4379
> > > (support for 'append' in hadoop 0.20.x).
> > >
> > > On my small test cluster, I am not able to break the latest patch
> posted
> > by
> > > Dhruba under heavy-loadings.  It seems to basically work.  On
> > regionserver
> > > crash, the master runs log split and when it comes to the last in the
> set
> > > of
> > > regionserver logs for splitting, the one that is inevitably unclosed
> > > because
> > > the process crashed, we are able to recover most edits in this last
> file
> > > (in
> > > my testing, it seemed to be all edits up to the last flush of the
> > > regionserver process).
> > >
> > > The upshot is that tentatively, we may have a "working" append in the
> > 0.20
> > > timeframe (In 0.21, we should have
> > > https://issues.apache.org/jira/browse/HDFS-265).  I'll keep testing
> but
> > > I'd
> > > suggest its time for others to try out.
> > >
> > > With HADOOP-4379, the process recovering non-closed log files -- the
> > master
> > > in our case -- must successfully open the file in append mode and then
> > > close
> > > it.  Once closed, new readers can purportedly see up to the last flush.
> > >  The
> > > open to append can take a little while before it will go through
> > (Complaint
> > > is that another process holds the files' lease).  Meantime, the opening
> > for
> > > append process must retry.   In my experience its taking 2-10 seconds.
> > >
> > > Support for appends is off by default in hadoop even after HADOOP-4379
> > has
> > > been applied.  To enable, you need to set dfs.support.append.   Set it
> > > everywhere -- all over hadoop and in hbase-site.xml so hbase/DFSClient
> > can
> > > see the attribute.
> > >
> > > HBase TRUNK will recognize if the bundled hadoop supports append via
> > > introspection (SequenceFile has a new syncFs method when HADOOP-4379
> has
> > > been applied).   If an append-supporting hadoop is present, and
> > > dfs.support.append is set in hbase context, then hbase when its running
> > > HLog#splitLog will try to opening files to append.  On regionserver
> > crash,
> > > you can see the master HLog#splitLog loop retrying the open for append
> > > until
> > > it is successful (You'll see in the master log complaint that lease on
> > the
> > > file is held by another process).  We retry every second.
> > >
> > > Successful recovery of all edits is uncovering new, interesting issues.
> >  In
> > > my testing I was killing regionserver only but also killing
> regionserver
> > > and
> > > datanode.  In latter case, what I would see is that namenode would
> > continue
> > > to assign the dead namenode work at least until its lease expired.
>  Fair
> > > enough says you, only the datanode lease is ten minutes by default.  I
> > set
> > > it down in my tests using heartbeat.recheck.interval (There is a
> pregnant
> > > comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they get
> > > around this issue by having client pass the namenode the datanodes it
> > knows
> > > dead when asking for an extra block).  We might want to recommend
> setting
> > > it
> > > down in general.
> > >
> > > Other issues are hbase bugs we see when edits all recovered.  I've been
> > > filing issues on these over last few days.
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <ap...@apache.org>
> > > wrote:
> > >
> > > > Good to see there's direct edit replication support; that can make
> > > > things easier.
> > > >
> > > > I've seen people use DRDB or NFS to replicate edits currently.
> > > >
> > > > Namenode failover is a "solvable" issue with traditional HA: OS level
> > > > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon
> starts
> > > > NN instance on node B if heartbeat from node A is lost and takes a
> > > > power control operation on A to make sure it is dead. On both nodes
> the
> > > > infastructure daemons trigger the OS watchdog if the NN process dies.
> > > > Combine this with automatic IP address reassignment. Then, page the
> > > > operators. Add another node C for additional redundancy, and make
> sure
> > > > all of the alternatives are on separate racks and power rails, and
> make
> > > > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to
> > > > redundant switches at L2, mesh routing at L3, etc.) If the cluster is
> > > > not super huge it can all be spanned at L2 over redundant switches.
> L3
> > > > redundancy is tricker. A typical configuration could have a lot of
> OSPF
> > > > stub networks -- depends how L2 is partitoned -- which can make the
> > > > routing table difficult for operators to sort out.
> > > >
> > > > I've seen this type of thing work for myself, ~15 seconds from
> > > > (simulated) fault on NN node A to the new NN up and responding to DN
> > > > reconnections on node B, with 0.19.
> > > >
> > > > You can build in additional assurance of fast failover by building
> > > > redundant processes to run concurrently with a few datanodes which
> over
> > > > and over ping the NN via the namenode protocol and trigger fencing
> and
> > > > failover if it stops responding.
> > > >
> > > > One wrinkle is the new namenode starts up in safe mode. As long as
> > > > HBase can handle temporary periods where the cluster goes into
> > > > safemode after NN fail over, it can ride it out.
> > > >
> > > > This is ugly, but this is I believe an accepted and valid systems
> > > > engineering solution for the NN SPOF issue for the folks I mentioned
> > > > in my previous email, something they would be familiar with. Edit
> > > > replication support in HDFS 0.21 makes it a little less work to
> > > > achieve and maybe a little faster to execute, so that's an
> > > > improvement.
> > > >
> > > > It may be overstating it a little bit to say that the NN SPOF is not
> a
> > > > concern for HBase, but, in my opinion, we need to address WAL and
> > > > (lack of FSCK) issues first before being concerned about it. HBase
> can
> > > > lose data all on its own.
> > > >
> > > >   - Andy
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Jean-Daniel Cryans <jd...@apache.org>
> > > > To: hbase-dev@hadoop.apache.org
> > > > Sent: Friday, August 7, 2009 3:25:19 AM
> > > > Subject: Re: roadmap: data integrity
> > > >
> > > > https://issues.apache.org/jira/browse/HADOOP-4539
> > > >
> > > > This issue was closed long ago. But, Steve Loughran just said on tha
> > > > hadoop mailing list that the new NN has to come up with the same
> > > > IP/hostname as the failed one.
> > > >
> > > > J-D
> > > >
> > > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<ry...@gmail.com>
> wrote:
> > > > > WAL is a major issue, but another one that is coming up fast is the
> > > > > SPOF that is the namenode.
> > > > >
> > > > > Right now, namenode aside, I can rolling restart my entire cluster,
> > > > > including rebooting the machines if I needed to. But not so with
> the
> > > > > namenode, because if it does AWOL, all sorts of bad can happen.
> > > > >
> > > > > I hope that HDFS 0.21 addresses both these issues.  Can we get
> > > > > positive confirmation that this is being worked on?
> > > > >
> > > > > -ryan
> > > > >
> > > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<
> apurtell@apache.org>
> > > > wrote:
> > > > >> I updated the roadmap up on the wiki:
> > > > >>
> > > > >>
> > > > >> * Data integrity
> > > > >>    * Insure that proper append() support in HDFS actually closes
> the
> > > > >>      WAL last block write hole
> > > > >>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for
> 0.21
> > > > >>
> > > > >> I have had several recent conversations on my travels with people
> in
> > > > >> Fortune 100 companies (based on this list:
> > > > >> http://www.wageproject.org/content/fortune/index.php).
> > > > >>
> > > > >> You and I know we can set up well engineered HBase 0.20 clusters
> > that
> > > > >> will be operationally solid for a wide range of use cases, but
> given
> > > > >> those aforementioned discussions there are certain sectors which
> > would
> > > > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we can
> > say:
> > > > >>
> > > > >>  - Yes, when the client sees data has been committed, it actually
> > has
> > > > >> been written and replicated on spinning or solid state media in
> all
> > > > >> cases.
> > > > >>
> > > > >>  - Yes, we go to great lengths to recover data if ${deity} forbid
> > you
> > > > >> crush some underprovisioned cluster with load or some bizarre bug
> or
> > > > >> system fault happens.
> > > > >>
> > > > >> HBASE-1295 is also required for business continuity reasons, but
> > this
> > > > >> is already a priority item for some HBase committers.
> > > > >>
> > > > >> The question is I think does the above align with project goals.
> > > > >> Making HBase-FSCK a blocker will probably knock something someone
> > > > >> wants for the 0.21 timeframe off the list.
> > > > >>
> > > > >>   - Andy
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
>
>
>
>
>

Re: append (hadoop-4379), was -> Re: roadmap: data integrity

Posted by stack <st...@duboce.net>.
If you kill datanode and regionserver, you may run into the issue Cosmin
just put up a patch for in hdfs-630.
St.Ack
P.S. Thats good news its working for you Clint.

On Wed, Sep 30, 2009 at 3:39 PM, Ryan Rawson <ry...@gmail.com> wrote:

> I've been working on HDFS-200 for a while, and I see similar
> experiences.  The questions I have for HDFS-265 is... how performant
> is it? How expensive are the syncs? And just how good is the recovery?
>
> Next time try kill -9ing the regionserver and the datanode on the same
> server.
>
> -ryan
>
> On Wed, Sep 30, 2009 at 3:35 PM, Clint Morgan <cl...@troove.net>
> wrote:
> > I got this working, did a hard kill of regionserver, and it worked.
> >
> > I used the hadoop/hdfs/branches/HDFS-265 branch and was banging my head
> > trying to get it work. Saw that hlog was reflectively calling
> > SequenceFile.Writer.syncFs(). This method did not exist (in
> > hadoop/common/branches/branch-0.21), so I naively changed it to call
> sync().
> > But this is a different kind of sync...
> >
> > To get it to work I added the Writer.syncFs() method which just calls
> > out.sync().
> >
> > On Sat, Aug 8, 2009 at 7:51 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> >> I realized too late I was not running Hadoop with DEBUG, only HBase.
> >>
> >> I'll try again next month, when it will not hurt to lose data.
> >>
> >>   - Andy
> >>
> >>
> >>
> >>
> >> ________________________________
> >> From: stack <st...@duboce.net>
> >> To: hbase-dev@hadoop.apache.org
> >> Sent: Saturday, August 8, 2009 6:34:07 PM
> >> Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
> >>
> >> Didn't mean to be so short.  I'd suggest that it would be good putting
> your
> >> experience up in HDFS-200/HADOOP-4379.  Lads there'd be interested in
> what
> >> you've found.
> >> St.Ack
> >>
> >> On Sat, Aug 8, 2009 at 9:36 AM, Andrew Purtell <ap...@apache.org>
> >> wrote:
> >>
> >> > Cluster down hard after RS failure. Master stuck indefinitely
> splitting
> >> > logs.
> >> > Endless instances of this message, once per second:
> >> >
> >> > org.apache.hadoop.hdfs.DFSClient: Could not complete file
> >> > /hbase/content/1965559571/oldlogfile.lo retrying...
> >> >
> >> > Turning off "dfs.support.append".
> >> >
> >> >   - Andy
> >> >
> >> >
> >> >
> >> >
> >> > ________________________________
> >> > From: stack <st...@duboce.net>
> >> > To: hbase-dev@hadoop.apache.org
> >> > Sent: Friday, August 7, 2009 12:34:40 PM
> >> > Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
> >> >
> >> > You are a good man Andrew.
> >> > St.Ack
> >> >
> >> > On Fri, Aug 7, 2009 at 10:27 AM, Andrew Purtell <ap...@apache.org>
> >> > wrote:
> >> >
> >> > > I'm going to join you in testing this stack, taking the below as
> config
> >> > > recipe.
> >> > >
> >> > >   - Andy
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > ________________________________
> >> > > From: stack <st...@duboce.net>
> >> > > To: hbase-dev@hadoop.apache.org
> >> > > Sent: Friday, August 7, 2009 9:54:53 AM
> >> > > Subject: append (hadoop-4379), was -> Re: roadmap: data integrity
> >> > >
> >> > > Here is a quick note on the current state of my testing of
> HADOOP-4379
> >> > > (support for 'append' in hadoop 0.20.x).
> >> > >
> >> > > On my small test cluster, I am not able to break the latest patch
> >> posted
> >> > by
> >> > > Dhruba under heavy-loadings.  It seems to basically work.  On
> >> > regionserver
> >> > > crash, the master runs log split and when it comes to the last in
> the
> >> set
> >> > > of
> >> > > regionserver logs for splitting, the one that is inevitably unclosed
> >> > > because
> >> > > the process crashed, we are able to recover most edits in this last
> >> file
> >> > > (in
> >> > > my testing, it seemed to be all edits up to the last flush of the
> >> > > regionserver process).
> >> > >
> >> > > The upshot is that tentatively, we may have a "working" append in
> the
> >> > 0.20
> >> > > timeframe (In 0.21, we should have
> >> > > https://issues.apache.org/jira/browse/HDFS-265).  I'll keep testing
> >> but
> >> > > I'd
> >> > > suggest its time for others to try out.
> >> > >
> >> > > With HADOOP-4379, the process recovering non-closed log files -- the
> >> > master
> >> > > in our case -- must successfully open the file in append mode and
> then
> >> > > close
> >> > > it.  Once closed, new readers can purportedly see up to the last
> flush.
> >> > >  The
> >> > > open to append can take a little while before it will go through
> >> > (Complaint
> >> > > is that another process holds the files' lease).  Meantime, the
> opening
> >> > for
> >> > > append process must retry.   In my experience its taking 2-10
> seconds.
> >> > >
> >> > > Support for appends is off by default in hadoop even after
> HADOOP-4379
> >> > has
> >> > > been applied.  To enable, you need to set dfs.support.append.   Set
> it
> >> > > everywhere -- all over hadoop and in hbase-site.xml so
> hbase/DFSClient
> >> > can
> >> > > see the attribute.
> >> > >
> >> > > HBase TRUNK will recognize if the bundled hadoop supports append via
> >> > > introspection (SequenceFile has a new syncFs method when HADOOP-4379
> >> has
> >> > > been applied).   If an append-supporting hadoop is present, and
> >> > > dfs.support.append is set in hbase context, then hbase when its
> running
> >> > > HLog#splitLog will try to opening files to append.  On regionserver
> >> > crash,
> >> > > you can see the master HLog#splitLog loop retrying the open for
> append
> >> > > until
> >> > > it is successful (You'll see in the master log complaint that lease
> on
> >> > the
> >> > > file is held by another process).  We retry every second.
> >> > >
> >> > > Successful recovery of all edits is uncovering new, interesting
> issues.
> >> >  In
> >> > > my testing I was killing regionserver only but also killing
> >> regionserver
> >> > > and
> >> > > datanode.  In latter case, what I would see is that namenode would
> >> > continue
> >> > > to assign the dead namenode work at least until its lease expired.
> >>  Fair
> >> > > enough says you, only the datanode lease is ten minutes by default.
>  I
> >> > set
> >> > > it down in my tests using heartbeat.recheck.interval (There is a
> >> pregnant
> >> > > comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they
> get
> >> > > around this issue by having client pass the namenode the datanodes
> it
> >> > knows
> >> > > dead when asking for an extra block).  We might want to recommend
> >> setting
> >> > > it
> >> > > down in general.
> >> > >
> >> > > Other issues are hbase bugs we see when edits all recovered.  I've
> been
> >> > > filing issues on these over last few days.
> >> > >
> >> > > St.Ack
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <apurtell@apache.org
> >
> >> > > wrote:
> >> > >
> >> > > > Good to see there's direct edit replication support; that can make
> >> > > > things easier.
> >> > > >
> >> > > > I've seen people use DRDB or NFS to replicate edits currently.
> >> > > >
> >> > > > Namenode failover is a "solvable" issue with traditional HA: OS
> level
> >> > > > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon
> >> starts
> >> > > > NN instance on node B if heartbeat from node A is lost and takes a
> >> > > > power control operation on A to make sure it is dead. On both
> nodes
> >> the
> >> > > > infastructure daemons trigger the OS watchdog if the NN process
> dies.
> >> > > > Combine this with automatic IP address reassignment. Then, page
> the
> >> > > > operators. Add another node C for additional redundancy, and make
> >> sure
> >> > > > all of the alternatives are on separate racks and power rails, and
> >> make
> >> > > > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to
> >> > > > redundant switches at L2, mesh routing at L3, etc.) If the cluster
> is
> >> > > > not super huge it can all be spanned at L2 over redundant
> switches.
> >> L3
> >> > > > redundancy is tricker. A typical configuration could have a lot of
> >> OSPF
> >> > > > stub networks -- depends how L2 is partitoned -- which can make
> the
> >> > > > routing table difficult for operators to sort out.
> >> > > >
> >> > > > I've seen this type of thing work for myself, ~15 seconds from
> >> > > > (simulated) fault on NN node A to the new NN up and responding to
> DN
> >> > > > reconnections on node B, with 0.19.
> >> > > >
> >> > > > You can build in additional assurance of fast failover by building
> >> > > > redundant processes to run concurrently with a few datanodes which
> >> over
> >> > > > and over ping the NN via the namenode protocol and trigger fencing
> >> and
> >> > > > failover if it stops responding.
> >> > > >
> >> > > > One wrinkle is the new namenode starts up in safe mode. As long as
> >> > > > HBase can handle temporary periods where the cluster goes into
> >> > > > safemode after NN fail over, it can ride it out.
> >> > > >
> >> > > > This is ugly, but this is I believe an accepted and valid systems
> >> > > > engineering solution for the NN SPOF issue for the folks I
> mentioned
> >> > > > in my previous email, something they would be familiar with. Edit
> >> > > > replication support in HDFS 0.21 makes it a little less work to
> >> > > > achieve and maybe a little faster to execute, so that's an
> >> > > > improvement.
> >> > > >
> >> > > > It may be overstating it a little bit to say that the NN SPOF is
> not
> >> a
> >> > > > concern for HBase, but, in my opinion, we need to address WAL and
> >> > > > (lack of FSCK) issues first before being concerned about it. HBase
> >> can
> >> > > > lose data all on its own.
> >> > > >
> >> > > >   - Andy
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > ________________________________
> >> > > > From: Jean-Daniel Cryans <jd...@apache.org>
> >> > > > To: hbase-dev@hadoop.apache.org
> >> > > > Sent: Friday, August 7, 2009 3:25:19 AM
> >> > > > Subject: Re: roadmap: data integrity
> >> > > >
> >> > > > https://issues.apache.org/jira/browse/HADOOP-4539
> >> > > >
> >> > > > This issue was closed long ago. But, Steve Loughran just said on
> tha
> >> > > > hadoop mailing list that the new NN has to come up with the same
> >> > > > IP/hostname as the failed one.
> >> > > >
> >> > > > J-D
> >> > > >
> >> > > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<ry...@gmail.com>
> >> wrote:
> >> > > > > WAL is a major issue, but another one that is coming up fast is
> the
> >> > > > > SPOF that is the namenode.
> >> > > > >
> >> > > > > Right now, namenode aside, I can rolling restart my entire
> cluster,
> >> > > > > including rebooting the machines if I needed to. But not so with
> >> the
> >> > > > > namenode, because if it does AWOL, all sorts of bad can happen.
> >> > > > >
> >> > > > > I hope that HDFS 0.21 addresses both these issues.  Can we get
> >> > > > > positive confirmation that this is being worked on?
> >> > > > >
> >> > > > > -ryan
> >> > > > >
> >> > > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<
> >> apurtell@apache.org>
> >> > > > wrote:
> >> > > > >> I updated the roadmap up on the wiki:
> >> > > > >>
> >> > > > >>
> >> > > > >> * Data integrity
> >> > > > >>    * Insure that proper append() support in HDFS actually
> closes
> >> the
> >> > > > >>      WAL last block write hole
> >> > > > >>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for
> >> 0.21
> >> > > > >>
> >> > > > >> I have had several recent conversations on my travels with
> people
> >> in
> >> > > > >> Fortune 100 companies (based on this list:
> >> > > > >> http://www.wageproject.org/content/fortune/index.php).
> >> > > > >>
> >> > > > >> You and I know we can set up well engineered HBase 0.20
> clusters
> >> > that
> >> > > > >> will be operationally solid for a wide range of use cases, but
> >> given
> >> > > > >> those aforementioned discussions there are certain sectors
> which
> >> > would
> >> > > > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we
> can
> >> > say:
> >> > > > >>
> >> > > > >>  - Yes, when the client sees data has been committed, it
> actually
> >> > has
> >> > > > >> been written and replicated on spinning or solid state media in
> >> all
> >> > > > >> cases.
> >> > > > >>
> >> > > > >>  - Yes, we go to great lengths to recover data if ${deity}
> forbid
> >> > you
> >> > > > >> crush some underprovisioned cluster with load or some bizarre
> bug
> >> or
> >> > > > >> system fault happens.
> >> > > > >>
> >> > > > >> HBASE-1295 is also required for business continuity reasons,
> but
> >> > this
> >> > > > >> is already a priority item for some HBase committers.
> >> > > > >>
> >> > > > >> The question is I think does the above align with project
> goals.
> >> > > > >> Making HBase-FSCK a blocker will probably knock something
> someone
> >> > > > >> wants for the 0.21 timeframe off the list.
> >> > > > >>
> >> > > > >>   - Andy
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >
>

Re: append (hadoop-4379), was -> Re: roadmap: data integrity

Posted by Ryan Rawson <ry...@gmail.com>.
I've been working on HDFS-200 for a while, and I see similar
experiences.  The questions I have for HDFS-265 is... how performant
is it? How expensive are the syncs? And just how good is the recovery?

Next time try kill -9ing the regionserver and the datanode on the same server.

-ryan

On Wed, Sep 30, 2009 at 3:35 PM, Clint Morgan <cl...@troove.net> wrote:
> I got this working, did a hard kill of regionserver, and it worked.
>
> I used the hadoop/hdfs/branches/HDFS-265 branch and was banging my head
> trying to get it work. Saw that hlog was reflectively calling
> SequenceFile.Writer.syncFs(). This method did not exist (in
> hadoop/common/branches/branch-0.21), so I naively changed it to call sync().
> But this is a different kind of sync...
>
> To get it to work I added the Writer.syncFs() method which just calls
> out.sync().
>
> On Sat, Aug 8, 2009 at 7:51 PM, Andrew Purtell <ap...@apache.org> wrote:
>
>> I realized too late I was not running Hadoop with DEBUG, only HBase.
>>
>> I'll try again next month, when it will not hurt to lose data.
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: stack <st...@duboce.net>
>> To: hbase-dev@hadoop.apache.org
>> Sent: Saturday, August 8, 2009 6:34:07 PM
>> Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
>>
>> Didn't mean to be so short.  I'd suggest that it would be good putting your
>> experience up in HDFS-200/HADOOP-4379.  Lads there'd be interested in what
>> you've found.
>> St.Ack
>>
>> On Sat, Aug 8, 2009 at 9:36 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>> > Cluster down hard after RS failure. Master stuck indefinitely splitting
>> > logs.
>> > Endless instances of this message, once per second:
>> >
>> > org.apache.hadoop.hdfs.DFSClient: Could not complete file
>> > /hbase/content/1965559571/oldlogfile.lo retrying...
>> >
>> > Turning off "dfs.support.append".
>> >
>> >   - Andy
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: stack <st...@duboce.net>
>> > To: hbase-dev@hadoop.apache.org
>> > Sent: Friday, August 7, 2009 12:34:40 PM
>> > Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
>> >
>> > You are a good man Andrew.
>> > St.Ack
>> >
>> > On Fri, Aug 7, 2009 at 10:27 AM, Andrew Purtell <ap...@apache.org>
>> > wrote:
>> >
>> > > I'm going to join you in testing this stack, taking the below as config
>> > > recipe.
>> > >
>> > >   - Andy
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > > From: stack <st...@duboce.net>
>> > > To: hbase-dev@hadoop.apache.org
>> > > Sent: Friday, August 7, 2009 9:54:53 AM
>> > > Subject: append (hadoop-4379), was -> Re: roadmap: data integrity
>> > >
>> > > Here is a quick note on the current state of my testing of HADOOP-4379
>> > > (support for 'append' in hadoop 0.20.x).
>> > >
>> > > On my small test cluster, I am not able to break the latest patch
>> posted
>> > by
>> > > Dhruba under heavy-loadings.  It seems to basically work.  On
>> > regionserver
>> > > crash, the master runs log split and when it comes to the last in the
>> set
>> > > of
>> > > regionserver logs for splitting, the one that is inevitably unclosed
>> > > because
>> > > the process crashed, we are able to recover most edits in this last
>> file
>> > > (in
>> > > my testing, it seemed to be all edits up to the last flush of the
>> > > regionserver process).
>> > >
>> > > The upshot is that tentatively, we may have a "working" append in the
>> > 0.20
>> > > timeframe (In 0.21, we should have
>> > > https://issues.apache.org/jira/browse/HDFS-265).  I'll keep testing
>> but
>> > > I'd
>> > > suggest its time for others to try out.
>> > >
>> > > With HADOOP-4379, the process recovering non-closed log files -- the
>> > master
>> > > in our case -- must successfully open the file in append mode and then
>> > > close
>> > > it.  Once closed, new readers can purportedly see up to the last flush.
>> > >  The
>> > > open to append can take a little while before it will go through
>> > (Complaint
>> > > is that another process holds the files' lease).  Meantime, the opening
>> > for
>> > > append process must retry.   In my experience its taking 2-10 seconds.
>> > >
>> > > Support for appends is off by default in hadoop even after HADOOP-4379
>> > has
>> > > been applied.  To enable, you need to set dfs.support.append.   Set it
>> > > everywhere -- all over hadoop and in hbase-site.xml so hbase/DFSClient
>> > can
>> > > see the attribute.
>> > >
>> > > HBase TRUNK will recognize if the bundled hadoop supports append via
>> > > introspection (SequenceFile has a new syncFs method when HADOOP-4379
>> has
>> > > been applied).   If an append-supporting hadoop is present, and
>> > > dfs.support.append is set in hbase context, then hbase when its running
>> > > HLog#splitLog will try to opening files to append.  On regionserver
>> > crash,
>> > > you can see the master HLog#splitLog loop retrying the open for append
>> > > until
>> > > it is successful (You'll see in the master log complaint that lease on
>> > the
>> > > file is held by another process).  We retry every second.
>> > >
>> > > Successful recovery of all edits is uncovering new, interesting issues.
>> >  In
>> > > my testing I was killing regionserver only but also killing
>> regionserver
>> > > and
>> > > datanode.  In latter case, what I would see is that namenode would
>> > continue
>> > > to assign the dead namenode work at least until its lease expired.
>>  Fair
>> > > enough says you, only the datanode lease is ten minutes by default.  I
>> > set
>> > > it down in my tests using heartbeat.recheck.interval (There is a
>> pregnant
>> > > comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they get
>> > > around this issue by having client pass the namenode the datanodes it
>> > knows
>> > > dead when asking for an extra block).  We might want to recommend
>> setting
>> > > it
>> > > down in general.
>> > >
>> > > Other issues are hbase bugs we see when edits all recovered.  I've been
>> > > filing issues on these over last few days.
>> > >
>> > > St.Ack
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <ap...@apache.org>
>> > > wrote:
>> > >
>> > > > Good to see there's direct edit replication support; that can make
>> > > > things easier.
>> > > >
>> > > > I've seen people use DRDB or NFS to replicate edits currently.
>> > > >
>> > > > Namenode failover is a "solvable" issue with traditional HA: OS level
>> > > > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon
>> starts
>> > > > NN instance on node B if heartbeat from node A is lost and takes a
>> > > > power control operation on A to make sure it is dead. On both nodes
>> the
>> > > > infastructure daemons trigger the OS watchdog if the NN process dies.
>> > > > Combine this with automatic IP address reassignment. Then, page the
>> > > > operators. Add another node C for additional redundancy, and make
>> sure
>> > > > all of the alternatives are on separate racks and power rails, and
>> make
>> > > > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to
>> > > > redundant switches at L2, mesh routing at L3, etc.) If the cluster is
>> > > > not super huge it can all be spanned at L2 over redundant switches.
>> L3
>> > > > redundancy is tricker. A typical configuration could have a lot of
>> OSPF
>> > > > stub networks -- depends how L2 is partitoned -- which can make the
>> > > > routing table difficult for operators to sort out.
>> > > >
>> > > > I've seen this type of thing work for myself, ~15 seconds from
>> > > > (simulated) fault on NN node A to the new NN up and responding to DN
>> > > > reconnections on node B, with 0.19.
>> > > >
>> > > > You can build in additional assurance of fast failover by building
>> > > > redundant processes to run concurrently with a few datanodes which
>> over
>> > > > and over ping the NN via the namenode protocol and trigger fencing
>> and
>> > > > failover if it stops responding.
>> > > >
>> > > > One wrinkle is the new namenode starts up in safe mode. As long as
>> > > > HBase can handle temporary periods where the cluster goes into
>> > > > safemode after NN fail over, it can ride it out.
>> > > >
>> > > > This is ugly, but this is I believe an accepted and valid systems
>> > > > engineering solution for the NN SPOF issue for the folks I mentioned
>> > > > in my previous email, something they would be familiar with. Edit
>> > > > replication support in HDFS 0.21 makes it a little less work to
>> > > > achieve and maybe a little faster to execute, so that's an
>> > > > improvement.
>> > > >
>> > > > It may be overstating it a little bit to say that the NN SPOF is not
>> a
>> > > > concern for HBase, but, in my opinion, we need to address WAL and
>> > > > (lack of FSCK) issues first before being concerned about it. HBase
>> can
>> > > > lose data all on its own.
>> > > >
>> > > >   - Andy
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > > From: Jean-Daniel Cryans <jd...@apache.org>
>> > > > To: hbase-dev@hadoop.apache.org
>> > > > Sent: Friday, August 7, 2009 3:25:19 AM
>> > > > Subject: Re: roadmap: data integrity
>> > > >
>> > > > https://issues.apache.org/jira/browse/HADOOP-4539
>> > > >
>> > > > This issue was closed long ago. But, Steve Loughran just said on tha
>> > > > hadoop mailing list that the new NN has to come up with the same
>> > > > IP/hostname as the failed one.
>> > > >
>> > > > J-D
>> > > >
>> > > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<ry...@gmail.com>
>> wrote:
>> > > > > WAL is a major issue, but another one that is coming up fast is the
>> > > > > SPOF that is the namenode.
>> > > > >
>> > > > > Right now, namenode aside, I can rolling restart my entire cluster,
>> > > > > including rebooting the machines if I needed to. But not so with
>> the
>> > > > > namenode, because if it does AWOL, all sorts of bad can happen.
>> > > > >
>> > > > > I hope that HDFS 0.21 addresses both these issues.  Can we get
>> > > > > positive confirmation that this is being worked on?
>> > > > >
>> > > > > -ryan
>> > > > >
>> > > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<
>> apurtell@apache.org>
>> > > > wrote:
>> > > > >> I updated the roadmap up on the wiki:
>> > > > >>
>> > > > >>
>> > > > >> * Data integrity
>> > > > >>    * Insure that proper append() support in HDFS actually closes
>> the
>> > > > >>      WAL last block write hole
>> > > > >>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for
>> 0.21
>> > > > >>
>> > > > >> I have had several recent conversations on my travels with people
>> in
>> > > > >> Fortune 100 companies (based on this list:
>> > > > >> http://www.wageproject.org/content/fortune/index.php).
>> > > > >>
>> > > > >> You and I know we can set up well engineered HBase 0.20 clusters
>> > that
>> > > > >> will be operationally solid for a wide range of use cases, but
>> given
>> > > > >> those aforementioned discussions there are certain sectors which
>> > would
>> > > > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we can
>> > say:
>> > > > >>
>> > > > >>  - Yes, when the client sees data has been committed, it actually
>> > has
>> > > > >> been written and replicated on spinning or solid state media in
>> all
>> > > > >> cases.
>> > > > >>
>> > > > >>  - Yes, we go to great lengths to recover data if ${deity} forbid
>> > you
>> > > > >> crush some underprovisioned cluster with load or some bizarre bug
>> or
>> > > > >> system fault happens.
>> > > > >>
>> > > > >> HBASE-1295 is also required for business continuity reasons, but
>> > this
>> > > > >> is already a priority item for some HBase committers.
>> > > > >>
>> > > > >> The question is I think does the above align with project goals.
>> > > > >> Making HBase-FSCK a blocker will probably knock something someone
>> > > > >> wants for the 0.21 timeframe off the list.
>> > > > >>
>> > > > >>   - Andy
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
>