You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Buckley,Ron" <bu...@oclc.org> on 2014/10/02 19:18:33 UTC

Recovering hbase after a failure

We just had an event where, on our main hbase instance, the /hbase directory got moved out from under the running system (Human error).

HBase was really unhappy about that, but we were able to recover it fairly easily and get back going.

As far as I can tell, all the data and tables came back correct. But, I'm pretty concerned that there may be some hidden corruption or data loss.

'hbase hbck'  runs clean and there are no new complaints in the logs.

Can anyone think of anything else we should look at?

RE: Recovering hbase after a failure

Posted by "Buckley,Ron" <bu...@oclc.org>.

Nick,

Good ideas.    Compared  file and region counts with our DR site.   Things looks OK.  Going to run some rowcounter's too. 

Feels like we got off easy.

Ron

-----Original Message-----
From: Nick Dimiduk [mailto:ndimiduk@gmail.com] 
Sent: Thursday, October 02, 2014 1:27 PM
To: hbase-user
Subject: Re: Recovering hbase after a failure

Hi Ron,

Yikes!

Do you have any basic metrics regarding the amount of data in the system -- size of store files before the incident, number of records, &c?

You could sift through the HDFS audit log and see if any files that were there previously have not been restored.

-n

On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:

> We just had an event where, on our main hbase instance, the /hbase 
> directory got moved out from under the running system (Human error).
>
> HBase was really unhappy about that, but we were able to recover it 
> fairly easily and get back going.
>
> As far as I can tell, all the data and tables came back correct. But, 
> I'm pretty concerned that there may be some hidden corruption or data loss.
>
> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>
> Can anyone think of anything else we should look at?
>
>
>
>
>

RE: Recovering hbase after a failure

Posted by "Buckley,Ron" <bu...@oclc.org>.

There were a bunch of new WAL's, but they were all empty. 

-----Original Message-----
From: Esteban Gutierrez [mailto:esteban@cloudera.com] 
Sent: Thursday, October 02, 2014 2:27 PM
To: user@hbase.apache.org
Subject: Re: Recovering hbase after a failure

Thanks for sharing the details Ron.

Did you move any WAL that might have been created back the the original .logs directory? Probably if some RSs rolled the WALs then at the time of the first mv those logs should have been replayed after merging the content of the original /hbase dir and the content of /hbase during the crash. If not then you probably have some missing data that needs to be replayed from those logs.

esteban.


--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:

> FWIW, in case something like this happens to someone else.
>
> To recover this, the first thing I tried was to just mv the /hbase
> directory back.   That doesn’t work.
>
> To get back going had to completely shut down and restart.
>
> Also, once the original /hbase got mv'd, a few of the region servers did
> some flush's before they aborted.   Those RS's actually created a new
> /hbase, with new table directories, but only containing the data from 
> the flush.
>
>
> -----Original Message-----
> From: Buckley,Ron
> Sent: Thursday, October 02, 2014 2:09 PM
> To: hbase-user
> Subject: RE: Recovering hbase after a failure
>
> Nick,
>
> Good ideas.    Compared  file and region counts with our DR site.   Things
> looks OK.  Going to run some rowcounter's too.
>
> Feels like we got off easy.
>
> Ron
>
> -----Original Message-----
> From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> Sent: Thursday, October 02, 2014 1:27 PM
> To: hbase-user
> Subject: Re: Recovering hbase after a failure
>
> Hi Ron,
>
> Yikes!
>
> Do you have any basic metrics regarding the amount of data in the 
> system
> -- size of store files before the incident, number of records, &c?
>
> You could sift through the HDFS audit log and see if any files that 
> were there previously have not been restored.
>
> -n
>
> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > We just had an event where, on our main hbase instance, the /hbase 
> > directory got moved out from under the running system (Human error).
> >
> > HBase was really unhappy about that, but we were able to recover it 
> > fairly easily and get back going.
> >
> > As far as I can tell, all the data and tables came back correct. 
> > But, I'm pretty concerned that there may be some hidden corruption 
> > or data
> loss.
> >
> > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> >
> > Can anyone think of anything else we should look at?
> >
> >
> >
> >
> >
>

Re: Recovering hbase after a failure

Posted by Esteban Gutierrez <es...@cloudera.com>.

Thanks for sharing the details Ron.

Did you move any WAL that might have been created back the the original
.logs directory? Probably if some RSs rolled the WALs then at the time of
the first mv those logs should have been replayed after merging the content
of the original /hbase dir and the content of /hbase during the crash. If
not then you probably have some missing data that needs to be replayed from
those logs.

esteban.


--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:

> FWIW, in case something like this happens to someone else.
>
> To recover this, the first thing I tried was to just mv the /hbase
> directory back.   That doesn’t work.
>
> To get back going had to completely shut down and restart.
>
> Also, once the original /hbase got mv'd, a few of the region servers did
> some flush's before they aborted.   Those RS's actually created a new
> /hbase, with new table directories, but only containing the data from the
> flush.
>
>
> -----Original Message-----
> From: Buckley,Ron
> Sent: Thursday, October 02, 2014 2:09 PM
> To: hbase-user
> Subject: RE: Recovering hbase after a failure
>
> Nick,
>
> Good ideas.    Compared  file and region counts with our DR site.   Things
> looks OK.  Going to run some rowcounter's too.
>
> Feels like we got off easy.
>
> Ron
>
> -----Original Message-----
> From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> Sent: Thursday, October 02, 2014 1:27 PM
> To: hbase-user
> Subject: Re: Recovering hbase after a failure
>
> Hi Ron,
>
> Yikes!
>
> Do you have any basic metrics regarding the amount of data in the system
> -- size of store files before the incident, number of records, &c?
>
> You could sift through the HDFS audit log and see if any files that were
> there previously have not been restored.
>
> -n
>
> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > We just had an event where, on our main hbase instance, the /hbase
> > directory got moved out from under the running system (Human error).
> >
> > HBase was really unhappy about that, but we were able to recover it
> > fairly easily and get back going.
> >
> > As far as I can tell, all the data and tables came back correct. But,
> > I'm pretty concerned that there may be some hidden corruption or data
> loss.
> >
> > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> >
> > Can anyone think of anything else we should look at?
> >
> >
> >
> >
> >
>

Re: Recovering hbase after a failure

Posted by Esteban Gutierrez <es...@cloudera.com>.

On Thu, Oct 2, 2014 at 3:12 PM, Andrew Purtell <ap...@apache.org> wrote:

> On Thu, Oct 2, 2014 at 3:02 PM, Esteban Gutierrez <es...@cloudera.com>
> wrote:
>
> > Another possibility is that we could
> > live with createNonRecursive until FileSystem becomes fully deprecated
> and
> > 
> > we can migrate to FileContext, perhaps for HBase 3.x?
> >
>
> Sure
>

Great!


>
>
> > HBASE-11045 goes in
> > the opposite direction to this but the discussion is in essence the same
> > problem.
> >
>
> Yes. Although I don't read it as going in the opposite direction. I read it
> as coming to the conclusion that there is no good alternative to
> createNonRecursive, which we need. Perhaps someone working with colleagues
> who have HDFS commit privileges can work out something suitable.
>
>
Thats correct, not good alternative to replace the deprecated method, the
other approaches should be good for now. Probably I wasn't clear enough :-)


>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Recovering hbase after a failure

Posted by Andrew Purtell <ap...@apache.org>.

On Thu, Oct 2, 2014 at 3:02 PM, Esteban Gutierrez <es...@cloudera.com>
wrote:

> Another possibility is that we could
> live with createNonRecursive until FileSystem becomes fully deprecated and
> 
> we can migrate to FileContext, perhaps for HBase 3.x?
>

Sure

> HBASE-11045 goes in
> the opposite direction to this but the discussion is in essence the same
> problem.
>

Yes. Although I don't read it as going in the opposite direction. I read it
as coming to the conclusion that there is no good alternative to
createNonRecursive, which we need. Perhaps someone working with colleagues
who have HDFS commit privileges can work out something suitable.

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Recovering hbase after a failure

Posted by Andrew Purtell <ap...@apache.org>.

On Thu, Oct 2, 2014 at 3:02 PM, Esteban Gutierrez <es...@cloudera.com>
wrote:

> I get that isDirectory is not atomic and not the best solution, but at
> least can provide an alternative to fail the operation without using the
> deprecated API or altering FileSystem
>

This is not an alternative solution because it's not atomic. Might as well
do nothing because a simple quirk of timing would produce the same result
as if no changes were made.

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Recovering hbase after a failure

Posted by Esteban Gutierrez <es...@cloudera.com>.

I get that isDirectory is not atomic and not the best solution, but at
least can provide an alternative to fail the operation without using the
deprecated API or altering FileSystem. Another possibility is that we could
live with createNonRecursive until FileSystem becomes fully deprecated and
we can migrate to FileContext, perhaps for HBase 3.x? HBASE-11045 goes in
the opposite direction to this but the discussion is in essence the same
problem.

thanks!
esteban.


--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 2:17 PM, Andrew Purtell <ap...@apache.org> wrote:

> 14 if you count createNewFile :-)
>
> http://search-hadoop.com/m/282AcZLDAp1. Maybe you could tap Andrew or
> Colin
> on the shoulder Esteban?
>
>
> On Thu, Oct 2, 2014 at 2:13 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > It's not the round trip, it's the atomicity of the operation. Consider a
> > rename happening between the isDirectory call and the subsequent create
> > call. What would you have achieved by introducing the isDirectory check?
> I
> > skimmed the FileSystem javadoc for 2.4.1 and none of the 13
> non-deprecated
> > create methods can provide the same semantics of createNonRecursive,
> shame.
> >
> >
> > On Thu, Oct 2, 2014 at 11:36 AM, Esteban Gutierrez <esteban@cloudera.com
> >
> > wrote:
> >
> >> I'm not sure if we should use the deprecated API, calling isDirectory
> >> shouldn't be that expensive in the NN but it will add another RPC call
> per
> >> flush.
> >>
> >> esteban.
> >>
> >> --
> >> Cloudera, Inc.
> >>
> >>
> >> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org>
> >> wrote:
> >>
> >> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org>
> >> wrote:
> >> >
> >> > > Also, once the original /hbase got mv'd, a few of the region servers
> >> did
> >> > > some flush's before they aborted.   Those RS's actually created a
> new
> >> > > /hbase, with new table directories, but only containing the data
> from
> >> the
> >> > > flush.
> >> >
> >> >
> >> > Sounds like we should be creating flush files with createNonRecursive
> >> (even
> >> > though it's deprecated)
> >> >
> >> >
> >> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org>
> wrote:
> >> >
> >> > > FWIW, in case something like this happens to someone else.
> >> > >
> >> > > To recover this, the first thing I tried was to just mv the /hbase
> >> > > directory back.   That doesn’t work.
> >> > >
> >> > > To get back going had to completely shut down and restart.
> >> > >
> >> > > Also, once the original /hbase got mv'd, a few of the region servers
> >> did
> >> > > some flush's before they aborted.   Those RS's actually created a
> new
> >> > > /hbase, with new table directories, but only containing the data
> from
> >> the
> >> > > flush.
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Buckley,Ron
> >> > > Sent: Thursday, October 02, 2014 2:09 PM
> >> > > To: hbase-user
> >> > > Subject: RE: Recovering hbase after a failure
> >> > >
> >> > > Nick,
> >> > >
> >> > > Good ideas.    Compared  file and region counts with our DR site.
> >> >  Things
> >> > > looks OK.  Going to run some rowcounter's too.
> >> > >
> >> > > Feels like we got off easy.
> >> > >
> >> > > Ron
> >> > >
> >> > > -----Original Message-----
> >> > > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> >> > > Sent: Thursday, October 02, 2014 1:27 PM
> >> > > To: hbase-user
> >> > > Subject: Re: Recovering hbase after a failure
> >> > >
> >> > > Hi Ron,
> >> > >
> >> > > Yikes!
> >> > >
> >> > > Do you have any basic metrics regarding the amount of data in the
> >> system
> >> > > -- size of store files before the incident, number of records, &c?
> >> > >
> >> > > You could sift through the HDFS audit log and see if any files that
> >> were
> >> > > there previously have not been restored.
> >> > >
> >> > > -n
> >> > >
> >> > > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org>
> >> wrote:
> >> > >
> >> > > > We just had an event where, on our main hbase instance, the /hbase
> >> > > > directory got moved out from under the running system (Human
> error).
> >> > > >
> >> > > > HBase was really unhappy about that, but we were able to recover
> it
> >> > > > fairly easily and get back going.
> >> > > >
> >> > > > As far as I can tell, all the data and tables came back correct.
> >> But,
> >> > > > I'm pretty concerned that there may be some hidden corruption or
> >> data
> >> > > loss.
> >> > > >
> >> > > > 'hbase hbck'  runs clean and there are no new complaints in the
> >> logs.
> >> > > >
> >> > > > Can anyone think of anything else we should look at?
> >> >
> >>
> >
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Recovering hbase after a failure

Posted by Andrew Purtell <ap...@apache.org>.

14 if you count createNewFile :-)

http://search-hadoop.com/m/282AcZLDAp1. Maybe you could tap Andrew or Colin
on the shoulder Esteban?


On Thu, Oct 2, 2014 at 2:13 PM, Andrew Purtell <ap...@apache.org> wrote:

> It's not the round trip, it's the atomicity of the operation. Consider a
> rename happening between the isDirectory call and the subsequent create
> call. What would you have achieved by introducing the isDirectory check? I
> skimmed the FileSystem javadoc for 2.4.1 and none of the 13 non-deprecated
> create methods can provide the same semantics of createNonRecursive, shame.
>
>
> On Thu, Oct 2, 2014 at 11:36 AM, Esteban Gutierrez <es...@cloudera.com>
> wrote:
>
>> I'm not sure if we should use the deprecated API, calling isDirectory
>> shouldn't be that expensive in the NN but it will add another RPC call per
>> flush.
>>
>> esteban.
>>
>> --
>> Cloudera, Inc.
>>
>>
>> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org>
>> wrote:
>> >
>> > > Also, once the original /hbase got mv'd, a few of the region servers
>> did
>> > > some flush's before they aborted.   Those RS's actually created a new
>> > > /hbase, with new table directories, but only containing the data from
>> the
>> > > flush.
>> >
>> >
>> > Sounds like we should be creating flush files with createNonRecursive
>> (even
>> > though it's deprecated)
>> >
>> >
>> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>> >
>> > > FWIW, in case something like this happens to someone else.
>> > >
>> > > To recover this, the first thing I tried was to just mv the /hbase
>> > > directory back.   That doesn’t work.
>> > >
>> > > To get back going had to completely shut down and restart.
>> > >
>> > > Also, once the original /hbase got mv'd, a few of the region servers
>> did
>> > > some flush's before they aborted.   Those RS's actually created a new
>> > > /hbase, with new table directories, but only containing the data from
>> the
>> > > flush.
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Buckley,Ron
>> > > Sent: Thursday, October 02, 2014 2:09 PM
>> > > To: hbase-user
>> > > Subject: RE: Recovering hbase after a failure
>> > >
>> > > Nick,
>> > >
>> > > Good ideas.    Compared  file and region counts with our DR site.
>> >  Things
>> > > looks OK.  Going to run some rowcounter's too.
>> > >
>> > > Feels like we got off easy.
>> > >
>> > > Ron
>> > >
>> > > -----Original Message-----
>> > > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
>> > > Sent: Thursday, October 02, 2014 1:27 PM
>> > > To: hbase-user
>> > > Subject: Re: Recovering hbase after a failure
>> > >
>> > > Hi Ron,
>> > >
>> > > Yikes!
>> > >
>> > > Do you have any basic metrics regarding the amount of data in the
>> system
>> > > -- size of store files before the incident, number of records, &c?
>> > >
>> > > You could sift through the HDFS audit log and see if any files that
>> were
>> > > there previously have not been restored.
>> > >
>> > > -n
>> > >
>> > > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org>
>> wrote:
>> > >
>> > > > We just had an event where, on our main hbase instance, the /hbase
>> > > > directory got moved out from under the running system (Human error).
>> > > >
>> > > > HBase was really unhappy about that, but we were able to recover it
>> > > > fairly easily and get back going.
>> > > >
>> > > > As far as I can tell, all the data and tables came back correct.
>> But,
>> > > > I'm pretty concerned that there may be some hidden corruption or
>> data
>> > > loss.
>> > > >
>> > > > 'hbase hbck'  runs clean and there are no new complaints in the
>> logs.
>> > > >
>> > > > Can anyone think of anything else we should look at?
>> >
>>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Recovering hbase after a failure

Posted by Andrew Purtell <ap...@apache.org>.

It's not the round trip, it's the atomicity of the operation. Consider a
rename happening between the isDirectory call and the subsequent create
call. What would you have achieved by introducing the isDirectory check? I
skimmed the FileSystem javadoc for 2.4.1 and none of the 13 non-deprecated
create methods can provide the same semantics of createNonRecursive, shame.


On Thu, Oct 2, 2014 at 11:36 AM, Esteban Gutierrez <es...@cloudera.com>
wrote:

> I'm not sure if we should use the deprecated API, calling isDirectory
> shouldn't be that expensive in the NN but it will add another RPC call per
> flush.
>
> esteban.
>
> --
> Cloudera, Inc.
>
>
> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > Also, once the original /hbase got mv'd, a few of the region servers
> did
> > > some flush's before they aborted.   Those RS's actually created a new
> > > /hbase, with new table directories, but only containing the data from
> the
> > > flush.
> >
> >
> > Sounds like we should be creating flush files with createNonRecursive
> (even
> > though it's deprecated)
> >
> >
> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > FWIW, in case something like this happens to someone else.
> > >
> > > To recover this, the first thing I tried was to just mv the /hbase
> > > directory back.   That doesn’t work.
> > >
> > > To get back going had to completely shut down and restart.
> > >
> > > Also, once the original /hbase got mv'd, a few of the region servers
> did
> > > some flush's before they aborted.   Those RS's actually created a new
> > > /hbase, with new table directories, but only containing the data from
> the
> > > flush.
> > >
> > >
> > > -----Original Message-----
> > > From: Buckley,Ron
> > > Sent: Thursday, October 02, 2014 2:09 PM
> > > To: hbase-user
> > > Subject: RE: Recovering hbase after a failure
> > >
> > > Nick,
> > >
> > > Good ideas.    Compared  file and region counts with our DR site.
> >  Things
> > > looks OK.  Going to run some rowcounter's too.
> > >
> > > Feels like we got off easy.
> > >
> > > Ron
> > >
> > > -----Original Message-----
> > > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> > > Sent: Thursday, October 02, 2014 1:27 PM
> > > To: hbase-user
> > > Subject: Re: Recovering hbase after a failure
> > >
> > > Hi Ron,
> > >
> > > Yikes!
> > >
> > > Do you have any basic metrics regarding the amount of data in the
> system
> > > -- size of store files before the incident, number of records, &c?
> > >
> > > You could sift through the HDFS audit log and see if any files that
> were
> > > there previously have not been restored.
> > >
> > > -n
> > >
> > > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org>
> wrote:
> > >
> > > > We just had an event where, on our main hbase instance, the /hbase
> > > > directory got moved out from under the running system (Human error).
> > > >
> > > > HBase was really unhappy about that, but we were able to recover it
> > > > fairly easily and get back going.
> > > >
> > > > As far as I can tell, all the data and tables came back correct. But,
> > > > I'm pretty concerned that there may be some hidden corruption or data
> > > loss.
> > > >
> > > > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> > > >
> > > > Can anyone think of anything else we should look at?
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Recovering hbase after a failure

Posted by Esteban Gutierrez <es...@cloudera.com>.

I'm not sure if we should use the deprecated API, calling isDirectory
shouldn't be that expensive in the NN but it will add another RPC call per
flush.

esteban.

--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org> wrote:

> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > Also, once the original /hbase got mv'd, a few of the region servers did
> > some flush's before they aborted.   Those RS's actually created a new
> > /hbase, with new table directories, but only containing the data from the
> > flush.
>
>
> Sounds like we should be creating flush files with createNonRecursive (even
> though it's deprecated)
>
>
> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > FWIW, in case something like this happens to someone else.
> >
> > To recover this, the first thing I tried was to just mv the /hbase
> > directory back.   That doesn’t work.
> >
> > To get back going had to completely shut down and restart.
> >
> > Also, once the original /hbase got mv'd, a few of the region servers did
> > some flush's before they aborted.   Those RS's actually created a new
> > /hbase, with new table directories, but only containing the data from the
> > flush.
> >
> >
> > -----Original Message-----
> > From: Buckley,Ron
> > Sent: Thursday, October 02, 2014 2:09 PM
> > To: hbase-user
> > Subject: RE: Recovering hbase after a failure
> >
> > Nick,
> >
> > Good ideas.    Compared  file and region counts with our DR site.
>  Things
> > looks OK.  Going to run some rowcounter's too.
> >
> > Feels like we got off easy.
> >
> > Ron
> >
> > -----Original Message-----
> > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> > Sent: Thursday, October 02, 2014 1:27 PM
> > To: hbase-user
> > Subject: Re: Recovering hbase after a failure
> >
> > Hi Ron,
> >
> > Yikes!
> >
> > Do you have any basic metrics regarding the amount of data in the system
> > -- size of store files before the incident, number of records, &c?
> >
> > You could sift through the HDFS audit log and see if any files that were
> > there previously have not been restored.
> >
> > -n
> >
> > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > We just had an event where, on our main hbase instance, the /hbase
> > > directory got moved out from under the running system (Human error).
> > >
> > > HBase was really unhappy about that, but we were able to recover it
> > > fairly easily and get back going.
> > >
> > > As far as I can tell, all the data and tables came back correct. But,
> > > I'm pretty concerned that there may be some hidden corruption or data
> > loss.
> > >
> > > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> > >
> > > Can anyone think of anything else we should look at?
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Recovering hbase after a failure

Posted by Esteban Gutierrez <es...@cloudera.com>.

Depending on which version they are using the RS should retrying the
operation to HDFS as we currently do. Eventually clients should be rejected
due maxing out the call queue. The question is for how long should we keep
the RS up until HDFS or the filesystem structure is back. Worst case
scenario we could provide a last resort option to drain the memstore or the
WAL before the RS goes down when there is no filesystem available.

esteban.

--
Cloudera, Inc.


On Thu, Oct 2, 2014 at 11:39 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> In this case, didn't the RS creating the directories and flushing the files
> prevent data loss? Had the flush aborted due to lack of directories, that
> flush data would have been lost entirely.
>
> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > Also, once the original /hbase got mv'd, a few of the region servers
> did
> > > some flush's before they aborted.   Those RS's actually created a new
> > > /hbase, with new table directories, but only containing the data from
> the
> > > flush.
> >
> >
> > Sounds like we should be creating flush files with createNonRecursive
> (even
> > though it's deprecated)
> >
> >
> > On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > FWIW, in case something like this happens to someone else.
> > >
> > > To recover this, the first thing I tried was to just mv the /hbase
> > > directory back.   That doesn’t work.
> > >
> > > To get back going had to completely shut down and restart.
> > >
> > > Also, once the original /hbase got mv'd, a few of the region servers
> did
> > > some flush's before they aborted.   Those RS's actually created a new
> > > /hbase, with new table directories, but only containing the data from
> the
> > > flush.
> > >
> > >
> > > -----Original Message-----
> > > From: Buckley,Ron
> > > Sent: Thursday, October 02, 2014 2:09 PM
> > > To: hbase-user
> > > Subject: RE: Recovering hbase after a failure
> > >
> > > Nick,
> > >
> > > Good ideas.    Compared  file and region counts with our DR site.
> >  Things
> > > looks OK.  Going to run some rowcounter's too.
> > >
> > > Feels like we got off easy.
> > >
> > > Ron
> > >
> > > -----Original Message-----
> > > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> > > Sent: Thursday, October 02, 2014 1:27 PM
> > > To: hbase-user
> > > Subject: Re: Recovering hbase after a failure
> > >
> > > Hi Ron,
> > >
> > > Yikes!
> > >
> > > Do you have any basic metrics regarding the amount of data in the
> system
> > > -- size of store files before the incident, number of records, &c?
> > >
> > > You could sift through the HDFS audit log and see if any files that
> were
> > > there previously have not been restored.
> > >
> > > -n
> > >
> > > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org>
> wrote:
> > >
> > > > We just had an event where, on our main hbase instance, the /hbase
> > > > directory got moved out from under the running system (Human error).
> > > >
> > > > HBase was really unhappy about that, but we were able to recover it
> > > > fairly easily and get back going.
> > > >
> > > > As far as I can tell, all the data and tables came back correct. But,
> > > > I'm pretty concerned that there may be some hidden corruption or data
> > > loss.
> > > >
> > > > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> > > >
> > > > Can anyone think of anything else we should look at?
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Re: Recovering hbase after a failure

Posted by Nick Dimiduk <nd...@gmail.com>.

Ah yes, of course there is.

On Thu, Oct 2, 2014 at 12:11 PM, Andrew Purtell <an...@gmail.com>
wrote:

> Is there not the WAL to handle a failed flush?
>
>
>
> > On Oct 2, 2014, at 11:39 AM, Nick Dimiduk <nd...@gmail.com> wrote:
> >
> > In this case, didn't the RS creating the directories and flushing the
> files
> > prevent data loss? Had the flush aborted due to lack of directories, that
> > flush data would have been lost entirely.
> >
> >> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> >>
> >> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org>
> wrote:
> >>
> >>> Also, once the original /hbase got mv'd, a few of the region servers
> did
> >>> some flush's before they aborted.   Those RS's actually created a new
> >>> /hbase, with new table directories, but only containing the data from
> the
> >>> flush.
> >>
> >>
> >> Sounds like we should be creating flush files with createNonRecursive
> (even
> >> though it's deprecated)
> >>
> >>
> >>> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org>
> wrote:
> >>>
> >>> FWIW, in case something like this happens to someone else.
> >>>
> >>> To recover this, the first thing I tried was to just mv the /hbase
> >>> directory back.   That doesn’t work.
> >>>
> >>> To get back going had to completely shut down and restart.
> >>>
> >>> Also, once the original /hbase got mv'd, a few of the region servers
> did
> >>> some flush's before they aborted.   Those RS's actually created a new
> >>> /hbase, with new table directories, but only containing the data from
> the
> >>> flush.
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Buckley,Ron
> >>> Sent: Thursday, October 02, 2014 2:09 PM
> >>> To: hbase-user
> >>> Subject: RE: Recovering hbase after a failure
> >>>
> >>> Nick,
> >>>
> >>> Good ideas.    Compared  file and region counts with our DR site.
> >> Things
> >>> looks OK.  Going to run some rowcounter's too.
> >>>
> >>> Feels like we got off easy.
> >>>
> >>> Ron
> >>>
> >>> -----Original Message-----
> >>> From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> >>> Sent: Thursday, October 02, 2014 1:27 PM
> >>> To: hbase-user
> >>> Subject: Re: Recovering hbase after a failure
> >>>
> >>> Hi Ron,
> >>>
> >>> Yikes!
> >>>
> >>> Do you have any basic metrics regarding the amount of data in the
> system
> >>> -- size of store files before the incident, number of records, &c?
> >>>
> >>> You could sift through the HDFS audit log and see if any files that
> were
> >>> there previously have not been restored.
> >>>
> >>> -n
> >>>
> >>>> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org>
> wrote:
> >>>>
> >>>> We just had an event where, on our main hbase instance, the /hbase
> >>>> directory got moved out from under the running system (Human error).
> >>>>
> >>>> HBase was really unhappy about that, but we were able to recover it
> >>>> fairly easily and get back going.
> >>>>
> >>>> As far as I can tell, all the data and tables came back correct. But,
> >>>> I'm pretty concerned that there may be some hidden corruption or data
> >>> loss.
> >>>>
> >>>> 'hbase hbck'  runs clean and there are no new complaints in the logs.
> >>>>
> >>>> Can anyone think of anything else we should look at?
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>   - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
>

Re: Recovering hbase after a failure

Posted by Andrew Purtell <an...@gmail.com>.

Is there not the WAL to handle a failed flush?



> On Oct 2, 2014, at 11:39 AM, Nick Dimiduk <nd...@gmail.com> wrote:
> 
> In this case, didn't the RS creating the directories and flushing the files
> prevent data loss? Had the flush aborted due to lack of directories, that
> flush data would have been lost entirely.
> 
>> On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org> wrote:
>> 
>> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>> 
>>> Also, once the original /hbase got mv'd, a few of the region servers did
>>> some flush's before they aborted.   Those RS's actually created a new
>>> /hbase, with new table directories, but only containing the data from the
>>> flush.
>> 
>> 
>> Sounds like we should be creating flush files with createNonRecursive (even
>> though it's deprecated)
>> 
>> 
>>> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>>> 
>>> FWIW, in case something like this happens to someone else.
>>> 
>>> To recover this, the first thing I tried was to just mv the /hbase
>>> directory back.   That doesn’t work.
>>> 
>>> To get back going had to completely shut down and restart.
>>> 
>>> Also, once the original /hbase got mv'd, a few of the region servers did
>>> some flush's before they aborted.   Those RS's actually created a new
>>> /hbase, with new table directories, but only containing the data from the
>>> flush.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Buckley,Ron
>>> Sent: Thursday, October 02, 2014 2:09 PM
>>> To: hbase-user
>>> Subject: RE: Recovering hbase after a failure
>>> 
>>> Nick,
>>> 
>>> Good ideas.    Compared  file and region counts with our DR site.
>> Things
>>> looks OK.  Going to run some rowcounter's too.
>>> 
>>> Feels like we got off easy.
>>> 
>>> Ron
>>> 
>>> -----Original Message-----
>>> From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
>>> Sent: Thursday, October 02, 2014 1:27 PM
>>> To: hbase-user
>>> Subject: Re: Recovering hbase after a failure
>>> 
>>> Hi Ron,
>>> 
>>> Yikes!
>>> 
>>> Do you have any basic metrics regarding the amount of data in the system
>>> -- size of store files before the incident, number of records, &c?
>>> 
>>> You could sift through the HDFS audit log and see if any files that were
>>> there previously have not been restored.
>>> 
>>> -n
>>> 
>>>> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:
>>>> 
>>>> We just had an event where, on our main hbase instance, the /hbase
>>>> directory got moved out from under the running system (Human error).
>>>> 
>>>> HBase was really unhappy about that, but we were able to recover it
>>>> fairly easily and get back going.
>>>> 
>>>> As far as I can tell, all the data and tables came back correct. But,
>>>> I'm pretty concerned that there may be some hidden corruption or data
>>> loss.
>>>> 
>>>> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>>>> 
>>>> Can anyone think of anything else we should look at?
>> 
>> 
>> 
>> --
>> Best regards,
>> 
>>   - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>

Re: Recovering hbase after a failure

Posted by Nick Dimiduk <nd...@gmail.com>.

In this case, didn't the RS creating the directories and flushing the files
prevent data loss? Had the flush aborted due to lack of directories, that
flush data would have been lost entirely.

On Thu, Oct 2, 2014 at 11:26 AM, Andrew Purtell <ap...@apache.org> wrote:

> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > Also, once the original /hbase got mv'd, a few of the region servers did
> > some flush's before they aborted.   Those RS's actually created a new
> > /hbase, with new table directories, but only containing the data from the
> > flush.
>
>
> Sounds like we should be creating flush files with createNonRecursive (even
> though it's deprecated)
>
>
> On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > FWIW, in case something like this happens to someone else.
> >
> > To recover this, the first thing I tried was to just mv the /hbase
> > directory back.   That doesn’t work.
> >
> > To get back going had to completely shut down and restart.
> >
> > Also, once the original /hbase got mv'd, a few of the region servers did
> > some flush's before they aborted.   Those RS's actually created a new
> > /hbase, with new table directories, but only containing the data from the
> > flush.
> >
> >
> > -----Original Message-----
> > From: Buckley,Ron
> > Sent: Thursday, October 02, 2014 2:09 PM
> > To: hbase-user
> > Subject: RE: Recovering hbase after a failure
> >
> > Nick,
> >
> > Good ideas.    Compared  file and region counts with our DR site.
>  Things
> > looks OK.  Going to run some rowcounter's too.
> >
> > Feels like we got off easy.
> >
> > Ron
> >
> > -----Original Message-----
> > From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> > Sent: Thursday, October 02, 2014 1:27 PM
> > To: hbase-user
> > Subject: Re: Recovering hbase after a failure
> >
> > Hi Ron,
> >
> > Yikes!
> >
> > Do you have any basic metrics regarding the amount of data in the system
> > -- size of store files before the incident, number of records, &c?
> >
> > You could sift through the HDFS audit log and see if any files that were
> > there previously have not been restored.
> >
> > -n
> >
> > On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > We just had an event where, on our main hbase instance, the /hbase
> > > directory got moved out from under the running system (Human error).
> > >
> > > HBase was really unhappy about that, but we were able to recover it
> > > fairly easily and get back going.
> > >
> > > As far as I can tell, all the data and tables came back correct. But,
> > > I'm pretty concerned that there may be some hidden corruption or data
> > loss.
> > >
> > > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> > >
> > > Can anyone think of anything else we should look at?
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Recovering hbase after a failure

Posted by Andrew Purtell <ap...@apache.org>.

On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:

> Also, once the original /hbase got mv'd, a few of the region servers did
> some flush's before they aborted.   Those RS's actually created a new
> /hbase, with new table directories, but only containing the data from the
> flush.


Sounds like we should be creating flush files with createNonRecursive (even
though it's deprecated)


On Thu, Oct 2, 2014 at 11:17 AM, Buckley,Ron <bu...@oclc.org> wrote:

> FWIW, in case something like this happens to someone else.
>
> To recover this, the first thing I tried was to just mv the /hbase
> directory back.   That doesn’t work.
>
> To get back going had to completely shut down and restart.
>
> Also, once the original /hbase got mv'd, a few of the region servers did
> some flush's before they aborted.   Those RS's actually created a new
> /hbase, with new table directories, but only containing the data from the
> flush.
>
>
> -----Original Message-----
> From: Buckley,Ron
> Sent: Thursday, October 02, 2014 2:09 PM
> To: hbase-user
> Subject: RE: Recovering hbase after a failure
>
> Nick,
>
> Good ideas.    Compared  file and region counts with our DR site.   Things
> looks OK.  Going to run some rowcounter's too.
>
> Feels like we got off easy.
>
> Ron
>
> -----Original Message-----
> From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
> Sent: Thursday, October 02, 2014 1:27 PM
> To: hbase-user
> Subject: Re: Recovering hbase after a failure
>
> Hi Ron,
>
> Yikes!
>
> Do you have any basic metrics regarding the amount of data in the system
> -- size of store files before the incident, number of records, &c?
>
> You could sift through the HDFS audit log and see if any files that were
> there previously have not been restored.
>
> -n
>
> On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > We just had an event where, on our main hbase instance, the /hbase
> > directory got moved out from under the running system (Human error).
> >
> > HBase was really unhappy about that, but we were able to recover it
> > fairly easily and get back going.
> >
> > As far as I can tell, all the data and tables came back correct. But,
> > I'm pretty concerned that there may be some hidden corruption or data
> loss.
> >
> > 'hbase hbck'  runs clean and there are no new complaints in the logs.
> >
> > Can anyone think of anything else we should look at?
> >
> >
> >
> >
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: Recovering hbase after a failure

Posted by "Buckley,Ron" <bu...@oclc.org>.

FWIW, in case something like this happens to someone else.

To recover this, the first thing I tried was to just mv the /hbase directory back.   That doesn’t work.

To get back going had to completely shut down and restart.  

Also, once the original /hbase got mv'd, a few of the region servers did some flush's before they aborted.   Those RS's actually created a new /hbase, with new table directories, but only containing the data from the flush. 

-----Original Message-----
From: Buckley,Ron 
Sent: Thursday, October 02, 2014 2:09 PM
To: hbase-user
Subject: RE: Recovering hbase after a failure

Nick,

Good ideas.    Compared  file and region counts with our DR site.   Things looks OK.  Going to run some rowcounter's too. 

Feels like we got off easy.

Ron

-----Original Message-----
From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
Sent: Thursday, October 02, 2014 1:27 PM
To: hbase-user
Subject: Re: Recovering hbase after a failure

Hi Ron,

Yikes!

Do you have any basic metrics regarding the amount of data in the system -- size of store files before the incident, number of records, &c?

You could sift through the HDFS audit log and see if any files that were there previously have not been restored.

-n

On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:

> We just had an event where, on our main hbase instance, the /hbase 
> directory got moved out from under the running system (Human error).
>
> HBase was really unhappy about that, but we were able to recover it 
> fairly easily and get back going.
>
> As far as I can tell, all the data and tables came back correct. But, 
> I'm pretty concerned that there may be some hidden corruption or data loss.
>
> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>
> Can anyone think of anything else we should look at?
>
>
>
>
>

Re: Recovering hbase after a failure

Posted by Nick Dimiduk <nd...@gmail.com>.

Hi Ron,

Yikes!

Do you have any basic metrics regarding the amount of data in the system --
size of store files before the incident, number of records, &c?

You could sift through the HDFS audit log and see if any files that were
there previously have not been restored.

-n

On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:

> We just had an event where, on our main hbase instance, the /hbase
> directory got moved out from under the running system (Human error).
>
> HBase was really unhappy about that, but we were able to recover it fairly
> easily and get back going.
>
> As far as I can tell, all the data and tables came back correct. But, I'm
> pretty concerned that there may be some hidden corruption or data loss.
>
> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>
> Can anyone think of anything else we should look at?
>
>
>
>
>

RE: Recovering hbase after a failure

Posted by "Buckley,Ron" <bu...@oclc.org>.

Esteban,

Thanks. No WAL replay errors. Just about all the region servers logged a DroppedSnapshotException and then aborted. I think we're good as far as that goes.

Ron

-----Original Message-----
From: Esteban Gutierrez [mailto:esteban@cloudera.com] 
Sent: Thursday, October 02, 2014 1:26 PM
To: user@hbase.apache.org
Subject: Re: Recovering hbase after a failure

Hi Ron,

Look into dropped snapshot exceptions in the logs and puts or deletes that skip the WAL. If everything is good there then clients should have handled the unavailability of HBase and there should not be any dataloss from the server side. Also double check if after the crash there were not errors replaying the WAL.

esteban.

--
Cloudera, Inc.

On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:

> We just had an event where, on our main hbase instance, the /hbase 
> directory got moved out from under the running system (Human error).
>
> HBase was really unhappy about that, but we were able to recover it 
> fairly easily and get back going.
>
> As far as I can tell, all the data and tables came back correct. But, 
> I'm pretty concerned that there may be some hidden corruption or data loss.
>
> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>
> Can anyone think of anything else we should look at?
>
>
>
>
>

Re: Recovering hbase after a failure

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hi Ron,

Look into dropped snapshot exceptions in the logs and puts or deletes that
skip the WAL. If everything is good there then clients should have handled
the unavailability of HBase and there should not be any dataloss from the
server side. Also double check if after the crash there were not errors
replaying the WAL.

esteban.

--
Cloudera, Inc.

On Thu, Oct 2, 2014 at 10:18 AM, Buckley,Ron <bu...@oclc.org> wrote:

> We just had an event where, on our main hbase instance, the /hbase
> directory got moved out from under the running system (Human error).
>
> HBase was really unhappy about that, but we were able to recover it fairly
> easily and get back going.
>
> As far as I can tell, all the data and tables came back correct. But, I'm
> pretty concerned that there may be some hidden corruption or data loss.
>
> 'hbase hbck'  runs clean and there are no new complaints in the logs.
>
> Can anyone think of anything else we should look at?
>
>
>
>
>