You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Brennon Church <br...@getjar.com> on 2013/04/12 07:50:50 UTC

Lost regions question

Hello,

I had an interesting problem come up recently.  We have a few thousand 
regions across 8 datanode/regionservers.  I made a change, increasing 
the heap size for hadoop from 128M to 2048M which ended up bringing the 
cluster to a complete halt after about 1 hour.  I reverted back to 128M 
and turned things back on again but didn't realize at the time that I 
came up with 9 fewer regions than I started.  Upon further 
investigation, I found that all 9 missing regions were from splits that 
occurred while the cluster was running after making the heap change and 
before it came to a halt.  There was a 10th regions (5 splits involved 
in total) that managed to get recovered.  The really odd thing is that 
in the case of the other 9 regions, the original parent regions, which 
as far as I can tell in the logs were deleted, were re-opened upon 
restarting things once again.  The daughter regions were gone.  
Interestingly, I found the orphaned datablocks still intact, and in at 
least some cases have been able to extract the data from them and will 
hopefully re-add it to the tables.

My question is this.  Does anyone know based on the rather muddled 
description I've given above, what could have possibly happened here?  
My best guess is that the bad state that hdfs was in caused some 
critical component of the split process to be missed, which resulted a 
reference to the parent regions sticking around and losing the 
references to the daughter regions.

Thanks for any insight you can provide.

--Brennon

Re: Lost regions question

Posted by Leonid Fedotov <lf...@hortonworks.com>.

Try to run "habase hbck -fix"
It should do the job.

Thank you!

Sincerely,
Leonid Fedotov

On Apr 12, 2013, at 9:56 AM, Brennon Church wrote:

> hbck does show the hdfs files there without associated regions.  I probably could have recovered had I noticed just after this happened, but given that we've been running like this for over a week, and that there is the potential for collisions between the missing and new data, I'm probably just going to manually reinsert it all using the hdfs files.
> 
> Hadoop version is 1.0.1, btw.
> 
> Thanks.
> 
> --Brennon
> 
> On 4/11/13 11:05 PM, Ted Yu wrote:
>> Brennon:
>> Have you run hbck to diagnose the problem ?
>> 
>> Since the issue might have involved hdfs, browsing DataNode log(s) may
>> provide some clue as well.
>> 
>> What hadoop version are you using ?
>> 
>> Cheers
>> 
>> On Thu, Apr 11, 2013 at 10:58 PM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.com> wrote:
>> 
>>> When you say that the parent regions got reopened does that mean that you
>>> did not lose any data(any data could not be read).  The reason am asking is
>>> if after the parent got split into daughters and the data was written to
>>> daughters and if the daughters related files could not be opened you could
>>> have ended up in not able to read the data.
>>> 
>>> Some logs could tell us what made the parent to get reopened rather than
>>> daughters.  Another thing i would like to ask is was the cluster brought
>>> down abruptly by killing the RS.
>>> 
>>> Which version of HBase?
>>> 
>>> Regards
>>> Ram
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com>
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I had an interesting problem come up recently.  We have a few thousand
>>>> regions across 8 datanode/regionservers.  I made a change, increasing the
>>>> heap size for hadoop from 128M to 2048M which ended up bringing the
>>> cluster
>>>> to a complete halt after about 1 hour.  I reverted back to 128M and
>>> turned
>>>> things back on again but didn't realize at the time that I came up with 9
>>>> fewer regions than I started.  Upon further investigation, I found that
>>> all
>>>> 9 missing regions were from splits that occurred while the cluster was
>>>> running after making the heap change and before it came to a halt.  There
>>>> was a 10th regions (5 splits involved in total) that managed to get
>>>> recovered.  The really odd thing is that in the case of the other 9
>>>> regions, the original parent regions, which as far as I can tell in the
>>>> logs were deleted, were re-opened upon restarting things once again.  The
>>>> daughter regions were gone.  Interestingly, I found the orphaned
>>> datablocks
>>>> still intact, and in at least some cases have been able to extract the
>>> data
>>>> from them and will hopefully re-add it to the tables.
>>>> 
>>>> My question is this.  Does anyone know based on the rather muddled
>>>> description I've given above, what could have possibly happened here?  My
>>>> best guess is that the bad state that hdfs was in caused some critical
>>>> component of the split process to be missed, which resulted a reference
>>> to
>>>> the parent regions sticking around and losing the references to the
>>>> daughter regions.
>>>> 
>>>> Thanks for any insight you can provide.
>>>> 
>>>> --Brennon
>>>> 
>>>> 
>>>> 
>>>> 
> 
>

Re: Lost regions question

Posted by Brennon Church <br...@getjar.com>.

hbck does show the hdfs files there without associated regions.  I 
probably could have recovered had I noticed just after this happened, 
but given that we've been running like this for over a week, and that 
there is the potential for collisions between the missing and new data, 
I'm probably just going to manually reinsert it all using the hdfs files.

Hadoop version is 1.0.1, btw.

Thanks.

--Brennon

On 4/11/13 11:05 PM, Ted Yu wrote:
> Brennon:
> Have you run hbck to diagnose the problem ?
>
> Since the issue might have involved hdfs, browsing DataNode log(s) may
> provide some clue as well.
>
> What hadoop version are you using ?
>
> Cheers
>
> On Thu, Apr 11, 2013 at 10:58 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
>> When you say that the parent regions got reopened does that mean that you
>> did not lose any data(any data could not be read).  The reason am asking is
>> if after the parent got split into daughters and the data was written to
>> daughters and if the daughters related files could not be opened you could
>> have ended up in not able to read the data.
>>
>> Some logs could tell us what made the parent to get reopened rather than
>> daughters.  Another thing i would like to ask is was the cluster brought
>> down abruptly by killing the RS.
>>
>> Which version of HBase?
>>
>> Regards
>> Ram
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I had an interesting problem come up recently.  We have a few thousand
>>> regions across 8 datanode/regionservers.  I made a change, increasing the
>>> heap size for hadoop from 128M to 2048M which ended up bringing the
>> cluster
>>> to a complete halt after about 1 hour.  I reverted back to 128M and
>> turned
>>> things back on again but didn't realize at the time that I came up with 9
>>> fewer regions than I started.  Upon further investigation, I found that
>> all
>>> 9 missing regions were from splits that occurred while the cluster was
>>> running after making the heap change and before it came to a halt.  There
>>> was a 10th regions (5 splits involved in total) that managed to get
>>> recovered.  The really odd thing is that in the case of the other 9
>>> regions, the original parent regions, which as far as I can tell in the
>>> logs were deleted, were re-opened upon restarting things once again.  The
>>> daughter regions were gone.  Interestingly, I found the orphaned
>> datablocks
>>> still intact, and in at least some cases have been able to extract the
>> data
>>> from them and will hopefully re-add it to the tables.
>>>
>>> My question is this.  Does anyone know based on the rather muddled
>>> description I've given above, what could have possibly happened here?  My
>>> best guess is that the bad state that hdfs was in caused some critical
>>> component of the split process to be missed, which resulted a reference
>> to
>>> the parent regions sticking around and losing the references to the
>>> daughter regions.
>>>
>>> Thanks for any insight you can provide.
>>>
>>> --Brennon
>>>
>>>
>>>
>>>

Re: Lost regions question

Posted by Ted Yu <yu...@gmail.com>.

Brennon:
Have you run hbck to diagnose the problem ?

Since the issue might have involved hdfs, browsing DataNode log(s) may
provide some clue as well.

What hadoop version are you using ?

Cheers

On Thu, Apr 11, 2013 at 10:58 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> When you say that the parent regions got reopened does that mean that you
> did not lose any data(any data could not be read).  The reason am asking is
> if after the parent got split into daughters and the data was written to
> daughters and if the daughters related files could not be opened you could
> have ended up in not able to read the data.
>
> Some logs could tell us what made the parent to get reopened rather than
> daughters.  Another thing i would like to ask is was the cluster brought
> down abruptly by killing the RS.
>
> Which version of HBase?
>
> Regards
> Ram
>
>
>
>
> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com>
> wrote:
>
> > Hello,
> >
> > I had an interesting problem come up recently.  We have a few thousand
> > regions across 8 datanode/regionservers.  I made a change, increasing the
> > heap size for hadoop from 128M to 2048M which ended up bringing the
> cluster
> > to a complete halt after about 1 hour.  I reverted back to 128M and
> turned
> > things back on again but didn't realize at the time that I came up with 9
> > fewer regions than I started.  Upon further investigation, I found that
> all
> > 9 missing regions were from splits that occurred while the cluster was
> > running after making the heap change and before it came to a halt.  There
> > was a 10th regions (5 splits involved in total) that managed to get
> > recovered.  The really odd thing is that in the case of the other 9
> > regions, the original parent regions, which as far as I can tell in the
> > logs were deleted, were re-opened upon restarting things once again.  The
> > daughter regions were gone.  Interestingly, I found the orphaned
> datablocks
> > still intact, and in at least some cases have been able to extract the
> data
> > from them and will hopefully re-add it to the tables.
> >
> > My question is this.  Does anyone know based on the rather muddled
> > description I've given above, what could have possibly happened here?  My
> > best guess is that the bad state that hdfs was in caused some critical
> > component of the split process to be missed, which resulted a reference
> to
> > the parent regions sticking around and losing the references to the
> > daughter regions.
> >
> > Thanks for any insight you can provide.
> >
> > --Brennon
> >
> >
> >
> >
>

Re: Lost regions question

Posted by Ted Yu <yu...@gmail.com>.

Brennon:
Can you try hbck to see if the problem is repaired ?

Thanks

On Fri, Apr 12, 2013 at 9:27 AM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Oh..sorry to hear that .  But i think it should be there in the system but
> not allowing you to access.  We should be able to bring it back.
>
> One set of logs that would be of interest is that of the RS and master when
> the split happened.
>
> And the main thing would be that when you restarted your cluster and the
> Master again came back. That is where the system does some self
> rectification after it sees if there were some partial splits.
>
> Regards
> Ram
>
>
> On Fri, Apr 12, 2013 at 9:34 PM, Brennon Church <br...@getjar.com>
> wrote:
>
> > Hello,
> >
> > We lost the data when the parent regions got reopened.  My guess, and
> it's
> > only that, is that the regions were  essentially empty when they started
> up
> > again in these cases.  We definitely lost data from the tables.
> >
> > I've looked through the hdfs and hbase logs and can't find any obvious
> > difference between a successful split and these failed ones.  All steps
> > show up the same in all cases.  After the handled split message that
> listed
> > the parent and daughter regions, the next reference is to the parent
> > regions once again as hbase is started back up after the failure.  No
> > further reference to the daughters is made.
> >
> > I couldn't cleanly shut several of the regionservers down, so they were
> > abruptly killed, yes.
> >
> > HBase version is 0.92.0, and hadoop is 1.0.1.
> >
> > Thanks.
> >
> > --Brennon
> >
> >
> > On 4/11/13 10:58 PM, ramkrishna vasudevan wrote:
> >
> >> When you say that the parent regions got reopened does that mean that
> you
> >> did not lose any data(any data could not be read).  The reason am asking
> >> is
> >> if after the parent got split into daughters and the data was written to
> >> daughters and if the daughters related files could not be opened you
> could
> >> have ended up in not able to read the data.
> >>
> >> Some logs could tell us what made the parent to get reopened rather than
> >> daughters.  Another thing i would like to ask is was the cluster brought
> >> down abruptly by killing the RS.
> >>
> >> Which version of HBase?
> >>
> >> Regards
> >> Ram
> >>
> >>
> >>
> >>
> >> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com>
> >> wrote:
> >>
> >>  Hello,
> >>>
> >>> I had an interesting problem come up recently.  We have a few thousand
> >>> regions across 8 datanode/regionservers.  I made a change, increasing
> the
> >>> heap size for hadoop from 128M to 2048M which ended up bringing the
> >>> cluster
> >>> to a complete halt after about 1 hour.  I reverted back to 128M and
> >>> turned
> >>> things back on again but didn't realize at the time that I came up
> with 9
> >>> fewer regions than I started.  Upon further investigation, I found that
> >>> all
> >>> 9 missing regions were from splits that occurred while the cluster was
> >>> running after making the heap change and before it came to a halt.
>  There
> >>> was a 10th regions (5 splits involved in total) that managed to get
> >>> recovered.  The really odd thing is that in the case of the other 9
> >>> regions, the original parent regions, which as far as I can tell in the
> >>> logs were deleted, were re-opened upon restarting things once again.
>  The
> >>> daughter regions were gone.  Interestingly, I found the orphaned
> >>> datablocks
> >>> still intact, and in at least some cases have been able to extract the
> >>> data
> >>> from them and will hopefully re-add it to the tables.
> >>>
> >>> My question is this.  Does anyone know based on the rather muddled
> >>> description I've given above, what could have possibly happened here?
>  My
> >>> best guess is that the bad state that hdfs was in caused some critical
> >>> component of the split process to be missed, which resulted a reference
> >>> to
> >>> the parent regions sticking around and losing the references to the
> >>> daughter regions.
> >>>
> >>> Thanks for any insight you can provide.
> >>>
> >>> --Brennon
> >>>
> >>>
> >>>
> >>>
> >>>
> >
>

Re: Lost regions question

Posted by ramkrishna vasudevan <ra...@gmail.com>.

Oh..sorry to hear that .  But i think it should be there in the system but
not allowing you to access.  We should be able to bring it back.

One set of logs that would be of interest is that of the RS and master when
the split happened.

And the main thing would be that when you restarted your cluster and the
Master again came back. That is where the system does some self
rectification after it sees if there were some partial splits.

Regards
Ram


On Fri, Apr 12, 2013 at 9:34 PM, Brennon Church <br...@getjar.com> wrote:

> Hello,
>
> We lost the data when the parent regions got reopened.  My guess, and it's
> only that, is that the regions were  essentially empty when they started up
> again in these cases.  We definitely lost data from the tables.
>
> I've looked through the hdfs and hbase logs and can't find any obvious
> difference between a successful split and these failed ones.  All steps
> show up the same in all cases.  After the handled split message that listed
> the parent and daughter regions, the next reference is to the parent
> regions once again as hbase is started back up after the failure.  No
> further reference to the daughters is made.
>
> I couldn't cleanly shut several of the regionservers down, so they were
> abruptly killed, yes.
>
> HBase version is 0.92.0, and hadoop is 1.0.1.
>
> Thanks.
>
> --Brennon
>
>
> On 4/11/13 10:58 PM, ramkrishna vasudevan wrote:
>
>> When you say that the parent regions got reopened does that mean that you
>> did not lose any data(any data could not be read).  The reason am asking
>> is
>> if after the parent got split into daughters and the data was written to
>> daughters and if the daughters related files could not be opened you could
>> have ended up in not able to read the data.
>>
>> Some logs could tell us what made the parent to get reopened rather than
>> daughters.  Another thing i would like to ask is was the cluster brought
>> down abruptly by killing the RS.
>>
>> Which version of HBase?
>>
>> Regards
>> Ram
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com>
>> wrote:
>>
>>  Hello,
>>>
>>> I had an interesting problem come up recently.  We have a few thousand
>>> regions across 8 datanode/regionservers.  I made a change, increasing the
>>> heap size for hadoop from 128M to 2048M which ended up bringing the
>>> cluster
>>> to a complete halt after about 1 hour.  I reverted back to 128M and
>>> turned
>>> things back on again but didn't realize at the time that I came up with 9
>>> fewer regions than I started.  Upon further investigation, I found that
>>> all
>>> 9 missing regions were from splits that occurred while the cluster was
>>> running after making the heap change and before it came to a halt.  There
>>> was a 10th regions (5 splits involved in total) that managed to get
>>> recovered.  The really odd thing is that in the case of the other 9
>>> regions, the original parent regions, which as far as I can tell in the
>>> logs were deleted, were re-opened upon restarting things once again.  The
>>> daughter regions were gone.  Interestingly, I found the orphaned
>>> datablocks
>>> still intact, and in at least some cases have been able to extract the
>>> data
>>> from them and will hopefully re-add it to the tables.
>>>
>>> My question is this.  Does anyone know based on the rather muddled
>>> description I've given above, what could have possibly happened here?  My
>>> best guess is that the bad state that hdfs was in caused some critical
>>> component of the split process to be missed, which resulted a reference
>>> to
>>> the parent regions sticking around and losing the references to the
>>> daughter regions.
>>>
>>> Thanks for any insight you can provide.
>>>
>>> --Brennon
>>>
>>>
>>>
>>>
>>>
>

Re: Lost regions question

Posted by Brennon Church <br...@getjar.com>.

Hello,

We lost the data when the parent regions got reopened.  My guess, and 
it's only that, is that the regions were  essentially empty when they 
started up again in these cases.  We definitely lost data from the tables.

I've looked through the hdfs and hbase logs and can't find any obvious 
difference between a successful split and these failed ones.  All steps 
show up the same in all cases.  After the handled split message that 
listed the parent and daughter regions, the next reference is to the 
parent regions once again as hbase is started back up after the 
failure.  No further reference to the daughters is made.

I couldn't cleanly shut several of the regionservers down, so they were 
abruptly killed, yes.

HBase version is 0.92.0, and hadoop is 1.0.1.

Thanks.

--Brennon

On 4/11/13 10:58 PM, ramkrishna vasudevan wrote:
> When you say that the parent regions got reopened does that mean that you
> did not lose any data(any data could not be read).  The reason am asking is
> if after the parent got split into daughters and the data was written to
> daughters and if the daughters related files could not be opened you could
> have ended up in not able to read the data.
>
> Some logs could tell us what made the parent to get reopened rather than
> daughters.  Another thing i would like to ask is was the cluster brought
> down abruptly by killing the RS.
>
> Which version of HBase?
>
> Regards
> Ram
>
>
>
>
> On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com> wrote:
>
>> Hello,
>>
>> I had an interesting problem come up recently.  We have a few thousand
>> regions across 8 datanode/regionservers.  I made a change, increasing the
>> heap size for hadoop from 128M to 2048M which ended up bringing the cluster
>> to a complete halt after about 1 hour.  I reverted back to 128M and turned
>> things back on again but didn't realize at the time that I came up with 9
>> fewer regions than I started.  Upon further investigation, I found that all
>> 9 missing regions were from splits that occurred while the cluster was
>> running after making the heap change and before it came to a halt.  There
>> was a 10th regions (5 splits involved in total) that managed to get
>> recovered.  The really odd thing is that in the case of the other 9
>> regions, the original parent regions, which as far as I can tell in the
>> logs were deleted, were re-opened upon restarting things once again.  The
>> daughter regions were gone.  Interestingly, I found the orphaned datablocks
>> still intact, and in at least some cases have been able to extract the data
>> from them and will hopefully re-add it to the tables.
>>
>> My question is this.  Does anyone know based on the rather muddled
>> description I've given above, what could have possibly happened here?  My
>> best guess is that the bad state that hdfs was in caused some critical
>> component of the split process to be missed, which resulted a reference to
>> the parent regions sticking around and losing the references to the
>> daughter regions.
>>
>> Thanks for any insight you can provide.
>>
>> --Brennon
>>
>>
>>
>>

Re: Lost regions question

Posted by ramkrishna vasudevan <ra...@gmail.com>.

When you say that the parent regions got reopened does that mean that you
did not lose any data(any data could not be read).  The reason am asking is
if after the parent got split into daughters and the data was written to
daughters and if the daughters related files could not be opened you could
have ended up in not able to read the data.

Some logs could tell us what made the parent to get reopened rather than
daughters.  Another thing i would like to ask is was the cluster brought
down abruptly by killing the RS.

Which version of HBase?

Regards
Ram




On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church <br...@getjar.com> wrote:

> Hello,
>
> I had an interesting problem come up recently.  We have a few thousand
> regions across 8 datanode/regionservers.  I made a change, increasing the
> heap size for hadoop from 128M to 2048M which ended up bringing the cluster
> to a complete halt after about 1 hour.  I reverted back to 128M and turned
> things back on again but didn't realize at the time that I came up with 9
> fewer regions than I started.  Upon further investigation, I found that all
> 9 missing regions were from splits that occurred while the cluster was
> running after making the heap change and before it came to a halt.  There
> was a 10th regions (5 splits involved in total) that managed to get
> recovered.  The really odd thing is that in the case of the other 9
> regions, the original parent regions, which as far as I can tell in the
> logs were deleted, were re-opened upon restarting things once again.  The
> daughter regions were gone.  Interestingly, I found the orphaned datablocks
> still intact, and in at least some cases have been able to extract the data
> from them and will hopefully re-add it to the tables.
>
> My question is this.  Does anyone know based on the rather muddled
> description I've given above, what could have possibly happened here?  My
> best guess is that the bad state that hdfs was in caused some critical
> component of the split process to be missed, which resulted a reference to
> the parent regions sticking around and losing the references to the
> daughter regions.
>
> Thanks for any insight you can provide.
>
> --Brennon
>
>
>
>