You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Jean-Daniel Cryans <jd...@apache.org> on 2012/07/18 17:53:13 UTC

Wondering what hbck should do in this situation

Hey devs,

I encountered an "interesting" situation with hbck in 0.94, we had
this region which was on HDFS that wasn't in .META. and hbck decided
to include it back:

ERROR: Region { meta => null, hdfs =>
hdfs://sfor3s24:10101/hbase/url_stumble_summary/159952764, deployed =>
 } on HDFS, but not listed in META or deployed on any region server
12/07/17 23:46:03 INFO util.HBaseFsck: Patching .META. with
.regioninfo: {NAME =>
'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
'25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
159952764,}

Then when it tried to assign the region it got bounced between region servers:

Trying to reassign region...
12/07/17 23:46:04 INFO util.HBaseFsckRepair: Region still in
transition, waiting for it to become assigned: {NAME =>
'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
'25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
159952764,}
12/07/17 23:46:05 INFO util.HBaseFsckRepair: Region still in
transition, waiting for it to become assigned: {NAME =>
'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
'25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
159952764,}
etc

Turns out that this region only contained references (as in post-split
references) to a region that didn't exist anymore so when the region
was being opened it was failing on opening those referenced files:

2012-07-18 00:00:27,454 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
open of region=url_stumble_summary,25467315:2009-12-28,1271922074820.159952764,
starting to roll back the global memstore size.
java.io.IOException: java.io.IOException:
java.io.FileNotFoundException: File does not exist:
/hbase/url_stumble_summary/208247386/default/2354161894779228084
	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3729)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3677)
...
Caused by: java.io.IOException: java.io.FileNotFoundException: File
does not exist:
/hbase/url_stumble_summary/208247386/default/2354161894779228084
	at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:405)
	at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:258)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2918)
...
Caused by: java.io.FileNotFoundException: File does not exist:
/hbase/url_stumble_summary/208247386/default/2354161894779228084
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
	at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:102)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
	at org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:547)
	at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1252)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:66)
...


At first it was confusing me why it was looking for another region
until I saw the HalfStoreFileReader :)

So this is a case where hbck made the cluster worse because the only
way to get rid of this region is to force unassign it, delete it from
.META. and then possibly also delete it from HDFS.

I'm wondering how this could be done better, should we do more checks
when including that sort of region? Like, at least make sure we can
open it? And then what? Just report it?

Thx for reading this far,

J-D

RE: Wondering what hbck should do in this situation

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.
J-D

Just going thro the explanation I feel that the region that had references
is a parent region and it should have an entry in META saying it is SPLIT
and OFFLINE?

May be while fixing those cases where we find something in HDFS and not in
META we may need see if it is splitted? 

Was there any reason why the CatalogJanitor was not able to pick this region
for clean up.  

I may be wrong here JD, just going thro the explanation am thinking this
could be the scenario.

Thanks for bringing this up, would add this to our internal testing also.

Regards
Ram


> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
> Daniel Cryans
> Sent: Wednesday, July 18, 2012 9:23 PM
> To: dev@hbase.apache.org
> Subject: Wondering what hbck should do in this situation
> 
> Hey devs,
> 
> I encountered an "interesting" situation with hbck in 0.94, we had
> this region which was on HDFS that wasn't in .META. and hbck decided
> to include it back:
> 
> ERROR: Region { meta => null, hdfs =>
> hdfs://sfor3s24:10101/hbase/url_stumble_summary/159952764, deployed =>
>  } on HDFS, but not listed in META or deployed on any region server
> 12/07/17 23:46:03 INFO util.HBaseFsck: Patching .META. with
> .regioninfo: {NAME =>
> 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> 159952764,}
> 
> Then when it tried to assign the region it got bounced between region
> servers:
> 
> Trying to reassign region...
> 12/07/17 23:46:04 INFO util.HBaseFsckRepair: Region still in
> transition, waiting for it to become assigned: {NAME =>
> 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> 159952764,}
> 12/07/17 23:46:05 INFO util.HBaseFsckRepair: Region still in
> transition, waiting for it to become assigned: {NAME =>
> 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> 159952764,}
> etc
> 
> Turns out that this region only contained references (as in post-split
> references) to a region that didn't exist anymore so when the region
> was being opened it was failing on opening those referenced files:
> 
> 2012-07-18 00:00:27,454 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open of region=url_stumble_summary,25467315:2009-12-
> 28,1271922074820.159952764,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist:
> /hbase/url_stumble_summary/208247386/default/2354161894779228084
> 	at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(
> HRegion.java:550)
> 	at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:46
> 3)
> 	at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3
> 729)
> 	at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3
> 677)
> ...
> Caused by: java.io.IOException: java.io.FileNotFoundException: File
> does not exist:
> /hbase/url_stumble_summary/208247386/default/2354161894779228084
> 	at
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:40
> 5)
> 	at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:258)
> 	at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.
> java:2918)
> ...
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/url_stumble_summary/208247386/default/2354161894779228084
> 	at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java
> :1822)
> 	at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1
> 813)
> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
> 	at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem
> .java:187)
> 	at
> org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:102)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
> 	at
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.j
> ava:547)
> 	at
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.
> java:1252)
> 	at
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileRead
> er.java:66)
> ...
> 
> 
> At first it was confusing me why it was looking for another region
> until I saw the HalfStoreFileReader :)
> 
> So this is a case where hbck made the cluster worse because the only
> way to get rid of this region is to force unassign it, delete it from
> .META. and then possibly also delete it from HDFS.
> 
> I'm wondering how this could be done better, should we do more checks
> when including that sort of region? Like, at least make sure we can
> open it? And then what? Just report it?
> 
> Thx for reading this far,
> 
> J-D


Re: Wondering what hbck should do in this situation

Posted by Ted Yu <yu...@gmail.com>.
Adding check on whether the referenced files can be found would help.
If any of the referenced files isn't found, report and don't repair.

Cheers

On Wed, Jul 18, 2012 at 8:53 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Hey devs,
>
> I encountered an "interesting" situation with hbck in 0.94, we had
> this region which was on HDFS that wasn't in .META. and hbck decided
> to include it back:
>
> ERROR: Region { meta => null, hdfs =>
> hdfs://sfor3s24:10101/hbase/url_stumble_summary/159952764, deployed =>
>  } on HDFS, but not listed in META or deployed on any region server
> 12/07/17 23:46:03 INFO util.HBaseFsck: Patching .META. with
> .regioninfo: {NAME =>
> 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> 159952764,}
>
> Then when it tried to assign the region it got bounced between region
> servers:
>
> Trying to reassign region...
> 12/07/17 23:46:04 INFO util.HBaseFsckRepair: Region still in
> transition, waiting for it to become assigned: {NAME =>
> 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> 159952764,}
> 12/07/17 23:46:05 INFO util.HBaseFsckRepair: Region still in
> transition, waiting for it to become assigned: {NAME =>
> 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> 159952764,}
> etc
>
> Turns out that this region only contained references (as in post-split
> references) to a region that didn't exist anymore so when the region
> was being opened it was failing on opening those referenced files:
>
> 2012-07-18 00:00:27,454 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open of
> region=url_stumble_summary,25467315:2009-12-28,1271922074820.159952764,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist:
> /hbase/url_stumble_summary/208247386/default/2354161894779228084
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3729)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3677)
> ...
> Caused by: java.io.IOException: java.io.FileNotFoundException: File
> does not exist:
> /hbase/url_stumble_summary/208247386/default/2354161894779228084
>         at
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:405)
>         at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:258)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2918)
> ...
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/url_stumble_summary/208247386/default/2354161894779228084
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at
> org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:102)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:547)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1252)
>         at
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:66)
> ...
>
>
> At first it was confusing me why it was looking for another region
> until I saw the HalfStoreFileReader :)
>
> So this is a case where hbck made the cluster worse because the only
> way to get rid of this region is to force unassign it, delete it from
> .META. and then possibly also delete it from HDFS.
>
> I'm wondering how this could be done better, should we do more checks
> when including that sort of region? Like, at least make sure we can
> open it? And then what? Just report it?
>
> Thx for reading this far,
>
> J-D
>

RE: Wondering what hbck should do in this situation

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.
J-d
Corrections, if META does not have an entry then we cannot know if it is
splitted or not.. Apologies for that.

I think we need to check for Reference files and if the opening fails we
need to report it.  That should be the way.
But we should also confirm whether this region was split properly, right?

Regards
Ram

> -----Original Message-----
> From: Ramkrishna.S.Vasudevan [mailto:ramkrishna.vasudevan@huawei.com]
> Sent: Thursday, July 19, 2012 10:21 AM
> To: 'dev@hbase.apache.org'
> Subject: RE: Wondering what hbck should do in this situation
> 
> J-D
> 
> Just going thro the explanation I feel that the region that had
> references is a parent region and it should have an entry in META
> saying it is SPLIT and OFFLINE?
> 
> May be while fixing those cases where we find something in HDFS and not
> in META we may need see if it is splitted?
> 
> Was there any reason why the CatalogJanitor was not able to pick this
> region for clean up.
> 
> I may be wrong here JD, just going thro the explanation am thinking
> this could be the scenario.
> 
> Thanks for bringing this up, would add this to our internal testing
> also.
> 
> Regards
> Ram
> 
> 
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-
> > Daniel Cryans
> > Sent: Wednesday, July 18, 2012 9:23 PM
> > To: dev@hbase.apache.org
> > Subject: Wondering what hbck should do in this situation
> >
> > Hey devs,
> >
> > I encountered an "interesting" situation with hbck in 0.94, we had
> > this region which was on HDFS that wasn't in .META. and hbck decided
> > to include it back:
> >
> > ERROR: Region { meta => null, hdfs =>
> > hdfs://sfor3s24:10101/hbase/url_stumble_summary/159952764, deployed
> =>
> >  } on HDFS, but not listed in META or deployed on any region server
> > 12/07/17 23:46:03 INFO util.HBaseFsck: Patching .META. with
> > .regioninfo: {NAME =>
> > 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> > '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> > 159952764,}
> >
> > Then when it tried to assign the region it got bounced between region
> > servers:
> >
> > Trying to reassign region...
> > 12/07/17 23:46:04 INFO util.HBaseFsckRepair: Region still in
> > transition, waiting for it to become assigned: {NAME =>
> > 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> > '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> > 159952764,}
> > 12/07/17 23:46:05 INFO util.HBaseFsckRepair: Region still in
> > transition, waiting for it to become assigned: {NAME =>
> > 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> > '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> > 159952764,}
> > etc
> >
> > Turns out that this region only contained references (as in post-
> split
> > references) to a region that didn't exist anymore so when the region
> > was being opened it was failing on opening those referenced files:
> >
> > 2012-07-18 00:00:27,454 ERROR
> > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> Failed
> > open of region=url_stumble_summary,25467315:2009-12-
> > 28,1271922074820.159952764,
> > starting to roll back the global memstore size.
> > java.io.IOException: java.io.IOException:
> > java.io.FileNotFoundException: File does not exist:
> > /hbase/url_stumble_summary/208247386/default/2354161894779228084
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(
> > HRegion.java:550)
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:46
> > 3)
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3
> > 729)
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3
> > 677)
> > ...
> > Caused by: java.io.IOException: java.io.FileNotFoundException: File
> > does not exist:
> > /hbase/url_stumble_summary/208247386/default/2354161894779228084
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:40
> > 5)
> > 	at
> > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:258)
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.
> > java:2918)
> > ...
> > Caused by: java.io.FileNotFoundException: File does not exist:
> > /hbase/url_stumble_summary/208247386/default/2354161894779228084
> > 	at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java
> > :1822)
> > 	at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1
> > 813)
> > 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
> > 	at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem
> > .java:187)
> > 	at
> > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:102)
> > 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
> > 	at
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.j
> > ava:547)
> > 	at
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.
> > java:1252)
> > 	at
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileRead
> > er.java:66)
> > ...
> >
> >
> > At first it was confusing me why it was looking for another region
> > until I saw the HalfStoreFileReader :)
> >
> > So this is a case where hbck made the cluster worse because the only
> > way to get rid of this region is to force unassign it, delete it from
> > .META. and then possibly also delete it from HDFS.
> >
> > I'm wondering how this could be done better, should we do more checks
> > when including that sort of region? Like, at least make sure we can
> > open it? And then what? Just report it?
> >
> > Thx for reading this far,
> >
> > J-D


Re: Wondering what hbck should do in this situation

Posted by Jonathan Hsieh <jo...@cloudera.com>.
We actually ran into something similar on an upgrade from hbase 0.90 to an
hbase 0.92 --  a few regions would bounce around between regionservers
failing after going into FAILED_OPEN rit state.

Here were the repair cases we considered:
1) What do you do if the parent file is not present?  Sideline the
reference files.  Bulk load and data files.  Without the original file we
cannot really  save anything.  If the parent is not present, it may have
been moved, but its data is still present.
2) What do you do if the parent file is present?  I think you can sideline
the reference files.  The original file is present somewhere in hdfs so
that means the data is not lost.

Another related idea is to have a quarantine directory for regions/files
that are repeatedly ill-behaved.  For example, if we tried to read a
reference file multiple times and failed, quarantine the file and try
again.  We had another case -- we ran into a truncated hfile and the same
strategy would have gotten the cluster working (and still has the
posibility of data recovery)

Jon.

On Wed, Jul 18, 2012 at 9:56 PM, Ramkrishna.S.Vasudevan <
ramkrishna.vasudevan@huawei.com> wrote:

> J-d
> Corrections, if META does not have an entry then we cannot know if it is
> splitted or not.. Apologies for that.
>
> I think we need to check for Reference files and if the opening fails we
> need to report it.  That should be the way.
> But we should also confirm whether this region was split properly, right?
>
> Regards
> Ram
>
> > -----Original Message-----
> > From: Ramkrishna.S.Vasudevan [mailto:ramkrishna.vasudevan@huawei.com]
> > Sent: Thursday, July 19, 2012 10:21 AM
> > To: 'dev@hbase.apache.org'
> > Subject: RE: Wondering what hbck should do in this situation
> >
> > J-D
> >
> > Just going thro the explanation I feel that the region that had
> > references is a parent region and it should have an entry in META
> > saying it is SPLIT and OFFLINE?
> >
> > May be while fixing those cases where we find something in HDFS and not
> > in META we may need see if it is splitted?
> >
> > Was there any reason why the CatalogJanitor was not able to pick this
> > region for clean up.
> >
> > I may be wrong here JD, just going thro the explanation am thinking
> > this could be the scenario.
> >
> > Thanks for bringing this up, would add this to our internal testing
> > also.
> >
> > Regards
> > Ram
> >
> >
> > > -----Original Message-----
> > > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> > Jean-
> > > Daniel Cryans
> > > Sent: Wednesday, July 18, 2012 9:23 PM
> > > To: dev@hbase.apache.org
> > > Subject: Wondering what hbck should do in this situation
> > >
> > > Hey devs,
> > >
> > > I encountered an "interesting" situation with hbck in 0.94, we had
> > > this region which was on HDFS that wasn't in .META. and hbck decided
> > > to include it back:
> > >
> > > ERROR: Region { meta => null, hdfs =>
> > > hdfs://sfor3s24:10101/hbase/url_stumble_summary/159952764, deployed
> > =>
> > >  } on HDFS, but not listed in META or deployed on any region server
> > > 12/07/17 23:46:03 INFO util.HBaseFsck: Patching .META. with
> > > .regioninfo: {NAME =>
> > > 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> > > '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> > > 159952764,}
> > >
> > > Then when it tried to assign the region it got bounced between region
> > > servers:
> > >
> > > Trying to reassign region...
> > > 12/07/17 23:46:04 INFO util.HBaseFsckRepair: Region still in
> > > transition, waiting for it to become assigned: {NAME =>
> > > 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> > > '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> > > 159952764,}
> > > 12/07/17 23:46:05 INFO util.HBaseFsckRepair: Region still in
> > > transition, waiting for it to become assigned: {NAME =>
> > > 'url_stumble_summary,25467315:2009-12-28,1271922074820', STARTKEY =>
> > > '25467315:2009-12-28', ENDKEY => '25821137:2010-03-08', ENCODED =>
> > > 159952764,}
> > > etc
> > >
> > > Turns out that this region only contained references (as in post-
> > split
> > > references) to a region that didn't exist anymore so when the region
> > > was being opened it was failing on opening those referenced files:
> > >
> > > 2012-07-18 00:00:27,454 ERROR
> > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> > Failed
> > > open of region=url_stumble_summary,25467315:2009-12-
> > > 28,1271922074820.159952764,
> > > starting to roll back the global memstore size.
> > > java.io.IOException: java.io.IOException:
> > > java.io.FileNotFoundException: File does not exist:
> > > /hbase/url_stumble_summary/208247386/default/2354161894779228084
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(
> > > HRegion.java:550)
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:46
> > > 3)
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3
> > > 729)
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3
> > > 677)
> > > ...
> > > Caused by: java.io.IOException: java.io.FileNotFoundException: File
> > > does not exist:
> > > /hbase/url_stumble_summary/208247386/default/2354161894779228084
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:40
> > > 5)
> > >     at
> > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:258)
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.
> > > java:2918)
> > > ...
> > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > /hbase/url_stumble_summary/208247386/default/2354161894779228084
> > >     at
> > >
> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java
> > > :1822)
> > >     at
> > >
> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1
> > > 813)
> > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
> > >     at
> > >
> > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem
> > > .java:187)
> > >     at
> > > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:102)
> > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
> > >     at
> > >
> > org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.j
> > > ava:547)
> > >     at
> > >
> > org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.
> > > java:1252)
> > >     at
> > >
> > org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileRead
> > > er.java:66)
> > > ...
> > >
> > >
> > > At first it was confusing me why it was looking for another region
> > > until I saw the HalfStoreFileReader :)
> > >
> > > So this is a case where hbck made the cluster worse because the only
> > > way to get rid of this region is to force unassign it, delete it from
> > > .META. and then possibly also delete it from HDFS.
> > >
> > > I'm wondering how this could be done better, should we do more checks
> > > when including that sort of region? Like, at least make sure we can
> > > open it? And then what? Just report it?
> > >
> > > Thx for reading this far,
> > >
> > > J-D
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Wondering what hbck should do in this situation

Posted by lars hofhansl <lh...@yahoo.com>.
+1 on a dry-run option. All that's needed might just a bit more logging on a normal "non-fix" run.

Interactive can be very simple with just some y/n decision points.Unix' fsck could be a potential guideline.


The issue at hand seems just like an oversight in the current implementation, though.

-- Lars


----- Original Message -----
From: Jean-Daniel Cryans <jd...@apache.org>
To: dev@hbase.apache.org
Cc: 
Sent: Thursday, July 19, 2012 11:05 AM
Subject: Re: Wondering what hbck should do in this situation

Thanks for the jira Jimmy, it seems to me that we should aim for a dry
run feature first and then consider the interactive part. At least it
would give the user an opportunity to fix problems that would
otherwise make things worse.

J-D

On Thu, Jul 19, 2012 at 11:02 AM, Jimmy Xiang <jx...@cloudera.com> wrote:
> HBASE-5324 is the one I filed on interactive hbck.  We can use it if
> there is no duplicate one.
>
> Thanks,
> Jmmy
>
> On Thu, Jul 19, 2012 at 10:55 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>> Jimmy and I have been adding features essentially as we've needed them,
>> including some options that limit fixes to particular tables, and limit the
>> kinds of fixes that are applied.
>>
>> There is a jira for making the repairs interactive -- either a hbck shell,
>> an interactive mode that provides a series of y/n questions.  I'd be
>> amenable to any of these kinds of improvements.
>>
>> Jon.
>>
>> On Thu, Jul 19, 2012 at 10:50 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> On Wed, Jul 18, 2012 at 9:56 PM, Ramkrishna.S.Vasudevan
>>> <ra...@huawei.com> wrote:
>>> > J-d
>>> > Corrections, if META does not have an entry then we cannot know if it is
>>> > splitted or not.. Apologies for that.
>>> >
>>> > I think we need to check for Reference files and if the opening fails we
>>> > need to report it.  That should be the way.
>>> > But we should also confirm whether this region was split properly, right?
>>>
>>> That's what I'm wondering about. It seems to me that hbck currently is
>>> overly aggressive fixing things (see also HBASE-6417 where it merged
>>> .META.). So should we have all the heuristics to detect problems and
>>> then add the corner cases after as people find them? Or should we let
>>> the users decide what should be fixed? It could be that we should ask
>>> more questions to the users. I'm thinking out loud here.
>>>
>>> J-D
>>>
>>
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com


Re: Wondering what hbck should do in this situation

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Thanks for the jira Jimmy, it seems to me that we should aim for a dry
run feature first and then consider the interactive part. At least it
would give the user an opportunity to fix problems that would
otherwise make things worse.

J-D

On Thu, Jul 19, 2012 at 11:02 AM, Jimmy Xiang <jx...@cloudera.com> wrote:
> HBASE-5324 is the one I filed on interactive hbck.  We can use it if
> there is no duplicate one.
>
> Thanks,
> Jmmy
>
> On Thu, Jul 19, 2012 at 10:55 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>> Jimmy and I have been adding features essentially as we've needed them,
>> including some options that limit fixes to particular tables, and limit the
>> kinds of fixes that are applied.
>>
>> There is a jira for making the repairs interactive -- either a hbck shell,
>> an interactive mode that provides a series of y/n questions.  I'd be
>> amenable to any of these kinds of improvements.
>>
>> Jon.
>>
>> On Thu, Jul 19, 2012 at 10:50 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> On Wed, Jul 18, 2012 at 9:56 PM, Ramkrishna.S.Vasudevan
>>> <ra...@huawei.com> wrote:
>>> > J-d
>>> > Corrections, if META does not have an entry then we cannot know if it is
>>> > splitted or not.. Apologies for that.
>>> >
>>> > I think we need to check for Reference files and if the opening fails we
>>> > need to report it.  That should be the way.
>>> > But we should also confirm whether this region was split properly, right?
>>>
>>> That's what I'm wondering about. It seems to me that hbck currently is
>>> overly aggressive fixing things (see also HBASE-6417 where it merged
>>> .META.). So should we have all the heuristics to detect problems and
>>> then add the corner cases after as people find them? Or should we let
>>> the users decide what should be fixed? It could be that we should ask
>>> more questions to the users. I'm thinking out loud here.
>>>
>>> J-D
>>>
>>
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com

Re: Wondering what hbck should do in this situation

Posted by Jimmy Xiang <jx...@cloudera.com>.
HBASE-5324 is the one I filed on interactive hbck.  We can use it if
there is no duplicate one.

Thanks,
Jmmy

On Thu, Jul 19, 2012 at 10:55 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> Jimmy and I have been adding features essentially as we've needed them,
> including some options that limit fixes to particular tables, and limit the
> kinds of fixes that are applied.
>
> There is a jira for making the repairs interactive -- either a hbck shell,
> an interactive mode that provides a series of y/n questions.  I'd be
> amenable to any of these kinds of improvements.
>
> Jon.
>
> On Thu, Jul 19, 2012 at 10:50 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> On Wed, Jul 18, 2012 at 9:56 PM, Ramkrishna.S.Vasudevan
>> <ra...@huawei.com> wrote:
>> > J-d
>> > Corrections, if META does not have an entry then we cannot know if it is
>> > splitted or not.. Apologies for that.
>> >
>> > I think we need to check for Reference files and if the opening fails we
>> > need to report it.  That should be the way.
>> > But we should also confirm whether this region was split properly, right?
>>
>> That's what I'm wondering about. It seems to me that hbck currently is
>> overly aggressive fixing things (see also HBASE-6417 where it merged
>> .META.). So should we have all the heuristics to detect problems and
>> then add the corner cases after as people find them? Or should we let
>> the users decide what should be fixed? It could be that we should ask
>> more questions to the users. I'm thinking out loud here.
>>
>> J-D
>>
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com

Re: Wondering what hbck should do in this situation

Posted by Jonathan Hsieh <jo...@cloudera.com>.
Jimmy and I have been adding features essentially as we've needed them,
including some options that limit fixes to particular tables, and limit the
kinds of fixes that are applied.

There is a jira for making the repairs interactive -- either a hbck shell,
an interactive mode that provides a series of y/n questions.  I'd be
amenable to any of these kinds of improvements.

Jon.

On Thu, Jul 19, 2012 at 10:50 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> On Wed, Jul 18, 2012 at 9:56 PM, Ramkrishna.S.Vasudevan
> <ra...@huawei.com> wrote:
> > J-d
> > Corrections, if META does not have an entry then we cannot know if it is
> > splitted or not.. Apologies for that.
> >
> > I think we need to check for Reference files and if the opening fails we
> > need to report it.  That should be the way.
> > But we should also confirm whether this region was split properly, right?
>
> That's what I'm wondering about. It seems to me that hbck currently is
> overly aggressive fixing things (see also HBASE-6417 where it merged
> .META.). So should we have all the heuristics to detect problems and
> then add the corner cases after as people find them? Or should we let
> the users decide what should be fixed? It could be that we should ask
> more questions to the users. I'm thinking out loud here.
>
> J-D
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Wondering what hbck should do in this situation

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Wed, Jul 18, 2012 at 9:56 PM, Ramkrishna.S.Vasudevan
<ra...@huawei.com> wrote:
> J-d
> Corrections, if META does not have an entry then we cannot know if it is
> splitted or not.. Apologies for that.
>
> I think we need to check for Reference files and if the opening fails we
> need to report it.  That should be the way.
> But we should also confirm whether this region was split properly, right?

That's what I'm wondering about. It seems to me that hbck currently is
overly aggressive fixing things (see also HBASE-6417 where it merged
.META.). So should we have all the heuristics to detect problems and
then add the corner cases after as people find them? Or should we let
the users decide what should be fixed? It could be that we should ask
more questions to the users. I'm thinking out loud here.

J-D