You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mat Hofschen <ho...@gmail.com> on 2011/07/06 17:22:06 UTC

Hbck errors in 0.90.3

Hello,

I am wondering what the best way is to fix errors reported with hbck in
0.90.3.
We did a migration from 0.20.4 to 0.90.3 by copying over hbase tables from
0.20.4 to a new cluster with 0.90.3. Then we used add_table.rb to create
META table from scratch. (we stopped all writes to source cluster, flushed
everything, then stopped hbase before copying over).

With hbck there are a few errors (52). Now I am wondering how to fix these.
For example hbck complains about two regions starting with the same key.
Next it complains about 2 regions overlapping. From looking at the META
table there seems to be a "parent region" and two "child regions from a
split". All three regions are registered, producing the two errros.
I examined the old 0.20.4 cluster META table, and it has exactly the same
problem (only there is no hbck to output the error).
So I am assuming that a split on 0.20.4 somehow got into trouble and
produced this error.

How would I go about fixing these problems. I tried to use Merge but got an
NPE.

Also what happens to a write operation that adds a key that would fit into 2
regions. Into which regions is the key actually inserted. Would it pick the
first matching region found in META? Then I am probably in trouble because
all three regions contain valid data.

One more question: How does HBase mark regions as offline in META, for
example if a split has occured but the parent is still not removed?

Thanks for your help.
Matthias

Re: Hbck errors in 0.90.3

Posted by Matthias Hofschen <ho...@gmail.com>.
Hi Stack,

finally the migration worked. We copied table data from 0.20.4 hbase cloud
to cdh3u1 cloud (0.90.3 hbase) by using the mozilla approach to copying the
files on hdfs level. (
http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/)

As described in the post from mozilla it minimizes the downtime. We copied
the data (3TB, 10 tables) 3 times while the old cluster was still running.
Then we disabled all writes to the old cluster and stopped hbase. The last
copy process then only took 10 minutes.

Then we deleted .META. on the new cluster and rebuild it with add_table.rb
script. (this might not be necessary, see comments on mozilla blog). Then we
executed hbase hbck. Most reported errors where related to (old) empty
tabledirs in hdfs (no data only oldlogs present). These we deleted from
hdfs. The remaining errors we fixed by hand (25 errors from 27000 regions).
Most of these where parent regions and both child regions present.

For the next larger migration we hope to use the new replication.

Cheers Matthias


On Mon, Jul 11, 2011 at 9:31 PM, Stack <st...@duboce.net> wrote:

> On Mon, Jul 11, 2011 at 7:39 AM, Mat Hofschen <ho...@gmail.com> wrote:
> > Hi Stack,
> > the scan of META does not contain any 'offline' or 'split' attributes.
>
> OK.  So the daughters and parents are all 'online'.
>
> > After executing add_table I restart hbase. Have not used disable/enable.
> >
>
> add_table.rb seems to be picking up parent and daughters?  Remove the
> daughter from the 0.90.x copy of the data.  They should not have been
> taking writes if parent was online.
>

We have actually ended up deleting the parent because the parents where not
taking any writes any more on the old cluster.
(check by looking at dates in dfs)

>
> > What actually happens when I copy the META from old cluster to new
> cluster?
> > META table contains references to old cluster machines. Because of that
> we
> > are using add_table. Is there another way to reuse the META table between
> > the two clusters?
> >
>
> Well, are the old machines online?  The new cluster is trying to
> contact them and failing?  Can you block the new cluster talking to
> the old (IIRC, if the old cluster is reachable, we'll try and talk to
> it and fail because of mismatch in rpc versions... I don't think we
> assume it down unless we get a socket timeout or some such failure).
>
> St.Ack
>

Re: Hbck errors in 0.90.3

Posted by Stack <st...@duboce.net>.
On Mon, Jul 11, 2011 at 7:39 AM, Mat Hofschen <ho...@gmail.com> wrote:
> Hi Stack,
> the scan of META does not contain any 'offline' or 'split' attributes.

OK.  So the daughters and parents are all 'online'.

> After executing add_table I restart hbase. Have not used disable/enable.
>

add_table.rb seems to be picking up parent and daughters?  Remove the
daughter from the 0.90.x copy of the data.  They should not have been
taking writes if parent was online.

> What actually happens when I copy the META from old cluster to new cluster?
> META table contains references to old cluster machines. Because of that we
> are using add_table. Is there another way to reuse the META table between
> the two clusters?
>

Well, are the old machines online?  The new cluster is trying to
contact them and failing?  Can you block the new cluster talking to
the old (IIRC, if the old cluster is reachable, we'll try and talk to
it and fail because of mismatch in rpc versions... I don't think we
assume it down unless we get a socket timeout or some such failure).

St.Ack

Re: Hbck errors in 0.90.3

Posted by Mat Hofschen <ho...@gmail.com>.
Hi Stack,
the scan of META does not contain any 'offline' or 'split' attributes. Still
parent reqion is present in META. I tried to merge the parent with one child
but received the following error: (on 0.90.3)
11/07/06 11:59:35 FATAL util.Merge: Merge failed
java.lang.NullPointerException
    at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
    at
org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:119)
    at org.apache.hadoop.hbase.util.Merge.mergeTwoRegions(Merge.java:219)
    at org.apache.hadoop.hbase.util.Merge.run(Merge.java:110)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.hbase.util.Merge.main(Merge.java:379)

After executing add_table I restart hbase. Have not used disable/enable.

What actually happens when I copy the META from old cluster to new cluster?
META table contains references to old cluster machines. Because of that we
are using add_table. Is there another way to reuse the META table between
the two clusters?

Thanks for your help
Matthias


On Thu, Jul 7, 2011 at 6:28 PM, Stack <st...@duboce.net> wrote:

> On Thu, Jul 7, 2011 at 2:39 AM, Mat Hofschen <ho...@gmail.com> wrote:
> > looking at the old 0.20.4 cluster the parent region is not written to any
> > more. (no data on filesystem) In META table I can not identify that this
> > parent region is offlined though. Where can I find that key? Why is the
> > region not being written to any more if there is no offline flag set?
> >
>
> Dump the meta region:
>
> echo 'scan ".META."' | ./bin/hbase shell &> /tmp/meta.txt
>
> Then look in the outputted file for the parent region (has same start
> key as the first daughter and same end key as second daughter).  Look
> through the output for the 'offline' and 'split' attributes.
>
> > So by copying over the data to new cloud and using add_table script the
> > information that the region was offlined is lost. I guess this is one of
> the
> > problems of using copy on dfs level.
> > The new cluster is therefore inconsitent at this point with data written
> to
> > the parent region and not the child regions.
> >
>
> When you do add_table.rb, you then disable and then enable the table
> to get it online?  Or how do you get the regions assigned?  Restart?
> Maybe this is the issue (enable/disable changed in how it works
> between 0.20 and 0.90 -- the offline attribute is not relied on as
> much).
>
> Its odd that your parent region over in 0.20.4 has no data.  It should
> have a bunch (daughters reference the data up in parent until they are
> done with it... the data in parent is not deleted until the parent
> itself if deleted).
>
> Is there data under the daughters in the filesystem?
>
> You tried merging the daughter regions and then merging the resultant
> region with the parent region?
>
> > Is there a way to reuse the META table from old cloud and avoid using the
> > add_table script?
> >
>
> You could copy the .META. from old cluster.
>
> St.Ack
>

Re: Hbck errors in 0.90.3

Posted by Stack <st...@duboce.net>.
On Thu, Jul 7, 2011 at 2:39 AM, Mat Hofschen <ho...@gmail.com> wrote:
> looking at the old 0.20.4 cluster the parent region is not written to any
> more. (no data on filesystem) In META table I can not identify that this
> parent region is offlined though. Where can I find that key? Why is the
> region not being written to any more if there is no offline flag set?
>

Dump the meta region:

echo 'scan ".META."' | ./bin/hbase shell &> /tmp/meta.txt

Then look in the outputted file for the parent region (has same start
key as the first daughter and same end key as second daughter).  Look
through the output for the 'offline' and 'split' attributes.

> So by copying over the data to new cloud and using add_table script the
> information that the region was offlined is lost. I guess this is one of the
> problems of using copy on dfs level.
> The new cluster is therefore inconsitent at this point with data written to
> the parent region and not the child regions.
>

When you do add_table.rb, you then disable and then enable the table
to get it online?  Or how do you get the regions assigned?  Restart?
Maybe this is the issue (enable/disable changed in how it works
between 0.20 and 0.90 -- the offline attribute is not relied on as
much).

Its odd that your parent region over in 0.20.4 has no data.  It should
have a bunch (daughters reference the data up in parent until they are
done with it... the data in parent is not deleted until the parent
itself if deleted).

Is there data under the daughters in the filesystem?

You tried merging the daughter regions and then merging the resultant
region with the parent region?

> Is there a way to reuse the META table from old cloud and avoid using the
> add_table script?
>

You could copy the .META. from old cluster.

St.Ack

Re: Hbck errors in 0.90.3

Posted by Mat Hofschen <ho...@gmail.com>.
Hi Stack,

looking at the old 0.20.4 cluster the parent region is not written to any
more. (no data on filesystem) In META table I can not identify that this
parent region is offlined though. Where can I find that key? Why is the
region not being written to any more if there is no offline flag set?

So by copying over the data to new cloud and using add_table script the
information that the region was offlined is lost. I guess this is one of the
problems of using copy on dfs level.
The new cluster is therefore inconsitent at this point with data written to
the parent region and not the child regions.

Basically we are trying to accomplish a "blue-green" migration. Once the new
cloud is proven to be stable we will switch of the old cloud. In the
meantime though we need to write all data to both clouds. And therefore we
need to have a defined starting point with the data from old cloud copied
somehow to new cloud with minimum downtime. (we are using the mozilla
approach to copy over).

Is there a way to reuse the META table from old cloud and avoid using the
add_table script?

Thanks for your help
Matthias


On Wed, Jul 6, 2011 at 9:37 PM, Stack <st...@duboce.net> wrote:

> On Wed, Jul 6, 2011 at 8:22 AM, Mat Hofschen <ho...@gmail.com> wrote:
> > With hbck there are a few errors (52). Now I am wondering how to fix
> these.
> > For example hbck complains about two regions starting with the same key.
> > Next it complains about 2 regions overlapping. From looking at the META
> > table there seems to be a "parent region" and two "child regions from a
> > split". All three regions are registered, producing the two errros.
> > I examined the old 0.20.4 cluster META table, and it has exactly the same
> > problem (only there is no hbck to output the error).
> > So I am assuming that a split on 0.20.4 somehow got into trouble and
> > produced this error.
> >
> > How would I go about fixing these problems. I tried to use Merge but got
> an
> > NPE.
> >
>
> So the parent is not offline?  You can tell a region is offline by
> fetching it from .META. in the shell and look for the 'offline'
> attribute.
>
> hbase> get '.META.', 'ROW_OF_PARENT_REGION_IN_META'
>
> ... or just scan .META. and find your region.
>
> If you look at the daughters, do they have any content (Check
> filesystem... look for files)?  I'd think not since we'll be returning
> the parent as the place to write when we look for which region to
> insert into (I suppose the daughters could have data in memory but
> unlikely if we are returning the parent region as place for clients to
> write).
>
> If daughters have no data, remove them from .META. and from filesystem.
>
> hbase> deleteall '.META.', 'DAUGHTER1_IN_META'
> hbase> deleteall '.META.', DAUGHTER2_IN_META'
>
> That should take care of that overlap.
>
> Yes, probably an incomplete split over 0.20.4
>
> (I can't believe how many folks ran 0.20.4; it had a serious deadlock
> issue that seemed easy to trigger at least on this end!)
>
> > Also what happens to a write operation that adds a key that would fit
> into 2
> > regions. Into which regions is the key actually inserted. Would it pick
> the
> > first matching region found in META? Then I am probably in trouble
> because
> > all three regions contain valid data.
> >
>
>
> It'd likely go into the first.
>
> How do you figure all regions have valid data?
>
>
> > One more question: How does HBase mark regions as offline in META, for
> > example if a split has occured but the parent is still not removed?
> >
>
> See above.  You'll see the 'offline' attribute if region is offline
> (note, we do not show an 'online' attribute in shell if region is
> 'online').
>
> St.Ack
>

Re: Hbck errors in 0.90.3

Posted by Stack <st...@duboce.net>.
On Wed, Jul 6, 2011 at 8:22 AM, Mat Hofschen <ho...@gmail.com> wrote:
> With hbck there are a few errors (52). Now I am wondering how to fix these.
> For example hbck complains about two regions starting with the same key.
> Next it complains about 2 regions overlapping. From looking at the META
> table there seems to be a "parent region" and two "child regions from a
> split". All three regions are registered, producing the two errros.
> I examined the old 0.20.4 cluster META table, and it has exactly the same
> problem (only there is no hbck to output the error).
> So I am assuming that a split on 0.20.4 somehow got into trouble and
> produced this error.
>
> How would I go about fixing these problems. I tried to use Merge but got an
> NPE.
>

So the parent is not offline?  You can tell a region is offline by
fetching it from .META. in the shell and look for the 'offline'
attribute.

hbase> get '.META.', 'ROW_OF_PARENT_REGION_IN_META'

... or just scan .META. and find your region.

If you look at the daughters, do they have any content (Check
filesystem... look for files)?  I'd think not since we'll be returning
the parent as the place to write when we look for which region to
insert into (I suppose the daughters could have data in memory but
unlikely if we are returning the parent region as place for clients to
write).

If daughters have no data, remove them from .META. and from filesystem.

hbase> deleteall '.META.', 'DAUGHTER1_IN_META'
hbase> deleteall '.META.', DAUGHTER2_IN_META'

That should take care of that overlap.

Yes, probably an incomplete split over 0.20.4

(I can't believe how many folks ran 0.20.4; it had a serious deadlock
issue that seemed easy to trigger at least on this end!)

> Also what happens to a write operation that adds a key that would fit into 2
> regions. Into which regions is the key actually inserted. Would it pick the
> first matching region found in META? Then I am probably in trouble because
> all three regions contain valid data.
>


It'd likely go into the first.

How do you figure all regions have valid data?


> One more question: How does HBase mark regions as offline in META, for
> example if a split has occured but the parent is still not removed?
>

See above.  You'll see the 'offline' attribute if region is offline
(note, we do not show an 'online' attribute in shell if region is
'online').

St.Ack