You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2011/07/03 09:26:22 UTC

[jira] [Created] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Extend TestHBaseFsck with a complete .META. recovery scenario
-------------------------------------------------------------

                 Key: HBASE-4058
                 URL: https://issues.apache.org/jira/browse/HBASE-4058
             Project: HBase
          Issue Type: Improvement
            Reporter: Andrew Purtell
             Fix For: 0.92.0


We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell reassigned HBASE-4058:
-------------------------------------

    Assignee:     (was: Andrew Purtell)

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062138#comment-13062138 ] 

stack commented on HBASE-4058:
------------------------------

Here is the thread that prompted this issue: http://search-hadoop.com/m/J27Y72CrGiD/%2522hbck+-fix%2522&subj=hbck+fix

So, one though I had was rebuilding .META. from a scan of .META. with a timestamp behind the catastrophe.  This is not going to be bullet-proof for the case where the .META. storefiles themselves have been damaged or lost.

So, we need a new add_table type fixup.  Wayne in the thread describes it as:

{quote}
Bugs and human error will bring on problems and nothing will
ever change that, but not having tools to help recover out of the hole is
where I think it is lacking...The hbase .META. table
(and -ROOT-?) are the core how HBase manages things. If this gets out of
whack all is lost...Something like a recovery mode that goes through and
sees what is out there and rebuilds the meta based on it. With corrupted
data and lost regions etc. etc. like any relational database there should be
one or more recovery modes that goes through everything and rebuilds it
consistently. Data may be lost but at least the cluster will be left in a
100% consistent/clean state. Manual editing of .META. is not something
anyone should do (especially me). It is prone to human error...it should be
easy to have well tested recover tools that can do the hard work for us.
{quote}



> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062197#comment-13062197 ] 

stack commented on HBASE-4058:
------------------------------

I took a look at the logs Wayne posted.  The master shows a few regionservers losing their leases and its having trouble connecting to a particular server.  The regionserver snippet posted shows a regionserver aborting because it can't roll its wal log. It gets an EOFE.  The datanode snippet shows connection refused trying to connect to the same server (130) that the master is trying to contact (NN?).

Its hard to tell much from snippets posted.

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "Lars Hofhansl (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234849#comment-13234849 ] 

Lars Hofhansl commented on HBASE-4058:
--------------------------------------

@Stack: Are you still planning this for 0.94?
                
> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>          Components: hbck
>            Reporter: Andrew Purtell
>            Assignee: stack
>             Fix For: 0.94.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4058:
--------------------------

    Fix Version/s:     (was: 0.92.0)
                   0.94.0

Moving to 0.94 since there is no owner for this issue at the moment.

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>             Fix For: 0.94.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-4058:
----------------------------

    Assignee: stack

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: stack
>             Fix For: 0.94.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062170#comment-13062170 ] 

stack commented on HBASE-4058:
------------------------------

So, reading Wayne's blow-by-blow, he 'fix' his hdfs, he ran 'fsck -move' which moves corrupt files to /lost+found.  I wonder how many of the 65 corrupt files found were from hbase and how many of these were from under .META. (65 corrupt files and 173 missing blocks.... thats a lot of 'missing' data).  Assuming an extreme, that there missing blocks in .META., this would imply we need to be able to rebuild .META. by reading the filesystem content.  It should be able to figure whats a daughter from whats a parent and it should write the .META. without overlaps and with holes plugged.  Finally it should make some sort of report on the type of surgery effected listing put-aside regions that it could not make sense of.

We currently don't have such a tool.

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071506#comment-13071506 ] 

stack commented on HBASE-4058:
------------------------------

Assigning myself. This is pretty critical one.  The online merges I just added back to 0.92 is prereq for this one.

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: stack
>             Fix For: 0.94.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Hsieh updated HBASE-4058:
----------------------------------

    Component/s: hbck

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>          Components: hbck
>            Reporter: Andrew Purtell
>            Assignee: stack
>             Fix For: 0.94.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062317#comment-13062317 ] 

stack commented on HBASE-4058:
------------------------------

So I see our fixup tool first running an evaluation on the state of meta and then it would offer admin choices.  For overlapping regions, we should make it so hbck will run online merge of the regions that span the broken key area.  If holes, offer to plug the hole.  We should also offer the option to rebuild from the fs per table or all under hbase.rootdir.

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-4058.
--------------------------

    Resolution: Won't Fix

This issue subsumed by Jon's work on uber-hbck.
                
> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>          Components: hbck
>            Reporter: Andrew Purtell
>            Assignee: stack
>             Fix For: 0.94.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell reassigned HBASE-4058:
-------------------------------------

    Assignee: Andrew Purtell

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4058) Extend TestHBaseFsck with a complete .META. recovery scenario

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062215#comment-13062215 ] 

stack commented on HBASE-4058:
------------------------------

Dan Harvey who is still on 0.20.x had a similar issue this month.  He added four new servers to his cluster.  These new servers were not resolving properly.  What we were seeing is that on startup, I believe, these new servers would be assigned their portion of the regions on checkin.  Then, the basescanner would run -- its 0.20.x hbase -- and it would not recognize the address the new servers were writing .META. and it would then think the regions unassigned and would assign them elsewhere.  So, we have double-assignment and at same time there was splitting and compactions running.  His .META. had holes and overlaps.

In his case, not all tables were honked.  Just the big ones.  I wonder if an improved add_table.rb would work in this case; i.e. do the same rewrite of the .META. content for a single table based off the content in the filesystem rather than trying fix up on .META. table.

Let me try adding add_table.rb to hbck.  Let me add option of running per table and then a global, restore all tables.

Dan sent me the .META. dir content.  It looks like this:

{code}
-rw-r--r--@ 1 Stack  staff         0 Jul  7 08:26 281906331022358506
-rw-r--r--@ 1 Stack  staff  94283152 Jul  7 08:26 5233066973300534672
-rw-r--r--@ 1 Stack  staff         0 Jul  7 08:26 6803125877105432645
-rw-r--r--@ 1 Stack  staff         0 Jul  7 08:26 8650632001596730954
{code}

i.e. three zero-length files.  I wonder how these were written (I asked him for a dir listing from actual cluster).

> Extend TestHBaseFsck with a complete .META. recovery scenario
> -------------------------------------------------------------
>
>                 Key: HBASE-4058
>                 URL: https://issues.apache.org/jira/browse/HBASE-4058
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>             Fix For: 0.92.0
>
>
> We should have a unit test that launches a minicluster and constructs a few tables, then deletes META files on disk, then bounces the master, then recovers the result with HBCK. Perhaps it is possible to extend TestHBaseFsck to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira