You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jonathan Hsieh (JIRA)" <ji...@apache.org> on 2012/11/30 17:52:00 UTC

[jira] [Created] (HBASE-7245) Recovery on failed restore.

Jonathan Hsieh created HBASE-7245:
-------------------------------------

             Summary: Recovery on failed restore.
                 Key: HBASE-7245
                 URL: https://issues.apache.org/jira/browse/HBASE-7245
             Project: HBase
          Issue Type: Sub-task
            Reporter: Jonathan Hsieh


Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.

We should define what the semantics are for recovering from this.  Some suggestions:

1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  

I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HBASE-7245) Recovery on failed snapshot restore

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508872#comment-13508872 ] 

Ted Yu edited comment on HBASE-7245 at 12/3/12 5:20 PM:
--------------------------------------------------------

The subject of this JIRA is w.r.t. snapshot restore. So the first two scenarios above can be handled in separate JIRA.

The operation directive can be created in zookeeper.
                
      was (Author: yuzhihong@gmail.com):
    The subject of this JIRA is w.r.t. snapshot restore. So the first two scenarios above can be handled in separate JIRA.

The operation directive file can be created in zookeeper.
                  
> Recovery on failed snapshot restore
> -----------------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>            Assignee: Matteo Bertozzi
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7245) Recovery on failed snapshot restore

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-7245:
--------------------------

    Summary: Recovery on failed snapshot restore  (was: Recovery on failed restore.)

The subject of this JIRA is w.r.t. snapshot restore. So the first two scenarios above can be handled in separate JIRA.

The operation directive file can be created in zookeeper.
                
> Recovery on failed snapshot restore
> -----------------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>            Assignee: Matteo Bertozzi
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7245) Recovery on failed snapshot restore

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509013#comment-13509013 ] 

Ted Yu commented on HBASE-7245:
-------------------------------

The discussion from HBASE-6721 is related in this regard.
Francis started with storing group information on hdfs. Later he switched to storage in table. Whether storing in zookeeper is under review.

I am fine with storing operation directive on hdfs.
                
> Recovery on failed snapshot restore
> -----------------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>            Assignee: Matteo Bertozzi
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7245) Recovery on failed restore.

Posted by "Matteo Bertozzi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508857#comment-13508857 ] 

Matteo Bertozzi commented on HBASE-7245:
----------------------------------------

The set of operations that have this kind of problem are:
 * create table: remove the table if failed (rollback) the user already received the failure
 * delete table: finish removing the table (rollforward) restoring the table is impossible
 * clone table: remove the table if failed (rollback) same as create table
 * restore table: finish restoring the table (rollforward) finish the restore
 * snapshot: removing the tmp folder (rollback)

One simple solution is to drop a "operation lock" file in the table folder, and on master startup, if the file is present look at the operation enum serialized and execute the "rollback/rollforward". (Note that if the master is not down, you can do the recovery catching the exception)
                
> Recovery on failed restore.
> ---------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>            Assignee: Matteo Bertozzi
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7245) Recovery on failed restore.

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507680#comment-13507680 ] 

Ted Yu commented on HBASE-7245:
-------------------------------

+1 on option 1.
                
> Recovery on failed restore.
> ---------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-7245) Recovery on failed restore.

Posted by "Matteo Bertozzi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matteo Bertozzi reassigned HBASE-7245:
--------------------------------------

    Assignee: Matteo Bertozzi
    
> Recovery on failed restore.
> ---------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>            Assignee: Matteo Bertozzi
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7245) Recovery on failed snapshot restore

Posted by "Matteo Bertozzi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509005#comment-13509005 ] 

Matteo Bertozzi commented on HBASE-7245:
----------------------------------------

If we use ZooKeeper what happens when the user erase the zk content before starting the master? we still have the bad state on disk and meta but no hint on the unfinished operation.

If I understood correctly we are using zookeeper as "ephimeral" only (except for replication) do we want to move to something that rely fully on zookeeper? 
                
> Recovery on failed snapshot restore
> -----------------------------------
>
>                 Key: HBASE-7245
>                 URL: https://issues.apache.org/jira/browse/HBASE-7245
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client, master, regionserver, snapshots, Zookeeper
>            Reporter: Jonathan Hsieh
>            Assignee: Matteo Bertozzi
>             Fix For: hbase-6055, 0.96.0
>
>
> Restore will do updates to the file system and to meta.  it seems that an inopportune failure before meta is completely updated could result in an inconsistent state that would require hbck to fix.
> We should define what the semantics are for recovering from this.  Some suggestions:
> 1) Fail Forward (see some log saying restore's meta edits not completed, then gather information necessary to build it all from fs, and complete meta edits.).
> 2) Fail backwards (see some log saying restore's meta edits not completed, delete incomplete snapshot region entries from meta.)  
> I think I prefer 1 -- if two processes end somehow updating  (somehow the original master didn't die, and a new one started up) they would be idempotent.  If we used 2, we could still have a race and still be in a bad place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira