You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2011/03/14 23:45:31 UTC

[jira] Created: (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

If a FS bootstrap, need to also ensure ZK is cleaned
----------------------------------------------------

                 Key: HBASE-3638
                 URL: https://issues.apache.org/jira/browse/HBASE-3638
             Project: HBase
          Issue Type: Bug
            Reporter: stack
            Priority: Minor


In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state.  It made for a mess in the test.

Last thing seen on previous cycle was:

{code}
2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META.
{code}

Then, in the messed up cycle I saw:

{code}
2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions
.....

{code}

Then after setting watcher on .META., we get a 

{code}
2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere
{code}

We're all confused.

Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006754#comment-13006754 ] 

stack commented on HBASE-3638:
------------------------------

HBASE-3637 has more background to this issue.

> If a FS bootstrap, need to also ensure ZK is cleaned
> ----------------------------------------------------
>
>                 Key: HBASE-3638
>                 URL: https://issues.apache.org/jira/browse/HBASE-3638
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Minor
>
> In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state.  It made for a mess in the test.
> Last thing seen on previous cycle was:
> {code}
> 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META.
> {code}
> Then, in the messed up cycle I saw:
> {code}
> 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions
> .....
> {code}
> Then after setting watcher on .META., we get a 
> {code}
> 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere
> {code}
> We're all confused.
> Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

Posted by "Shrijeet Paliwal (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183925#comment-13183925 ] 

Shrijeet Paliwal commented on HBASE-3638:
-----------------------------------------

Here is the relevant portion of log. 

The master (even if you restart all the Hbase services across the cluster) will always
get stuck at this state. 
{noformat}
2012-01-10 21:28:03,382 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up txa-18.rfiserve.net,60020,1326125886539; letting RIT timeout so will be assigned elsewhere
2012-01-10 21:28:06,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1326241230066
2012-01-10 21:28:06,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:16,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1326241230066
2012-01-10 21:28:16,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:26,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1326241230066
2012-01-10 21:28:26,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:36,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1326241230066
2012-01-10 21:28:36,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:46,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1326241230066
2012-01-10 21:28:46,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:56,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1326241230066
{noformat}


bq. What do you think Stack, can master pick a stale ZK state which is not a leftover from previous HBase install, in other words a stale state created by itself?

By this I was referring to comment made by Todd in the related jira when he said:

bq. Notably, it wasn't clearing ZK between runs. So some leftover RIT data from a previous HBase incarnation may be confusing this one's master.

He floated one possibility, left over RIT from previous incarnation. I am thinking what other possibilities are there? 
                
> If a FS bootstrap, need to also ensure ZK is cleaned
> ----------------------------------------------------
>
>                 Key: HBASE-3638
>                 URL: https://issues.apache.org/jira/browse/HBASE-3638
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Minor
>
> In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state.  It made for a mess in the test.
> Last thing seen on previous cycle was:
> {code}
> 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META.
> {code}
> Then, in the messed up cycle I saw:
> {code}
> 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions
> .....
> {code}
> Then after setting watcher on .META., we get a 
> {code}
> 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere
> {code}
> We're all confused.
> Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183903#comment-13183903 ] 

stack commented on HBASE-3638:
------------------------------

bq.  We did not do an FS bootstrap (I assume you mean cleaning /hbase directory from hdfs by FS bootstrap).

Yes.

bq. What do you think Stack, can master pick a stale ZK state which is not a leftover from previous HBase install, in other words a stale state created by itself?

I don't follow.

In your case, it seems dumb that we'd let a region hang out in region-in-transition though its corresponding server no longer up.

Were we trying to processing the OPENED and failing because server not online?
                
> If a FS bootstrap, need to also ensure ZK is cleaned
> ----------------------------------------------------
>
>                 Key: HBASE-3638
>                 URL: https://issues.apache.org/jira/browse/HBASE-3638
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Minor
>
> In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state.  It made for a mess in the test.
> Last thing seen on previous cycle was:
> {code}
> 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META.
> {code}
> Then, in the messed up cycle I saw:
> {code}
> 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions
> .....
> {code}
> Then after setting watcher on .META., we get a 
> {code}
> 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere
> {code}
> We're all confused.
> Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

Posted by "Shrijeet Paliwal (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183875#comment-13183875 ] 

Shrijeet Paliwal commented on HBASE-3638:
-----------------------------------------

We just hit this issue today in production. We did not do an FS bootstrap (I assume you mean cleaning /hbase directory from hdfs by FS bootstrap). It was a regular day a RS was throwing not serving exceptions and I went ahead and restarted it. It was not a META or ROOT serving RS. Following this RS restart hbck started reporting holes in regions. 

Later, for some unexplainable, crazy and panicky reason I restarted Master and all other region servers. This is the point where master started complaining META is in OPENED state in ZK, for a server which no longer exists. And like Todd explained in the other Jira, master went to an unending loop. 

The work around was to clear up all files from ZK data directory. 

What do you think Stack, can master pick a *stale* ZK state which is not a leftover from previous HBase install, in other words a stale state created by itself?
                
> If a FS bootstrap, need to also ensure ZK is cleaned
> ----------------------------------------------------
>
>                 Key: HBASE-3638
>                 URL: https://issues.apache.org/jira/browse/HBASE-3638
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Minor
>
> In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state.  It made for a mess in the test.
> Last thing seen on previous cycle was:
> {code}
> 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META.
> {code}
> Then, in the messed up cycle I saw:
> {code}
> 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions
> .....
> {code}
> Then after setting watcher on .META., we get a 
> {code}
> 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere
> {code}
> We're all confused.
> Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

Posted by "Shrijeet Paliwal (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183929#comment-13183929 ] 

Shrijeet Paliwal commented on HBASE-3638:
-----------------------------------------

I must add to avoid ambiguity, the log I pasted is of a time when master is initializing. 
                
> If a FS bootstrap, need to also ensure ZK is cleaned
> ----------------------------------------------------
>
>                 Key: HBASE-3638
>                 URL: https://issues.apache.org/jira/browse/HBASE-3638
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Minor
>
> In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state.  It made for a mess in the test.
> Last thing seen on previous cycle was:
> {code}
> 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META.
> {code}
> Then, in the messed up cycle I saw:
> {code}
> 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions
> .....
> {code}
> Then after setting watcher on .META., we get a 
> {code}
> 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
> 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere
> {code}
> We're all confused.
> Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira