You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2008/10/11 08:48:44 UTC

[jira] Created: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

region close and open processed out of order; makes for disagreement between master and regionserver on region state
--------------------------------------------------------------------------------------------------------------------

Key: HBASE-921
URL: https://issues.apache.org/jira/browse/HBASE-921
Project: Hadoop HBase
Issue Type: Bug
Affects Versions: 0.18.0
Reporter: stack
Priority: Blocker
Fix For: 0.18.1, 0.19.0

Master assigns region X successfully. It then decides to close it because it wants it opened elsewhere as part of region rebalancing. Both the open and close operations are reported back to the master. Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.

The close operation does the bulk of its work inline with the master main processing loop. Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.

Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver. It takes a long time to process the shutdown of a regionserver when thousands of regions This latter delays the processing of the open and close todos. In effect the open is running after the close. The region goes into limbo. Only a restart of the 'hosting' regionserver 'fixes' this state.

This is a particular case of the general HBASE-543 issue. Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.

Jim Firby here had a good idea for conditions like this. Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong. Revisit any notion that said region is at said location". Mr. Master would then go off and do something drastic like close and reassign the region.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HBASE-921:
--------------------------------

    Issue Type: Sub-task  (was: Bug)
        Parent: HBASE-678

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-921:
------------------------

    Fix Version/s:     (was: 0.18.1)
                       (was: 0.19.0)
                   0.18.2

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.2
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639293#action_12639293 ] 

stack commented on HBASE-921:
-----------------------------

Patch looks good to me (its a bit hard to follow whats going on, but thats not the patches fault).  If unit tests pass, commit I'd say.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652502#action_12652502 ] 

stack commented on HBASE-921:
-----------------------------

Did regionserver say why it was closing?  An exception or OOME?  Or was it a request from master?  Should we keep this issue in 0.19.0 Andrew?  If so, how to replicate or log from the problem-time would help.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman reassigned HBASE-921:
-----------------------------------

    Assignee: Jim Kellerman  (was: stack)

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638753#action_12638753 ] 

Jean-Daniel Cryans commented on HBASE-921:
------------------------------------------

This issue seems very much like HBASE-851 and Firby's idea is like the bandaid I proposed there.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell reopened HBASE-921:
----------------------------------

      Assignee:     (was: Jim Kellerman)

I believe I saw an instance of this issue again today. 

What about Jim Firby's original suggestion: "Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong. Revisit any notion that said region is at said location". Mr. Master would then go off and do something drastic like close and reassign the region."

Or the Master can sanity check on its own. If my latest patch to HBASE-1018 goes in, the master can look at the HServerLoad and note that not all expected regions are found there. 

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650738#action_12650738 ] 

Jim Kellerman commented on HBASE-921:
-------------------------------------

The problem is that the master does not know where any regions are located except for root and meta regions. It is up to the client to figure out that a region isn't where it's supposed to be, and at that point the client should rescan the meta to find the region.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell resolved HBASE-921.
----------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.18.2)

The regionserver sent the CLOSE because a DFS error prevented deployment. Master state races are being addressed elsewhere. For example HBASE-1098 provides a workaround. In addition HBASE-1038 and ZK work will impact this area.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651736#action_12651736 ] 

Andrew Purtell commented on HBASE-921:
--------------------------------------

What I saw was a CLOSE on the region server and nothing in the master log. 

I opened HBASE-1038  for Jim's suggestion.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman resolved HBASE-921.
---------------------------------

    Resolution: Fixed

Committed to branch and trunk.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650747#action_12650747 ] 

Andrew Purtell commented on HBASE-921:
--------------------------------------

The problem is that META has out of date information on where the region is located, due to the master leaving server location information in META in an incorrect state for whatever reason. This is a problem with region (re)assignment. The master is not adequately testing its view of region assignments against the reality. 

I think either Jim Firby's suggestion should be implemented or the additional information in HSL if that patch for HBASE-1018 should be used. Either option allows the master to get feedback on region assignment state from the regionservers.

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-921:
---------------------------

    Assignee: stack

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HBASE-921:
--------------------------------

    Attachment: 921-0.18.0.patch

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639110#action_12639110 ] 

Jim Kellerman commented on HBASE-921:
-------------------------------------

In HBase-0.1.x, all region closes in HMaster went through the toDo queue to preserve ordering. Later, an "optimization" was introduced, so that simple closes were processed immediately, effectively allowing normal closes to jump ahead of the queue. Here is a snipped from HMaster.processMsgs circa HBase-0.1.0:

{code}
      case HMsg.MSG_REPORT_CLOSE:
        ... (code irrelevant to this issue not shown)

          // NOTE: we cannot put the region into unassignedRegions as that
          //       could create a race with the pending close if it gets 
          //       reassigned before the close is processed.
          unassignedRegions.remove(region);
          try {
            toDoQueue.put(new ProcessRegionClose(region, reassignRegion,
                deleteRegion));
...
{code}

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639471#action_12639471 ] 

stack commented on HBASE-921:
-----------------------------

Rong-en up on IRC:
{code}
[23:33]	<rafan>	btw, jim's 921 passes unit test here
{code}

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651729#action_12651729 ] 

stack commented on HBASE-921:
-----------------------------

Andrew: Were you sure the issue you saw was processing of messages out of order?  To do the jim firby suggestion, should we open a new issue and close this one.  Should it be in 0.19?

> region close and open processed out of order; makes for disagreement between master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it opened elsewhere as part of region rebalancing.  Both the open and close operations are reported back to the master.  Both have operation processing components that are added to the todo list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing loop.  Its todo component does some work if the region is offlined but otherwise nothing of consequence whereas the open in its todo does the important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver when thousands of regions   This latter delays the processing of the open and close todos.  In effect the open is running after the close.  The region goes into limbo.  Only a restart of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say "I've asked for a regions location 10 times now and Mr. Master, you've given me the same response ten times in a row and each time, the answer was wrong.  Revisit any notion that said region is at said location".  Mr. Master would then go off and do something drastic like close and reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.