You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/09/13 23:49:09 UTC

[jira] [Created] (HBASE-4395) EnableTableHandler races with itself

EnableTableHandler races with itself
------------------------------------

                 Key: HBASE-4395
                 URL: https://issues.apache.org/jira/browse/HBASE-4395
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.4
            Reporter: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.5


Very often when we try to enable a big table we get something like:

{quote}
2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
java.lang.IllegalStateException
        at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
        at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
{quote}

The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:

{quote}
2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
{quote}

Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.

I'm pretty sure there's a window where double assignment are possible.

Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-4395:
--------------------------------------

    Attachment: HBASE-4395-0.90-v2.patch

Patch fixed.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104077#comment-13104077 ] 

stack commented on HBASE-4395:
------------------------------

bq. Looking more into this, deep in the BulkEnabler it waitUntilDone and it has a timeout of 5 minutes. I guess the loop is the way we use to take action when we timeout. I don't think it's a good idea in hindsight.

Yeah.  Notion was you'd turn off timeout while we got the regions out.  After this configurable amount of time -- 5 minutes being default -- we'd then let the timeout monitor cut in to take care of any stragglers.

That was original thinking.

bq. Since we already have access to both the number of regions in the table and the number of online regions, I think we could do a trick where we count the latter number in a loop and as long we make progress we don't timeout.

This is a better idea.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105819#comment-13105819 ] 

Hudson commented on HBASE-4395:
-------------------------------

Integrated in HBase-TRUNK #2217 (See [https://builds.apache.org/job/HBase-TRUNK/2217/])
    HBASE-4395  EnableTableHandler races with itself

jdcryans : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/EnableTableHandler.java


> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch, HBASE-4395-trunk.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104197#comment-13104197 ] 

Ted Yu commented on HBASE-4395:
-------------------------------

In the loop of waitUntilDone():
{code}
+        // Punt on the timeout as long we make progress
+        if (regions.size() < lastNumberOfRegions) {
+          lastNumberOfRegions = regions.size();
+          remaining += waitingTimeForEvents;
+        }
         remaining = timeout - (System.currentTimeMillis() - startTime);
{code}
I think regions.size() should increase in the above loop. So I don't understand the condition for if above.
Also, remaining is calculated lastly. I don't know why remaining is updated in the if block.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105041#comment-13105041 ] 

stack commented on HBASE-4395:
------------------------------

regions.isEmpty is cheaper than +    if (regions.size() == 0) {

You get size of region list to do logging... a few lines later.  Might as well cache it?

You can address on commit.

+1 on patch.



> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-4395:
--------------------------------------

    Attachment: HBASE-4395-trunk.patch

Patch for trunk, contains fixes for what Stack wrote.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch, HBASE-4395-trunk.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-4395.
---------------------------------------

      Resolution: Fixed
    Release Note: EnableTableHandler used to call many enables on the table, now it calls it once and eventually times out if not all the regions are assigned and none is moving.
    Hadoop Flags: [Reviewed]

Committed to branch and trunk.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch, HBASE-4395-trunk.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104056#comment-13104056 ] 

Jean-Daniel Cryans commented on HBASE-4395:
-------------------------------------------

Looking more into this, deep in the BulkEnabler it waitUntilDone and it has a timeout of 5 minutes. I guess the loop is the way we use to take action when we timeout. I don't think it's a good idea in hindsight.

Since we already have access to both the number of regions in the table and the number of online regions, I think we could do a trick where we count the latter number in a loop and as long we make progress we don't timeout. If we do timeout, the best we can do is logging an angry message in the master log... I don't think looping forever is better, same for calling BulkEnabler multiple times.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104735#comment-13104735 ] 

Jean-Daniel Cryans commented on HBASE-4395:
-------------------------------------------

bq. I think regions.size() should increase in the above loop. So I don't understand the condition for if above.

Yeah that was a last minute change, I actually tested with "regions.size() > lastNumberOfRegions" and then thought that that number was going down, I was confused with regionsToAssign().

bq. Also, remaining is calculated lastly. I don't know why remaining is updated in the if block.

Derp sorry it should be the timeout that's incremented.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4395) EnableTableHandler races with itself

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104842#comment-13104842 ] 

Ted Yu commented on HBASE-4395:
-------------------------------

+1 on patch version 2.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-4395:
--------------------------------------

    Attachment: HBASE-4395-0.90.patch

Removes the loop and adds the timeout bumping. Tried it on a largish table (1k regions) and it worked fine. Also passes TestAdmin.

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4395) EnableTableHandler races with itself

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans reassigned HBASE-4395:
-----------------------------------------

    Assignee: Jean-Daniel Cryans

> EnableTableHandler races with itself
> ------------------------------------
>
>                 Key: HBASE-4395
>                 URL: https://issues.apache.org/jira/browse/HBASE-4395
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4395-0.90-v2.patch, HBASE-4395-0.90.patch
>
>
> Very often when we try to enable a big table we get something like:
> {quote}
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> java.lang.IllegalStateException
>         at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1074)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1030)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:858)
>         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:838)
>         at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-02 12:21:56,620 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> {quote}
> The issue is that EnableTableHandler calls multiple BulkEnabler and it's possible that by the time it calls it a second time, using a stale list of still-not-enabled regions, that it tries to set one region offline in ZK but just after its state changed. Case in point:
> {quote}
> 2011-09-02 12:21:56,616 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region huge_ass_region_name to sv4r23s16,60020,1314880035029
> 2011-09-02 12:21:56,619 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; huge_ass_region_name state=PENDING_OPEN, ts=1314991316616
> {quote}
> Here the first line is the first assign done in the first thread, and the second line is the second thread that got to process the same region around the same time. 3ms difference in time. After that, the master dies, and it's pretty sad when it restarts because it failovers an enabling table and it's ungodly slow.
> I'm pretty sure there's a window where double assignment are possible.
> Talking with Stack, it doesn't really make sense to call multiple enables since the list of regions is static (the table is disabled!). We should just call it and wait. Also there's a lot of cleanup to do in EnableTableHandler since it refers to disabling the table (copy pasta I guess).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira