You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2010/11/17 05:58:13 UTC

[jira] Created: (HBASE-3243) Disable Table closed region on wrong host

Disable Table closed region on wrong host
-----------------------------------------

                 Key: HBASE-3243
                 URL: https://issues.apache.org/jira/browse/HBASE-3243
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.0
            Reporter: Todd Lipcon
            Priority: Blocker
             Fix For: 0.90.0


I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934530#action_12934530 ] 

Jonathan Gray commented on HBASE-3243:
--------------------------------------

Been through this several times on the code paths and have found nothing other than what is in my patch.

There is a weird potential race condition around the master handling of a split (we offline parent and online regions one after the other w/ no locking between operations).  But that does not appear to be the issue here.

I'm yet to really dig through logs though and will do that next.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964926#action_12964926 ] 

Jonathan Gray commented on HBASE-3243:
--------------------------------------

Yes.  That was wrong, we never use this.servers as a lock, we always manipulate this.regions and this.servers under the this.regions lock.

The removal from regionPlans was just moved underneath it, I didn't take that part out.  It's in the next chunk of the diff.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932818#action_12932818 ] 

Todd Lipcon commented on HBASE-3243:
------------------------------------

No idea if it's related but it seems like AssignmentManager's {{regions}} member is accessed without synchronization sometimes... could end up getting some incorrect data due to this.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3243) Disable Table closed region on wrong host

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3243:
-------------------------

    Attachment: 3243-v2.patch

Here's what I applied.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 3243-v2.patch, hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964918#action_12964918 ] 

Jonathan Gray commented on HBASE-3243:
--------------------------------------

+1 to your proposal stack

I've also gone through this several times and have not been able to come up with anything besides what is contained in these patches.

Todd, do your best to break the new RC :)

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934535#action_12934535 ] 

Jonathan Gray commented on HBASE-3243:
--------------------------------------

Well try running again with my patch.  Or you could even run it again without to see if it happens again and we could get another set of logs.  I guess run it with the patch and then if it doesn't ever happen again we can punt the issue or resolve it until we see it again.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932825#action_12932825 ] 

Todd Lipcon commented on HBASE-3243:
------------------------------------

Just grasping at straws here... but another thought is this:
The keys of the {{regions}} map are HRegionInfo, where compareTo() includes calling down to compareTo() on HTableDescriptor. HTableDescriptor's compareTo eventually delegates to hashcode on the {{values}} instance... do we end up changing the {{values}} instance when a table gets disabled? Perhaps this is then changing the sort order in the TreeMap and confusing things?

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3243) Disable Table closed region on wrong host

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3243:
-------------------------

    Attachment: hri.diff

This little patch makes HRI compare use tablename only rather than full compareTo.  Full compareTo is OTT and looks error-prone.

I've spent some time looking at this issue and I can't figure how it came about.  I think we should apply the two patches that are in this issue, down the issues priority, and then cut a new RC.  We then test the new RC and see if we can reproduce.  If we do, recast this issue as a blocker.  Hows that sound?

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932814#action_12932814 ] 

Todd Lipcon commented on HBASE-3243:
------------------------------------

Here are some relevant logs... the region in question is user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1

Here's the birth of the region on haus01:
2010-11-16 02:28:10,471 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
2010-11-16 02:28:11,432 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Region split, META updated, and report to master. Parent=usertable,user1057689679,1289901563182.24d8aec47045a86b97b3508ed16ced29., new regions: usertable,user1057689679,1289903283192.b313a3864e1eb1aaf2a72dcf2c2814b3., usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.. Split took 8sec

Master gets split:
2010-11-16 02:28:11,450 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: usertable,user1057689679,1289901563182.24d8aec47045a86b97b3508ed16ced29.: Daughters; usertable,user1057689679,1289903283192.b313a3864e1eb1aaf2a72dcf2c2814b3., usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1. from haus01.sf.cloudera.com,60020,1289890926766

.. then plenty of successful flushes/compactions on haus01, no mention on any other region server. The benchmark finished around 3:20am, and no mention of this region in any logs from that point until I disabled the table this evening:

Master says:
2010-11-16 18:16:17,223 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1. (offlining)

Region server *haus02* (NOTE: this is the wrong regionserver!!) says:
2010-11-16 18:16:17,238 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received close region: usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
2010-11-16 18:16:17,239 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for region we are not serving; 8a2c525797cad9fd9882cc6304362bd1

Master says:
2010-11-16 18:16:17,240 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to send CLOSE to serverName=haus02.sf.cloudera.com,60020,1289890927217, load=(requests=1029, regions=28, usedHeap=4772, maxHeap=8185) for region usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1. but failed, setting region as OFFLINE and reassigning
...
2010-11-16 18:16:17,248 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1. state=PENDING_CLOSE, ts=1289960177227
2010-11-16 18:16:17,248 INFO org.apache.hadoop.hbase.master.AssignmentManager: Table usertable disabling; skipping assign of usertable,user1072135606,1289903283192.8a2c525797cad9fd9882cc6304362bd1.
2010-11-16 18:16:17,248 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x12c537d84e10000 Deleting existing unassigned node for 8a2c525797cad9fd9882cc6304362bd1 that is in expected state RS_ZK_REGION_CLOSED
2010-11-16 18:16:17,248 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x12c537d84e10000 Unable to get data of znode /hbase/unassigned/8a2c525797cad9fd9882cc6304362bd1 because node does not exist (not necessarily an error)


So basically, the master randomly got the wrong regionserver for this region, and there's little indication why.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932947#action_12932947 ] 

Jonathan Gray commented on HBASE-3243:
--------------------------------------

This is very weird.  Can you put up the full logs somewhere?

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964924#action_12964924 ] 

stack commented on HBASE-3243:
------------------------------

Did you intend to do this in your patch Jon?

{code}
@@ -1359,11 +1361,6 @@
     }
     synchronized (this.regions) {
       this.regions.remove(hri);
-    }
-    synchronized (this.regionPlans) {
-      this.regionPlans.remove(hri.getEncodedName());
-    }
-    synchronized (this.servers) {
       for (List<HRegionInfo> regions : this.servers.values()) {
         for (int i=0;i<regions.size();i++) {
           if (regions.get(i).equals(hri)) {
{code}

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3243) Disable Table closed region on wrong host

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3243:
-------------------------

         Priority: Major  (was: Blocker)
    Fix Version/s:     (was: 0.90.0)
                   0.90.1

Moving out of 0.90... to 0.90.1 and made it major for now until we see it again.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>             Fix For: 0.90.1
>
>         Attachments: 3243-v2.patch, hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934531#action_12934531 ] 

Todd Lipcon commented on HBASE-3243:
------------------------------------

Thanks for looking at this, Jon. I can run some tests with my "crazy config" which generates lots and lots of regions, and see if it happens again. Any good log messages you can suggest adding for this test?

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965020#action_12965020 ] 

Todd Lipcon commented on HBASE-3243:
------------------------------------

Sounds good to me.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>             Fix For: 0.90.1
>
>         Attachments: 3243-v2.patch, hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch, hri.diff
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933094#action_12933094 ] 

Todd Lipcon commented on HBASE-3243:
------------------------------------

bq. Looking at this more, I'm not sure synchronization is the issue here because TreeMap appears to only be not thread-safe when there are mutations. The two critical pieces of code where a conflict could happen are where we read the server a region is assigned to, and where we set the server a region is assigned to

What about removals? I thought I saw a couple places with remove() that were unsynchronized. Will take a look at your patch momentarily.

bq. This is very weird. Can you put up the full logs somewhere

Yep, will upload them here, it's just fake data, nothing secret.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933628#action_12933628 ] 

stack commented on HBASE-3243:
------------------------------

This is an odd one.  I don't see anything jumping out at me.  Need to dig more.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-3243:
-------------------------------

    Attachment: hbase-3243-logs.tar.bz2

Here are all the logs that contain that region ID.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3243-logs.tar.bz2, HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3243:
---------------------------------

    Attachment: HBASE-3243-v1.patch

Fixes to {{regions}} synchronization including one misuse of {{servers}} modification under it's own lock rather than the {{regions}} lock.

Looking at this more, I'm not sure synchronization is the issue here because TreeMap appears to only be not thread-safe when there are mutations.  The two critical pieces of code where a conflict could happen are where we read the server a region is assigned to, and where we set the server a region is assigned to.

AssignmentManager:525 in {{regionOnline()}}
{noformat}
    synchronized (this.regions) {
      // Add check
      HServerInfo hsi = this.regions.get(regionInfo);
      if (hsi != null) LOG.warn("Overwriting " + regionInfo.getEncodedName() +
        " on " + hsi);
      this.regions.put(regionInfo, serverInfo);
      addToServers(serverInfo, regionInfo);
    }
{noformat}

AssignmentManager:1008 in {{unassign()}}
{noformat}
    synchronized (this.regions) {
      server = regions.get(region);
    }
{noformat}

Seems like this is okay as-is.  We should definitely still apply this patch to fix the synchronization issues but I'm not sure it is the cause.

Digging more.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3243-v1.patch
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3243) Disable Table closed region on wrong host

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932894#action_12932894 ] 

Jonathan Gray commented on HBASE-3243:
--------------------------------------

Dug into code around compareTo() and the values instance.  Pretty sure we are not touching that while a cluster is running, and definitely not during disabling.  We've actually even taken the 'offline' flag out of META as well, it's a node in ZK now that signals the state of a table (enabling/disabling/disabled).

It is interesting that HRI comparator uses the full HTD comparator.  It should probably just use the tableName itself to compare though considering HTD is a member it might make sense in some circumstances to do the full HTD.compareTo()?

Looking around the code, it does look like there are multiple places we're using regions w/o a lock in the disable table path.  Pretty sure this is the cause.  Patch soon.

> Disable Table closed region on wrong host
> -----------------------------------------
>
>                 Key: HBASE-3243
>                 URL: https://issues.apache.org/jira/browse/HBASE-3243
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> I ran some YCSB benchmarks which resulted in about 150 regions worth of data overnight. Then I disabled the table, and the master for some reason closed one region on the wrong server. The server ignored this, but the region remained open on a different server, which later flipped out when it tried to flush due to hlog accumulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.