You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2010/09/07 03:17:33 UTC

[jira] Created: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Deadlock when RS tries to RPC to itself inside SplitTransaction
---------------------------------------------------------------

                 Key: HBASE-2964
                 URL: https://issues.apache.org/jira/browse/HBASE-2964
             Project: HBase
          Issue Type: Bug
          Components: ipc, regionserver
    Affects Versions: 0.90.0
            Reporter: Todd Lipcon
            Priority: Blocker


In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:

- All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
- CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.

Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907294#action_12907294 ] 

Todd Lipcon commented on HBASE-2964:
------------------------------------

+1 to stack's patch from reviewboard. Imported about 550G over night, worked OK.

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906908#action_12906908 ] 

HBase Review Board commented on HBASE-2964:
-------------------------------------------

Message from: "Todd Lipcon" <to...@cloudera.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/
-----------------------------------------------------------

Review request for hbase and stack.


Summary
-------

Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.

In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.


This addresses bug HBASE-2964.
    http://issues.apache.org/jira/browse/HBASE-2964


Diffs
-----

  src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d 

Diff: http://review.cloudera.org/r/798/diff


Testing
-------

YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.


Thanks,

Todd




> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906926#action_12906926 ] 

HBase Review Board commented on HBASE-2964:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/#review1110
-----------------------------------------------------------



src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
<http://review.cloudera.org/r/798/#comment3770>

    Let me make a version of this patch that takes care of rollback -- currently rollback expects the lock to be held on entrance; this will not be the case post close if above applied.


- stack





> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907303#action_12907303 ] 

HBase Review Board commented on HBASE-2964:
-------------------------------------------

Message from: stack@duboce.net


bq.  On 2010-09-07 18:33:16, Todd Lipcon wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java, line 207
bq.  > <http://review.cloudera.org/r/798/diff/2/?file=11132#file11132line207>
bq.  >
bq.  >     maybe now we can do an:
bq.  >     
bq.  >     assert !this.parent.lock.writeLock().isHeldByCurrentThread() : "Unsafe to hold write lock while performing RPCs";

I'll add in this assert


- stack


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/#review1122
-----------------------------------------------------------





> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906667#action_12906667 ] 

stack commented on HBASE-2964:
------------------------------

I agree this a blocker on 0.90.x

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2964.
--------------------------

     Hadoop Flags: [Reviewed]
    Fix Version/s: 0.90.0
       Resolution: Fixed

Thanks for review and for testing Todd (applied to TRUNK and to 0.89.20100830 branch.

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906631#action_12906631 ] 

Todd Lipcon commented on HBASE-2964:
------------------------------------

Fixing this is a little tricky. We could short-circuit the IPC path when detecting that a region is hosted in the same process, and thus avoid going through handlers (this is what the datanode does in the block recovery code). However, you still can have a situation where two regionservers are trying to talk to each other and end up in a deadlock.

Another option is to add a timeout to these RPCs, abort the split and try again later if it fails.

Another thing that might help is to have the start of the split transaction flag the table as "going offline", and before taking the readlock, other accessors of the table can check for this case and immediately throw NSRE rather than blocking once the split is in progress.

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907052#action_12907052 ] 

HBase Review Board commented on HBASE-2964:
-------------------------------------------

Message from: "Todd Lipcon" <to...@cloudera.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/#review1122
-----------------------------------------------------------


Seems to make sense. Let me try it on a cluster before I +1 it


src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
<http://review.cloudera.org/r/798/#comment3823>

    maybe now we can do an:
    
    assert !this.parent.lock.writeLock().isHeldByCurrentThread() : "Unsafe to hold write lock while performing RPCs";


- Todd





> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906819#action_12906819 ] 

Todd Lipcon commented on HBASE-2964:
------------------------------------

Overnight test completed OK with that patch. I think we should rebuild the rc with this if Stack thinks it looks good.

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-2964:
----------------------------

    Assignee: Todd Lipcon

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906963#action_12906963 ] 

HBase Review Board commented on HBASE-2964:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/
-----------------------------------------------------------

(Updated 2010-09-07 13:38:39.968517)


Review request for hbase and stack.


Changes
-------

This version removes from SplitTransaction the setting of the this.parent.lock completely.  Its not needed.  Down in the parent close, it takes out the write lock.

In the past, we had a split lock and a close lock (splitLock and splitsAndClosesLock).  The split lock was held across the split while daughter regions were calculated and during close, actual split and update of .META.  As part of lock pruning, an error made in hbase-2641, was using splitsAndClosesLock where splitLock was used previously -- and even expanding the scope of what splitLock used cover).

Looking, splitLock looks like it could have served some purpose preventing two threads contending over splitting (splits make objects in filesystem and move stuff around), but we don't really need this in current HBase since only CompactSplitThread runs splits -- even in new master regime where client can call a splitRegion. Later when we want to run multiple concurrent split transactions, we'll need to reexamine.


Summary
-------

Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.

In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.


This addresses bug HBASE-2964.
    http://issues.apache.org/jira/browse/HBASE-2964


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java a692125 
  src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java a245d97 

Diff: http://review.cloudera.org/r/798/diff


Testing
-------

YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.


Thanks,

Todd




> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-2964:
-------------------------------

    Attachment: hbase-2964.txt

I also had to move the "new HTable" call outside of the lock, since the HTable constructor does an RPC.

This patch seems to fix the issue for me. Running an overnight load test - if it's still going in the morning I'd say we're good :)

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906675#action_12906675 ] 

Todd Lipcon commented on HBASE-2964:
------------------------------------

As noted on the list, this seems to be due to HBASE-2461.

Prior to 2461, when we split, we would close the region before doing any of the writes to META, and didn't hold any locks while doing the META updates. Now we keep the write lock all the way through, even after closing the region.

I think simply moving the writeLock().unlock() up after the this.parent.close(false) in SplitTransaction should fix this issue. I'm testing that change on my test cluster now.

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2964) Deadlock when RS tries to RPC to itself inside SplitTransaction

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906943#action_12906943 ] 

stack commented on HBASE-2964:
------------------------------

Hmmm... now I'm thinking instead that we punt locking up here in splittransaction completely.  The core issue comes of an incorrect mapping of old splitLock on to new region 'lock'.  Looking at what was done under the old splitLock, it all looks safe in the face of concurrency.  Down in the region close, its already taking out the region write lock.  Let me make a different kinda patch.

> Deadlock when RS tries to RPC to itself inside SplitTransaction
> ---------------------------------------------------------------
>
>                 Key: HBASE-2964
>                 URL: https://issues.apache.org/jira/browse/HBASE-2964
>             Project: HBase
>          Issue Type: Bug
>          Components: ipc, regionserver
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>         Attachments: hbase-2964.txt
>
>
> In testing the 0.89.20100830 rc, I ran into a deadlock with the following situation:
> - All of the IPC Handler threads are blocked on the region lock, which is held by CompactSplitThread.
> - CompactSplitThread is in the process of trying to edit META to create the offline parent. META happens to be on the same server as is executing the split.
> Therefore, the CompactSplitThread is trying to connect back to itself, but all of the handler threads are blocked, so the IPC never happens. Thus, the entire RS gets deadlocked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.