You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Todd Lipcon <to...@cloudera.com> on 2010/09/07 19:46:38 UTC

Review Request: Fix RPC deadlock when splitting regions on same RS as meta under heavy load

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/
-----------------------------------------------------------

Review request for hbase and stack.


Summary
-------

Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.

In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.


This addresses bug HBASE-2964.
    http://issues.apache.org/jira/browse/HBASE-2964


Diffs
-----

  src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d 

Diff: http://review.cloudera.org/r/798/diff


Testing
-------

YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.


Thanks,

Todd

Re: Review Request: Fix RPC deadlock when splitting regions on same RS as meta under heavy load

Posted by st...@duboce.net.


> On 2010-09-07 18:33:16, Todd Lipcon wrote:
> > src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java, line 207
> > <http://review.cloudera.org/r/798/diff/2/?file=11132#file11132line207>
> >
> >     maybe now we can do an:
> >     
> >     assert !this.parent.lock.writeLock().isHeldByCurrentThread() : "Unsafe to hold write lock while performing RPCs";

I'll add in this assert


- stack


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/#review1122
-----------------------------------------------------------


On 2010-09-07 13:38:39, Todd Lipcon wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/798/
> -----------------------------------------------------------
> 
> (Updated 2010-09-07 13:38:39)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.
> 
> In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.
> 
> 
> This addresses bug HBASE-2964.
>     http://issues.apache.org/jira/browse/HBASE-2964
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java a692125 
>   src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d 
>   src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java a245d97 
> 
> Diff: http://review.cloudera.org/r/798/diff
> 
> 
> Testing
> -------
> 
> YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.
> 
> 
> Thanks,
> 
> Todd
> 
>

Re: Review Request: Fix RPC deadlock when splitting regions on same RS as meta under heavy load

Posted by Todd Lipcon <to...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/#review1122
-----------------------------------------------------------


Seems to make sense. Let me try it on a cluster before I +1 it


src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
<http://review.cloudera.org/r/798/#comment3823>

    maybe now we can do an:
    
    assert !this.parent.lock.writeLock().isHeldByCurrentThread() : "Unsafe to hold write lock while performing RPCs";


- Todd


On 2010-09-07 13:38:39, Todd Lipcon wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/798/
> -----------------------------------------------------------
> 
> (Updated 2010-09-07 13:38:39)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.
> 
> In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.
> 
> 
> This addresses bug HBASE-2964.
>     http://issues.apache.org/jira/browse/HBASE-2964
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java a692125 
>   src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d 
>   src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java a245d97 
> 
> Diff: http://review.cloudera.org/r/798/diff
> 
> 
> Testing
> -------
> 
> YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.
> 
> 
> Thanks,
> 
> Todd
> 
>

Re: Review Request: Fix RPC deadlock when splitting regions on same RS as meta under heavy load

Posted by st...@duboce.net.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/
-----------------------------------------------------------

(Updated 2010-09-07 13:38:39.968517)

Review request for hbase and stack.

Changes
-------

This version removes from SplitTransaction the setting of the this.parent.lock completely. Its not needed. Down in the parent close, it takes out the write lock.

In the past, we had a split lock and a close lock (splitLock and splitsAndClosesLock). The split lock was held across the split while daughter regions were calculated and during close, actual split and update of .META. As part of lock pruning, an error made in hbase-2641, was using splitsAndClosesLock where splitLock was used previously -- and even expanding the scope of what splitLock used cover).

Looking, splitLock looks like it could have served some purpose preventing two threads contending over splitting (splits make objects in filesystem and move stuff around), but we don't really need this in current HBase since only CompactSplitThread runs splits -- even in new master regime where client can call a splitRegion. Later when we want to run multiple concurrent split transactions, we'll need to reexamine.

Summary
-------

Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.

In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.

This addresses bug HBASE-2964.
http://issues.apache.org/jira/browse/HBASE-2964

Diffs (updated)
-----

src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java a692125
src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d
src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java a245d97

Diff: http://review.cloudera.org/r/798/diff

Testing
-------

YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.

Thanks,

Todd

Re: Review Request: Fix RPC deadlock when splitting regions on same RS as meta under heavy load

Posted by st...@duboce.net.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/798/#review1110
-----------------------------------------------------------



src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
<http://review.cloudera.org/r/798/#comment3770>

    Let me make a version of this patch that takes care of rollback -- currently rollback expects the lock to be held on entrance; this will not be the case post close if above applied.


- stack


On 2010-09-07 10:46:38, Todd Lipcon wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/798/
> -----------------------------------------------------------
> 
> (Updated 2010-09-07 10:46:38)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> Moves all RPCs outside of the region writeLock - the writeLock is now only used long enough to set the 'closing' flag. When we drop the lock any waiters will see 'closing' upon acquiring the lock, and thus throw NSRE.
> 
> In the case that we abort the split, it will reopen the region as before. Accessors will have gotten NSRE but will just come back to the same region eventually.
> 
> 
> This addresses bug HBASE-2964.
>     http://issues.apache.org/jira/browse/HBASE-2964
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java 3507c0d 
> 
> Diff: http://review.cloudera.org/r/798/diff
> 
> 
> Testing
> -------
> 
> YCSB testing on my cluster - it used to deadlock due to this bug within an hour. I ran a 5 hour load test overnight and it worked OK.
> 
> 
> Thanks,
> 
> Todd
> 
>