You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/05/17 20:21:47 UTC

[jira] [Created] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

HRegion.internalObtainRowLock shouldn't wait forever
----------------------------------------------------

                 Key: HBASE-3893
                 URL: https://issues.apache.org/jira/browse/HBASE-3893
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 0.90.2
            Reporter: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.4


We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.

Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.

We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Latham updated HBASE-3893:
-------------------------------

    Attachment: concurrentRowLocks.patch

Here's a patch I've been playing with.

In the existing code each time any lock is unlocked, all 20 threads trying to acquire a lock wake up, contend for the monitor, and have to check for their lock in that TreeSet (15 byte[] comparisons), whether or not their particular row was unlocked.

This patch replaces the set with a concurrent hash map.  In order to use it, we must wrap the byte array in another object that gives it a hash identity based on its contents rather than its instance.  However, every row lock is already creating a couple objects (the Integer lockId, as well as the tree node), so the object creation overhead is worth it.

The patch also only awakens threads when their particular row is unlocked.

Some further considerations:
 - On release, should we throw an exception if the client attempts to release a lock id that doesn't exist, or just log it?
 - Do we really need to generate lock ids?  Can we trust HBase client implementations to not allow arbitrary lock releases?  Or if not, for locks that are only acquired / released internally to the regionserver, we should be able to trust that code to use the row key rather than need to generate another lock id
 - When an HRegion is doing a miniBatch of thousands of rows, is it really best to attempt to acquire thousands of locks and hold them all while doing the write?  This one is probably a separate JIRA.

This patch has not yet been tested, but I wanted to put it up for discussion since other people are looking at the issue.


> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: concurrentRowLocks.patch, regionserver_rowLock_set_contention.threads.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Latham updated HBASE-3893:
-------------------------------

    Attachment:     (was: concurrentRowLocks.patch)

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061666#comment-13061666 ] 

Hudson commented on HBASE-3893:
-------------------------------

Integrated in HBase-TRUNK #2011 (See [https://builds.apache.org/job/HBase-TRUNK/2011/])
    HBASE-3893  HRegion.internalObtainRowLock shouldn't wait forever

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/CHANGES.txt


> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3893:
--------------------------------------

    Comment: was deleted

(was: We've also run into this issue a couple times.  I'm attaching a sample thread dump.

I examined a heap dump as well, and saw about 160K locks in the TreeSet of row locks.)

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061625#comment-13061625 ] 

Ted Yu commented on HBASE-3893:
-------------------------------

Integrated to branch and TRUNK.

Thanks for the review Stack.

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Latham updated HBASE-3893:
-------------------------------

    Attachment: regionserver_rowLock_set_contention.threads.txt

We've also run into this issue a couple times.  I'm attaching a sample thread dump.

I examined a heap dump as well, and saw about 160K locks in the TreeSet of row locks.

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: regionserver_rowLock_set_contention.threads.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3893:
--------------------------

    Release Note: 
hbase.rowlock.wait.duration has been introduced which controls the duration in milliseconds waiting to acquire row lock.
If row lock cannot be acquired within this duration, no row lock would be obtained.

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034975#comment-13034975 ] 

Jean-Daniel Cryans commented on HBASE-3893:
-------------------------------------------

I deleted Dave's comments since he opened HBASE-3894.

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3893:
--------------------------------------

    Comment: was deleted

(was: Here's a patch I've been playing with.

In the existing code each time any lock is unlocked, all 20 threads trying to acquire a lock wake up, contend for the monitor, and have to check for their lock in that TreeSet (15 byte[] comparisons), whether or not their particular row was unlocked.

This patch replaces the set with a concurrent hash map.  In order to use it, we must wrap the byte array in another object that gives it a hash identity based on its contents rather than its instance.  However, every row lock is already creating a couple objects (the Integer lockId, as well as the tree node), so the object creation overhead is worth it.

The patch also only awakens threads when their particular row is unlocked.

Some further considerations:
 - On release, should we throw an exception if the client attempts to release a lock id that doesn't exist, or just log it?
 - Do we really need to generate lock ids?  Can we trust HBase client implementations to not allow arbitrary lock releases?  Or if not, for locks that are only acquired / released internally to the regionserver, we should be able to trust that code to use the row key rather than need to generate another lock id
 - When an HRegion is doing a miniBatch of thousands of rows, is it really best to attempt to acquire thousands of locks and hold them all while doing the write?  This one is probably a separate JIRA.

This patch has not yet been tested, but I wanted to put it up for discussion since other people are looking at the issue.
)

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu resolved HBASE-3893.
---------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3893:
--------------------------

    Attachment: 3893.txt

Patch for TRUNK.
Running test suite for 0.90 codebase.

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061605#comment-13061605 ] 

stack commented on HBASE-3893:
------------------------------

+1 on v2

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061526#comment-13061526 ] 

stack commented on HBASE-3893:
------------------------------

Why do this:

{code}
+    this.rowLockWaitDuration = 30000;
{code}

when later you do this:

{code}
+    this.rowLockWaitDuration = conf.getInt("hbase.rowlock.wait.duration", 30000);
{code}


Otherwise, +1 on commit.


> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3893:
--------------------------

    Attachment: 3893-v2.txt

Defined a constant for the 30 second wait duration.
Since rowLockWaitDuration is declared final, it should be initialized in ctor.
In the first ctor, conf isn't available. Thus I need to assign the above constant directly.

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3893:
--------------------------

    Release Note: 
hbase.rowlock.wait.duration has been introduced which controls the duration in milliseconds waiting to acquire row lock.
Default value is 30 seconds.
If row lock cannot be acquired within this duration, no row lock would be obtained.

  was:
hbase.rowlock.wait.duration has been introduced which controls the duration in milliseconds waiting to acquire row lock.
If row lock cannot be acquired within this duration, no row lock would be obtained.


> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Latham updated HBASE-3893:
-------------------------------

    Attachment:     (was: regionserver_rowLock_set_contention.threads.txt)

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061714#comment-13061714 ] 

Hudson commented on HBASE-3893:
-------------------------------

Integrated in HBase-TRUNK #2012 (See [https://builds.apache.org/job/HBase-TRUNK/2012/])
    HBASE-3893 account for new int field for FIXED_OVERHEAD

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java


> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893-v2.txt, 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-3893) HRegion.internalObtainRowLock shouldn't wait forever

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu reassigned HBASE-3893:
-----------------------------

    Assignee: Ted Yu

> HRegion.internalObtainRowLock shouldn't wait forever
> ----------------------------------------------------
>
>                 Key: HBASE-3893
>                 URL: https://issues.apache.org/jira/browse/HBASE-3893
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Ted Yu
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3893.txt
>
>
> We just had a weird episode where one user was trying to insert a lot of data with overlapping keys into a single region (all of that is a separate problem), and the region server rapidly filled up all it's handlers + queues with those calls. Basically it wasn't deadlocked but almost.
> Worse, now that we have a 60 seconds socket timeout the clients were eventually getting the timeout and then retrying another call to that same region server.
> We should have a timeout on lockedRows.wait() in HRegion.internalObtainRowLock in order to survive this better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira