You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "ryan rawson (JIRA)" <ji...@apache.org> on 2009/12/01 23:21:20 UTC

[jira] Created: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Client sync block can cause 1 thread of a multi-threaded client to block all others
-----------------------------------------------------------------------------------

                 Key: HBASE-2023
                 URL: https://issues.apache.org/jira/browse/HBASE-2023
             Project: Hadoop HBase
          Issue Type: Bug
    Affects Versions: 0.20.2
            Reporter: ryan rawson


Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:

        synchronized(userRegionLock){
          return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
        }

So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 

This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!

Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788190#action_12788190 ] 

Joydeep Sen Sarma commented on HBASE-2023:
------------------------------------------

ok - couple of follow on questions:

- would u advise 0.21/trunk for testing instead?
- we haven't run ZK on separate nodes yet. i just wanted to confirm whether that could be exacerbating this problem.

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806620#action_12806620 ] 

Jean-Daniel Cryans commented on HBASE-2023:
-------------------------------------------

Maybe a low hanging fruit would be to narrow the synchronize on the table level:

{code}
synchronized(getTableLocations(tableName)){
  return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
}
{code}

This way you can even disable a table without stopping all request from coming in.

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838145#action_12838145 ] 

Jean-Daniel Cryans commented on HBASE-2023:
-------------------------------------------

@Karthik

That sounds good, only one client hitting the META/ROOT region at a time while not blocking others for seconds.

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788194#action_12788194 ] 

Jean-Daniel Cryans commented on HBASE-2023:
-------------------------------------------

Well the master rewrite code isn't in 0.21 yet, currently the main advantage in trunk is HDFS-265. WRT ZK, as long as you make sure that the quorum members aren't IO starved (eg have their own disk) and there's no swap then you should be good.

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788165#action_12788165 ] 

Jean-Daniel Cryans commented on HBASE-2023:
-------------------------------------------

Yes region splits (takes at least 6 seconds in the pre-0.21 architecture) will generate NotServingRegionException from the RS that was holding the parent of the split so if 1 out of 10 threads (so in the same JVM) goes to write to that location then it will block all threads for that time.

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788448#action_12788448 ] 

Jean-Daniel Cryans commented on HBASE-2023:
-------------------------------------------

So for this issue I see some kind of trade-off. 

 - If all threads synchronize before the method, stuff in cache won't be picked up until another thread is done looking for a another row. On the plus side, that thread waiting in line could be needing the new location that will be put in the cache by the thread holding the lock.

 - If the synchronize is more narrow eg after looking up the cache, the threads won't be blocked but some threads looking for a location in .META. could be looking for the same row and yet will all go through that code.

 - If no synchronization, it's like the previous situation but all threads will query .META. around the same time.

I don't like putting more load on .META. and I don't like having clients waiting sometimes for nothing.

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788825#action_12788825 ] 

stack commented on HBASE-2023:
------------------------------

Can you add line numbers from code to your comments above so can follow along with your comments please J-D?  Thanks.



> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788157#action_12788157 ] 

Joydeep Sen Sarma commented on HBASE-2023:
------------------------------------------

i am looking at some pauses while loading data in and trying to figure out if this is applicable. we have multiple machines loading data - each multithreaded - each thread writing to a different range. all get paused at the same times once in a while. there's no cpu/io going on the region servers when this happens. (next time i reproduce - i will get a jstack dump on the regionservers).

can this happen on region splits? (I sure wasn't going any table offline/online during the test).

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806629#action_12806629 ] 

stack commented on HBASE-2023:
------------------------------

I wonder if it'd be possible to do a mock regionserver implemenation and then do a test that had thousands of clients in the one jvm?  The mock would then on a period do a hold on the lookup to locateRegionInMeta.  See how much it effects other threads?

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Karthik Ranganathan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838137#action_12838137 ] 

Karthik Ranganathan commented on HBASE-2023:
--------------------------------------------

Kannan and I took a look at this issue and came up with yet another possibility in addition to the 3 JD mentioned:

Move the synchronized block inside the try catch loop just around the getClosestRowBefore() call. This causes each thread to give up the lock before sleeping to retry. This allows other threads to make a call in case one particular region was offline. In addition, if useCache is true, we can look at the cache and return the region right away without ever entering the synchronized section. So the new workflow in  locateRegionInMeta() will look as follows:

1. If useCache is true and the region is in the cache, return the region. If not, We have to make a remote call. 
2. for the number of retries
3.   wait for lock
4.   check cache again (someone could have filled the cache while we were waiting). Return if found.
5.   make the remote call
6.   release lock
7.   return on success, otherwise usual error handling/sleep, goto 2

I can work on the fix if this sounds good to you guys.


> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788834#action_12788834 ] 

Jean-Daniel Cryans commented on HBASE-2023:
-------------------------------------------

Ok so in the same order:

 - This is the current situation, synchronization at line 613 of HCM.
 - Narrowing down the sync block we could put it at line 637 and cover the rest of the locateRegionInMeta method.
 - Removing the sync means getting rid of synchronized at line 613.




> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-2023) Client sync block can cause 1 thread of a multi-threaded client to block all others

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2023.
--------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
                   0.20.4
         Assignee: Karthik Ranganathan
     Hadoop Flags: [Reviewed]

Committed branch and trunk.  Thanks for the patch Karthik (I added you as contributor if you don't mind).

> Client sync block can cause 1 thread of a multi-threaded client to block all others
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2023
>                 URL: https://issues.apache.org/jira/browse/HBASE-2023
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: ryan rawson
>            Assignee: Karthik Ranganathan
>             Fix For: 0.20.4, 0.21.0
>
>         Attachments: HBASE-2023_0.20.3.patch
>
>
> Take a highly multithreaded client, processing a few thousand requests a second.  If a table goes offline, one thread will get stuck in "locateRegionInMeta" which is located inside the following sync block:
>         synchronized(userRegionLock){
>           return locateRegionInMeta(META_TABLE_NAME, tableName, row, useCache);
>         }
> So when other threads need to find a region (EVEN IF ITS CACHED!!!) it will encounter this sync and wait. 
> This can become an issue on a busy thrift server (where I first noticed the problem), one region offline can prevent access to all other regions!
> Potential solution: narrow this lock, or perhaps just get rid of it completely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.