You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Wayne (JIRA)" <ji...@apache.org> on 2011/01/12 18:01:51 UTC

[jira] Created: (HBASE-3438) Cluster Wide Pauses

Cluster Wide Pauses
-------------------

                 Key: HBASE-3438
                 URL: https://issues.apache.org/jira/browse/HBASE-3438
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.89.20100924
         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
            Reporter: Wayne


Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980891#action_12980891 ] 

Jonathan Gray commented on HBASE-3438:
--------------------------------------

Can you add instrumentation in your clients?  Is there something in common with the operation they get blocked on?  (same row or same region, perhaps)

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12998550#comment-12998550 ] 

stack commented on HBASE-3438:
------------------------------

@Wayne Does hbase-3483 fix this issue?

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980854#action_12980854 ] 

stack commented on HBASE-3438:
------------------------------

And Ted, you have one regionserver only?

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980886#action_12980886 ] 

stack commented on HBASE-3438:
------------------------------

Reads would be blocked as well (if region is not online, they have nothing to read).   Tell us more about size of  cells being inserted and if much variance.   I'm trying to reproduce your pause over here on a ten node cluster.  Thanks.

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Wayne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981744#action_12981744 ] 

Wayne commented on HBASE-3438:
------------------------------

After 2 more days of testing I think I can confirm your assumption. When we are writing to multiple tables we never see cluster wide pauses. This is a slow split in conjunction with meta table updates that cause all workers to be stuck on a hot region. Why would a split take 10 seconds and what can be done to minimize this pause?

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Resolved] (HBASE-3438) Cluster Wide Pauses

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved HBASE-3438.
--------------------------------

    Resolution: Duplicate

Resolving this as dup of HBASE-3483, I haven't seen this to be a problem of late.

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980826#action_12980826 ] 

Jonathan Gray commented on HBASE-3438:
--------------------------------------

What do the operations coming in to the cluster look like?  Is each client issuing requests that are randomly distributed across all regions/nodes?  If so, then the unavailability of a single region could end up looking like a pause against all your clients (eventually each client piles up waiting on the offline region).  This could be worse if you have larger regions because unavailability could increase and likelihood that a client hits that region increases.

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Wayne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987593#action_12987593 ] 

Wayne commented on HBASE-3438:
------------------------------

We now assume this is caused by a memstore flush. See https://issues.apache.org/jira/browse/HBASE-3483.


> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980907#action_12980907 ] 

Jonathan Gray commented on HBASE-3438:
--------------------------------------

Sounds like a large number of small, evenly distributed writes.  With every client hitting every region frequently, you can certainly run into an issue where data unavailability of a single region could cause a "global pause".

Lots of work has been done to make this faster, and there's plenty more work that can be done.  Like Todd said, be sure to check on your client retries and you might turn up client-side debugging to look for more signs.  You are also going to pay the cost of a re-lookup in META.  There's been some discussion for a while about clients more proactively learning of new region locations via ZK (this could also trigger a retry, negating the need to do frequent retries after an NSRE).

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980901#action_12980901 ] 

Todd Lipcon commented on HBASE-3438:
------------------------------------

Another thing to consider is our backoff policy on client retries. If the client hits a NSRE due to split, it might be backing off too quickly to multi-second sleep times?

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Wayne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980876#action_12980876 ] 

Wayne commented on HBASE-3438:
------------------------------

We have 10 nodes with 4 dedicated writers per node. These writers are basically pushing randomly distributed writes via 40 "threads" constantly 7/24. If a single region was locked I guess it could cause a pause for all of them, but it would then would go into a wait state like domino's one after the other. It appears to occur all at once and reads are blocked as well which is the biggest concern. There are also 40 read workers. It seems a little crazy to me to think 80 threads are all waiting for a single region when there are thousands of regions. A region split only pauses writes anyway correct? Are reads blocked as well?

> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3438) Cluster Wide Pauses

Posted by "Wayne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980905#action_12980905 ] 

Wayne commented on HBASE-3438:
------------------------------

I did not think a region split would block reads, just writes. There is very little variance in the cell size. We have 8 tables 3 of which are used currently each with 4 CFs  and only 1 CF is ever written to for a given row key. For loads the batch mutate is given 10k values at a time which are grouped into a list of row mutations. The # of row mutations in a batch totally varies from a few to thousands. All 10k values are passed to 3 CFs as there are 2 secondary CFs. Below is a typical row key / col / ts / value.

P.12345_D.1234567890 / P:D.1234567800_M.123 / 12948623468629950 / 52.5


> Cluster Wide Pauses
> -------------------
>
>                 Key: HBASE-3438
>                 URL: https://issues.apache.org/jira/browse/HBASE-3438
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>         Environment: CentOS 5.5, 10 Nodes, 24GB RAM, 4 1TB Disks, 8GB Heap
>            Reporter: Wayne
>
> Under heavy write load the entire cluster seems to pause with all nodes pausing writes/reads for several seconds at a time. This seems to be worse with larger region sizes. One possible explanation is that a single node gets caught/paused/stuck during a split and that all other nodes are waiting on that one node so it looks like a cluster wide pause.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.