You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2014/09/10 20:51:34 UTC

[jira] [Commented] (HBASE-11935) Unbounded creation of Replication Failover workers

    [ https://issues.apache.org/jira/browse/HBASE-11935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128924#comment-14128924 ] 

Andrew Purtell commented on HBASE-11935:
----------------------------------------

We could get unbounded ReplicationSource allocation in ReplicationSourceManager.NodeFailoverWorker.run:
{noformat}
  ReplicationTrackerZKImpl.OtherRegionServerWatcher.nodeDeleted ->
      ReplicationSourceManager.regionServerRemoved ->
      ReplicationSourceManager.transferQueues ->
      NodeFailoverWorker.run ->
      ReplicationSourceManager.getReplicationSource ->
      new ReplicationSource
{noformat}

> Unbounded creation of Replication Failover workers
> --------------------------------------------------
>
>                 Key: HBASE-11935
>                 URL: https://issues.apache.org/jira/browse/HBASE-11935
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Jesse Yates
>            Priority: Critical
>             Fix For: 2.0.0, 0.98.7, 0.94.24, 0.99.1
>
>         Attachments: hbase-11935-0.98-v0.patch
>
>
> We just ran into a production incident with TCP SYN storms on port 2181 (zookeeper).
> In our case the slave cluster was not running. When we bounced the primary cluster we saw an "unbounded" number of failover threads all hammering the hosts on the slave ZK machines (which did not run ZK at the time)... Causing overall degradation of network performance between datacenters.
> Looking at the code we noticed that the thread pool handling of the Failover workers was probably unintended.
> Patch coming soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)