You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Ian Friedman (JIRA)" <ji...@apache.org> on 2014/03/04 22:48:01 UTC

[jira] [Created] (HBASE-10673) [replication] Recovering a large number of queues can cause OOME

Ian Friedman created HBASE-10673:
------------------------------------

Summary: [replication] Recovering a large number of queues can cause OOME
Key: HBASE-10673
URL: https://issues.apache.org/jira/browse/HBASE-10673
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 0.94.5
Reporter: Ian Friedman

Recently we had an accidental shutdown of approx 400 RegionServers in our cluster. Our cluster runs master-master replication to our seconday datacenter. Once we got the nodes sorted out and added back to the cluster, we noticed RSes continuing to die from OOMExceptions. Investigating the heap dumps from the dead RSes I found them full of WALEdits, and a lot of ReplicationSource threads generated by the NodeFailoverWorker replicating a lot of queues for the dead RSes.

After doing some digging into the code and with a lot of help from [~jdcryans] we found that it is possible for the node failover process in a very few regionservers to pick up the majority of queues when a lot of RSes die simultaneously. This causes a lot of HLogs to be opened and read simultaneously.

as per [~jdcryans]:
{quote}
Well there's a race to grab the queues to recover, and the surviving region servers are just scanning the znode with all the RS to try to determine which ones are dead (and when it finds a dead one it forks off a NodeFailoverWorker) so that if one RS got the notification that one RS died but in reality a bunch of them died, it could win a lot of those races.

My guess is after that one died, its queues were redistributed, and maybe you had another hero RS, but as more and more died the queues eventually got manageable since a single RS cannot win all the races. Something like this:

- 400 die, one RS manages to grab 350
- That RS dies, one other RS manages to grab 300 of the 350 (the other queues going to other region servers)
- That second RS dies, hero #3 grabs 250
etc.

...

We don't have the concept of a limit of recovered queues, less so the concept of respecting it. Aggregating the queues into one is one solution, but it would perform poorly as show with your use case where one region server would be responsible to replicate the data for 300 others. It would scale better if the queues were more evenly distributed (and it was the original intent, except that in practice the dynamics are different).
{quote}

--
This message was sent by Atlassian JIRA
(v6.2#6252)