You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Jerry He (JIRA)" <ji...@apache.org> on 2016/09/08 04:12:21 UTC
[jira] [Created] (HBASE-16581) Optimize Replication queue transfers
after server fail over
Jerry He created HBASE-16581:
--------------------------------
Summary: Optimize Replication queue transfers after server fail over
Key: HBASE-16581
URL: https://issues.apache.org/jira/browse/HBASE-16581
Project: HBase
Issue Type: Improvement
Reporter: Jerry He
Assignee: Jerry He
Currently if a region server fails, the replication queue of this server will be picked up by another region server. The problem is this queue can possibly be huge and contains queues from other multiple or cascading server failures.
We had such a case in production. From zk_dump:
{code}
...
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467603735059: 18748267
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471723778060:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468258960080:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468204958990:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469701010649:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1470409989238:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471838985073:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467142915090: 57804890
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1472181000614:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471464567365:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469486466965:
/hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467787339841: 47812951
...
{code}
There were hundreds of wals hanging under this queue, coming from diferent region servers, which took a long time to replicate.
We should have a better strategy which lets live region servers each grep part of this nested queue, and replicate in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)