You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2021/07/27 01:41:00 UTC
[jira] [Assigned] (HBASE-26120) New replication gets stuck or data
loss when multiwal groups more than 10
[ https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang reassigned HBASE-26120:
---------------------------------
Assignee: Duo Zhang
> New replication gets stuck or data loss when multiwal groups more than 10
> -------------------------------------------------------------------------
>
> Key: HBASE-26120
> URL: https://issues.apache.org/jira/browse/HBASE-26120
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.7.1, 2.4.5
> Reporter: Jasee Tao
> Assignee: Duo Zhang
> Priority: Critical
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
> recordLog(newLog);
> String logName = newLog.getName();
> String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
> synchronized (latestPaths) {
> Iterator<Path> iterator = latestPaths.iterator();
> while (iterator.hasNext()) {
> Path path = iterator.next();
> if (path.getName().contains(logPrefix)) {
> iterator.remove();
> break;
> }
> }
> this.latestPaths.add(newLog);
> }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last WALlog and all of them will be enqueue for replication when new replication peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of WALlog group will be _regionserver.null0.timestamp_ to _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ to replace old logs in same group, leads when _regionserver.null1.ts_ comes, _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not in replication queue at startup.
> Because of [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer may not delete this znode, and other regionserver can't not pick up this queue for replication failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)