You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2021/07/27 01:41:00 UTC

[jira] [Assigned] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

     [ https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Duo Zhang reassigned HBASE-26120:
---------------------------------

    Assignee: Duo Zhang

> New replication gets stuck or data loss when multiwal groups more than 10
> -------------------------------------------------------------------------
>
>                 Key: HBASE-26120
>                 URL: https://issues.apache.org/jira/browse/HBASE-26120
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.7.1, 2.4.5
>            Reporter: Jasee Tao
>            Assignee: Duo Zhang
>            Priority: Critical
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
>   recordLog(newLog);
>   String logName = newLog.getName();
>   String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
>   synchronized (latestPaths) {
>     Iterator<Path> iterator = latestPaths.iterator();
>     while (iterator.hasNext()) {
>       Path path = iterator.next();
>       if (path.getName().contains(logPrefix)) {
>         iterator.remove();
>         break;
>       }
>     }
>     this.latestPaths.add(newLog);
>   }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last WALlog and all of them will be enqueue for replication when new replication  peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of WALlog group will be _regionserver.null0.timestamp_ to _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ to replace old logs in same group, leads when _regionserver.null1.ts_ comes, _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not in replication queue at startup.
> Because of [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer may not delete this znode, and other regionserver can't not pick up this queue for replication failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)