You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jasee Tao (Jira)" <ji...@apache.org> on 2021/07/26 02:58:00 UTC

[jira] [Created] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Jasee Tao created HBASE-26120:
---------------------------------

             Summary: New replication gets stuck or data loss when multiwal groups more than 10
                 Key: HBASE-26120
                 URL: https://issues.apache.org/jira/browse/HBASE-26120
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 2.4.5, 1.7.1
            Reporter: Jasee Tao


{code:java}
void preLogRoll(Path newLog) throws IOException {
  recordLog(newLog);
  String logName = newLog.getName();
  String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
  synchronized (latestPaths) {
    Iterator<Path> iterator = latestPaths.iterator();
    while (iterator.hasNext()) {
      Path path = iterator.next();
      if (path.getName().contains(logPrefix)) {
        iterator.remove();
        break;
      }
    }
    this.latestPaths.add(newLog);
  }
}
{code}
ReplicationSourceManager use latestPaths to track each walgroup's last WALlog and all of them will be enqueue for replication when new replication  peer added。

If we set hbase.wal.regiongrouping.numgroups > 10, says 11, the name of WALlog group will be regionserver.null0.timestamp to regionserver.null1.timestamp。String.contains is used in preoLogRoll to replace old logs in same group, leads when regionserver.null1.ts comes, regionserver.null11.ts may be replaced, and latestPaths growing with wrong logs.

Replication then partly stuckd as regionsserver.null1.ts not exists on hdfs, and data may not be replicated to slave as regionserver.null11.ts not in replication queue at startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)