You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2018/11/01 02:21:00 UTC
[jira] [Commented] (HBASE-21387) Race condition in snapshot cache
refreshing leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671033#comment-16671033 ]
Ted Yu commented on HBASE-21387:
--------------------------------
In the old code, to simulate the race condition, we can use CountDownLatch.
Here is a sketch:
{code}
diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/
index 358b4ea..2941400 100644
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
@@ -27,6 +27,7 @@ import java.util.Map;
import java.util.Set;
import java.util.Timer;
import java.util.TimerTask;
+import java.util.concurrent.CountDownLatch;
import java.util.concurrent.locks.ReentrantLock;
import org.apache.hadoop.conf.Configuration;
@@ -92,6 +93,8 @@ public class SnapshotFileCache implements Stoppable {
private final SnapshotFileInspector fileInspector;
private final Path snapshotDir;
private final Set<String> cache = new HashSet<>();
+ private final CountDownLatch latchRefresh = new CountDownLatch(1);
+ private final CountDownLatch latchContains = new CountDownLatch(1);
/**
* This is a helper map of information about the snapshot directories so we don't need to rescan
* them if they haven't changed since the last time we looked.
@@ -180,16 +183,18 @@ public class SnapshotFileCache implements Stoppable {
// cache, but that seems overkill at the moment and isn't necessarily a bottleneck.
public synchronized Iterable<FileStatus> getUnreferencedFiles(Iterable<FileStatus> files,
final SnapshotManager snapshotManager)
- throws IOException {
+ throws IOException, InterruptedException {
List<FileStatus> unReferencedFiles = Lists.newArrayList();
List<String> snapshotsInProgress = null;
boolean refreshed = false;
for (FileStatus file : files) {
String fileName = file.getPath().getName();
if (!refreshed && !cache.contains(fileName)) {
+ latchRefresh.await();
refreshCache();
refreshed = true;
}
+ latchContains.await();
if (cache.contains(fileName)) {
continue;
}
@@ -226,9 +231,11 @@ public class SnapshotFileCache implements Stoppable {
// 1. update the modified time
this.lastModifiedTime = dirStatus.getModificationTime();
+ latchRefresh.countDown();
// 2.clear the cache
this.cache.clear();
+ latchContains.countDown();
Map<String, SnapshotDirectoryInfo> known = new HashMap<>();
// 3. check each of the snapshot directories
{code}
With the fix, the race condition is gone.
> Race condition in snapshot cache refreshing leads to loss of snapshot files
> ---------------------------------------------------------------------------
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
> Issue Type: Bug
> Reporter: Ted Yu
> Assignee: Ted Yu
> Priority: Major
> Labels: snapshot
> Attachments: 21387.v1.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table.
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive
> {code}
> The root cause is race condition surrounding SnapshotFileCache#refreshCache().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
> // if the snapshot directory wasn't modified since we last check, we are done
> if (dirStatus.getModificationTime() <= this.lastModifiedTime) return;
> // 1. update the modified time
> this.lastModifiedTime = dirStatus.getModificationTime();
> // 2.clear the cache
> this.cache.clear();
> {code}
> Suppose the RefreshCacheTask runs past the if check and sets this.lastModifiedTime
> The cleaner executes refreshCache and returns immediately since this.lastModifiedTime matches the modification time of the directory.
> Now RefreshCacheTask clears the cache. By the time the cleaner performs cache lookup, the cache is empty.
> Therefore cleaner puts the file into unReferencedFiles - leading to data loss.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)