You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2018/11/01 02:21:00 UTC

[jira] [Commented] (HBASE-21387) Race condition in snapshot cache refreshing leads to loss of snapshot files

    [ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671033#comment-16671033 ] 

Ted Yu commented on HBASE-21387:
--------------------------------

In the old code, to simulate the race condition, we can use CountDownLatch.
Here is a sketch:
{code}
diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/
index 358b4ea..2941400 100644
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
@@ -27,6 +27,7 @@ import java.util.Map;
 import java.util.Set;
 import java.util.Timer;
 import java.util.TimerTask;
+import java.util.concurrent.CountDownLatch;
 import java.util.concurrent.locks.ReentrantLock;

 import org.apache.hadoop.conf.Configuration;
@@ -92,6 +93,8 @@ public class SnapshotFileCache implements Stoppable {
   private final SnapshotFileInspector fileInspector;
   private final Path snapshotDir;
   private final Set<String> cache = new HashSet<>();
+  private final CountDownLatch latchRefresh = new CountDownLatch(1);
+  private final CountDownLatch latchContains = new CountDownLatch(1);
   /**
    * This is a helper map of information about the snapshot directories so we don't need to rescan
    * them if they haven't changed since the last time we looked.
@@ -180,16 +183,18 @@ public class SnapshotFileCache implements Stoppable {
   // cache, but that seems overkill at the moment and isn't necessarily a bottleneck.
   public synchronized Iterable<FileStatus> getUnreferencedFiles(Iterable<FileStatus> files,
       final SnapshotManager snapshotManager)
-      throws IOException {
+      throws IOException, InterruptedException {
     List<FileStatus> unReferencedFiles = Lists.newArrayList();
     List<String> snapshotsInProgress = null;
     boolean refreshed = false;
     for (FileStatus file : files) {
       String fileName = file.getPath().getName();
       if (!refreshed && !cache.contains(fileName)) {
+        latchRefresh.await();
         refreshCache();
         refreshed = true;
       }
+      latchContains.await();
       if (cache.contains(fileName)) {
         continue;
       }
@@ -226,9 +231,11 @@ public class SnapshotFileCache implements Stoppable {

     // 1. update the modified time
     this.lastModifiedTime = dirStatus.getModificationTime();
+    latchRefresh.countDown();

     // 2.clear the cache
     this.cache.clear();
+    latchContains.countDown();
     Map<String, SnapshotDirectoryInfo> known = new HashMap<>();

     // 3. check each of the snapshot directories
{code}
With the fix, the race condition is gone.

> Race condition in snapshot cache refreshing leads to loss of snapshot files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-21387
>                 URL: https://issues.apache.org/jira/browse/HBASE-21387
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>              Labels: snapshot
>         Attachments: 21387.v1.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive
> {code}
> The root cause is race condition surrounding SnapshotFileCache#refreshCache().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>     // if the snapshot directory wasn't modified since we last check, we are done
>     if (dirStatus.getModificationTime() <= this.lastModifiedTime) return;
>     // 1. update the modified time
>     this.lastModifiedTime = dirStatus.getModificationTime();
>     // 2.clear the cache
>     this.cache.clear();
> {code}
> Suppose the RefreshCacheTask runs past the if check and sets this.lastModifiedTime
> The cleaner executes refreshCache and returns immediately since this.lastModifiedTime matches the modification time of the directory.
> Now RefreshCacheTask clears the cache. By the time the cleaner performs cache lookup, the cache is empty.
> Therefore cleaner puts the file into unReferencedFiles - leading to data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)