You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/11/09 16:13:55 UTC

[GitHub] [accumulo] keith-turner commented on a change in pull request #2320: Periodically verify tablet metadata

keith-turner commented on a change in pull request #2320:
URL: https://github.com/apache/accumulo/pull/2320#discussion_r745777393



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
##########
@@ -1403,6 +1402,23 @@ private void closeConsistencyCheck() {
     // TODO check lastFlushID and lostCompactID - ACCUMULO-1290
   }
 
+  private void compareToDataInMemory(TabletMetadata tabletMetadata) {
+    if (!tabletMetadata.getFilesMap().equals(getDatafileManager().getDatafileSizes())) {
+      String msg = "Data files in " + extent + " differ from in-memory data "
+          + tabletMetadata.getFilesMap() + " " + getDatafileManager().getDatafileSizes();
+      log.error(msg);
+      throw new RuntimeException(msg);
+    }
+  }
+
+  public void compareTabletInfo(Pair<Long,TabletMetadata> tabletInfo) {
+    // if the counter didn't change, compare metadata to what is in memory
+    if (tabletInfo.getFirst() == this.getUpdateCounter()) {
+      this.compareToDataInMemory(tabletInfo.getSecond());
+    }
+    // if counter did change, don't compare metadata and try again later
+  }

Review comment:
       Need to synchronize this method on the tablet lock and check if the tablet is closed.  We gathered a snapshot of all online tablets at a point in time, it possible that some of the tablets in that snapshot are closed when we try to inpsect them.  The reason we need to syncronize on the tablet lock is that we are inspecting multiple tablet instance vars (closed state, update counter, and files map) and we want to consider all of the instance vars atomically. W/o sync the closed state can change after we inspect it, the update counter can change after we inspect it, etc.
   
   ```suggestion
     public synchronized void compareTabletInfo(Pair<Long,TabletMetadata> tabletInfo) {
        if(isClosing() || isClosed()) {
             return;
        }
     
       // if the counter didn't change, compare metadata to what is in memory
       if (tabletInfo.getFirst() == this.getUpdateCounter()) {
         this.compareToDataInMemory(tabletInfo.getSecond());
       }
       // if counter did change, don't compare metadata and try again later
     }
   ```

##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java
##########
@@ -874,6 +877,7 @@ public void flush(long tableFlushID) {
         } else {
           initiateMinor = true;
         }
+        updateCounter.getAndIncrement();

Review comment:
       I think it would be better to place this counter in the DataFileManager class.  Also would not make it an atomic long, would just make it a long.  The variable it tightly coupled to the map dataFileSizes in DataFileManager.  Any time that map is update, the counter needs to be incremented.  This map update and counter increment should all be done in the same sync block, so there is no need to make the counter an atomic long.  Making it an atomic long gives the impression that it can be accessed on its own when in reality it should only be accessed in the same sync that controls access to the map.
   
   The DataFileManager class relies on being accessed when the tablet lock is held, so it does not really need its own synch.
   
   So could add the counter to the DataFileManager class and updated whenever the map is updated and also add a method to get the counter from DataFileManager.  Rely on the tablet lock for sync, there is another comment related to this.

##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -791,6 +791,27 @@ public void run() {
       }
     }, 0, 5000, TimeUnit.MILLISECONDS);
 
+    ThreadPools.createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
+      final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
+      final SortedMap<KeyExtent,Pair<Long,TabletMetadata>> tabletValidationInfo = new TreeMap<>();
+
+      // gather update counters and metadata for all tablets
+      for (var entry : onlineTabletsSnapshot.entrySet()) {
+        KeyExtent keyExtent = entry.getKey();
+        Long counter = entry.getValue().getUpdateCounter();
+        TabletMetadata tm = getContext().getAmple().readTablet(keyExtent,
+            TabletMetadata.ColumnType.FILES, TabletMetadata.ColumnType.LOGS,
+            TabletMetadata.ColumnType.ECOMP, TabletMetadata.ColumnType.PREV_ROW);

Review comment:
       This will read tablets one at a time from the metadata table resulting in an RPC per tablet.  Should use the readTablets ample method to read the tablet which will use a batch scanner which can result in a lot less RPCs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org