You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/03/22 18:16:40 UTC

[GitHub] [accumulo] milleruntime opened a new pull request #2583: Add trace and debug log to consistency check

milleruntime opened a new pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583


   * Closes #2577 
   * Add trace span and time measurement around consistency check
   so we get an idea of how long metadata scans are taking every 30-60 minutes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833403206



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -818,18 +823,36 @@ public void run() {
             updateCounts.put(ke, tablet.getUpdateCount());
           });
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+          Instant start = Instant.now();
+          Duration duration = Duration.ZERO;
+          int tabletCount = 0;
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+              mdScanSpan.end();
+              duration = Duration.between(start, Instant.now());
+
+              // for each tablet, compare its metadata to what is held in memory
+              for (var tabletMetadata : tabletsMetadata) {
+                tabletCount++;
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              }
+
+              log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
+                  tabletCount);
+            }
+          } finally {
+            if (duration.toMinutes() > 1) {
+              log.warn(
+                  "Metadata scan took {}ms for {} tablets. Performance of all activities will be severely hindered!",
+                  duration.toMillis(), tabletCount);
+            }

Review comment:
       With `duration` initialized to zero, this code is structured so that it will be a no-op if there is an exception.  Since the code will always be a no-op in the case of an exception it does not seems like putting it in a finally block is useful.  Could this be moved out of the finally block and placed after the metadata read?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833538140



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -818,18 +823,36 @@ public void run() {
             updateCounts.put(ke, tablet.getUpdateCount());
           });
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+          Instant start = Instant.now();
+          Duration duration = Duration.ZERO;
+          int tabletCount = 0;
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+              mdScanSpan.end();
+              duration = Duration.between(start, Instant.now());
+
+              // for each tablet, compare its metadata to what is held in memory
+              for (var tabletMetadata : tabletsMetadata) {
+                tabletCount++;
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              }
+
+              log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
+                  tabletCount);
+            }
+          } finally {
+            if (duration.toMinutes() > 1) {
+              log.warn(
+                  "Metadata scan took {}ms for {} tablets. Performance of all activities will be severely hindered!",
+                  duration.toMillis(), tabletCount);
+            }

Review comment:
       Doing some local testing with Uno, this is the best (or worst) I was able to do:
   <pre>
   tserver2_ip-10-113-12-25.log:2022-03-23T13:07:31,950 [tserver.TabletServer] DEBUG: Metadata scan took 1019ms for 42303 tablets read.
   </pre>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime edited a comment on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime edited a comment on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1076255248


   I was thinking that it might be nice to make the consistency check frequency configurable (something like `tserver.health.check.frequency` with the default 30-60mins). Having it check more often on a  tserver would be a nice health check when debugging. Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834308231



##########
File path: core/src/main/java/org/apache/accumulo/core/conf/Property.java
##########
@@ -656,6 +656,8 @@
       "1.4.0"),
   TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
       "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,

Review comment:
       As follow on work may be good to investigate if anything else should use this property.  There is a check that perodically looks for stuck compactions in the tablet server, maybe it could use this property too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1076222342


   > Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend.
   
   OPs? I don't know metrics code, any tips how to do this in the tserver?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834522621



##########
File path: core/src/main/java/org/apache/accumulo/core/conf/Property.java
##########
@@ -656,6 +656,8 @@
       "1.4.0"),
   TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
       "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,

Review comment:
       The `startWatching()` method does get called, once in the constructor of `MajorCompactor`. And it looks like the other times that the `run()` gets called are just for single, one time checks.
   https://github.com/apache/accumulo/blob/dca3ae8d056a24510aa9d86347d9d6338613c4b3/server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java#L757-L759




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834347013



##########
File path: core/src/main/java/org/apache/accumulo/core/conf/Property.java
##########
@@ -656,6 +656,8 @@
       "1.4.0"),
   TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
       "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,

Review comment:
       The `run()` method does get called though... I am not sure if `startWatching()` should be called instead though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r832628526



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
 
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
+            Map<KeyExtent,Long> updateCounts = new HashMap<>();
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
+            // gather updateCounts for each tablet
+            onlineTabletsSnapshot.forEach((ke, tablet) -> {
+              updateCounts.put(ke, tablet.getUpdateCount());
             });
+
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+
+              // for each tablet, compare its metadata to what is held in memory
+              tabletsMetadata.forEach(tabletMetadata -> {
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              });
+            }
+          } finally {
+            mdScanSpan.end();
+            Instant end = Instant.now();
+            Duration duration = Duration.between(start, end);
+            log.debug("Metadata scan took {}ms", duration.toMillis());
+            if (duration.toMinutes() > 5) {
+              log.warn(
+                  "Metadata scan is taking too long. Performance of all activities will be severely hindered!");

Review comment:
       would be nice if the logs massage included how long it took and how many exents were looked up.

##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();

Review comment:
       This lambda is starting to get long, wonder if it should be pulled out to a function.

##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
 
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
+            Map<KeyExtent,Long> updateCounts = new HashMap<>();
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
+            // gather updateCounts for each tablet
+            onlineTabletsSnapshot.forEach((ke, tablet) -> {
+              updateCounts.put(ke, tablet.getUpdateCount());
             });
+
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+
+              // for each tablet, compare its metadata to what is held in memory
+              tabletsMetadata.forEach(tabletMetadata -> {
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);

Review comment:
       This method gets a lock on each tablet.  It should not hold things up (the tablet lock is never supposed to be held for long operations like I/O, but not sure if thats true), but it could possibly.  Anyway this could contribute to the time.  May want to time just the readTablets() call in the code above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
EdColeman commented on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1075584456


   Exposing this through metrics rather than tracing would provide better utility for OPs to monitor, alert and trend.  
   
   Tracing would be helpful if times could be correlated with other activities - but I am not sure that is possible. First, this is running as its own thread, without a parent as part of some other activity, The other issue this will only be a periodic snapshot and not have insight into global activities - and even if it did, I don't know how that would be expressed. 
   
   For example, if during bulk ingest or maybe metadata table compaction it may be "normal" if the scan time increased from a baseline "idle".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime edited a comment on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime edited a comment on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1076255248


   I was thinking that it might be nice to make the consistency check frequency configurable (something like `tserver.health.check.frequency` with the default 30-60mins). Having it check more often on a  tserver would be a nice health check when debugging. IThoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner edited a comment on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner edited a comment on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1076448080


   This seems like a good short term way to get a sense of metadata table read performance.  Longer term it may be better to instrument the scanner and batch scanner to support emitting metrics.  Maybe the scanner and batch scanner could have a client property with a value that is a list of tables they would emit metrics for.  Then this could be flipped on and we could see metadata table read performance from across the cluster in a metrics system. 
   
   If this more general solution was ever implemented it would be good to remove this code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834340812



##########
File path: core/src/main/java/org/apache/accumulo/core/conf/Property.java
##########
@@ -656,6 +656,8 @@
       "1.4.0"),
   TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
       "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,

Review comment:
       Is it `CompactionWatcher`? I noticed this method never gets called: https://github.com/apache/accumulo/blob/7fb1b5df9e3abe18304039fb3e781b2617eec4b5/server/base/src/main/java/org/apache/accumulo/server/compaction/CompactionWatcher.java#L124




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r832628526



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
 
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
+            Map<KeyExtent,Long> updateCounts = new HashMap<>();
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
+            // gather updateCounts for each tablet
+            onlineTabletsSnapshot.forEach((ke, tablet) -> {
+              updateCounts.put(ke, tablet.getUpdateCount());
             });
+
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+
+              // for each tablet, compare its metadata to what is held in memory
+              tabletsMetadata.forEach(tabletMetadata -> {
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              });
+            }
+          } finally {
+            mdScanSpan.end();
+            Instant end = Instant.now();
+            Duration duration = Duration.between(start, end);
+            log.debug("Metadata scan took {}ms", duration.toMillis());
+            if (duration.toMinutes() > 5) {
+              log.warn(
+                  "Metadata scan is taking too long. Performance of all activities will be severely hindered!");

Review comment:
       would be nice if the log message included how long it took and how many extents were looked up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1076255248


   I was thinking that it might be nice to make the consistency check frequency configurable. Having it check more often on a  tserver would be a nice health check when debugging. Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833551626



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -818,18 +823,36 @@ public void run() {
             updateCounts.put(ke, tablet.getUpdateCount());
           });
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+          Instant start = Instant.now();
+          Duration duration = Duration.ZERO;
+          int tabletCount = 0;
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+              mdScanSpan.end();
+              duration = Duration.between(start, Instant.now());
+
+              // for each tablet, compare its metadata to what is held in memory
+              for (var tabletMetadata : tabletsMetadata) {
+                tabletCount++;
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              }
+
+              log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
+                  tabletCount);
+            }
+          } finally {
+            if (duration.toMinutes() > 1) {
+              log.warn(
+                  "Metadata scan took {}ms for {} tablets. Performance of all activities will be severely hindered!",
+                  duration.toMillis(), tabletCount);
+            }

Review comment:
       > Could move the debug stmt closer to the metadata read in case there is an exception.
   
   Then I would have to remove the `tabletCount` from the debug message. I guess I could have 2 separate debug statements.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833116378



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
 
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
+            Map<KeyExtent,Long> updateCounts = new HashMap<>();
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
+            // gather updateCounts for each tablet
+            onlineTabletsSnapshot.forEach((ke, tablet) -> {
+              updateCounts.put(ke, tablet.getUpdateCount());
             });
+
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+
+              // for each tablet, compare its metadata to what is held in memory
+              tabletsMetadata.forEach(tabletMetadata -> {
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              });
+            }
+          } finally {
+            mdScanSpan.end();
+            Instant end = Instant.now();
+            Duration duration = Duration.between(start, end);
+            log.debug("Metadata scan took {}ms", duration.toMillis());
+            if (duration.toMinutes() > 5) {
+              log.warn(
+                  "Metadata scan is taking too long. Performance of all activities will be severely hindered!");

Review comment:
       Ah yeah, good idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834344272



##########
File path: core/src/main/java/org/apache/accumulo/core/conf/Property.java
##########
@@ -656,6 +656,8 @@
       "1.4.0"),
   TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
       "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,

Review comment:
       Yeah that is the code I was thinking about.  There is already a property there, the warn time prop.  So probably would not make sense to use this new prop there.  The frequency of the compaction check should probably be 1/2 the warn time prop.
   
   > I noticed this method never gets called: 
   
   That is not good.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834305694



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -803,38 +811,49 @@ public void run() {
         }
       }
     }, 0, 5, TimeUnit.SECONDS);
-    ThreadPools.watchNonCriticalScheduledTask(future);
+    watchNonCriticalScheduledTask(future);
 
-    int tabletCheckFrequency = 30 + random.nextInt(31); // random 30-60 minute delay
+    long tabletCheckFrequency = aconf.getTimeInMillis(Property.TSERV_HEALTH_CHECK_FREQ);
     // Periodically check that metadata of tablets matches what is held in memory
-    ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
-        .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
-
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
-
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+    watchCriticalFixedDelay(aconf, tabletCheckFrequency, () -> {
+      final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
+
+      Map<KeyExtent,Long> updateCounts = new HashMap<>();
+
+      // gather updateCounts for each tablet
+      onlineTabletsSnapshot.forEach((ke, tablet) -> {
+        updateCounts.put(ke, tablet.getUpdateCount());
+      });
+
+      Instant start = Instant.now();
+      Duration duration;
+      int tabletCount = 0;
+      Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+      try (Scope scope = mdScanSpan.makeCurrent()) {
+        // gather metadata for all tablets readTablets()
+        try (TabletsMetadata tabletsMetadata =
+            getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+          mdScanSpan.end();
+          duration = Duration.between(start, Instant.now());
+
+          // for each tablet, compare its metadata to what is held in memory
+          for (var tabletMetadata : tabletsMetadata) {
+            tabletCount++;
+            KeyExtent extent = tabletMetadata.getExtent();
+            Tablet tablet = onlineTabletsSnapshot.get(extent);
+            Long counter = updateCounts.get(extent);
+            tablet.compareTabletInfo(counter, tabletMetadata);
           }
-        }, tabletCheckFrequency, tabletCheckFrequency, TimeUnit.MINUTES));
+
+          log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
+              tabletCount);

Review comment:
       Could make the following change and move the log stmt further up.  I realize its the number looked up and not the number returned.  That highlights something I have noticed before with this check, I don't think it detects tablets that do not exist in the metadata table at all but are being served by a tablet.
   
   ```suggestion
             log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
                 onlineTabletsSnapshot.keySet().size());
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833389600



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
 
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
+            Map<KeyExtent,Long> updateCounts = new HashMap<>();
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
+            // gather updateCounts for each tablet
+            onlineTabletsSnapshot.forEach((ke, tablet) -> {
+              updateCounts.put(ke, tablet.getUpdateCount());
             });
+
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+
+              // for each tablet, compare its metadata to what is held in memory
+              tabletsMetadata.forEach(tabletMetadata -> {
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);

Review comment:
       Done in latest commits.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1076448080


   This seems like a good short term way to get a sense of metadata table read performance.  Longer term it may be better to instrument the scanner and batch scanner to support emitting metrics.  Maybe the scanner and batch scanner could have a client property with a value that is a list of tables they would emit metrics for.  Then this could be flipped on and we could see metadata table read performance from across the cluster in a metrics system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r835563694



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -803,38 +811,46 @@ public void run() {
         }
       }
     }, 0, 5, TimeUnit.SECONDS);
-    ThreadPools.watchNonCriticalScheduledTask(future);
+    watchNonCriticalScheduledTask(future);
 
-    int tabletCheckFrequency = 30 + random.nextInt(31); // random 30-60 minute delay
+    long tabletCheckFrequency = aconf.getTimeInMillis(Property.TSERV_HEALTH_CHECK_FREQ);
     // Periodically check that metadata of tablets matches what is held in memory
-    ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
-        .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
-
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
-
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+    watchCriticalFixedDelay(aconf, tabletCheckFrequency, () -> {
+      final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
+
+      Map<KeyExtent,Long> updateCounts = new HashMap<>();
+
+      // gather updateCounts for each tablet
+      onlineTabletsSnapshot.forEach((ke, tablet) -> {
+        updateCounts.put(ke, tablet.getUpdateCount());
+      });
+
+      Instant start = Instant.now();
+      Duration duration;
+      Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+      try (Scope scope = mdScanSpan.makeCurrent()) {
+        // gather metadata for all tablets readTablets()
+        try (TabletsMetadata tabletsMetadata =
+            getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+          mdScanSpan.end();
+          duration = Duration.between(start, Instant.now());
+          log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),

Review comment:
       Duration probably has a toString that might be better to use. Or use DurationFormat?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r834335277



##########
File path: core/src/main/java/org/apache/accumulo/core/conf/Property.java
##########
@@ -656,6 +656,8 @@
       "1.4.0"),
   TSERV_BULK_TIMEOUT("tserver.bulk.timeout", "5m", PropertyType.TIMEDURATION,
       "The time to wait for a tablet server to process a bulk import request.", "1.4.3"),
+  TSERV_HEALTH_CHECK_FREQ("tserver.health.check.interval", "30m", PropertyType.TIMEDURATION,

Review comment:
       Any idea where this other check is located?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1077611507


   @keith-turner @ctubbsii I think this PR is good to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833119505



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();

Review comment:
       I agree. It is very hard to read and decipher. I think the two ThreadPools methods could be consolidated. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833423661



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -818,18 +823,36 @@ public void run() {
             updateCounts.put(ke, tablet.getUpdateCount());
           });
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+          Instant start = Instant.now();
+          Duration duration = Duration.ZERO;
+          int tabletCount = 0;
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+              mdScanSpan.end();
+              duration = Duration.between(start, Instant.now());
+
+              // for each tablet, compare its metadata to what is held in memory
+              for (var tabletMetadata : tabletsMetadata) {
+                tabletCount++;
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              }
+
+              log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
+                  tabletCount);
+            }
+          } finally {
+            if (duration.toMinutes() > 1) {
+              log.warn(
+                  "Metadata scan took {}ms for {} tablets. Performance of all activities will be severely hindered!",
+                  duration.toMillis(), tabletCount);
+            }

Review comment:
       I am not sure if we want to keep the warning. I don't know if it will ever get above 1 minute ore how useful it is once the performance gets that bad. At least with the debug statement we can get an idea of the performance on a large cluster.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833427237



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -818,18 +823,36 @@ public void run() {
             updateCounts.put(ke, tablet.getUpdateCount());
           });
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
-            });
+          Instant start = Instant.now();
+          Duration duration = Duration.ZERO;
+          int tabletCount = 0;
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+              mdScanSpan.end();
+              duration = Duration.between(start, Instant.now());
+
+              // for each tablet, compare its metadata to what is held in memory
+              for (var tabletMetadata : tabletsMetadata) {
+                tabletCount++;
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);
+              }
+
+              log.debug("Metadata scan took {}ms for {} tablets read.", duration.toMillis(),
+                  tabletCount);
+            }
+          } finally {
+            if (duration.toMinutes() > 1) {
+              log.warn(
+                  "Metadata scan took {}ms for {} tablets. Performance of all activities will be severely hindered!",
+                  duration.toMillis(), tabletCount);
+            }

Review comment:
       yeah maybe the warning could be dropped with the debug stmt.  Could move the debug stmt closer to the metadata read in case there is an exception.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime merged pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime merged pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#issuecomment-1075529206


   I did some quick testing locally using Uno and CI and this seems like a good gauge for general cluster performance. With the cluster sitting idle, scan times were in the range of 7 - 10 ms. With CI running, the times were around 200 - 400 ms. I modified the thread checker time to check every minute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on a change in pull request #2583: Add trace and debug log to consistency check

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #2583:
URL: https://github.com/apache/accumulo/pull/2583#discussion_r833120813



##########
File path: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java
##########
@@ -809,27 +814,40 @@ public void run() {
     // Periodically check that metadata of tablets matches what is held in memory
     ThreadPools.watchCriticalScheduledTask(ThreadPools.getServerThreadPools()
         .createGeneralScheduledExecutorService(aconf).scheduleWithFixedDelay(() -> {
-          final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
-
-          Map<KeyExtent,Long> updateCounts = new HashMap<>();
+          Instant start = Instant.now();
+          Span mdScanSpan = TraceUtil.startSpan(this.getClass(), "metadataScan");
+          try (Scope scope = mdScanSpan.makeCurrent()) {
+            final SortedMap<KeyExtent,Tablet> onlineTabletsSnapshot = onlineTablets.snapshot();
 
-          // gather updateCounts for each tablet
-          onlineTabletsSnapshot.forEach((ke, tablet) -> {
-            updateCounts.put(ke, tablet.getUpdateCount());
-          });
+            Map<KeyExtent,Long> updateCounts = new HashMap<>();
 
-          // gather metadata for all tablets readTablets()
-          try (TabletsMetadata tabletsMetadata =
-              getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
-                  .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
-
-            // for each tablet, compare its metadata to what is held in memory
-            tabletsMetadata.forEach(tabletMetadata -> {
-              KeyExtent extent = tabletMetadata.getExtent();
-              Tablet tablet = onlineTabletsSnapshot.get(extent);
-              Long counter = updateCounts.get(extent);
-              tablet.compareTabletInfo(counter, tabletMetadata);
+            // gather updateCounts for each tablet
+            onlineTabletsSnapshot.forEach((ke, tablet) -> {
+              updateCounts.put(ke, tablet.getUpdateCount());
             });
+
+            // gather metadata for all tablets readTablets()
+            try (TabletsMetadata tabletsMetadata =
+                getContext().getAmple().readTablets().forTablets(onlineTabletsSnapshot.keySet())
+                    .fetch(FILES, LOGS, ECOMP, PREV_ROW).build()) {
+
+              // for each tablet, compare its metadata to what is held in memory
+              tabletsMetadata.forEach(tabletMetadata -> {
+                KeyExtent extent = tabletMetadata.getExtent();
+                Tablet tablet = onlineTabletsSnapshot.get(extent);
+                Long counter = updateCounts.get(extent);
+                tablet.compareTabletInfo(counter, tabletMetadata);

Review comment:
       Interesting. Yeah I should to get a more pure measurement.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org