You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/02/10 21:40:46 UTC

[GitHub] [accumulo] DomGarguilo opened a new pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

DomGarguilo opened a new pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888


   Fixes #1791 
   Added a few fixes to increase the reliability of this test:
   
   - Increased the suspend duration of the tablets
   - Created the table with splits instead of adding splits after table creation. This should increase stability of balancing.
   - Increased time given for migrations to finish
   - Other small changes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-768674007


   I tried building and running just this single test, and it timed out on my machine. After adding `-Dtimeout.factor=3`, I got past the timeout, but then I ran into:
   
   <details>
   <summary>
   
   ```java
   [ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 605.616 s <<< FAILURE! - in org.apache.accumulo.test.master.SuspendedTabletsIT
   [ERROR] crashAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 347.964 s  <<< FAILURE!
   ```
   
   </summary>
   
   ```java
   java.lang.AssertionError: Scanning of metadata failed, aborting
   	at org.junit.Assert.fail(Assert.java:89)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT$TabletLocations.retrieve(SuspendedTabletsIT.java:306)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:208)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.crashAndResumeTserver(SuspendedTabletsIT.java:101)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:834)
   ```
   
   </details>
   
   <details>
   
   <summary>
   
   Interestingly, the scanning of the metadata failed in this case because a file for the metadata was deleted by the garbage collector about a minute before the TabletServer needed it to scan for metadata. I'm not sure how that happened:
   
   </summary>
   
   ```java
   2021-01-27T17:59:14,161 [gc.SimpleGarbageCollector] DEBUG: Deleting file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/
   ```
   
   ```java
   2021-01-27T18:00:10,691 [tserver.FileManager] ERROR: Failed to open file file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
   2021-01-27T18:00:10,693 [tserver.FileManager] ERROR: Failed to open file file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
   2021-01-27T18:00:10,693 [tserver.FileManager] ERROR: Failed to open file file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
   2021-01-27T18:00:10,694 [problems.ProblemReports] DEBUG: Filing problem report !0 FILE_READ file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf
   2021-01-27T18:00:10,694 [scan.LookupTask] WARN : lookup failed for tablet !0;~<                     
   java.io.IOException: Failed to open file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf
     at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:331) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:492) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:501) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.ScanDataSource.createIterator(ScanDataSource.java:164) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.ScanDataSource.iterator(ScanDataSource.java:120) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.iteratorsImpl.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:228) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:493) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:646) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.scan.LookupTask.run(LookupTask.java:117) [accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.session.ScanSession$ScanMeasurer.run(ScanSession.java:54) [accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) [htrace-core-3.2.0-incubating.jar:3.2.0-incubating]
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]          
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]          
     at java.lang.Thread.run(Thread.java:834) [?:?]                                                    
   Caused by: java.io.UncheckedIOException: java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$BCFileLoader.load(CachableBlockFile.java:227) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:127) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.resolveDependencies(SynchronousLoadingBlockCache.java:64) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:109) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:381) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1164) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1256) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.getReader(RFileOperations.java:55) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:70) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:85) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.FileOperations$ReaderBuilder.build(FileOperations.java:449) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:309) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     ... 13 more                                                                                       
   Caused by: java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
     at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:668) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:989) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:658) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:460) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:155) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:356) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:945) ~[hadoop-client-api-3.3.0.jar:?]     
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$CachableBuilder.lambda$fsPath$0(CachableBlockFile.java:92) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:167) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$BCFileLoader.load(CachableBlockFile.java:225) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:127) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.resolveDependencies(SynchronousLoadingBlockCache.java:64) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:109) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:381) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1164) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1256) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.getReader(RFileOperations.java:55) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:70) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:85) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.FileOperations$ReaderBuilder.build(FileOperations.java:449) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:309) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     ... 13 more    
   ```
   
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#discussion_r579404037



##########
File path: test/src/main/java/org/apache/accumulo/test/manager/SuspendedTabletsIT.java
##########
@@ -117,7 +117,12 @@ public void shutdownAndResumeTserver() throws Exception {
     suspensionTestBody((ctx, locs, count) -> {
       Set<TServerInstance> tserversSet = new HashSet<>();
       for (TabletLocationState tls : locs.locationStates.values()) {
-        if (tls.current != null) {
+
+        TabletLocator tl = TabletLocator.getLocator(ctx, MetadataTable.ID);
+        String metadataTserver =
+            tl.locateTablet(ctx, tls.extent.toMetaRow(), false, false).tablet_location;
+        // if the server does not hold the metadata, add it to the list to be shutdown
+        if (tls.current != null && !tls.current.toString().startsWith(metadataTserver)) {

Review comment:
       It looks like this code checks to see if the tablet has a currently assigned location, and *only* if that currently assigned location is *NOT* also hosting the metadata for that tablet's own metadata, then the tserver is okay to shutdown.
   
   However, it seems like there are two problems with this:
   
   1. the tserver could still be added to the list if another tablet adds it, so it could still cause problems, and
   2. this could result in no tservers shutting down, therefore not actually checking the conditions the test is intended to check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-778462710


   @Manno15 and I have done some more investigation.
   
   The main issue we have found stems from the way the tservers are shutdown. This portion of the test often hangs between the time two tservers are shut down. This time difference can lead to the suspend time expiring which leaves the tablets free to migrate. There are two parts to the test. One that shuts down the servers and one that crashes the servers. The version that crashes the servers leaves no time in between server deaths for this migration inducing gap. It seems that the rest of the test behaves as expected majority of the time and the flaky process that is used to shutdown the servers is causing the error.
   
   I'm not sure what a solution to this might be since the tests purpose is to ensure suspending tablets behaves as expected whether servers are crashed or shutdown meaning we need to test with server shutdown.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-782299921


   > Would it be worth adding a custom balancer config to ensure the tserver hosting the metadata is left alone, but that we still shut down the other tservers?
   
   I tried to look into this a bit however couldn't find a great way to do it. This is the simplest way I could figure out however if it seems like a custom balancer would be a better solution I can try that instead.
   
   > Or, should we just make sure the suspend time is sufficiently long enough that any metadata tablets that got shut down are reassigned and recovered, before the suspend time expires? So that way, it doesn't matter if the metadata was hosted on a tserver that was shut down?
   
   With the current solution, the tserver with the metadata is not shutdown. Are you suggesting we include the tserver with the metadata on it in the list of servers to be shutdown? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-782162252


   Upon further investigation, @Manno15  and I confirmed that the error causing this test to be flaky stemmed from shutting down the tserver hosting the metadata. The changes in the newest commits prevent shutting down the tserver with the metadata on it. This should be ready for review now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on a change in pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on a change in pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#discussion_r579480765



##########
File path: test/src/main/java/org/apache/accumulo/test/manager/SuspendedTabletsIT.java
##########
@@ -117,7 +117,12 @@ public void shutdownAndResumeTserver() throws Exception {
     suspensionTestBody((ctx, locs, count) -> {
       Set<TServerInstance> tserversSet = new HashSet<>();
       for (TabletLocationState tls : locs.locationStates.values()) {
-        if (tls.current != null) {
+
+        TabletLocator tl = TabletLocator.getLocator(ctx, MetadataTable.ID);
+        String metadataTserver =
+            tl.locateTablet(ctx, tls.extent.toMetaRow(), false, false).tablet_location;
+        // if the server does not hold the metadata, add it to the list to be shutdown
+        if (tls.current != null && !tls.current.toString().startsWith(metadataTserver)) {

Review comment:
       > > 1. the tserver could still be added to the list if another tablet adds it, so it could still cause problems, and
   > 
   > I don't see how this could happen. The name of the tserver with the metadata is compared with the tserver that the current tablet is on. Therefore any tablet on that server should yield the same result.
   > 
   > > 1. this could result in no tservers shutting down, therefore not actually checking the conditions the test is intended to check
   > 
   > This would only happen if the metadata was not on any of the servers. I'm not sure this could happen since the metadata would need to be on one of the available servers from my understanding.
   > 
   
   My understanding is that it is looping through all tablet locations, each assigned to `tls` as it loops. For each tablet (`tls.extent`), it tries to find the location of the metadata tablet that is hosting that tablet's metadata (locating `tls.extent.toMetaRow()`). It will *not* add the tserver, if the location is the same as the current location.
   
   However, consider tablets `A` and `B`, whose metadata tablets are in `m(A)` and `m(B)`:
   
   ```
   tserver1 contains A, B, and m(A)
   tserver2 contains m(B)
   ```
   
   As you go through this loop, we see tablet `A` first, but since `m(A)` and `A` are both hosted on `tserver1`, it is not added to the list to shut down. Next, going through the loop, we reach tablet `B`, but since `m(B)` is hosted on `tserver2`, then `tserver1` is added to the list to shut down.
   
   `tserver1` is still added to the list to shut down at the end of the loop, not by tablet `A`, but by tablet `B`. To get your strategy to work, you would have to add *all* tservers whose metadata tablets were hosted on the same tserver to a block list, and then do `tserversSet.removeAll(noShutdownList)` at the end of this loop. However, you could easily get into a situation where all tservers are shut down (`tserver1 = {A, B}, tserver2 = {m(A), m(B)}`), or none are (`tserver1 = {A, m(A)}, tserver2 = {B, m(B)}`).
   
   > It may be a good idea to add an assert to make sure two servers are on track to be shutdown.
   
   I thought this test only ran two total tservers.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-773538393






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-773538393


   From your results, this test is still indeed flaky despite my changes. A few of these errors I have never received while running this test on my machine. I will leave my analysis of the errors below:  
   
   **The failed assert on line 253:** 
   The test keeps track of when it can first scan the metadata table after it kills the servers. A comment at this point claims that a tablet can not be suspended before the metadata table can be scanned. It also keeps track of when all suspended tablets are re-hosted. The test asserts that the suspend time is less than the time between those two points. It seems this assert checks to make sure the suspend time is roughly the duration we think it will be (between first scan on metadata table and re-host). Either this is an incorrect assumption, the suspend time is lasting longer than it should, or the tablets are re-hosted before they should be. 
   
   **The failed assert on line 243:**
   Checks that the suspended tablet locations on a server are the same as the tablet locations on that server after it is restarted. This is checking that the tablets are reassigned to the same server in the same location after the server is restarted. If it fails, tablets were either lost, migrated or the way that the test checks this is incorrect. 
   
   **The failed assert on line 225:** 
   This is the most common assert I have received (maybe because it happens first in the test, blocking the others). This occurs when the tablet locations, before and after servers are suspended, differ. This means that the tablets are migrating when they are not supposed to, or the way the test measures the locations is flawed.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
Manno15 commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-778491267


   > The main issue we have found stems from the way the tservers are shutdown. This portion of the test often hangs between the time two tservers are shut down
   
   This seems to be the part that has the highest resource and performance dependency of the test. Where it will pass in intellij but fail in terminal. 
   
   It does stand to reason that increasing the SuspendDuration will increase reliability. Less of a chance the tablets become unsuspended before the tservers are properly recovered. This will increase the amount of time the test will take since the latter part of each test does wait for the suspend duration to end to see if those tablets are properly unsuspended due to it running out  
   (https://github.com/apache/accumulo/blob/87548d42c7bc02d567918f8333f0be9ed24698e8/test/src/main/java/org/apache/accumulo/test/manager/SuspendedTabletsIT.java#L248). To combat this, we would have to increase the timeout. 
   
   I would suggest splitting that feature of the test up and having it in its own test. This way, we can have a longer duration for the initial issue mentioned above and then a table with a shorter SuspendDuration (since it is a per-table property I believe) for specifically testing that tablets get reassigned once that duration ends. This will hopefully make things more reliable without increasing the amount of time the test by too much. 
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-786062253


   The changes in my most recent commit should account for the mentioned concerns. This should be ready for review again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii edited a comment on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii edited a comment on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-770085271


   I ran the full ITs with this PR on Jenkins and the only test that failed was this one:
   
   ```java
   java.lang.AssertionError
   	at org.junit.Assert.fail(Assert.java:87)
   	at org.junit.Assert.assertTrue(Assert.java:42)
   	at org.junit.Assert.assertTrue(Assert.java:53)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:253)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	...
   ```
   
   <details>
   <summary>More output from several subsequent runs:</summary>
   
   ```java
   [INFO] Running org.apache.accumulo.test.master.SuspendedTabletsIT
   [ERROR] Tests run: 9, Failures: 3, Errors: 5, Skipped: 0, Time elapsed: 1,333.444 s <<< FAILURE! - in org.apache.accumulo.test.master.SuspendedTabletsIT
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 182.585 s  <<< FAILURE!
   java.lang.AssertionError
   	at org.junit.Assert.fail(Assert.java:87)
   	at org.junit.Assert.assertTrue(Assert.java:42)
   	at org.junit.Assert.assertTrue(Assert.java:53)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:253)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:834)
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 125.98 s  <<< FAILURE!
   java.lang.AssertionError: expected:<[1;21;20, 1;1<, 1;10;1, 1;19;18, 1;13;12, 1;24;23, 1;26;25, 1;3;29, 1;17;16, 1;8;7]> but was:<[1;1<, 1;10;1, 1;19;18, 1;3;29, 1;21;20, 1;12;11, 1;13;12, 1;24;23, 1;4;3, 1;26;25, 1;27;26, 1;17;16, 1;8;7]>
   	at org.junit.Assert.fail(Assert.java:89)
   	at org.junit.Assert.failNotEquals(Assert.java:835)
   	at org.junit.Assert.assertEquals(Assert.java:120)
   	at org.junit.Assert.assertEquals(Assert.java:146)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:243)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:834)
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 194.216 s  <<< FAILURE!
   java.lang.AssertionError: expected:<[1;9;8, 1;1<, 1;19;18, 1;23;22, 1;4;3, 1;14;13, 1;16;15, 1;7;6, 1;17;16]> but was:<[1;9;8, 1;1<, 1;11;10, 1;19;18, 1;23;22, 1;4;3, 1;14;13, 1;16;15, 1;7;6, 1;17;16]>
   	at org.junit.Assert.fail(Assert.java:89)
   	at org.junit.Assert.failNotEquals(Assert.java:835)
   	at org.junit.Assert.assertEquals(Assert.java:120)
   	at org.junit.Assert.assertEquals(Assert.java:146)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:225)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:834)
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 24.635 s  <<< ERROR!
   org.apache.accumulo.core.client.AccumuloException: Failed to connect to zookeeper (localhost:35593) within 2x zookeeper timeout period 5000
   	at org.apache.accumulo.core.clientImpl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:402)
   	at org.apache.accumulo.core.clientImpl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:354)
   	at org.apache.accumulo.core.clientImpl.TableOperationsImpl.doTableFateOperation(TableOperationsImpl.java:1692)
   	at org.apache.accumulo.core.clientImpl.TableOperationsImpl.create(TableOperationsImpl.java:245)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:179)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:834)
   Caused by: java.lang.RuntimeException: Failed to connect to zookeeper (localhost:35593) within 2x zookeeper timeout period 5000
   	at org.apache.accumulo.fate.zookeeper.ZooSession.connect(ZooSession.java:162)
   	at org.apache.accumulo.fate.zookeeper.ZooSession.getSession(ZooSession.java:215)
   	at org.apache.accumulo.fate.zookeeper.ZooSession.getAnonymousSession(ZooSession.java:188)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getZooKeeper(ZooReader.java:57)
   	at org.apache.accumulo.fate.zookeeper.ZooCache.getZooKeeper(ZooCache.java:150)
   	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:414)
   	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:387)
   	at org.apache.accumulo.fate.zookeeper.ZooCache$ZooRunnable.retry(ZooCache.java:279)
   	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:442)
   	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:372)
   	at org.apache.accumulo.core.clientImpl.ClientContext.getInstanceID(ClientContext.java:414)
   	at org.apache.accumulo.core.clientImpl.ClientContext.getInstanceID(ClientContext.java:403)
   	at org.apache.accumulo.core.clientImpl.ClientContext.getMasterLocations(ClientContext.java:364)
   	at org.apache.accumulo.core.clientImpl.MasterClient.getConnection(MasterClient.java:59)
   	at org.apache.accumulo.core.clientImpl.MasterClient.getConnectionWithRetry(MasterClient.java:49)
   	at org.apache.accumulo.core.clientImpl.TableOperationsImpl.beginFateOperation(TableOperationsImpl.java:257)
   	at org.apache.accumulo.core.clientImpl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:364)
   	... 19 more
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 300.013 s  <<< ERROR!
   org.junit.runners.model.TestTimedOutException: test timed out after 300 seconds
   	at java.base@11.0.9.1/java.net.SocketInputStream.socketRead0(Native Method)
   	at java.base@11.0.9.1/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
   	at java.base@11.0.9.1/java.net.SocketInputStream.read(SocketInputStream.java:168)
   	at java.base@11.0.9.1/java.net.SocketInputStream.read(SocketInputStream.java:140)
   	at java.base@11.0.9.1/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
   	at java.base@11.0.9.1/java.io.BufferedInputStream.read1(BufferedInputStream.java:292)
   	at java.base@11.0.9.1/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
   	at app//org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
   	at app//org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
   	at app//org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:132)
   	at app//org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:100)
   	at app//org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
   	at app//org.apache.accumulo.core.clientImpl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:546)
   	at app//org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:637)
   	at app//org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:505)
   	at app//org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
   	at app//org.apache.accumulo.core.master.thrift.MasterClientService$Client.recv_shutdownTabletServer(MasterClientService.java:404)
   	at app//org.apache.accumulo.core.master.thrift.MasterClientService$Client.shutdownTabletServer(MasterClientService.java:388)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.lambda$shutdownAndResumeTserver$1(SuspendedTabletsIT.java:132)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT$$Lambda$208/0x00000008402d7440.execute(Unknown Source)
   	at app//org.apache.accumulo.core.clientImpl.MasterClient.executeGeneric(MasterClient.java:139)
   	at app//org.apache.accumulo.core.clientImpl.MasterClient.executeVoid(MasterClient.java:190)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.lambda$shutdownAndResumeTserver$2(SuspendedTabletsIT.java:130)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT$$Lambda$143/0x00000008401e7c40.eliminateTabletServers(Unknown Source)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:204)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base@11.0.9.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base@11.0.9.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base@11.0.9.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base@11.0.9.1/java.lang.reflect.Method.invoke(Method.java:566)
   	at app//org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at app//org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at app//org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at app//org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at app//org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at app//org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at app//org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at app//org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base@11.0.9.1/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base@11.0.9.1/java.lang.Thread.run(Thread.java:834)
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 300.013 s  <<< ERROR!
   java.lang.Exception: Appears to be stuck in thread Time-limited test-SendThread(localhost:42101)
   	at java.base@11.0.9.1/sun.nio.ch.EPoll.wait(Native Method)
   	at java.base@11.0.9.1/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
   	at java.base@11.0.9.1/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
   	at java.base@11.0.9.1/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
   	at app//org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:347)
   	at app//org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 300.007 s  <<< ERROR!
   org.junit.runners.model.TestTimedOutException: test timed out after 300 seconds
   	at java.base@11.0.9.1/java.net.SocketInputStream.socketRead0(Native Method)
   	at java.base@11.0.9.1/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
   	at java.base@11.0.9.1/java.net.SocketInputStream.read(SocketInputStream.java:168)
   	at java.base@11.0.9.1/java.net.SocketInputStream.read(SocketInputStream.java:140)
   	at java.base@11.0.9.1/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
   	at java.base@11.0.9.1/java.io.BufferedInputStream.read1(BufferedInputStream.java:292)
   	at java.base@11.0.9.1/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
   	at app//org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
   	at app//org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
   	at app//org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:132)
   	at app//org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:100)
   	at app//org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
   	at app//org.apache.accumulo.core.clientImpl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:546)
   	at app//org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:637)
   	at app//org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:505)
   	at app//org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
   	at app//org.apache.accumulo.core.master.thrift.MasterClientService$Client.recv_shutdownTabletServer(MasterClientService.java:404)
   	at app//org.apache.accumulo.core.master.thrift.MasterClientService$Client.shutdownTabletServer(MasterClientService.java:388)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.lambda$shutdownAndResumeTserver$1(SuspendedTabletsIT.java:132)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT$$Lambda$208/0x00000008402d7440.execute(Unknown Source)
   	at app//org.apache.accumulo.core.clientImpl.MasterClient.executeGeneric(MasterClient.java:139)
   	at app//org.apache.accumulo.core.clientImpl.MasterClient.executeVoid(MasterClient.java:190)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.lambda$shutdownAndResumeTserver$2(SuspendedTabletsIT.java:130)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT$$Lambda$143/0x00000008401e7c40.eliminateTabletServers(Unknown Source)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:204)
   	at app//org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base@11.0.9.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base@11.0.9.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base@11.0.9.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base@11.0.9.1/java.lang.reflect.Method.invoke(Method.java:566)
   	at app//org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at app//org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at app//org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at app//org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at app//org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at app//org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at app//org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at app//org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base@11.0.9.1/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base@11.0.9.1/java.lang.Thread.run(Thread.java:834)
   
   [ERROR] shutdownAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 300.007 s  <<< ERROR!
   java.lang.Exception: Appears to be stuck in thread Time-limited test-SendThread(localhost:35265)
   	at java.base@11.0.9.1/sun.nio.ch.EPoll.wait(Native Method)
   	at java.base@11.0.9.1/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
   	at java.base@11.0.9.1/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
   	at java.base@11.0.9.1/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:136)
   	at app//org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:347)
   	at app//org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)
   ```
   
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-770085271


   I ran the full ITs with this PR on Jenkins and the only test that failed was this one:
   
   ```java
   java.lang.AssertionError
   	at org.junit.Assert.fail(Assert.java:87)
   	at org.junit.Assert.assertTrue(Assert.java:42)
   	at org.junit.Assert.assertTrue(Assert.java:53)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:253)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.shutdownAndResumeTserver(SuspendedTabletsIT.java:118)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	...
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo closed pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo closed pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-769239757


   @ctubbsii interesting. I have not gotten this error and have not yet been able to reproduce it. My guess is this is this happens when the test hangs on its `retrieve()` method which scans the tablet locations. The test often hits the retrieval timeout and then continues successfully so I'm guessing if the scan hangs here for too long, the files are GC'd which leads to the error. I will look into this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-782366827


   > I'm pointing out that metadata can be split across multiple tservers
   
   I was overlooking this while developing this attempted fix. I now see how this is faulty and will continue looking for a solution to address this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-769449213


   > I will look into this.
   
   Thanks. This might be an entirely separate issue to track, and may be independent of this SuspendedTabletsIT. It definitely shouldn't be the case that Accumulo can delete a file that is later still needed. If you are able to confirm in your investigation that this is a separate issue (or at least, if you are unable to rule out that possibility), feel free to create a new issue to track it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii merged pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii merged pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on a change in pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on a change in pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#discussion_r579428538



##########
File path: test/src/main/java/org/apache/accumulo/test/manager/SuspendedTabletsIT.java
##########
@@ -117,7 +117,12 @@ public void shutdownAndResumeTserver() throws Exception {
     suspensionTestBody((ctx, locs, count) -> {
       Set<TServerInstance> tserversSet = new HashSet<>();
       for (TabletLocationState tls : locs.locationStates.values()) {
-        if (tls.current != null) {
+
+        TabletLocator tl = TabletLocator.getLocator(ctx, MetadataTable.ID);
+        String metadataTserver =
+            tl.locateTablet(ctx, tls.extent.toMetaRow(), false, false).tablet_location;
+        // if the server does not hold the metadata, add it to the list to be shutdown
+        if (tls.current != null && !tls.current.toString().startsWith(metadataTserver)) {

Review comment:
       >1. the tserver could still be added to the list if another tablet adds it, so it could still cause problems, and
   
   I don't see how this could happen. The name of the tserver with the metadata is compared with the tserver that the current tablet is on. Therefore any tablet on that server should yield the same result.
    
   >2. this could result in no tservers shutting down, therefore not actually checking the conditions the test is intended to check
   
   This would only happen if the metadata was not on any of the servers. I'm not sure this could happen since the metadata would need to be on one of the available servers from my understanding.
   
   It may be a good idea to add an assert to make sure two servers are on track to be shutdown.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-773539060


   It seems results differ depending on what machine this test is run.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] DomGarguilo commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
DomGarguilo commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-780058577


   It was suggested to look into using a custom balancer in this test. It is speculated that the issue we are seeing with hanging shutdowns may be caused by shutting down the server which is hosting the metadata. The custom balancer may allow us to make sure the server that is hosting the metadata table is not one of the tables we attempt to shutdown in the test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1888: Fixes #1791 - Flaky test: SuspendedTabletsIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1888:
URL: https://github.com/apache/accumulo/pull/1888#issuecomment-782359653


   > With the current solution, the tserver with the metadata is not shutdown. Are you suggesting we include the tserver with the metadata on it in the list of servers to be shutdown?
   
   I'm pointing out that metadata can be split across multiple tservers as well, and is not guaranteed to be only on one server. If you want to enforce that somehow, you'd have to use a balancer, or some other mechanism to prevent the metadata tablets from being on the tserver being shut down, and also guaranteeing that the tserver being shut down actually has tablets for the test table as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org