You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/02/08 18:24:49 UTC

[GitHub] [accumulo] DomGarguilo opened a new issue #1916: Garbage Collector deleted metadata file before it should have

DomGarguilo opened a new issue #1916:
URL: https://github.com/apache/accumulo/issues/1916


   **Describe the bug**
   While investigating a flaky test ([SuspendedTabletsIT](https://github.com/apache/accumulo/pull/1888)), it was found that the garbage collector deleted a metadata file that was still needed for the test which caused it to fail once there was an attempt to read the metadata. This error was reported in [this comment](https://github.com/apache/accumulo/pull/1888#issuecomment-768674007). I cannot find anything that seems to suggest SuspendedTabletsIT directly caused this error, that is, it seems the error lies within the garbage collectors behavior. This error has only occurred once while running this test and I was unable to reproduce it.
   
   **Logs**
   
   _Originally posted by @ctubbsii in https://github.com/apache/accumulo/issues/1888#issuecomment-768674007_
   
   <details>
   
   ```java
   [ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 605.616 s <<< FAILURE! - in org.apache.accumulo.test.master.SuspendedTabletsIT
   [ERROR] crashAndResumeTserver(org.apache.accumulo.test.master.SuspendedTabletsIT)  Time elapsed: 347.964 s  <<< FAILURE!
   ```
   
   
   ```java
   java.lang.AssertionError: Scanning of metadata failed, aborting
   	at org.junit.Assert.fail(Assert.java:89)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT$TabletLocations.retrieve(SuspendedTabletsIT.java:306)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.suspensionTestBody(SuspendedTabletsIT.java:208)
   	at org.apache.accumulo.test.master.SuspendedTabletsIT.crashAndResumeTserver(SuspendedTabletsIT.java:101)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:834)
   ```
   
   </details>
   
   <details>
   
   
   ```java
   2021-01-27T17:59:14,161 [gc.SimpleGarbageCollector] DEBUG: Deleting file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/
   ```
   
   ```java
   2021-01-27T18:00:10,691 [tserver.FileManager] ERROR: Failed to open file file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
   2021-01-27T18:00:10,693 [tserver.FileManager] ERROR: Failed to open file file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
   2021-01-27T18:00:10,693 [tserver.FileManager] ERROR: Failed to open file file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
   2021-01-27T18:00:10,694 [problems.ProblemReports] DEBUG: Filing problem report !0 FILE_READ file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf
   2021-01-27T18:00:10,694 [scan.LookupTask] WARN : lookup failed for tablet !0;~<                     
   java.io.IOException: Failed to open file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf
     at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:331) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:492) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:501) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.ScanDataSource.createIterator(ScanDataSource.java:164) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.ScanDataSource.iterator(ScanDataSource.java:120) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.iteratorsImpl.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:228) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:493) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:646) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.scan.LookupTask.run(LookupTask.java:117) [accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.session.ScanSession$ScanMeasurer.run(ScanSession.java:54) [accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) [htrace-core-3.2.0-incubating.jar:3.2.0-incubating]
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]          
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]          
     at java.lang.Thread.run(Thread.java:834) [?:?]                                                    
   Caused by: java.io.UncheckedIOException: java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$BCFileLoader.load(CachableBlockFile.java:227) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:127) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.resolveDependencies(SynchronousLoadingBlockCache.java:64) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:109) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:381) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1164) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1256) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.getReader(RFileOperations.java:55) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:70) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:85) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.FileOperations$ReaderBuilder.build(FileOperations.java:449) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:309) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     ... 13 more                                                                                       
   Caused by: java.io.FileNotFoundException: File file:/home/christopher/git/apache/accumulo/accumulo/test/target/mini-tests/org.apache.accumulo.test.master.SuspendedTabletsIT_crashAndResumeTserver/accumulo/tables/!0/table_info/F0000038.rf does not exist
     at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:668) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:989) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:658) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:460) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:155) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:356) ~[hadoop-client-api-3.3.0.jar:?]
     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:945) ~[hadoop-client-api-3.3.0.jar:?]     
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$CachableBuilder.lambda$fsPath$0(CachableBlockFile.java:92) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:167) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$BCFileLoader.load(CachableBlockFile.java:225) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:127) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.resolveDependencies(SynchronousLoadingBlockCache.java:64) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:109) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:381) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1164) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:1256) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.getReader(RFileOperations.java:55) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:70) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:85) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.core.file.FileOperations$ReaderBuilder.build(FileOperations.java:449) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:309) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
     ... 13 more    
   ```
   
   </details>
   
   **Additional context**
   SuspendedTabletsIT sometimes hangs while running and times out. This error occurred in a run when the timeout was extended. The deletion of the metadata file came at minute 9 of the test which failed at 10 minutes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-871383637


   I noticed in `TabletIterator` that the boolean `returnPrevEndRow` gets set to false for the GC but is true the other times it gets used in the code. This results in a strange call to remove in the `next()` method.
   https://github.com/apache/accumulo/blob/f4f43febbc3e68013d8a1bcd46d8b44275e2e55e/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java#L177-L180


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii closed issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii closed issue #1916:
URL: https://github.com/apache/accumulo/issues/1916


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-843413614


   > I'm not sure it's a race condition with metadata writes. I'm thinking this has something to do with a logic error involving WALs and the garbage collector. In both SuspendedTabletsIT and in #608, there was a tserver failure and WAL recovery.
   
   So you think its more likely to be another recovery bug? Specifically, a recovery bug when recovering the metadata table? I know this IT used to have issue when crashing a tserver hosting Metadata tablets. If this was fixed in #1888 or if it shows up too infrequently in SuspendedTabletsIT, then perhaps we need another test, specifically for this bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-947332423


   > Unsure if this issue needs to remain open or can be closed
   
   We can close this, since we haven't seen it in awhile, and it is likely covered by the other issues you mentioned. We can always revisit if new evidence arrives that leads us to believe the work in those other issues is insufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-869750066


   FYI it looks like the code in question in `ScannerIterator` changed significantly in 2.0. The Queue that was holding the exceptions was replaced by a class in `ScannerImpl` called `Reporter`. Most of the changes took place in https://github.com/apache/accumulo/pull/905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-842627077


   I wonder if this is a duplicate of #608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-843404555


   > I think this might be a duplicate of #608
   
   It is possible but one big difference I noticed is that `SuspendedTabletsIT` does not use Bulk Import. If there is a race condition between the Garbage Collector and Metadata tablet mutations, then it might not make a difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-843408096


   > > I think this might be a duplicate of #608
   > 
   > It is possible but one big difference I noticed is that `SuspendedTabletsIT` does not use Bulk Import. If there is a race condition between the Garbage Collector and Metadata tablet mutations, then it might not make a difference.
   
   I'm not sure it's a race condition with metadata writes. I'm thinking this has something to do with a logic error involving WALs and the garbage collector. In both SuspendedTabletsIT and in #608, there was a tserver failure and WAL recovery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] mjwall commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
mjwall commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-947058508


   I can't be sure what I found is exactly what @DomGarguilo experienced when running SuspendedTabletsIT but there are 2 different scenarios that have the same affect of removing candidates during Accumulo GC that are still in use
   
   1. https://github.com/apache/accumulo/issues/1377 where entire tables are missed while scanning for references and the consistency checks failed to notice.  PR is up for this at https://github.com/apache/accumulo/pull/2293 to add more consistency checks.
   2. https://github.com/apache/accumulo/issues/2322 where I think a hardware failure and admin process of not stopping services led to a TabletDeletedException that is assumed to be normal enough to not fail the current GC cycle.  I would like to change that and have the GC cycle bail out, but waiting for community input to that issue.
   
   Unsure if this issue needs to remain open or can be closed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] mjwall commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
mjwall commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-873429251


   Found [1260](https://github.com/apache/accumulo/issues/1260) where @keith-turner has some fixes around this area in 1.10.  Working to replicate in 1.9 line, then will try the test in 1.10 to see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii removed a comment on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii removed a comment on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-842627077


   I wonder if this is a duplicate of #608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-843496960


   > So you think its more likely to be another recovery bug?
   
   That's my suspicion, but I don't have strong evidence for this.
   
   > Specifically, a recovery bug when recovering the metadata table?
   
   I don't know about that. It could be a generic recovery bug, and we just tend to notice when it affects the metadata table, because it affects Accumulo behavior, rather than user data.
   
   > ... perhaps we need another test, specifically for this bug.
   
   I've only seen this specific issue once. If it is the same as #608, then it's only ever occurred a handful of times to my knowledge. Either way, it seems very hard to reproduce. If we can come up with a test case that reproduces this specific bug, that would excellent progress towards solving it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] mjwall edited a comment on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
mjwall edited a comment on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-873429251


   Found [mr 1379](https://github.com/apache/accumulo/pull/1379) and [issue 1260](https://github.com/apache/accumulo/issues/1260) where @keith-turner has some fixes around this area in 1.10.  Working to replicate in 1.9 line, then will try the test in 1.10 to see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-869750066


   FYI it looks like the code in question in `ScannerIterator` changed significantly in 2.0. The Queue that was holding the exceptions was replaced by a class in `ScannerImpl` called `Reporter`. Most of the changes took place in https://github.com/apache/accumulo/pull/905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] mjwall commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
mjwall commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-868465742


   @ivakegg found something in the [TabletIterator](https://github.com/apache/accumulo/blob/rel/1.9.3/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java) that could miss a section of the metadata table during scanning.  This would cause those candidates not to be removed from the candidateMap when GC is [checking](https://github.com/apache/accumulo/blob/rel/1.9.3/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L176) and therefore still be part of the [candidateMap](https://github.com/apache/accumulo/blob/rel/1.9.3/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L311) where references are removed.  This hypothesis is consistent with what was seen several times on a large cluster.  
   
   Part of working out how this could happen is understanding how the client code handles a scan failure.  Stepping through the code, I hit this section in [ScannerIterator](https://github.com/apache/accumulo/blob/rel/1.9.3/core/src/main/java/org/apache/accumulo/core/client/impl/ScannerIterator.java#L93) which appears to swallow the errors.  The first group of exceptions is logged at TRACE, the systems where we have seen this issue log at DEBUG.   
   
   So if what I am seeing is correct, something as simple as a scan timeout in an unfortunate metadata range in the TabletIterator would not log anything and the consistency checks would not catch issue.
   
   Working to reproduce and prove this locally. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1916: Garbage Collector deleted metadata file before it should have

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-842627946


   I think this might be a duplicate of #608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org