You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "cshannon (via GitHub)" <gi...@apache.org> on 2023/02/05 18:56:49 UTC

[GitHub] [accumulo] cshannon commented on issue #608: Garbage Collector removed referenced files

cshannon commented on issue #608:
URL: https://github.com/apache/accumulo/issues/608#issuecomment-1418236284

   I did some more testing this past Friday/Saturday on this and couldn't get an errors to show up still. I did however see in the logs the iterator actually detect inconsistencies in the meatadata table and it fixed itself (just like I said in my previous comment). I talked to @keith-turner about this a bit and he suggested I try testing against 1.10 (and not just 2.1 and main). I also realized I was mostly testing using the mini accumulo cluster which does not use HDFS. So today I spent some time-rerunning my tests against 1.10 and using Uno (so it's a real cluster with HDFS) and I still didn't get an errors and everything worked as it should.
   
   One idea I had that could possibly help the problem would be to require garbage collection to run more than one time and compare results before actually removing files. For example, GC could run the scan to get the file candidates with references multiple times in the same GC run and then compare results and take a superset or fail if inconsistent.  Another option is to require more than one GC run before actually deleting. Something like when GC runs and comes comes up with the files it is about to delete maybe we just mark them (in metadata probably) as deleted/to be deleted and a subsequent run could do the actual delete if it detects the file as marked in a previous run so we know we at least had multiple runs where we think we should delete. We could even make it configurable to require X number of positive hits to do a deletion or have a time delay etc. 
   
   Running the scan more than once to detect the file references would only actually help if the problem was transient and non-deterministic and was isolated to a single scan and wouldn't just happen in a future scan which is hard to say because we don't know the actual problem. There's also a bit of a chicken/egg problem where if we are writing the GC result back to metadata but metadata scans are the issue and inconsistent it could be a problem. Also this behavior to delay deletion can sort of already be accomplished by using HDFS trash and files can be recovered from there and there is already #3140 to help make the HDFS trash option better but having it built in to GC could also be nice if it would help prevent early deletion in the first place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org