You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/05/25 00:58:03 UTC

[GitHub] [accumulo] ctubbsii opened a new issue, #2729: Explore using hard links to eliminate the need for file garbage collection

ctubbsii opened a new issue, #2729:
URL: https://github.com/apache/accumulo/issues/2729

   **Is your feature request related to a problem? Please describe.**
   Compactions, splits, merges, and table clones are tricky and complicated because we have multiple references to the same files, making it difficult to know when it is safe to delete a file. We keep track of files in use and when we're done with them, we only mark them as candidates for deletion. We rely on a separate garbage collection service to ensure that a file is no longer in use before we can safely delete it. Even then, the garbage collection process can be slow, risky, and if it crashes, it may leave behind unreferenced files.
   
   **Describe the solution you'd like**
   HDFS has a kind of [HardLink](https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/fs/HardLink.html) feature that we may be able to leverage to avoid garbage collection entirely. I have not tested how it works in practice, but in theory, we could just create unique file names whenever we split a tablet or clone a table, or even bulk import files, if the files were made as hard links, rather than simply copying the same reference. This would probably increase the memory footprint of the Hadoop NameNode, but it would enable dramatic simplification of Accumulo, so it would probably be worth it. When we are done with a file, we could just delete it immediately, because we wouldn't have to worry about any other references. The actual blocks would still be referenced and not deleted, by the other hard links. We can let Hadoop reclaim the blocks when the last hard link is deleted.
   
   **Describe alternatives you've considered**
   Keep doing file-based garbage collection and hoping for the best.
   
   **Additional context**
   Doing this could simplify the implementation of "no-chop merges" described in #1327 because each file would reference only a single range in its metadata.
   
   To implement this, we may need some kind of global locking per file, to ensure a file can't be deleted while hard links are being created.
   
   We'd need to test to make sure that the original file can still be deleted... that it's treated like any other hard link, and that we can make hard links of hard links, etc.
   
   We might still want a garbage collection service to lazily clean up files, but we'd no longer have to do complicated reference checking for candidates, if we could rely on file names being globally unique.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on issue #2729: Explore using hard links to eliminate the need for file garbage collection

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.
ctubbsii commented on issue #2729:
URL: https://github.com/apache/accumulo/issues/2729#issuecomment-1681183574

   I don't know if it's true, but some documentation I read indicated that a simple copy in the HDFS on the same filesystem is essentially a hard link already. I don't know if we can rely on that, though, because some FileSystem implementations may do IO. If there's an API specific to using hard links, that would be safer to use than to rely on a particular copy behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on issue #2729: Explore using hard links to eliminate the need for file garbage collection

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.
ctubbsii commented on issue #2729:
URL: https://github.com/apache/accumulo/issues/2729#issuecomment-1681181086

   #3700 was created as a duplicate, and expands on a few details, but is otherwise is suggesting the identical solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org