You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "ctubbsii (via GitHub)" <gi...@apache.org> on 2023/03/06 15:20:33 UTC

[GitHub] [accumulo] ctubbsii commented on a diff in pull request #3140: Add option to only skip trash for Import files

ctubbsii commented on code in PR #3140:
URL: https://github.com/apache/accumulo/pull/3140#discussion_r1126585073


##########
core/src/main/java/org/apache/accumulo/core/conf/Property.java:
##########
@@ -746,6 +746,8 @@ public enum Property {
       "The number of threads used to delete RFiles and write-ahead logs", "1.3.5"),
   GC_TRASH_IGNORE("gc.trash.ignore", "false", PropertyType.BOOLEAN,
       "Do not use the Trash, even if it is configured.", "1.5.0"),
+  GC_TRASH_IGNORE_IMPORTS_ONLY("gc.trash.ignore.imports.only", "false", PropertyType.BOOLEAN,
+      "Skip Trash for Import files when the trash is in use.", "3.0.0"),

Review Comment:
   Having a per-table property might be useful, but the unintuitiveness of two separate properties to control three modes of operation would still be an issue that needs addressing. The per-table option wasn't what was being proposed, and it didn't occur to me. I can see the merits of it, but it's not so simple to implement, since the accumulo-gc operates asynchronously, and doesn't know the table from which a file came... it can only make educated guesses based on the file's path, but that's not reliable, since it could have been used by multiple tables, or manually placed through some import process that didn't include a table id in the path... which is only done by convention, not a guarantee.
   
   My view that Accumulo is managing the trash is based on our long-held desire to try to decouple ourselves from HDFS, and I'm recognizing the trash as an HDFS-specific feature. The more that filesystem-related behavior is handled within the filesystem config, and Accumulo-related behavior is handled within Accumulo's config, the better, I think. While this may be less convenient initially, I think it makes greater sense as the overall system architecture grows in complexity. Replication (number of block replicas) and WALOG durability (hsync, hflush) are also filesystem-specific behaviors, but those knobs already exist. I'm more inclined to scrutinize the knobs that further tightly couple us to HDFS when they are new, so we're not digging ourselves a deeper hole that is harder to get out of (to decouple). But this is just a heuristic I apply to help me think about the consequences of a change... it's not a fixed prerequisite for me to support a change or anything like that.
   
   I was also thinking about an "on-delete" hook or similar... but I don't think I'd be in favor of that, because I think if a user wanted that, they could just subclass an existing FileSystem and add that feature to its delete method. If every user did that, it would be a hassle... but making a new FineGrainedTrashPolicyFileSystem only needs to be done once. I'm not suggesting this is the best option... just one possible alternative solution to the property changes being proposed here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org