You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Sean Mackrory (JIRA)" <ji...@apache.org> on 2017/02/03 00:42:51 UTC

[jira] [Commented] (HADOOP-14041) CLI command to prune old metadata

    [ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850820#comment-15850820 ] 

Sean Mackrory commented on HADOOP-14041:
----------------------------------------

Been thinking about it some more and cleaning up directories is very tricky. One problem is that since we don't put a mod_time on directories (presumably just because S3 doesn't?) so it's impossible to distinguish between a directory that has existed for a long time and has had all of it's contents pruned, vs. a directory that was just created recently and had no contents to prune (yet). Putting a mod_time on a directory could be done in 2 days: we could just use that as a creation time, or a time when it's list of children changed. If it's only used for deciding when to prune old metadata, using it as creation time allows us to clean very old directories that don't have more recent children without the overhead of updating it every time we add or modify a child. But that might be a bit of a departure from the meaning expressed by "modification time".

I'm thinking a couple of things:

1) For now, I think I'll just prune directories that did have contents, but are now completely empty post-prune. Later, maybe we can add mod_time for directories and clean up directories that are old enough to be pruned and are empty, even though they didn't have children removed in the prune. The more I think about it, the more I think that will be rare and not worth adding mod_time to all directories just to clean it up more nicely.

2) Having thought about the gap between identifying files to prune and which directories to prune, it's probably better to do this in very small batches. It's okay for this prune command to take a longer time to run because we're making many round trips. The benefit of that is we minimize the window in which files can get created in a directory that is being cleaned up and might be considered empty. It also minimized impact on other workloads.

So ultimately I'm thinking the best way to do this is to clean up directories that did have children but had them all pruned (and THEIR parents if the same is now true of the parent directory), and to do this in very small batches or even individually. The more I think about it, it's probably not worth adding mod_time to directories to handle this any more completely. Would love to hear others' input, though.

> CLI command to prune old metadata
> ---------------------------------
>
>                 Key: HADOOP-14041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HADOOP-14041-HADOOP-13345.001.patch
>
>
> Add a CLI command that allows users to specify an age at which to prune metadata that hasn't been modified for an extended period of time. Since the primary use-case targeted at the moment is list consistency, it would make sense (especially when authoritative=false) to prune metadata that is expected to have become consistent a long time ago.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org