You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Dmytro Molkov (JIRA)" <ji...@apache.org> on 2010/05/12 02:23:42 UTC
[jira] Updated: (HADOOP-6761) Improve Trash Emptier

     [ https://issues.apache.org/jira/browse/HADOOP-6761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmytro Molkov updated HADOOP-6761:
----------------------------------

    Attachment: HADOOP-6761.patch

The fix itself is very simple. Please see patch attached. Have two configuration parameters instead of one. The rest of the code should work as is. Since we only delete something that has a timestamp older than now - interval. Which would mean that only the 25-th hour would get deleted in my example of 24 hours retention and 1 hour checkpointing.
And checkpointing happens every time the Emptier starts. So that part should work fine too.

The only problem is writing the test for this one. Since the Trash and Emptier have minute long granularity and checkpoint format is the timestamp up to the minute this test will need to run for a couple of minutes to finish.
Can anyone think of a clean and nice way to override these so that we can have a quick test that tests everything?

Thanks

> Improve Trash Emptier
> ---------------------
>
>                 Key: HADOOP-6761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6761
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>         Attachments: HADOOP-6761.patch
>
>
> There are two inefficiencies in the Trash functionality right now that have caused some problems for us.
> First if you configured your trash interval to be one day (24 hours) that means that you store 2 days worth of data eventually. The Current and the previous timestamp that will not be deleted until the end of the interval.
> And another problem is accumulating a lot of data in Trash before the Emptier wakes up. If there are a couple of million files trashed and the Emptier does deletion on HDFS the NameNode will freeze until everything is removed. (this particular problem hopefully will be addressed with HDFS-1143).
> My proposal is to have two configuration intervals. One for deleting the trashed data and another for checkpointing. This way for example for intervals of one day and one hour we will only store 25 hours of data instead of 48 right now and the deletions will be happening in smaller chunks every hour of the day instead of a huge deletion at the end of the day now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.