You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Sandy Ryza (JIRA)" <ji...@apache.org> on 2012/09/25 00:18:08 UTC

[jira] [Created] (MAPREDUCE-4680) Job history cleaner should only check timestamps of files in old enough directories

Sandy Ryza created MAPREDUCE-4680:
-------------------------------------

             Summary: Job history cleaner should only check timestamps of files in old enough directories
                 Key: MAPREDUCE-4680
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4680
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobhistoryserver
    Affects Versions: 2.0.0-alpha
            Reporter: Sandy Ryza


Job history files are stored in yyyy/mm/dd folders.  Currently, the job history cleaner checks the modification date of each file in every one of these folders to see whether it's past the maximum age.  The load on HDFS could be reduced by only checking the ages of files in directories that are old enough, as determined by their name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4680) Job history cleaner should only check timestamps of files in old enough directories

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462781#comment-13462781 ] 

Robert Joseph Evans commented on MAPREDUCE-4680:
------------------------------------------------

This is something that I have wanted to do for a while, and now that DEBUG_MODE has been removed it makes a lot of since to do it.  +1 for the idea.  It would be good to make sure that we delete the yyyy/mm/dd directories as well, because currently we leak them.
                
> Job history cleaner should only check timestamps of files in old enough directories
> -----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4680
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4680
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.0.0-alpha
>            Reporter: Sandy Ryza
>
> Job history files are stored in yyyy/mm/dd folders.  Currently, the job history cleaner checks the modification date of each file in every one of these folders to see whether it's past the maximum age.  The load on HDFS could be reduced by only checking the ages of files in directories that are old enough, as determined by their name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4680) Job history cleaner should only check timestamps of files in old enough directories

Posted by "Sandy Ryza (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463143#comment-13463143 ] 

Sandy Ryza commented on MAPREDUCE-4680:
---------------------------------------

I just looked at the code again, and I think I misunderstood the first time, so I wanted to make sure we're on the same page.  Currently, all the yyyy/mm/dd directories are gathered, then sorted in ascending order by time.  Then we go through and delete files until we reach a young enough directory, then halt.  I had thought that job history files inside dd/ directories that were too young were being examined, but they are not.

The load on HDFS could be decreased further by, say, if the max age is 2 years, and it's 2012, not looking at anything deeper in the 2011 dir (and same for months).  But would this be worthwhile?  It would make a difference only if the max history age were greater than a month (default is a week), in which case it could save a listStatus for each month of age.

If not, I could still make it delete the old folders.
                
> Job history cleaner should only check timestamps of files in old enough directories
> -----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4680
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4680
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.0.0-alpha
>            Reporter: Sandy Ryza
>
> Job history files are stored in yyyy/mm/dd folders.  Currently, the job history cleaner checks the modification date of each file in every one of these folders to see whether it's past the maximum age.  The load on HDFS could be reduced by only checking the ages of files in directories that are old enough, as determined by their name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira